### 1. Feature Engineering

Os dados de texto bruto não podem ser fornecidos diretamente no modelo LSTM. Nós devemos fazer engenharia dos atributos primeiro antes de podermos seguir para a etapa de modelagem.

In [None]:
!git clone https://github.com/FIAPON/fiap-deep-learning.git

In [None]:
%cd /content/fiap-deep-learning/RNNS

In [1]:
# Imports
import numpy as np
import pandas as pd

In [2]:
# Carregando o dataset
data = pd.read_csv('dataset/tweets.csv')

In [3]:
data.head()

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.97e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,


Tudo o que precisamos é o campo ** Tweet_Text **. Vamos combinar todas as linhas para criar um corpus de texto, concatenando tweets, mas separando-os com duas novas linhas:

In [4]:
text = '\n\n'.join(data['Tweet_Text'].values)

Para reduzir o tamanho do nosso espaço de recursos e o tempo de treinamento, removemos caracteres raros:

In [5]:
from collections import Counter
import re

In [6]:
cntr = Counter(text)
rare = list(np.asarray(list(cntr.keys()))[np.asarray(list(cntr.values())) < 300])
for c in rare:
    text = re.sub('[' + c + ']', '', text)

Aqui está como o início do corpus se parece:

In [7]:
text[:1000]

'Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z\n\nBusy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!\n\nLove the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!\n\nJust had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!\n\nA fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!\n\nHappy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4\n\nSuch a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before\n\nWatching the returns at 9:45pm.\n#ElectionNight #MAGA__ https://t.co/HfuJeRZ

In [8]:
print('Total de Caracteres no Corpus:', len(text))
chars = sorted(list(set(text)))
print('Total de Caracteres Únicos:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

Total de Caracteres no Corpus: 857177
Total de Caracteres Únicos: 78


Agora, vamos cortar o texto em sequências semi-redundantes de caracteres **maxlen** para que ele possa ser alimentado em um modelo LSTM:

In [9]:
maxlen = 50
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Número de Sequências:', len(sentences))

Número de Sequências: 285709


Então, vamos vetorizar as frases:

In [10]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

### 2. Modelo LSTM

In [11]:
import random
import sys
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

Using TensorFlow backend.


Vamos criar algumas funções reutilizáveis que podem que podem gerar texto para nosso modelo generativo.

In [12]:
cntr = Counter(text)
cntr_sum = sum(cntr.values())
char_probs = list(map(lambda c: cntr[c] / cntr_sum, chars))

In [13]:
def sample(preds):
    preds = np.asarray(preds).astype('float64')
    preds = preds / np.sum(preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [14]:
def generate(model, length, seed=''):
    
    if len(seed) != 0:
        sys.stdout.write(seed)
    
    generated = seed
    sentence = seed
    
    for i in range(length):
        x = np.zeros((1, maxlen, len(chars)))

        padding = maxlen - len(sentence)
        
        for i in range(padding):
            x[0, i] = char_probs # pad usando os anteriores
            
        for t, char in enumerate(sentence):
            x[0, padding + t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds)
        next_char = indices_char[next_index]

        sentence = sentence[1:] + next_char
        generated += next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
        
    return generated

Agora, vamos construir o grafo da nossa rede neural. Depois, vamos treinar nosso modelo e exibir algumas amostras em todas as épocas. No final do treinamento, salvamos o modelo para que possamos reutilizá-lo rapidamente no futuro.

In [15]:
from os.path import isfile
from keras.models import load_model

MODEL_PATH = 'stacked-lstm-2-layers-128-hidden.h5'

if isfile(MODEL_PATH):
    model = load_model(MODEL_PATH)
else:
    N_HIDDEN = 128

    # Modelo
    model = Sequential()
    model.add(LSTM(N_HIDDEN, dropout = 0.1, input_shape = (maxlen, len(chars)), return_sequences = True))
    model.add(LSTM(N_HIDDEN, dropout = 0.1))
    model.add(Dense(len(chars), activation = 'softmax'))

    # Otimizador
    optimizer = RMSprop(lr = 0.01)
    
    # Compilação
    model.compile(loss = 'categorical_crossentropy', optimizer = optimizer)

    # Imprime amostras a cada época
    for iteration in range(1, 40):
        print('\n')
        print('-' * 50)
        print('\nIteração', iteration)
        model.fit(X, y, batch_size=3000, epochs=1)

        print('\n-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------\n')

        rand = np.random.randint(len(text) - maxlen)
        seed = text[rand:rand + maxlen]
        generate(model, 400, seed)

    model.save(MODEL_PATH)



--------------------------------------------------

Iteração 1
Epoch 1/1

-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

ht at 6PM. http://t.co/iEh3RHp98S

"Labor Unions GMA,hlot TP NAAC,TAAEAAAAAARAA ATRNAAAAA!AERTAAIEAAOARPA
AAAEAAASCE!AROSNAA DASALlIAEPRARAAANMIA WADDAMAAAAAGAPNETAAAALEEARPAASRMRRINAA.NI ONRA@ANMSRPNNIADAANAANC 
MARS EADARHANNANAIAAATAEAN IAPANANARAA!AAS APhAAGAlARE6ANANIAGNA AARAAAPAPTANALAAAGLTAAAGICAASARATAAARAD

AASATANRa: AAS@EDAwDYAM AEDATAEANN SASAAADAATONLRA! ANAIIADDAATAEIASAA!AAAGACRaAACAARAISEAAACARINE
AA AEAAI!DANAEAAEA TAAMANSAATAKA

--------------------------------------------------

Iteração 2
Epoch 1/1

-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

mon Core. Get rid of Common Core--- keep education Carnaca in the Cibaut, Dhanl Cread &amp; bae ton, set? https://t.co/Knd412MYg

"@DotmllnN O: ErineTere @bkowstmy: -Don, for. @CNCNN on @GoxN Cara
https://t.co/D2vFL7VZ"

"Mazaigbe


-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

 and enthusiasm is something I will never forget.

"@lranyCNarane:  @Retenlaws Right offeed usions! @Waysic_Oblay #BigLeagueTruth #WITG https://t.co/tDWwsamMgl"

Colkes tonight! They is a call _ with John like Clinton!

Spe low in All of the want  - @oricyJurn Kaine is quanting to high his pacying that Nufers is 10,0000 Riblizanies, Montras: http:/_

Carby fent Sundusty of @ueliePOCline nation point sperit of @cantonnelorsyjam mentive instite. Gr

--------------------------------------------------

Iteração 14
Epoch 1/1

-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

ttp://t.co/x1swUrw8l1

Via @AP March,2013: Jeb said. Donald Trump vast exactos not be Moke Ellbs a Wishoussi: Urst First numbers in the president! Weran women people!

Underson the based the presidential cluss some ark country! No lhan the ppoples more.

"@A00 American BOT increditiving the Ald my well 


-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

ff to a good start!

"@HouseCracka: Donald Trump is lost a LASEDY results and Mances, we must be President. I fun!

Via Tuesday I Walla Trunp is for ALL points looks 5.M. I get such great ection - Before Wisconsin @LindseyGrahamSC - http://t.co/EEcpua8I1 https://t.co/7cpPz6O9Cd"

Just last nike-watch States in Mexyn, hape amazered homean and will winner your nice more our country. Liberay!

The bOSh has up won up Second who guy greated tired on @

--------------------------------------------------

Iteração 26
Epoch 1/1

-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

ve plans. She could care less what Kaine thinks."

Love its unden McSide _ncrassed Senator that wells it with the plany crowds. Clinton is the people and knownelt from talk that Bill to hishone in 2016 https://t.co/XWIIl4pni"

"@gaortod717: @FoxNews is a believe on @chuck5o  Made beck for the @cred nomi


-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

against it!"

Just released that international gang I want a big of Bernie! #Trump2016 https://t.co/lILqNOmVtq #Trump2016 #Trump2016 https://t.co/tRCm6k1ePP https://t.co/3VfJzcuR0"

Trad thishally in Leadership on Oheo is the unagering, TPP with Markare Novennt Politics 43 .12 Stien is The all of Trump wonderd U.S. overnoke at 8pmE that 16pr unlike that many Waynes was stopping with a Tuesday  #MakeAmericaGreatAgain #Trump2016 https://t.co/KBDMVY

--------------------------------------------------

Iteração 38
Epoch 1/1

-------------------- Tweet Gerado Pelo Modelo Nesta Iteração ---------------------

Belt" was created by politicians like the Clintons win the supports. Its the people of their 9 peace about.

Dishonelies speaks the lampine twath first @mitd_Restigal eeficsion, list people. Thank you!

"@ChanFoler: Hinner @realDonaldTrump You believe a great phace isupature reporting @realDonaldTrump h

Agora vamos experimentar o modelo!

Usando a primeira frase deste <a href='https://twitter.com/realDonaldTrump/status/890764622852173826'>tweet</a> como semente, vamos tentar continuar a frase de Trump e ver quais coisas interessantes o nosso modelo pode dizer:

In [16]:
sample_tweet_start = 'Go Republican Senators, Go!'
_ = generate(model, 200, sample_tweet_start)

Go Republican Senators, Go! You are life this ganges!

"@MagiceJlzcen44: When and have phoning that he sells TeleJusianakal.  Debates2

Great job! #Trump2016
https://t.co/MZIqkdD3hN"

Credate whose the floridative in CLUSH Supp