Agora que já sabemos fazer o processamento inicial no texto e construir um classificador, vamos explorar as arquiteturas que permitam processar a sequência de texto mais efetivamente.
[GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU)
[LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)

Como primeiro passo, vamos considerar a tarefa de classificacão de sentimento que exploramos na última aula.

In [3]:
import tensorflow as tf
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np


filepath_dict = {'yelp':   'data/sentiment/yelp_labelled.txt',
                 'amazon': 'data/sentiment/amazon_cells_labelled.txt',
                 'imdb':   'data/sentiment/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)
df = df.sample(frac=1).reset_index(drop=True)

perc_train = 0.85
len_train = int(len(df)*perc_train)

dataset_train = df.iloc[0:len_train, :-1]
dataset_test = df.iloc[len_train:, :-1]

stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

dataset_train.loc[:,'sentence'] = dataset_train.loc[:,'sentence'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))


max_vocab_size = 500   #Tamanho máximo do vocabulário
oov_token = '<OOV>'   # Token usado caso alguma palavra não for encontrada no vocabulário

tokenizer = Tokenizer(num_words=max_vocab_size, lower=True, oov_token = oov_token)
tokenizer.fit_on_texts(dataset_train.loc[:, 'sentence'])

dataset_train_sequences = tokenizer.texts_to_sequences(dataset_train.loc[:,'sentence'])
dataset_test_sequences = tokenizer.texts_to_sequences(dataset_test.loc[:,'sentence'])


maxlen = 500  #Tamanho máximo da frase
padding_type = 'post'
truncating_type = 'post'

dataset_train_sequences = pad_sequences(dataset_train_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)
dataset_test_sequences = pad_sequences(dataset_test_sequences, maxlen = maxlen, padding=padding_type, truncating=truncating_type)

train_seqs = dataset_train_sequences
train_labels = np.array(dataset_train.loc[:, 'label'])
test_seqs = dataset_test_sequences
test_labels = np.array(dataset_test.loc[:, 'label'])


In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten

embedding_dim = 16

model = Sequential()
model.add(Embedding(len(tokenizer.word_index), embedding_dim, input_length=maxlen))
model.add(Flatten()),
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
optimizer = 'adam'
loss = 'binary_crossentropy'
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
num_epochs = 20
model.fit(train_seqs,train_labels, epochs = num_epochs, validation_data=(test_seqs,test_labels), verbose=2 )

Train on 2335 samples, validate on 413 samples
Epoch 1/20
2335/2335 - 1s - loss: 0.6954 - accuracy: 0.4921 - val_loss: 0.6939 - val_accuracy: 0.4528
Epoch 2/20
2335/2335 - 1s - loss: 0.6588 - accuracy: 0.6347 - val_loss: 0.6059 - val_accuracy: 0.7458
Epoch 3/20
2335/2335 - 1s - loss: 0.4536 - accuracy: 0.8069 - val_loss: 0.4830 - val_accuracy: 0.7772
Epoch 4/20
2335/2335 - 1s - loss: 0.3312 - accuracy: 0.8548 - val_loss: 0.5063 - val_accuracy: 0.7627
Epoch 5/20
2335/2335 - 1s - loss: 0.2785 - accuracy: 0.8895 - val_loss: 0.5177 - val_accuracy: 0.7676
Epoch 6/20
2335/2335 - 1s - loss: 0.2265 - accuracy: 0.9088 - val_loss: 0.5747 - val_accuracy: 0.7530
Epoch 7/20
2335/2335 - 1s - loss: 0.1857 - accuracy: 0.9306 - val_loss: 0.5989 - val_accuracy: 0.7530
Epoch 8/20
2335/2335 - 1s - loss: 0.1551 - accuracy: 0.9430 - val_loss: 0.6609 - val_accuracy: 0.7433
Epoch 9/20
2335/2335 - 1s - loss: 0.1220 - accuracy: 0.9559 - val_loss: 0.8222 - val_accuracy: 0.7240
Epoch 10/20
2335/2335 - 1s - loss: 

<tensorflow.python.keras.callbacks.History at 0x7f2f646d9f10>

Agora vamos avaliar se a utilizacão de RNNs faz diferenca para esta tarefa.

Construa uma nova arquitetura com GRUs e/ou LSTMs agora, isto afeta a acurácia do modelo?



In [16]:
from tensorflow.keras.layers import LSTM
import tensorflow
model = Sequential()
model.add(Embedding(len(tokenizer.word_index), embedding_dim, input_length=maxlen))
model.add(LSTM(32)),
model.add(Dense(len(tokenizer.word_index)/2, activation='relu', kernel_regularizer=tensorflow.keras.regularizers.l2(0.01)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
num_epochs = 20
model.fit(train_seqs,train_labels, epochs = num_epochs, validation_data=(test_seqs,test_labels), verbose=2 )

Train on 2335 samples, validate on 413 samples
Epoch 1/20
2335/2335 - 17s - loss: 0.8148 - accuracy: 0.4809 - val_loss: 0.6935 - val_accuracy: 0.4528
Epoch 2/20
2335/2335 - 16s - loss: 0.6932 - accuracy: 0.5032 - val_loss: 0.6934 - val_accuracy: 0.4528
Epoch 3/20
2335/2335 - 17s - loss: 0.6932 - accuracy: 0.5032 - val_loss: 0.6934 - val_accuracy: 0.4528
Epoch 4/20
2335/2335 - 19s - loss: 0.6932 - accuracy: 0.5032 - val_loss: 0.6936 - val_accuracy: 0.4528
Epoch 5/20


KeyboardInterrupt: 