# Redes Recorrentes


## Análise de sentimento (classificação binária)

Exemplo baseado em https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

Conjunto de dados original em http://ai.stanford.edu/~amaas/data/sentiment/

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Find out NVIDIA GPU model (randomly assigned by Colab)
# Deep learning performance: K80 < P4 < T4 < P100
# https://ai-benchmark.com/ranking_deeplearning.html
gpu = !nvidia-smi -L
print('Not using GPU' if 'failed' in gpu[0] else gpu[0].split(' (')[0])

# For more GPU information, run:
#!nvidia-smi

GPU 0: Tesla K80


In [None]:
# load the dataset but only keep the top n words
# (maximum is 88588)
num_words = 5000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_words)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [None]:
# dataset is an array of lists of different lengths
print(x_train.shape, x_test.shape)
x_train[:3]

(25000,) (25000,)


array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]),
       list([1, 194, 1153, 194, 2, 78, 228, 5, 6, 1463, 4369,

In [None]:
# get vocabulary (see imdb.load_data() docstring for info on special characters)
words = {i+3: w for w, i in imdb.get_word_index().items()}
words[0] = '_'          # padding
words[1] = '<START>'    # marks the start of a sequence
words[2] = '<REMOVED>'  # replaces words that were cut out because of the `num_words` or `skip_top` limit
words[3] = '<NOT_USED>'

def decode(x):
    return ' '.join([words[i] for i in x])

for i in sorted(words)[:20]:
    print(str(i),words[i])

0 _
1 <START>
2 <REMOVED>
3 <NOT_USED>
4 the
5 and
6 a
7 of
8 to
9 is
10 br
11 in
12 it
13 i
14 this
15 that
16 was
17 as
18 for
19 with


In [None]:
i = 0
print('Sample (word indices):',x_train[i],'\n')
print('Sample (text):',decode(x_train[i]),'\n')
print('Label:',y_train[i])

Sample (word indices): [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32] 

Sample (text): <START> this film was just brillian

In [None]:
i = 2
print('Sample (word indices):',x_train[i],'\n')
print('Sample (text):',decode(x_train[i]),'\n')
print('Label:',y_train[i])

Sample (word indices): [1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 2, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2401, 311, 12, 16, 3711, 33, 75, 43, 1829, 296, 4, 86, 320, 35, 534, 19, 263, 4821, 1301, 4, 1873, 33, 89, 78, 12, 66, 16, 4, 360, 7, 4, 58, 316, 334, 11, 4, 1716, 43, 645, 662, 8, 257, 85, 1200, 42, 1228, 2578, 83, 68, 3912, 15, 36, 165, 1539, 278, 36, 69, 2, 780, 8, 106, 14, 2, 1338, 18, 6, 22, 12, 215, 28, 610, 40, 6, 87, 326, 23, 2300, 21, 23, 22, 12, 272, 40, 57, 31, 11, 4, 22, 47, 6, 2307, 51, 9, 170, 23, 595, 116, 595, 1352, 13, 191, 79, 638, 89, 2, 14, 9, 8, 106, 607, 624, 35, 534, 6, 227, 7, 129, 113] 

Sample (text): <START> this has to be one of the worst films of the <REMOVED> when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally 

In [None]:
# truncate and pad input sequences
input_len = 500
x_train = pad_sequences(x_train, maxlen=input_len)
x_test = pad_sequences(x_test, maxlen=input_len)

In [None]:
# now each sample is an array of the same length and x is a 2D array
print(x_train.shape, x_test.shape)

(25000, 500) (25000, 500)


In [None]:
i = 0
print('Sample (word indices):')
print(x_train[i],'\n')
print('Sample (text):',decode(x_train[i]),'\n')
print('Label:',y_train[i])

Sample (word indices):
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    

In [None]:
i = 2
print('Sample (word indices):')
print(x_train[i],'\n')
print('Sample (text):',decode(x_train[i]),'\n')
print('Label:',y_train[i])

Sample (word indices):
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    

## LSTM

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam

In [None]:
# create the model
def get_model():
  model = Sequential(
      [
       Embedding(num_words, 32, input_length=input_len),
       LSTM(100),
       Dense(1, activation='sigmoid')
      ]
  )
  return model

get_model().summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 32)           160000    
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


In [None]:
# train for a single epoch
model = get_model()
model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=64, epochs=1, validation_data=(x_test, y_test));



In [None]:
# continue training for another epoch
model.fit(x_train, y_train, batch_size=64, epochs=1, validation_data=(x_test, y_test));



In [None]:
# continue training for another epoch
model.fit(x_train, y_train, batch_size=64, epochs=1, validation_data=(x_test, y_test));



Este conjunto de dados é tão grande que poucas épocas já são suficientes para um bom desempenho.


Para uma análise mais precisa seria interessante medir o desempenho num conjunto de validação diversas vezes ao longo de uma época.

## 1D CNN

Aqui uma rede convolucional 1D é usada (em conjunto com MaxPooling) para reduzir o tamanho das sequências e acelerar o treinamento.

In [None]:
from tensorflow.keras.layers import Conv1D, MaxPooling1D

In [None]:
# create the model
def get_model():
  model = Sequential(
      [
       Embedding(num_words, 32, input_length=input_len),
       Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'),
       MaxPooling1D(pool_size=2),
       LSTM(100),
       Dense(1, activation='sigmoid')
      ]
  )
  return model

get_model().summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d (Conv1D)              (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 250, 32)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
_________________________________________________________________


In [None]:
# train for a single epoch
model = get_model()
model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=64, epochs=1, validation_data=(x_test, y_test));



## Outras variações

* Substituir `LSTM` por `GRU`
* Utilizar uma camada bidirecional, por exemplo `Bidirectional(GRU(100))`
* Retornar as saídas de todos os timesteps, não apenas do último, e aplicar um pooling global ao longo dos timesteps:
```python
GRU(100, return_sequences=True),
GlobalMaxPooling1D(), #ou GlobalAveragePooling1D()
```
* Adicionar mais camadas




## Usando embeddings de palavra pré-treinados



Siga o tutorial em https://keras.io/examples/nlp/pretrained_word_embeddings/

Além do GloVe usado no tutorial, outros embeddings de palavra populares incluem o Word2Vec e o FastText.

Algumas bibliotecas que facilitam o uso de embeddings são o [spaCy](https://spacy.io) e o [Gensim](https://radimrehurek.com/gensim).

Modelos mais modernos utilizam embeddings que dependem do contexto, como o famoso [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)).

Outros tutoriais interessantes podem ser encontrados aqui: https://keras.io/examples/nlp/