<h1>RNN - LSTM x GRU</h1>
<p>O objetivo deste código é apresentar um exemplo simples de aplicação de redes neurais recorrentes (RNN) para a análise de textos (NLP). Neste caso, vamos aplicar deep learning para obter a partir do comentário realizado a cerca de um deteminado filme se este comentário foi ou não positivo. Os dados foram retirados do IMDB, e cujo link segue abaixo: </p>
<p><a href="https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences">Dados de comentários sobre IMDB</a></p>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Activation,SpatialDropout1D,Embedding,LSTM,GRU
from keras.callbacks import EarlyStopping

Using TensorFlow backend.


In [2]:
df = pd.read_table('imdb.txt', header=0)
df.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,msg,positive
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [3]:
df.msg[5]

"The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  "

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 2 columns):
msg         748 non-null object
positive    748 non-null int64
dtypes: int64(1), object(1)
memory usage: 11.8+ KB


In [5]:
df['positive'].value_counts()

1    386
0    362
Name: positive, dtype: int64

In [6]:
numero_maximo_palavras = 1000
sequencia_maxima_palavras = 100
dimensao = 100

In [7]:
tokenizer = Tokenizer(num_words=numero_maximo_palavras, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['msg'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 3236 unique tokens.


In [8]:
X = tokenizer.texts_to_sequences(df['msg'].values)
X = pad_sequences(X, maxlen=sequencia_maxima_palavras)
Y = df['positive'].values
print('Shape of data tensor:', X.shape)

Shape of data tensor: (748, 100)


In [None]:
ts = 0.20
rs = 100
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = ts,random_state = rs)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

In [None]:
epochs = 20
batch_size = 10
patience = 3
vs=0.1
md = 0.0001
monitor = 'val_loss'

In [None]:
modelLSTM = Sequential()
modelLSTM.add(Embedding(numero_maximo_palavras, dimensao, input_length=X.shape[1]))
modelLSTM.add(SpatialDropout1D(0.1))
modelLSTM.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
modelLSTM.add(Dense(1, activation='relu'))
modelLSTM.add(Dense(1, activation='relu'))
modelLSTM.add(Dense(1, activation='relu'))
modelLSTM.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = modelLSTM.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=vs,callbacks=[EarlyStopping(monitor=monitor, patience=patience, min_delta=md)])

In [None]:
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();

In [None]:
# modelGRU = Sequential()
# modelGRU.add(Embedding(numero_maximo_palavras, dimensao, input_length=X.shape[1]))
# modelGRU.add(SpatialDropout1D(0.2))
# modelGRU.add(GRU(10, dropout=0.2, recurrent_dropout=0.2))
# modelGRU.add(Dense(1, activation='softmax'))
# modelGRU.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# epochs = 10
# batch_size = 15

# history = modelGRU.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

In [None]:
# plt.title('Loss')
# plt.plot(history.history['loss'], label='train')
# plt.plot(history.history['val_loss'], label='test')
# plt.legend()
# plt.show();