# Sentiment Classification IMDb Recurrent Neural Networks

Read the instructions and solve the problem below.

## Info about IMDB Movie reviews sentiment classification
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

In [32]:
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Activation
from keras.datasets import imdb
from keras.layers.wrappers import TimeDistributed

In [2]:
max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

x_train = x_train[:1500]
y_train = y_train[:1500]
x_test = x_test[:10000]
y_test = y_test[:10000]

print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


Loading data...
1500 train sequences
10000 test sequences
Pad sequences (samples x time)
x_train shape: (1500, 80)
x_test shape: (10000, 80)


# Naive Baseline 

In [3]:
import keras
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1)))
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Build model...
Instructions for updating:
Colocations handled automatically by placer.


In [4]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=5,
          validation_split=.1)
          #validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Instructions for updating:
Use tf.cast instead.
Train on 1350 samples, validate on 150 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test score: 0.48458081040382384
Test accuracy: 0.7897


# Design a solution based in RNNs

Try to outperform the above baseline by using a model based in some RNNs. You can tweak and explore with most of the elements in this notebook. For example, you could use more data to train your model, change the numer of epochs, the optimizer, etc. The only restriction is to keep the same test set.

The ideal scenario is to outperform the baseline, or be very close to that performance.

### Note: Provide a brief explanation about your solution and "why"/"why not" your proposal "is"/"is not" working. This description is going to be EQUALLY important than the code that you write.

In [83]:
# Write your code here.
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, input_shape=(1500, 80), return_sequences=True))
model.add(TimeDistributed(Dense(128))) #Check names to see how to load weights
model.add(keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1)))
model.add(Dense(128, activation='relu'))
model.add(keras.layers.Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.summary()

optimizer = keras.optimizers.RMSprop(lr=1e-3)

model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_40 (Embedding)     (None, None, 128)         2560000   
_________________________________________________________________
lstm_34 (LSTM)               (None, None, 128)         131584    
_________________________________________________________________
time_distributed_9 (TimeDist (None, None, 128)         16512     
_________________________________________________________________
lambda_27 (Lambda)           (None, 128)               0         
_________________________________________________________________
dense_73 (Dense)             (None, 128)               16512     
_________________________________________________________________
dropout_29 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_74 (Dense)             (None, 1)                 129   

Como primera propuesta para mejorar la red de clasificación de la base de datos de IMDB con sentiment analysis, fue agregar una capa LSTM (como es usual) para que la red puediera de cierta manera darle contexto a las opiniones de criticas. De la primera capa LSTM se obtienen un numero de outputs iguales a cada entrada de la oración con el fin de que la red reconociera la esturctura sintactica de una buena critica o una mala critica. Despues este output es procesado por otra capa densade 128 neuronas a la que se agrego un 50% de Dropout con el fin de que no se sobreajustara a los datos de entrenamientos. Finalmente se pasan a la ultima red que es la que clasifica el sentimiento. Los resultados se ven en la siguiente celda. El optimizador fue ajustado.

Resultados: La red mostro un rendimiento más pobre que el intento Naive Baseline obteniendo un acc menor al 75%. Observamos un sobre ajuste de los datos de entrenamiento y, apesar de obtener val_acc de hasta casi un 82% de los datos, la presición de los datos de prueba no mejoró. Cremos que se debe a una mala estructura de la red y que nuestro intento se puede mejorar si se supiera como superar el sobreajuste.

In [84]:
print('Train...')

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=5,
          validation_split=.1)
          #validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 1350 samples, validate on 150 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test score: 0.9990566026687622
Test accuracy: 0.7413


In [80]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1)))
model.add(keras.layers.Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

optimizer = keras.optimizers.RMSprop(lr=2e-4)

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Build model...


Segundo intento: A la red Naive Baseline se le agredo Dropout en la penultima capa para superar el overfitting y se modifico el optimizador. Los resultados se muestran a continuación. 

Resultados: parece que la precision de red mejora en el conjunto de pruebas hasta obtener un 79% en los datos de prueba.

In [81]:
print('Train...')

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=5,
          validation_split=.1)
          #validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 1350 samples, validate on 150 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test score: 0.458798303937912
Test accuracy: 0.7919


(1500, 80)