
# Movie-Review

![](https://images.unsplash.com/photo-1524985069026-dd778a71c7b4?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1051&q=80)

In this exercise, you will compare the classical NLP approach to the sequential approach on the movie dataset.

First download the dataset, located in `tensorflow.keras.datasets.imdb` with 10000 words (if you are experiencing memory issue, you can go down to 5000 words).

In [2]:
# TODO: Load the dataset
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Dense, SimpleRNN, Embedding
from tensorflow.keras.models import Sequential

In [3]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

Explore the dataset: you can make use of the function `imdb.get_word_index()` to get back to words and display some reviews. Be careful, the word indices `0`, `1`, `2` and `3` are reserved and mean no word. 

In [4]:
# TODO: Explore the data, display some sentences
word2idx = imdb.get_word_index()

In [5]:
idx2word = {v: k for k, v in word2idx.items()}

In [6]:
def to_text(vec):
    lst_word = [idx2word[i] for i in vec]
    return lst_word

In [7]:
X_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [8]:
to_text(X_train[0])

['the',
 'as',
 'you',
 'with',
 'out',
 'themselves',
 'powerful',
 'lets',
 'loves',
 'their',
 'becomes',
 'reaching',
 'had',
 'journalist',
 'of',
 'lot',
 'from',
 'anyone',
 'to',
 'have',
 'after',
 'out',
 'atmosphere',
 'never',
 'more',
 'room',
 'and',
 'it',
 'so',
 'heart',
 'shows',
 'to',
 'years',
 'of',
 'every',
 'never',
 'going',
 'and',
 'help',
 'moments',
 'or',
 'of',
 'every',
 'chest',
 'visual',
 'movie',
 'except',
 'her',
 'was',
 'several',
 'of',
 'enough',
 'more',
 'with',
 'is',
 'now',
 'current',
 'film',
 'as',
 'you',
 'of',
 'mine',
 'potentially',
 'unfortunately',
 'of',
 'you',
 'than',
 'him',
 'that',
 'with',
 'out',
 'themselves',
 'her',
 'get',
 'for',
 'was',
 'camp',
 'of',
 'you',
 'movie',
 'sometimes',
 'movie',
 'that',
 'with',
 'scary',
 'but',
 'and',
 'to',
 'story',
 'wonderful',
 'that',
 'in',
 'seeing',
 'in',
 'character',
 'to',
 'of',
 '70s',
 'musicians',
 'with',
 'heart',
 'had',
 'shadows',
 'they',
 'of',
 'here',
 

## Classical NLP

Make a prediction using classical NLP tools: BOW and TF-IDF. Followed by a classification model. Choose a random forest or gradient boosting, and perform a grid search for hyperparameter optimization.

*Warning, you are used to manipulate words, here they are already encoded into integers.*

In [9]:
import numpy as np

In [10]:
### TODO: Perform classification using NLP tools
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()

In [11]:
tf_idf_test = vectorizer.transform(X_test).toarray()

In [12]:
### TODO: Perform classification using NLP tools
from lightgbm import LGBMClassifier

In [14]:
lgbm = LGBMClassifier(n_jobs=-1)

In [15]:
lgbm.fit(tf_idf_train, y_train)

LGBMClassifier()

In [17]:
y_pred = lgbm.predict(tf_idf_test)

In [18]:
from sklearn.metrics import classification_report

In [19]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.85      0.86     12500
           1       0.86      0.88      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



What accuracy did you reach? Let's see if you can do better with RNN.

## RNN

Since you will use sequences, you will have to choose a sequence length.

First, you can check the min, max and average length of the sequences.

In [24]:
# TODO: compute basic descriptive statistics of the length of sequences
np.mean([len(X_train[i]) for i in range(len(X_train))])

238.71364

In [25]:
np.min([len(X_train[i]) for i in range(len(X_train))])

11

In [26]:
np.max([len(X_train[i]) for i in range(len(X_train))])

2494

Make now the padding of sequences: you choose a value related to the mean or median length.

In [75]:
# TODO: Make the padding
from tensorflow.keras.preprocessing import sequence

X_train = sequence.pad_sequences(X_train,
                                 value=0,
                                 padding='post', # to add zeros at the end
                                 truncating='post', # to cut the end of long sequences
                                 maxlen=238) # the length we want

X_test = sequence.pad_sequences(X_test,
                                value=0,
                                padding='post', # to add zeros at the end
                                truncating='post', # to cut the end of long sequences
                                maxlen=238) # the length we want

Now build a RNN, with for example two layers of 32 units. Do not forget the first layer of embedding, and the last layer of sigmoid for binary classification. Warning, the training might take several minutes! You can choose to have less layers and/or units!

In [80]:
# TODO: Build your model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding


def my_RNN():

    model = Sequential()
    # The input_dim is the number of different words we have in our corpus: here 10000
    # The input_length is the length of our sequences: here 128 thanks to padding
    model.add(Embedding(input_dim=10000, output_dim=32, input_length=238))

    # We add two layers of RNN 
    model.add(SimpleRNN(units=8, return_sequences=True))
    model.add(SimpleRNN(units=8, return_sequences=False))
    
    # Finally we add a sigmoid
    model.add(Dense(units=1, activation='sigmoid'))

    return model

In [77]:
model = my_RNN()

In [78]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')

In [81]:
model.fit(X_train, y_train, validation_split=.15, epochs=10, batch_size=64)

Epoch 1/10
Epoch 2/10

KeyboardInterrupt: 

Finally compile and train the model on the training data.

In [None]:
# TODO: Compile and fit your model

You can have a look at the tensorboard as usual.

As usual, compute the accuracy.

In [None]:
# TODO: Compute the accuracy of your model

You might want to improve your results by playing with the hyperparameters: play with the layers and number of units, you can add dropout, play with the optimizer, mini-batch size, data preprocessing...