# IMDb Sentiment Analysis

## Technical Preliminaries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras

keras.utils.set_random_seed(42)

## Problem Setup

We will work with the famous IMDb dataset of movie reviews.

The datasets we will work with has just two columns:
* the text of the review
* a label of 1 or 0 indicating a positive or negative review

Our task is to develop models to predict the sentiment from the review text.

As you will soon see, we only have 50 reviews in the training set! Given this small dataset, what's the best way to build an accurate model? That's what we will try to answer in this homework!


## Data Prep



In [None]:
train_df = pd.read_csv('https://www.dropbox.com/s/seqzwmzfpq50kyn/train_df.csv?dl=1', index_col=0)
test_df = pd.read_csv('https://www.dropbox.com/s/dssjsrxr9zx43rq/test_df.csv?dl=1', index_col=0)

In [None]:
print(f"""
Train samples: {train_df.shape[0]}
Test samples: {test_df.shape[0]}
""")


Train samples: 50
Test samples: 500



In [None]:
train_df.head()

Unnamed: 0,label,text
7069,0,"b""If derivative and predictable rape-revenge t..."
16664,0,"b'Unimaginably stupid, redundant and humiliati..."
3362,0,b'This is the kind of movie which shows the pa...
165,0,"b""This is by far THE WORST movie i have ever w..."
13898,0,"b""What a load of rubbish.. I can't even begin ..."


What's the proportion of positive and negative labels?

In [None]:
train_df['label'].value_counts() / train_df.shape[0]

0    0.5
1    0.5
Name: label, dtype: float64

Nice, it is balanced.

In [None]:
# Let's turn the target into a dummy vector
y_train = pd.get_dummies(train_df['label']).to_numpy()
y_test = pd.get_dummies(test_df['label']).to_numpy()

In [None]:
y_train[:10]

array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0]], dtype=uint8)

## Problem 1: Bag-of-Words Baseline Model

Please follow the instructions in the HW PDF and complete the cells below.

In [None]:
# First, we configure a Text Vectorization layer using the default
# standardization and multi-hot encoding

# Set the maximum number of tokens
max_tokens = 1000

# Configure the text vectorization layer
text_vectorization = keras.layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="multi_hot")

In [None]:
# Let's adapt the Text Vectorization layer using the training corpus
text_vectorization.adapt(train_df['text'])

In [None]:
# We vectorize our input with the adapted Text Vectorization layer
x_train = text_vectorization(train_df['text'])
x_test = text_vectorization(test_df['text'])

In [None]:
# Build a baseline NN model with one hidden layer that has 8 neurons.

inputs = keras.Input(shape=(max_tokens, ))
x = keras.layers.Dense(8, activation="relu")(inputs)
outputs = keras.layers.Dense(2, activation="softmax")(x)
model = keras.Model(inputs, outputs)

model.summary()

Model: "model_25"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_26 (InputLayer)       [(None, 1000)]            0         
                                                                 
 dense_62 (Dense)            (None, 8)                 8008      
                                                                 
 dense_63 (Dense)            (None, 2)                 18        
                                                                 
Total params: 8,026
Trainable params: 8,026
Non-trainable params: 0
_________________________________________________________________


In [None]:
num_param_dense = 8 * 1000 + 8   #8008
output = 8*2 + 2                 #18
total = num_param_dense + output
print(total)                     #8026

8026


In [None]:
# Compile model using Adam
model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

# Fit model on the training data with 10 epochs and batch size of 32
model.fit(x=x_train, y=y_train,
          epochs=10,
          batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f92e5ac6ec0>

In [None]:
# Accuracy on test data
model.evaluate(x=x_test, y=y_test)
model.evaluate(x=x_train, y=y_train)



[0.453866571187973, 0.9200000166893005]

## Problem 2: Improve the baseline model

In this problem, we will define and train more complex models to try to increase the accuracy on our test dataset. Try combining different models by changing:
- Number of hidden units
- Adding another hidden layer.
- Adding dropout.
- Changing the number of epochs.
- Using bigrams instead of unigrams.

To guide your search for the best parameters, note how the accuracy changes on both train and test data.

In [None]:
# Begin your code here
max_token_number = 1000

# Multiple iterations with different number of neurons and epochs
params = [[8,10],
          [16,10],
          [32,10],
          [64,10],
          [128,10],
          [256,10],
          [8,20],
          [16,20],
          [32,20],
          [64,20],
          [128,20],
          [256,20]]

results_dict = {'Hidden units'   :[],
                'Epochs'         :[],
                'Hidden Layers'  :[],
                'Test accuracy'  :[],
                'Train Accuracy' :[],
                'Delta Accuracy' :[]
                }

for p in params:
  neuron_number = p[0]
  no_epochs = p[1]

  text_vectorization = keras.layers.TextVectorization(
      ngrams=2,
      max_tokens=max_token_number,
      output_mode="multi_hot",
  )

  text_vectorization.adapt(train_df['text'])
  x_train = text_vectorization(train_df['text'])
  x_test = text_vectorization(test_df['text'])

  inputs = keras.Input(shape=(max_tokens,))
  x = keras.layers.Dense(neuron_number, activation="relu")(inputs)
  x = keras.layers.Dropout(0.5)(x)
  outputs = keras.layers.Dense(2, activation="softmax")(x)

  modelOneHidden = keras.Model(inputs, outputs)
  modelOneHidden.summary()
  modelOneHidden.compile(optimizer="adam",
                        loss="categorical_crossentropy",
                        metrics=["accuracy"])

  modelOneHidden.fit(x=x_train, y=y_train,
            epochs=no_epochs,
            batch_size=32)

  modelOneHidden.evaluate(x=x_test, y=y_test)

  #----------------------------------------------------------

  inputs = keras.Input(shape=(max_tokens,))
  x = keras.layers.Dense(neuron_number, activation="relu")(inputs)
  x = keras.layers.Dense(neuron_number, activation="relu")(x)
  x = keras.layers.Dropout(0.5)(x)
  outputs = keras.layers.Dense(2, activation="softmax")(x)

  modelTwoHidden = keras.Model(inputs, outputs)
  modelTwoHidden.summary()
  modelTwoHidden.compile(optimizer="adam",
                loss="categorical_crossentropy",
                metrics=["accuracy"])

  modelTwoHidden.fit(x=x_train, y=y_train,
            epochs=no_epochs,
            batch_size=32)

  print("Accuracy for the one hiden layer, with parameters: max_token_number =", max_token_number, "; neuron_number = ", neuron_number, ";no_epochs = ", no_epochs," Test Accuracy of the model = ", modelOneHidden.evaluate(x=x_test, y=y_test)[1])
  print("Accuracy for the one hiden layer, with parameters: max_token_number =", max_token_number, "; neuron_number = ", neuron_number, ";no_epochs = ", no_epochs," Train Accuracy of the model = ", modelOneHidden.evaluate(x=x_train, y=y_train)[1])
  print("Delta =", modelOneHidden.evaluate(x=x_train, y=y_train)[1] - modelOneHidden.evaluate(x=x_test, y=y_test)[1] )
  print("Accuracy for the two hiden layer, with parameters: max_token_number =", max_token_number, "; neuron_number = ", neuron_number, ";no_epochs = ", no_epochs," Test Accuracy of the model = ", modelTwoHidden.evaluate(x=x_test, y=y_test)[1])
  print("Accuracy for the two hiden layer, with parameters: max_token_number =", max_token_number, "; neuron_number = ", neuron_number, ";no_epochs = ", no_epochs," Train Accuracy of the model = ", modelTwoHidden.evaluate(x=x_train, y=y_train)[1])
  print("Delta =", modelTwoHidden.evaluate(x=x_train, y=y_train)[1] - modelTwoHidden.evaluate(x=x_test, y=y_test)[1])


  results_dict['Hidden units'].append(neuron_number)
  results_dict['Epochs'].append(no_epochs)
  results_dict['Hidden Layers'].append(1)
  results_dict['Test accuracy'].append(modelOneHidden.evaluate(x=x_test, y=y_test)[1])
  results_dict['Train Accuracy'].append(modelOneHidden.evaluate(x=x_train, y=y_train)[1])
  results_dict['Delta Accuracy'].append(modelOneHidden.evaluate(x=x_train, y=y_train)[1] - modelOneHidden.evaluate(x=x_test, y=y_test)[1])

  results_dict['Hidden units'].append(neuron_number)
  results_dict['Epochs'].append(no_epochs)
  results_dict['Hidden Layers'].append(2)
  results_dict['Test accuracy'].append(modelTwoHidden.evaluate(x=x_test, y=y_test)[1])
  results_dict['Train Accuracy'].append(modelTwoHidden.evaluate(x=x_train, y=y_train)[1])
  results_dict['Delta Accuracy'].append(modelTwoHidden.evaluate(x=x_train, y=y_train)[1] - modelTwoHidden.evaluate(x=x_test, y=y_test)[1])

print(results_dict)


Model: "model_26"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_27 (InputLayer)       [(None, 1000)]            0         
                                                                 
 dense_64 (Dense)            (None, 8)                 8008      
                                                                 
 dropout_24 (Dropout)        (None, 8)                 0         
                                                                 
 dense_65 (Dense)            (None, 2)                 18        
                                                                 
Total params: 8,026
Trainable params: 8,026
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: "model_27"
___________________________________________________________

In [None]:
results_df = pd.DataFrame.from_dict(results_dict)

In [None]:
from google.colab import files
results_df.to_csv('Model_Results.csv', index = False, encoding = 'utf-8-sig')
files.download('Model_Results.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Problem 3: Use a pre-trained model

Next, we use will a famous pre-trained model called [Bert](https://en.wikipedia.org/wiki/BERT_(language_model)).

In [None]:
!pip install -q -U tensorflow-text ## install the package for NLP tasks in tensorflow

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/6.0 MB[0m [31m75.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.0/6.0 MB[0m [31m104.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import tensorflow_hub as hub
import tensorflow_text

bert_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

bert_layers = 12
bert_units = 768
bert_heads = 12

bert_encoder = f'https://tfhub.dev/tensorflow/bert_en_uncased_L-{bert_layers}_H-{bert_units}_A-{bert_heads}/4'

Let's look to some examples of the text processing required for BERT:

In [None]:
bert_preprocess_model = hub.KerasLayer(bert_preprocess)

text_test = ['This is comedy as it once was and comparing this with the two remakes.']
text_preprocessed = bert_preprocess_model(text_test)
print(f'Word Ids   : {text_preprocessed["input_word_ids"]}')

text_test = ['Although I rated this movie a 2 for showing a complete lack of effort in trying to create a quality horror film it was a 10 on the unintentional funny scale.']
text_preprocessed = bert_preprocess_model(text_test)
print(f'Word Ids   : {text_preprocessed["input_word_ids"]}')

Word Ids   : [[  101  2023  2003  4038  2004  2009  2320  2001  1998 13599  2023  2007
   1996  2048 12661  2015  1012   102     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]]
Word Ids   : [[  101  2348  1045  6758  2023  3185  1037  1016  2005  4760  1037  3143
   3768  1997  3947  1999  2667  2000  3443  1037  3737  5469  2143  2009
   2001  1037  2184  2006  1996  4

Notice how all examples start with the token `101` and end with `102` (and then followed by `[PAD]` tokens as indicated by 0). Those are special placeholder tokens that can be used for different purposes. For this exercise, we will only concern about the first one, and we will call this token the 'classification token', or `[CLS]`.

In [None]:
max_length = 512
preprocessor = hub.load(bert_preprocess)
encoder = hub.KerasLayer(bert_encoder, trainable=False)

def bert_textvect(x):
  input = keras.layers.Input(shape=(), dtype=tf.string)
  tokenized_input = hub.KerasLayer(preprocessor.tokenize)(input)
  bert_pack_inputs = hub.KerasLayer(preprocessor.bert_pack_inputs, arguments=dict(seq_length=max_length))
  output = bert_pack_inputs([tokenized_input])
  model = keras.Model(input, output)
  result = model.predict(x)
  return result

def bert_features(x):
  inputs = dict(
    input_word_ids=keras.layers.Input(shape=(max_length,), dtype=tf.int32),
    input_mask=keras.layers.Input(shape=(max_length,), dtype=tf.int32),
    input_type_ids=keras.layers.Input(shape=(max_length,), dtype=tf.int32),
  )

  output = encoder(inputs)['sequence_output'][:, 0, :]
  model = keras.Model(inputs, output)
  return model.predict(x)


In [None]:
X_bert_train = bert_textvect(train_df['text'])
X_bert_test = bert_textvect(test_df['text'])

features_train = bert_features(X_bert_train)
features_test = bert_features(X_bert_test)

7069     b"If derivative and predictable rape-revenge t...
16664    b'Unimaginably stupid, redundant and humiliati...
3362     b'This is the kind of movie which shows the pa...
165      b"This is by far THE WORST movie i have ever w...
13898    b"What a load of rubbish.. I can't even begin ...
16719    b"Bob Clampett's 'Porky's Poor Fish' is a so-s...
9417     b'I loved the first "Azumi" movie. I\'ve seen ...
13277    b'This TV-made thriller is all talk, little ac...
8861     b'Hi folks<br /><br />Forget about that movie....
3330     b'I gave this more than a 1 because I did thin...
1321     b"An OK flick, set in Mexico, about a hit-man ...
10135    b"This movie wasn't awful but it wasn't very g...
3518     b"After coming off the first one you think the...
1504     b"<br /><br />I didn't see They Call Me Trinit...
17526    b'I wished I\'d taped MEN IN WHITE so I could ...
5392     b'This is a fascinating film--especially to ol...
9571     b"I first saw this movie when I was about 10 y.

Next, we'll use the BERT output embedding for the CLS token as input to train a simple neural net.

(BTW, we have commented out the `Dropout` layer - feel free to turn it on and see if it improves the model's accuracy)

In [None]:
hidden_units = 64

# Input
input = keras.Input(shape=(bert_units, ))


x = keras.layers.Dense(hidden_units)(input)
x = keras.layers.Dense(hidden_units)(x)
x = keras.layers.Dropout(0.1)(x)

output = keras.layers.Dense(2, activation='softmax')(x)

# Model
model = keras.Model(input, output)

model.summary()

model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

batch_size = 32
epochs = 20

# Fit model
model.fit(x=features_train, y=y_train,
          epochs=epochs,
          batch_size=batch_size)

Model: "model_20"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_31 (InputLayer)       [(None, 768)]             0         
                                                                 
 dense_5 (Dense)             (None, 64)                49216     
                                                                 
 dense_6 (Dense)             (None, 64)                4160      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_7 (Dense)             (None, 2)                 130       
                                                                 
Total params: 53,506
Trainable params: 53,506
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20

<keras.callbacks.History at 0x7f180422b8b0>

In [None]:
print(model.evaluate(features_test, y_test))

[0.6586838960647583, 0.7319999933242798]


In [None]:
review_text = ''' I found 'A Beautiful Mind' to be a captivating and thought-provoking film that offers
                  a unique perspective on mental illness and the human experience. The film's nuanced portrayal
                   of the complexities of the mind and its impact on relationships is both powerful and poignant.
                   Russell Crowe's performance as John Nash is simply brilliant, and the film's exploration of Nash's
                   genius and struggle with schizophrenia is both engaging and heartbreaking. Overall,
                   'A Beautiful Mind' is a deeply moving and inspiring film that offers a powerful message of hope and resilience.'''

review_text = [review_text]
review_text_preprocessed = bert_textvect(review_text)
features_review_text = bert_features(review_text_preprocessed)
model.predict(features_review_text)



array([[0.01647703, 0.98352295]], dtype=float32)

In [None]:
print(y_test)

[[0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]
 [0 1]