# Sentiment Analysis IMDB

This notebook is a simple straight-forward way to achieve 90% accuracy on IMDB dataset. Note that this is not the only way to achieve such accuracy.

## Load Data

In [1]:
import nlp_proj_utils as utils
import pandas as pd

pd.set_option('max_colwidth', 500)  # Set display column width to show more content

# Load dataset, download if necessary
train, test = utils.get_imdb_dataset()

# Get a sample (head) of the data frame
train.sample(3)

data already available, skip downloading.
imdb loaded successfully.


Unnamed: 0,text,sentiment
9675,"I admit to being somewhat jaded about the movie genre of a young child softening the heart of his/her reluctant guardian. I've seen enough of them  Baby Boom, Kolya, About a Boy, Mostly Martha, and to some extent, Whale Rider  to expect to be bored by the formula. What held my attention in The King of Masks was the grimness of the setting: small-town China in the 1930's. Extreme poverty was the norm, and girl children were considered so worthless to poor parents that they killed them at bi...",pos
20233,"The person who wrote the glowing review of this misguided project must be related to the writer/director/star--or is, in fact, the same person as it defies rational thinking that this movie would be appealing to anyone not connected to a very tightly woven inner circle. How about this? You want to make a movie--tell a story; entertain; draw me in with vivid characters. Sure, you can do it artfully without bowing to the commercial elements designed for mass appeal. However, do not address ele...",neg
7609,"A prison cell.Four prisoners-Carrere,a young company director accused of fraud,35 year old transsexual in the process of his transformation, Daisy,a 20 year-old mentally challenged idiot savant and Lassalle,a 60 year-old intellectual who murdered his wife.Behind a stone slab in the cell,mysteriously pulled loose,they discovered a book:the diary of a former prisoner,Danvers,who occupied the cell at the beginning of the century.The diary contains magic formulas that supposedly enable prisoners...",pos


## Prepare Data 

In this part,I will remove all the html label,punctuation and stopwords from the dataset. In order to reach a higher accuracy, I have selected 3000 most common word in the training data, and only the word in this list will be kept for further anylysis.
1. Remove HTML tag (<br /> in this case) from the review text
2. Remove punctuations (replace with whitespace)
3. Split review text into tokens
4. Remove tokens that are considered as "stopwords"
5. For the rest, do lemmatization

In [2]:
import string
import nltk

transtbl = str.maketrans(string.punctuation, ' '*len(string.punctuation))
stopwords = nltk.corpus.stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer()

In [3]:
#Take a text input and return the preprocessed string.
def preprocessing(line: str) -> str:
    """
    Take a text input and return the preprocessed string.
    i.e.: preprocessed tokens concatenated by whitespace
    """
    line = line.replace('<br />','').translate(transtbl)
    
    tokens = [lemmatizer.lemmatize(t.lower(),'v')
              for t in nltk.word_tokenize(line)
              if t.lower() not in stopwords]
    
    return ' '.join(tokens)

preprocessing("I bought several books yesterday<br /> and I really love them!")

'buy several book yesterday really love'

In [4]:
from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

for df in train, test:
    df['text_prep'] = df['text'].progress_apply(preprocessing)

  from pandas import Panel


HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))




In [5]:
train.sample(2)

Unnamed: 0,text,sentiment,text_prep
14657,"Julia Stiles is a talented young actress, who with guidance from a reputable agent has a lot of potential. Obviously, the person who guided her into this travesty is not someone who cares anything about her career. I sat in the theater surrounded by teenagers who left in droves to find another movie to sneak into wondering who thought this movie would appeal to anyone. It was poorly written, the casting director could only have put 1 or 2 minutes of effort into the characters and the directo...",neg,julia stiles talented young actress guidance reputable agent lot potential obviously person guide travesty someone care anything career sit theater surround teenagers leave droves find another movie sneak wonder think movie would appeal anyone poorly write cast director could put 1 2 minutes effort character director obviously care
3926,"As a long time fan of Peter O'Donnell's greatest creation, I watched this film on DVD with no great hopes of enjoyment; indeed I expected to be reaching in disgust for the remote control within fifteen minutes. But instead I thoroughly enjoyed this production, and I especially enjoyed and appreciated how the producers and director succeeded in telling the Modesty Blaise back story. They managed to avoid the trap of making a (bad) film version of the books we are all so familiar with, choosin...",pos,long time fan peter donnell greatest creation watch film dvd great hop enjoyment indeed expect reach disgust remote control within fifteen minutes instead thoroughly enjoy production especially enjoy appreciate producers director succeed tell modesty blaise back story manage avoid trap make bad film version book familiar choose instead concentrate period modesty life allude novels production value student cinematography yes film film tight financial time budget maybe show spoil viewer enjoym...


### Keep the most common words

In [6]:
all_words = [w for text in tqdm_notebook(train['text_prep']) 
             for w in text.split()]

HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))




In [7]:
# Use FreqDist to get count for each word
voca = nltk.FreqDist(all_words)
print(voca)

<FreqDist with 65102 samples and 3025774 outcomes>


In [8]:
voca.most_common(10)

[('film', 48184),
 ('movie', 44024),
 ('one', 26785),
 ('make', 23568),
 ('like', 22361),
 ('see', 20792),
 ('get', 18140),
 ('time', 16167),
 ('good', 15140),
 ('character', 14172)]

In [9]:
topwords = [word for word, _ in voca.most_common(3000)]

In [10]:
# import 
import numpy as np
import nlp_proj_utils as utils
from tensorflow.keras.models import Model  
from tensorflow.keras.layers import Dense, Input, Dropout, LSTM, Activation, Embedding
from tensorflow.keras.preprocessing import sequence

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'


np.random.seed(1)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [11]:
word_to_index, word_to_vec_map = utils.load_glove_vecs()

data already available, skip downloading.
loading glove... this may take a while...
glove loaded successfully.


### Select the first 200 words for embedding

In [12]:
maxlen = 200
print('max number of words in a sentence:', maxlen)

max number of words in a sentence: 200


In [13]:
# Convert training/testing features into index list
train_text = utils.sentences_to_indices(train['text_prep'], word_to_index, maxlen, topwords)
test_text = utils.sentences_to_indices(test['text_prep'], word_to_index, maxlen, topwords)

In [14]:
train_text

array([[251034., 160418., 306501., ...,      0.,      0.,      0.],
       [ 77324., 181890., 251034., ...,      0.,      0.,      0.],
       [336968., 148224., 236880., ...,      0.,      0.,      0.],
       ...,
       [268508.,  61762., 251057., ...,      0.,      0.,      0.],
       [134390.,  44995.,  74804., ...,      0.,      0.,      0.],
       [125377., 251057., 303435., ...,      0.,      0.,      0.]])

Convert label to 0 and 1

In [15]:
train_y = train['sentiment'].apply(lambda x: 1 if x == 'pos' else 0)
test_y = test['sentiment'].apply(lambda x: 1 if x == 'pos' else 0)

### Embedding layer

In [16]:
def pretrained_embedding_layer(word_to_index, word_to_vec_map):
    """
    Build and return a Keras Embedding Layer given word_to_vec mapping and word_to_index mapping
    
    Args:
        word_to_index (dict[str->int]): map from a word to its index in vocabulary
        word_to_vec_map (dict[str->np.ndarray]): map from a word to a vector with shape (N,) where N is the length of a word vector (50 in our case)

    Return:
        Keras.layers.Embedding: Embedding layer
    """
    
    # Keras requires vocab length start from index 1
    vocab_len = len(word_to_index) + 1  
    emb_dim = list(word_to_vec_map.values())[0].shape[0]
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    return Embedding(
        input_dim=vocab_len,
        output_dim=emb_dim,
        trainable=False,  # Indicating this is a pre-trained embedding 
        weights=[emb_matrix])

## Build a LSTM Model

I will use a two layer LSTM Model to train the data.

In [17]:
def build_model(input_dim, word_to_index, word_to_vec_map):
    """
    Build and return the Keras model
    
    Args:
        input_dim: The dim of input layer
        word_to_vec_map (dict[str->np.ndarray]): map from a word to a vector with shape (N,) where N is the length of a word vector (50 in our case)
        word_to_index (dict[str->int]): map from a word to its index in vocabulary
    
    Returns:
        Keras.models.Model: 2-layer LSTM model
    """
    
    # Input layer
    sentence_indices = Input(shape=(input_dim,), dtype='int32')
    
    # Build embedding layer
    embedding_layer = pretrained_embedding_layer(word_to_index, word_to_vec_map)
    embeddings = embedding_layer(sentence_indices)   
    
    # 2-layer LSTM
    X = LSTM(128, return_sequences=True, recurrent_dropout=0.5)(embeddings)  # N->N RNN，得到所有的a
    X = Dropout(rate=0.8)(X)
    X = LSTM(128, recurrent_dropout=0.5)(X)  # N -> 1 RNN
    X = Dropout(rate=0.8)(X)
    X = Dense(1, activation='sigmoid')(X)
    
    # Create and return model
    model = Model(inputs=sentence_indices, outputs=X)
    
    return model

In [18]:
imdb_model = build_model(
    maxlen, 
    word_to_index, 
    word_to_vec_map)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [19]:
imdb_model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 192)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 192, 50)           20000050  
_________________________________________________________________
lstm (LSTM)                  (None, 192, 128)          91648     
_________________________________________________________________
dropout (Dropout)            (None, 192, 128)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 65    

## Compile the Model

In [20]:
imdb_model.compile(
    loss='binary_crossentropy', 
    optimizer='adam',
    metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [None]:
history = imdb_model.fit(
    train_text, 
    train_y, 
    epochs = 200,  
    shuffle=True,
    validation_data=[test_text, test_y]
)

utils.plot_history(history, ['loss', 'val_loss'])

utils.plot_history(history, ['acc', 'val_acc'])

imdb_model.evaluate(train_text, train_y)
imdb_model.evaluate(test_text, test_y)

Train on 25000 samples, validate on 25000 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200


Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200

## Callbacks

Callbacks (aka hooks) are functions called every N epochs that help you monitor and log the training process. By default, they will be called every 1 epoch. We will be using two common callbacks here: `EarlyStopping` and `ModelCheckpoint`. The first is used to prevent overfitting and the second is used to keep track of the best models we got so far.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint

In [None]:
early_stoppping_hook = EarlyStopping(
    monitor='val_loss',  # what metrics to track
    patience=20,  # maximum number of epochs allowed without imporvement on monitored metrics 
)

CPK_PATH = 'model_cpk.hdf5'    # path to store checkpoint

model_cpk_hook = ModelCheckpoint(
    CPK_PATH,
    monitor='val_loss', 
    save_best_only=True,  # Only keep the best model
)

## Train the Model, Hope for the Best

In [None]:
history = imdb_model.fit(
    train_text, 
    train_y, 
    epochs = 200,  
    shuffle=True,
    validation_data=[test_text, test_y]
)
print('Training finished')

## Evaluation

Load the best model and do evaluation:

In [None]:
# Load the model checkpoint
imdb_model.load_weights(CPK_PATH)

# Accuracy on validation 
imdb_model.evaluate(test_text, test_y)