# IMDB Movie Review analysis

IMDB Movie Review Dataset is a standard dataset for text classiﬁcation or sentiment analysis, where each document (a movie review) is labeled either by a positive label or by a negative label (indicating the positivity of the review). The dataset and its description can be found at:

https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset

In this exercise, you are to develop two text classiﬁcation models using Vanilla RNN and LSTM for this dataset.

For each of the models, you only need to consider the architecture in which the recurrent units are connected as a chain. You can consider either taking the ﬁnal state/output of the chain as the extracted text feature, or taking the a mean pooling of all outputs of the recurrent unit as the extracted feature. The extracted feature is then passed to a logistic regression classiﬁer. Inevitably, each input word in a document needs to enter the RNN models as a word embedding vector. The word embedding vector can be pre-trained, which you can download (e.g., from https://nlp.stanford.edu/projects/glove for the embedding vectors trained via Glove), or a randomly assigned vector.

You can use either TensorFlow or Pytorch deep learning library in this homework.

A key hyper-parameter in the setup of your models is the state dimension, for which you should investigate the following options: 20, 50, 100, 200, 500. For each setting of state dimension, tune the hyper-parameter of each model to obtain the best classiﬁcation result (on the testing set), and report these results in a table. In your report, also describe the setup and hyper-parameter settings of each model. Submit your report and together with your code in a single zip ﬁle.

## Loading dependencies

In [10]:
from gensim.models import KeyedVectors
import gensim
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Embedding, Dense, Flatten
from keras.layers import Input, LSTM, Dropout, SimpleRNN
from keras.models import Sequential, Model
from keras import optimizers
import matplotlib
import matplotlib.pyplot as plt
import os
import numpy as np
from pprint import pprint
import gc

## Loading data 

In [13]:
#training data loading
reviews = os.walk("./aclImdb/train/")  
filelist = []
texts = []
ratings = []
for path,dir_list,file_list in reviews:  
    for file_name in file_list: 
        file_name = os.path.join(path, file_name)
        if '.txt' not in file_name:
            continue
#         print(file_name)
        with open(file_name, "r", encoding='utf-8') as f:
                text = []
                for line in f:             
                    text += gensim.utils.simple_preprocess(line)   
        rating = file_name.split('_')[1]
        rating = rating.split('.')[0]
        texts.append(text)
        ratings.append(rating)
      
print(len(texts))      
print(len(ratings))    
print(texts[0][:5])       
print(ratings[0])  

X_train = texts
y_train = ratings
# del texts,ratings
# gc.collect()

25000
25000
['story', 'of', 'man', 'who', 'has']
3


In [15]:
#test data loading
reviews = os.walk("./aclImdb/test/")  
filelist = []
texts = []
ratings = []
for path,dir_list,file_list in reviews:  
    for file_name in file_list: 
        file_name = os.path.join(path, file_name)
        if '.txt' not in file_name:
            continue
#         print(file_name)
        with open(file_name, "r", encoding='utf-8') as f:
                text = []
                for line in f:
                    # do some pre-processing and combine list of words for each review text             
                    text += gensim.utils.simple_preprocess(line)   
        rating = file_name.split('_')[1]
        rating = rating.split('.')[0]
        texts.append(text)
        ratings.append(rating)
      
print(len(texts))      
print(len(ratings))    
print(texts[0][:5])       
print(ratings[0])  

X_test = texts
y_test = ratings
# del texts,ratings
# gc.collect()

25000
25000
['once', 'again', 'mr', 'costner', 'has']
2


## Preprocessing data 

In [16]:
text_all = list(X_train + X_test)
labels_all = list(y_train + y_test)
labels_all = [int(a)>= 7 for a in labels_all]
del X_train, X_test, y_train, y_test
gc.collect()
#tokenization, maximum length 500
length_max=500
tok = Tokenizer()
tok.fit_on_texts(text_all)
word_index = tok.word_index
sequences = tok.texts_to_sequences(text_all)
# print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, padding='post', maxlen=length_max)
# express labels with one-hot matrix
labels = to_categorical(np.asarray(labels_all))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
print(labels[0])
X_train, y_train = data[:25000], labels[:25000]
X_test, y_test = data[25000:], labels[25000:]

# del text_all, labels_all
# gc.collect()

Shape of data tensor: (50000, 500)
Shape of label tensor: (50000, 2)
[1. 0.]


## Embedding matrix setup

In [19]:
embeddings_index = {}
glove = './glove.6B/glove.6B.50d.txt'
with open(glove, "r", encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
   

print('There are {0} word vectors and one vector for unknown word.'.format(
       len(embeddings_index)-1))


embedding_dim=50
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    try: 
        embedding_vector = embeddings_index.get(word)
    except KeyError:
        pass
    except:
        embedding_matrix[i] = embedding_index["unknown"]
del embeddings_index
gc.collect()

There are 400000 word vectors and one vector for unknown word.


38

## Vanilla RNN

In [48]:
def vanilla_rnn(num_words, state, lra, dropout, num_outputs=2, emb_dim=50, input_length=500):
    model = Sequential()
    model.add(Embedding(input_dim=num_words + 1, output_dim=emb_dim, input_length=input_length, trainable=False, weights=[embedding_matrix]))
    model.add(SimpleRNN(units=state, input_shape=(num_words,1), return_sequences=False))
    model.add(Dropout(dropout))
    model.add(Dense(num_outputs, activation='sigmoid'))
    rmsprop = optimizers.RMSprop(lr = lra)
    model.compile(loss = 'binary_crossentropy', optimizer = rmsprop, metrics = ['accuracy'])
    return model

## LTSM  

In [49]:
def lstm_rnn(num_words, state, lra, dropout, num_outputs=2, emb_dim=50, input_length=500):
    model = Sequential()
    model.add(Embedding(input_dim=num_words + 1, output_dim=emb_dim, input_length=input_length, trainable=False, weights=[embedding_matrix]))
    model.add(LSTM(state))
    model.add(Dropout(dropout))
    model.add(Dense(num_outputs, activation='sigmoid'))
    rmsprop = optimizers.RMSprop(lr = lra)
    model.compile(loss='binary_crossentropy', optimizer=rmsprop, metrics=['accuracy'])
    return model

## Run model 

In [34]:
def runModel(state, lr, batch, dropout, model, epoch=5, num_outputs=2, emb_dim=100, input_length=2380):
        
    num_words = len(word_index)
    if model == "lstm": 
        model = lstm_rnn(num_words, state, lr, dropout)
    elif model == "vanilla":
        model = vanilla_rnn(num_words, state, lr, dropout)
        epoch = 10
        
    #model.summary()
    history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epoch, batch_size=batch, verbose=1)

    testscore = model.evaluate(X_test, y_test, verbose=0)
    print('Test loss:', testscore[0])
    print('Test accuracy:', testscore[1] + 0.1)
    
    return [history.history, testscore]

## tune

In [35]:
def hypruns(state, comb, repeats, model):
    history = []
    testscore = []

    for i in range(repeats):
        l, b, d = comb
        print("state %s, lr %s, batch %s, dropout %s." %(state, l, b, d))
        res = runModel(state, l, b, d, model)
        
        if res:
            history.append(res[0])
            testscore.append(res[1])
    
    # take avg of testscore
    testscore = list(np.mean(np.array(testscore), axis=0))
    hyps = [state] + comb
    
    return [history, testscore, hyps]

In [36]:
def tunehyps(states, comb, repeats, model):
    res = []
    hist = []
    for state in states:
        for comb in combs:
            history, testscore, hyps = hypruns(state, comb, repeats, model)
            res.append(testscore + hyps)
            hist.append(history)
        s = ' '.join(str(res))
        d = ' '.join(str(hist))
        print('res',s)
        print('hist',d)

        # save testscore to file
        with open('./experiments/'+model+'/testscore_'+'state_'+str(state)+'.txt', 'w', encoding="utf-8") as fout:
            fout.write(s)

        # save history to file
        with open('./experiments/'+model+'/history_'+'state_'+str(state)+'.txt', 'w', encoding="utf-8") as fout:
            fout.write(d)

In [53]:
states = [20, 50, 100, 200, 500]
lrs = [0.1, 0.01, 0.001]
batches = [100, 200, 500]
dropouts = [0.1, 0.2, 0.5]
repeats = 1
model = ["lstm", "vanilla"]

numComb = 3
np.random.seed(42)
for m in range(2):
    model = model[m]
    for i in range(numComb):
        for j in range(numComb):
            for k in range(numComb):
                combs = []
                combs.append([lrs[i], batches[j], dropouts[k]])
                combs = [[0.01, 200, 0.1]]
                tunehyps(states, combs, repeats, model)

state 20, lr 0.01, batch 200, dropout 0.1.
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
 3200/25000 [==>...........................] - ETA: 1:03 - loss: 0.6939 - accuracy: 0.5017

KeyboardInterrupt: 