

## Introduction

**Project name:** Sentiment analysis for IMDB reviews <br>
Autor: Tomasz Ostaszewicz <br>
Implementation: pandas, matplotlib, scikit-learn, keras.

**Description:** This is sentiment analysis for IMDB reviews. The aim of the analysis is to predict if movie review form IMDB is positive or negative (binary classification problem). To predict sentiment different preprocessing schemes and machine learning and deep learning models were used, including:

1. Classical machine learning models:
    1. Data preprocessing: removing stop words, removing punctuation, expanding contractions, stemming, 
    2. Input data from: Count variables, TFidf variables
    2. Models: Logistic Regression, SVM, Random Forest, Naive Bayes, XGBoost
2. Deep learning models:
    1. Data preprocessing: removing punctuation
    2. Input data from: Embeddings (from Glove model)
    3. Models: Simple RNN, LSTM, LSTM Bidirectional, LSTM Bidirectional with dropout, GRU, GRU Bidirectional, GRU Bidirectional with dropout

Conclusions of analysis are presented at the end of each section.

import numpy as np
import pandas as pd
import nltk
import string
import re
import matplotlib.pyplot as plt

from sklearn.base import TransformerMixin, ClassifierMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.metrics import accuracy_score, roc_auc_score

from xgboost.sklearn import XGBClassifier

import itertools
import os
import IPython

pd.set_option('display.max_rows', 1000)

## Downloading input data

Create directories for input and output data.

def create_dir(d):
    if not os.path.exists(d):
        os.makedirs(d)
    return None

create_dir("01 Input data")
create_dir("02 Output data")

Download data manually from Kaggle into subfolder "01 Input data" created above.
Use this link to download data: https://www.kaggle.com/c/word2vec-nlp-tutorial/data
Two files are needed: labeledTrainData.tsv.zip, testData.tsv.zip
Extract those files into "01 Input data" subdirectory.
Names of extracted files are: labeledTrainData.tsv, testData.tsv

Download manually file glove.6B.zip with GloVe vectors into subfolder "01 Input data" created above. 
Use this link to download data: http://nlp.stanford.edu/data/glove.6B.zip
Extract file: glove.6B.300d.txt into "01 Input data" subdirectory.

## Input data analysis

Read data.

kaggle_input_data_train = pd.read_csv('01 Input data/labeledTrainData.tsv', sep='\t', index_col = 'id')
kaggle_input_data_test = pd.read_csv('01 Input data/testData.tsv', sep='\t', index_col = 'id')

kaggle_input_data_train.info()

# Checking data types for train
print(kaggle_input_data_train.info())

# Checking individuals rows
print(kaggle_input_data_train.head(5))
print(kaggle_input_data_train.sample(5))
print(kaggle_input_data_train.tail(5))

# Checking data types for test
print(kaggle_input_data_test.info())

# Checking individuals rows
print(kaggle_input_data_test.head(5))
print(kaggle_input_data_test.sample(5))
print(kaggle_input_data_test.tail(5))

**Conclusions for reading from train and test data:**
1. reading from tsv files is correct
2. No null values in review and sentiment columns

Target variable distribution check.

print(kaggle_input_data_train.sentiment.value_counts())
print(kaggle_input_data_train.sentiment.value_counts(normalize=True))

Conclusion:
Target variable classes' share is 50%, 50%. No problem of imbalanced classes in data.

## Sample split

Assumptions about sample split:
1. Number of observations in training sample: 25 000
2. Split:
   a. 20 000 observations for training sample,
   b. 5 000 observatins for hold-out sample
3. Training is done on 20 000 observations using 3-kfold cross-validation for classical models
4. Test dataset from kaggle: kaggle_input_data_test does not containt target variable. It be used as additinal independent verification of a few best models

# Sample split with stratified splitting is done
X_train, X_test, y_train, y_test = train_test_split(kaggle_input_data_train.review, kaggle_input_data_train.sentiment, 
                                                    test_size=5000, random_state=123,
                                                   stratify=kaggle_input_data_train.sentiment)

print(y_train.shape, y_train.sum())
print(y_test.shape, y_test.sum())

## Data preprocessing

Assumption:
Data preprocessing is done on whole training sample as a separate step to speed-up cross-validation step
Data preprocessing - especially stemming is time consuming.

Create dict that will be used for expanding contractions.

# This contraction dict was copied from:
# https://gist.github.com/nealrs/96342d8231b75cf4bb82

contractons_dict = {
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he will have",
  "he's": "he is",
  "how'd": "how did",
  "how'd'y": "how do you",
  "how'll": "how will",
  "how's": "how is",
  "i'd": "i would",
  "i'd've": "i would have",
  "i'll": "i will",
  "i'll've": "i will have",
  "i'm": "i am",
  "i've": "i have",
  "isn't": "is not",
  "it'd": "it had",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so is",
  "that'd": "that would",
  "that'd've": "that would have",
  "that's": "that is",
  "there'd": "there had",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "to've": "to have",
  "wasn't": "was not",
  "we'd": "we had",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "y'all": "you all",
  "y'alls": "you alls",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "you'd": "you had",
  "you'd've": "you would have",
  "you'll": "you you will",
  "you'll've": "you you will have",
  "you're": "you are",
  "you've": "you have"
}

Extend contraction dict by other types of quotes which are used in reviews.

contractons_dict_ext = {}
for k, v in contractons_dict.items():
    contractons_dict_ext[k] = v
    contractons_dict_ext[k.replace("'", "`")] = v
    contractons_dict_ext[k.replace("'", "´")] = v

def rep_contr(match):
    'Function expands contractions using dictionary: contractons_dict_ext'
    return contractons_dict_ext[match.group(0)]

class TextStemmer(TransformerMixin):
    def __init__(self, stemmer = 'nltk_porter_stemmer', stop_words_list=nltk.corpus.stopwords.words('english'), punctuation_list=string.punctuation, expand_contractions = True):
        self.params_ = {'stemmer': stemmer, 'stop_words_list': stop_words_list, 'punctuation_list': punctuation_list, 'expand_contractions': expand_contractions}
        
    def fit(self, X, y=None, **kwargs):
        return self
        
    def transform(self, X,  **kwargs):
        
        # make text lower case
        X = [text.lower() for text in X] 
        
        if self.params_['stemmer'] == None and self.params_['stop_words_list'] == None \
            and self.params_['punctuation_list'] == None \
            and (self.params_['expand_contractions'] == None or self.params_['expand_contractions'] == False):
            # if there aren't any transformations return X
                return X
        else:
            
            # expand contractions
            if not (self.params_['expand_contractions'] == None or self.params_['expand_contractions'] == False):
                X = [re.sub('|'.join(contractons_dict_ext.keys()), rep_contr, text) for text in X] 
            
            # split text to separate words and remove single quote sing (') if it is first letter in a word. Change words to lower case
            X =[[word[1:].lower() if word.startswith("'") else word.lower() for word in nltk.word_tokenize(text)] for text in X]            
            
          
            # remove punctuation
            if self.params_['punctuation_list'] != None:
                X = [[word for word in text if word not in self.params_['punctuation_list']] for text in X]
            
            # remove stopwords
            if self.params_['stop_words_list'] != None:
                X = [[word for word in text if word not in self.params_['stop_words_list']] for text in X]
                
            # use stemmer
            if self.params_['stemmer'] == 'nltk_porter_stemmer':
                stemmer = nltk.PorterStemmer()
                X = [[stemmer.stem(word) for word in text] for text in X]
            
            # put separate words together into text
            X = [" ".join(text) for text in X]
            
            return X

Define pipelines for data preprocessing.

pipe_dict_preproc_data = {}

stemmer_dict = {
               'expcontr_rempunc_remsw_stemmer' : TextStemmer(expand_contractions=True, punctuation_list=string.punctuation, stop_words_list=nltk.corpus.stopwords.words('english'), stemmer = 'nltk_porter_stemmer' ),
               'expcontr_norempunc_noremsw_nostemmer' : TextStemmer(expand_contractions=True, punctuation_list=None, stop_words_list=None, stemmer=None),
               'noexpcontr_norempunc_noremsw_nostemmer' : TextStemmer(expand_contractions=None, punctuation_list=None, stop_words_list=None, stemmer=None)
               }

vectorizer_dict = {
                  'countvect5000' : CountVectorizer(max_features=5000, min_df=0.001, max_df=0.5),
                  'countvect1000' : CountVectorizer(max_features=1000, min_df=0.001, max_df=0.5),
                  'tfidvect5000' : TfidfVectorizer(max_features=5000, min_df=0.001, max_df=0.5),
                  'tfidvect1000' : TfidfVectorizer(max_features=1000, min_df=0.001, max_df=0.5)
                  }

for stemmer_key, stemmer_obj in stemmer_dict.items():
    for vectorizer_key, vectorizer_obj in vectorizer_dict.items():
        
        pipe_key = stemmer_key + '_' + vectorizer_key
        pipe = Pipeline([('stemmer', stemmer_obj), ('vectorizer', vectorizer_obj)])
        pipe_dict_preproc_data[pipe_key] = pipe


# Example of manual definition for dict: pipe_dict_preproc_data definition (above loop is used for convenience)
# pipe_key = 'stemmer_sw_punc_tfidvect5000'
# pipe = Pipeline([('stemmer', TextStemmer(stemmer=None)), ('vectorizer', TfidfVectorizer(max_features=5000, min_df=0.001, max_df=0.5))])
# pipe_dict_preproc_data[pipe_key] = pipe

Preprocess data using pipelines defined above to the form to be used in model estimation.

X_train_dict_preprocessed = {k : v.fit_transform(X_train) for (k, v) in pipe_dict_preproc_data.items()}
X_test_dict_preprocessed = {k : v.transform(X_test) for (k, v) in pipe_dict_preproc_data.items()}

## Classic model estimation setting

Define dictionaries to store estimated models.

gs_dict_gs = {}
gs_dict_best_estimator = {}
gs_dict_best_estimator_measures = {}
df_dict_gs_detail_res = {}
df_classic_models_res = None

def run_grid_search(pipe_key, pipe, grid_params):
    'Function runs grid search and collects results'
    
    for data_key, X_train_preproc in X_train_dict_preprocessed.items():

        dict_key = data_key + ' & ' + pipe_key
        print("Estimating: ", data_key, pipe_key)

        grid_search = GridSearchCV(pipe, grid_params, n_jobs=-1, scoring={'accuracy_score' : 'accuracy', 'roc_auc_score': 'roc_auc' }, refit='accuracy_score', return_train_score=True)
        %timeit -r 1 grid_search.fit(X_train_preproc, y_train)

        # save grid search results
        gs_dict_gs[dict_key] = grid_search
        gs_dict_best_estimator[dict_key] = grid_search.best_estimator_
        gs_dict_best_estimator_measures[dict_key] = grid_search.best_score_
        
        df_dict_gs_detail_res[pipe_key] = detail_est_res_to_df(gs_dict_gs, pipe_key)
    
    IPython.display.display(df_dict_gs_detail_res[pipe_key])
    

def detail_est_res_to_df(dict_est_results, pipe_key, gs_dict_gs=gs_dict_gs):
    '''
    Function converts dictionary with grid search results to pandas data frame in order to 
    analyze grid search parameters vs model performance
    '''
    
    # filter input dictionary by pipe_key
    dict_est_results = {k : v for (k,v) in dict_est_results.items() if k.endswith('& ' + pipe_key) == True}
   
    df_list = []
    # for each gs result create data frame containing gs parameters and measures values
    for dict_key, gs_res in dict_est_results.items():
        df_list.append(
                pd.concat((
                    pd.Series([dict_key.split('&')[0].strip() for x in range(len(dict_est_results[dict_key].cv_results_['params']))], name='input_data'),
                    pd.Series([dict_key.split('&')[1].strip() for x in range(len(dict_est_results[dict_key].cv_results_['params']))], name='model_name'),
                    pd.DataFrame(dict_est_results[dict_key].cv_results_['params']), 
                    
                    pd.Series( [(0 if param != dict_est_results[dict_key].best_params_ else 1) \
                                for param in dict_est_results[dict_key].cv_results_['params']], name = 'best_param_ind' ),
                    
                    pd.Series(gs_dict_gs[dict_key].cv_results_['mean_train_accuracy_score'], name = 'mean_train_accuracy_score'),
                    pd.Series(gs_dict_gs[dict_key].cv_results_['mean_test_accuracy_score'], name = 'mean_test_accuracy_score'),
                    pd.Series(gs_dict_gs[dict_key].cv_results_['mean_train_accuracy_score'] - gs_dict_gs[dict_key].cv_results_['mean_test_accuracy_score'], name = 'mean_train_test_accuracy_score_dif'),
                    
                    pd.Series(gs_dict_gs[dict_key].cv_results_['mean_train_roc_auc_score'], name = 'mean_train_roc_auc_score'),
                    pd.Series(gs_dict_gs[dict_key].cv_results_['mean_test_roc_auc_score'], name = 'mean_test_roc_auc_score'),
                    pd.Series(gs_dict_gs[dict_key].cv_results_['mean_train_roc_auc_score'] - gs_dict_gs[dict_key].cv_results_['mean_test_roc_auc_score'], name = 'mean_train_test_roc_auc_score_dif')
                ),
                    axis=1))
    
    
    return pd.concat(df_list)

def est_res_to_df(dict_est_results):
    'Function converts dictionary with grid search results to pandas data frame'
    
    col_index = dict_est_results.keys()
    col_means_val = [v.best_score_ for v in dict_est_results.values()]
    col_best_params = [v.best_params_ for v in dict_est_results.values()]
    dict_index_best_param = {k: (v.cv_results_['params'].index(v.best_params_)) for k,v in gs_dict_gs.items()}
    
    
    col_train_acc_score = { k1 : gs_dict_gs[k1].cv_results_['mean_train_accuracy_score'][v1] for (k1 , v1) in dict_index_best_param.items()}
    col_test_acc_score = { k1 : gs_dict_gs[k1].cv_results_['mean_test_accuracy_score'][v1] for (k1 , v1) in dict_index_best_param.items()}

    col_train_roc_auc_score = { k1 : gs_dict_gs[k1].cv_results_['mean_train_roc_auc_score'][v1] for (k1 , v1) in dict_index_best_param.items()}
    col_test_roc_auc_score = { k1 : gs_dict_gs[k1].cv_results_['mean_test_roc_auc_score'][v1] for (k1 , v1) in dict_index_best_param.items()}
    
    df_est_results = pd.DataFrame(pd.Series(data=col_means_val, index = col_index, name='measure_value'))
    df_est_results=df_est_results.assign(input_data = df_est_results.index.str.extract('(.*)&.*', expand=False).get_values())
    df_est_results=df_est_results.assign(model_name = df_est_results.index.str.extract('.*&(.*)', expand=False).get_values())
    df_est_results=df_est_results.assign(best_params = col_best_params)
    df_est_results=df_est_results.assign(mean_train_accuracy_score = col_train_acc_score.values())
    df_est_results=df_est_results.assign(mean_test_accuracy_score = col_test_acc_score.values())
    df_est_results['mean_train_test_accuracy_score_dif'] = df_est_results['mean_train_accuracy_score'] - df_est_results['mean_test_accuracy_score']

    df_est_results=df_est_results.assign(mean_train_roc_auc_score = col_train_roc_auc_score.values())
    df_est_results=df_est_results.assign(mean_test_roc_auc_score = col_test_roc_auc_score.values())
    df_est_results['mean_train_test_auc_roc_score_dif'] = df_est_results['mean_train_roc_auc_score'] - df_est_results['mean_test_roc_auc_score']

    df_est_results.reset_index(inplace=True, drop=True)
    
    return df_est_results

## Logistic Regression

pipe_key = "LOG_REG"
pipe = Pipeline([('model', LogisticRegression())])
grid_params = [{'model__C':[0.01, 0.1, 1, 10], 'model__penalty':['l1', 'l2']}]

run_grid_search(pipe_key, pipe, grid_params)

pipe_key = "svd_LOG_REG"
pipe = Pipeline([('selection', TruncatedSVD()), ('model', LogisticRegression())])
grid_params = [{'selection__n_components':[10, 50, 100], 'model__C':[0.01, 0.1, 1, 10]}]

run_grid_search(pipe_key, pipe, grid_params)

## SVM

# Due to long computation time SVM model was calculated only on variables resulting from SVD decomposition
pipe_key = "svm"
pipe = Pipeline([('selection', TruncatedSVD()), ('model', SVC())])
grid_params = [{'selection__n_components':[500], 'model__C':[ 20, 50, 100]}]

run_grid_search(pipe_key, pipe, grid_params)

## Random Forest

pipe_key = "random_forest"
pipe = Pipeline([('model', RandomForestClassifier())])
grid_params = [{'model__min_samples_leaf':[1, 2, 5, 10, 20, 50], 'model__n_estimators':[10, 100, 500, 1000]}]

run_grid_search(pipe_key, pipe, grid_params)

## Naive Bayes

pipe_key = "naive_bayes"
pipe = Pipeline([('model', MultinomialNB())])
grid_params = [{'model__alpha':[0, 1, 10, 50]}]

run_grid_search(pipe_key, pipe, grid_params)


## XGBoost

pipe_key = "xgboost"
pipe = Pipeline([('model', XGBClassifier())])
grid_params = [{'model__n_estimators':[100, 500, 1000], 'model__learning_rate': [0.05, 0.1, 0.3, 1]}]

run_grid_search(pipe_key, pipe, grid_params)

## Summarize classic models

df_classic_models_res = est_res_to_df(gs_dict_gs)
df_classic_models_res.to_csv("02 Output data/df_classic_models_est_results.csv")
df_classic_models_res

Show 10 best results.

df_classic_models_res.sort_values('mean_test_accuracy_score', ascending=False).head(10)

def create_df_dict_gs_detail_res2():
    'Function creates df with detail estimation results unified for all classical model types'
    
    df_dict_gs_detail_res2 = {}
    for model_id, df_res in df_dict_gs_detail_res.items():
        df = df_dict_gs_detail_res[model_id].reset_index()
        df_dict_gs_detail_res2[model_id]=df.melt(id_vars=['index', 'input_data', 'model_name', 'best_param_ind', 'mean_train_accuracy_score',\
               'mean_test_accuracy_score', 'mean_train_test_accuracy_score_dif',\
               'mean_train_roc_auc_score', 'mean_test_roc_auc_score',\
               'mean_train_test_roc_auc_score_dif'],
               var_name='Parameter_name', value_name='Parameter_value')
    df_detail_res = pd.concat(df_dict_gs_detail_res2.values(),axis=0)
    return df_detail_res

Save concatenated detail estimation results.

df_detail_res_to_csv = create_df_dict_gs_detail_res2()
df_detail_res_to_csv.to_csv("02 Output data/df_classic_models_est_results_details.csv", sep=";", decimal=',')

Save estimation results to files (without unification done above).

for model_id, df_res in df_dict_gs_detail_res.items():
    df_dict_gs_detail_res[model_id].to_csv("02 Output data/df_classic_models_est_results_details_" + model_id + ".csv", sep=";", decimal=',')

Model performance comparison.

df_detail_res_to_plot = df_detail_res_to_csv[df_detail_res_to_csv.best_param_ind==1].sort_values(['mean_test_accuracy_score'], ascending = False)
df_detail_res_to_plot.boxplot(column='mean_test_accuracy_score', by='model_name', figsize=(10,10), vert=False)
plt.show()

### Conclusions for the plot above

The best model is Logistic Regression which is slightly surprising. XGBoost and SVM are slightly worse.
It is possible that better tuning of hyperparameters for XGBoost and SVM could result in better performance (at least on LR level)
Using SVD to preselect variables for Logistic Regression gives unsatisfactory results.
Logistic regression gives the best accuracy score (0.87850).

df_detail_res_to_plot2 = df_detail_res_to_plot
df_detail_res_to_plot2['num_words'] = df_detail_res_to_plot2.input_data.str[-4:]
df_detail_res_to_plot2.boxplot(column='mean_test_accuracy_score', by=['num_words', 'input_data'], figsize=(10,10), rot=0, vert=False)
plt.show()

### Conclusions

Data above are group by number of words (5000 - first part, 1000 word - second part)
Using 1000 words as a base for estimation gives noticeable worse result than estimation on 5000 words.
Preprocessing data (expanding contraction, removing punctuation and stop words and using stemmer) gives slightly worse performance than data which are not preprocessed in any way. In chart below TFIdf and Count Vectorizers interweave. It is noticeable that using count vectorizer give slightly worse performance than using TFidf Vectorizer.

## Recurrent Neural Networks

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding, LSTM, GRU, GlobalAveragePooling1D, Bidirectional
from keras.callbacks import EarlyStopping, ModelCheckpoint
import tensorflow as tf
from keras import backend as K
from keras import regularizers

num_cores = 4

# use CPU or GPU
num_CPU = 1
num_GPU = 0 #set 1 if you want to run calculations on GPU

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,\
        inter_op_parallelism_threads=num_cores, allow_soft_placement=True,\
        device_count = {'CPU' : num_CPU, 'GPU' : num_GPU})
session = tf.Session(config=config)
K.set_session(session)

class Rnn_Tokenizer(TransformerMixin):
    'Assigns integers to words and creates integer sequences (by transform method)'
    
    def __init__(self, num_words = None, max_seq_length = None):
        self.num_words_ = num_words
        self.max_seq_length_ = max_seq_length
        
    def fit(self, X, max_seq_length = None):
        
        self.tokenizer = Tokenizer(num_words = self.num_words_)
        self.tokenizer.fit_on_texts(X)
        
        if self.max_seq_length_ == None:
            self.max_seq_length_ = max([len(l) for l in self.tokenizer.texts_to_sequences(X)])
        
        return self
        
    def transform(self, X, **kwargs):            
        int_text = self.tokenizer.texts_to_sequences(X)
        return pad_sequences(int_text, maxlen=self.max_seq_length_)
        
        

par_max_num_words = 200000 #parameter value indicates max number of words which is possible to be used in model estimation

Calculate maximum number of words in review (after removing punctuation) for concatenated train and test sets.

rnn_pipe = Pipeline([('nostemmer_nosw_punc', TextStemmer(stemmer=None, stop_words_list=None, punctuation_list=string.punctuation))
                     , ('rnn_tokenizer', Rnn_Tokenizer(num_words=par_max_num_words))
                     ])
rnn_pipe.fit_transform(np.concatenate((kaggle_input_data_train['review'].values, kaggle_input_data_test['review'].values)))
par_max_seq_length_calculated = rnn_pipe.steps[1][1].max_seq_length_
par_max_seq_length_calculated

par_max_seq_length=400 #set manually maximum sequence length (=maximum number of words in review) to reduce computation time

### Read embeddings

Read embeddings and filter them by words which occur in reviews (train and test sets). This step creates two numpy arrays: emb_words (array of words), emb_vectors (array of embeddings values)

par_embedding_length = 300

with open('01 Input data/glove.6B.' + str(par_embedding_length) + 'd.txt', encoding="utf8") as f:
    i = 1
    emb_words = np.empty((par_max_num_words+1), dtype='object')
    emb_vectors = np.zeros((par_max_num_words+1, par_embedding_length))
    emb_words[0] = ''
    for line in f:

        line_list = line.split(' ')
        
        
        try:
            rnn_pipe.steps[1][1].tokenizer.word_index[line_list[0]]    
        except KeyError:
            pass
        else:
            if i <= par_max_num_words:
                emb_words[i] = line_list[0]
                emb_vectors[i] = line_list[1:]
            i = i + 1

Sample of first 1000 words in embeddings.

for i in range(min(1000, par_max_num_words)):
    print(emb_words[i])

Check which words are in reviews and aren't in embeddings.

words_in_reviews_not_in_emb = []
for word in rnn_pipe.steps[1][1].tokenizer.word_index.keys():
    if word not in emb_words:
        words_in_reviews_not_in_emb.append(word)

words_in_reviews_not_in_emb

Preliminary analysis indicated that:
1. words such as: "'real", "'love" that is  with single quote sign (') at the beginning appear in list
2. words such as: "it´s", "didn't", "don't", "i'm" that are shortcuts appear quite often


To address these issues TextStemmer class was extended so that:
Ad 1. Single quote sign (') appearing at the beginning of the word is removed
Ad 2. Common contractions are expanded to full words.
Preliminary list contained in total: 32832 words. List after applying steps 1. and 2. contains: 27947 words.
Usually words on the list below are spelling errors or uncommon words.

# Number of word in reviews but not in embeddings vector
len(words_in_reviews_not_in_emb)

### Data preprocessing for RNN

Create Pipelines for data preprocessing.

dict_rnn_pipe = {}

text_stemmer_key = 'nostemmer_nosw_punc_expcontr'
text_stemmer_obj = TextStemmer(stemmer=None, stop_words_list=None, punctuation_list=string.punctuation, expand_contractions = True)


rnn_tokenizer_key = 'rnn_tokenizer_1000'
rnn_tokenizer_obj = Rnn_Tokenizer(num_words=1000, max_seq_length=par_max_seq_length)
rnn_pipe = Pipeline([(text_stemmer_key, text_stemmer_obj), (rnn_tokenizer_key, rnn_tokenizer_obj)])
dict_rnn_pipe[text_stemmer_key + '__' + rnn_tokenizer_key] = rnn_pipe


rnn_tokenizer_key = 'rnn_tokenizer_5000'
rnn_tokenizer_obj = Rnn_Tokenizer(num_words=5000, max_seq_length=par_max_seq_length)
rnn_pipe = Pipeline([(text_stemmer_key, text_stemmer_obj), (rnn_tokenizer_key, rnn_tokenizer_obj)])
dict_rnn_pipe[text_stemmer_key + '__' + rnn_tokenizer_key] = rnn_pipe


rnn_tokenizer_key = 'rnn_tokenizer_20000'
rnn_tokenizer_obj = Rnn_Tokenizer(num_words=20000, max_seq_length=par_max_seq_length)
rnn_pipe = Pipeline([(text_stemmer_key, text_stemmer_obj), (rnn_tokenizer_key, rnn_tokenizer_obj)])
dict_rnn_pipe[text_stemmer_key + '__' + rnn_tokenizer_key] = rnn_pipe

Run Pipelines for data preprocessing.

rnn_X_train_dict_preprocessed = {}
rnn_X_test_dict_preprocessed = {}

for pipe_key, pipe in dict_rnn_pipe.items():
    rnn_X_train_dict_preprocessed[pipe_key] = pipe.fit_transform(X_train)
    rnn_X_test_dict_preprocessed[pipe_key] = pipe.transform(X_test)

def create_embeddding_weights(tokenizer, emb_words, emb_vectors):
    '''
    Function creates embedding matrix to be used in Embedding layer and creates dictionary with
    words which do not have representation in embedding vector
    '''

    embeddings = np.zeros((tokenizer.num_words + 1, par_embedding_length))
    words_not_in_embeddings = {}
    
    for word, word_int in tokenizer.word_index.items():
        if word_int <= tokenizer.num_words:
            if np.nonzero(emb_words==word)[0].shape[0] == 0:
                words_not_in_embeddings[word] = word_int
            else:
                embeddings[word_int, :] = emb_vectors[np.nonzero(emb_words==word)[0]]
    return embeddings, words_not_in_embeddings

Create embedding matrices to be used in Embedding layer and create dictionaries with words which do not have representation in embedding vector.

rnn_X_train_embeddings = {}
rnn_X_test_embeddings = {}
rnn_X_train_words_not_in_embeddings = {}


for pipe_key, pipe in dict_rnn_pipe.items():
    
    tokenizer = dict_rnn_pipe[pipe_key].steps[1][1].tokenizer
    rv = create_embeddding_weights(tokenizer, emb_words, emb_vectors)
    
    rnn_X_train_embeddings[pipe_key] = rv[0]
    rnn_X_test_embeddings[pipe_key] = rv[0]
    rnn_X_train_words_not_in_embeddings[pipe_key] = rv[1]

del rv

### Estimation of RNN models

def create_rnn_models(dict_rnn_pipe, rnn_embeddings):
    'Function defines RNN models'
    
    rnn_models_dict = {}
    
    for pipe_key, pipe in dict_rnn_pipe.items(): 
    
        model_key = "Simple RNN"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(SimpleRNN(100))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model
        
        
        model_key = "LSTM"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(LSTM(100))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model
        
        model_key = "LSTM Bidirectional"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(Bidirectional(LSTM(100)))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model

                
        model_key = "LSTM Bidirectional with dropout"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(Bidirectional(LSTM(100,dropout=0.3,recurrent_dropout=0.3)))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model

        
        
        model_key = "GRU"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(GRU(100))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model

        model_key = "GRU Bidirectional"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(Bidirectional(GRU(100)))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model
        
                  
        model_key = "GRU Bidirectional with dropout"
        
        model = Sequential()
        model.add(Embedding(dict_rnn_pipe[pipe_key].steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[rnn_embeddings[pipe_key]]))
        model.add(Bidirectional(GRU(100,dropout=0.3,recurrent_dropout=0.3)))
        model.add(Dense(1, activation="sigmoid"))

        model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
        rnn_models_dict[pipe_key + '__' + model_key] = model


    return rnn_models_dict

Create dict with RNN models.

rnn_models_dict = create_rnn_models(dict_rnn_pipe, rnn_X_train_embeddings)

Fit models and evaluate performance.

par_batch_size = [32, 128]
epochs = 100

input_data = []
model_name = []
acc_train = []
acc_test = []
batch_size_list = []

for model_key, model in rnn_models_dict.items():
    
    for bs in par_batch_size:
        
        print("\n******************************************")
        print("Model:", model_key)
        print("Batch size:", bs)
    
        data_key = re.search(r'^(.+__.+)__(.+)$', model_key).group(1)
        model_key2 = re.search(r'^(.+__.+)__(.+)$', model_key).group(2)
        input_data.append(data_key)
        model_name.append(model_key2)

        early_stopping = EarlyStopping(patience=3,monitor="val_loss")
        take_best = ModelCheckpoint("weights.h5py",save_best_only=True)
        rnn_models_dict[model_key].fit(rnn_X_train_dict_preprocessed[data_key], y_train, callbacks=[early_stopping, take_best], validation_split=0.25, batch_size=bs, epochs=epochs)
        rnn_models_dict[model_key].load_weights("weights.h5py")
        os.remove("weights.h5py")
        acc_train.append(rnn_models_dict[model_key].evaluate(rnn_X_train_dict_preprocessed[data_key], y_train)[1])
        acc_test.append(rnn_models_dict[model_key].evaluate(rnn_X_test_dict_preprocessed[data_key], y_test)[1])
        batch_size_list.append(bs)
        
        print("Results")
        print("Model:", model_key)
        print("Batch size:", bs)
        print("Train:", acc_train[-1])
        print("Test:", acc_test[-1])
    
df_rnn_model_results = pd.DataFrame({'input_data': input_data, 'model_name': model_name, 'batch_size':batch_size_list, 'train_accuracy_score':acc_train, 'test_accuracy_score':acc_test})

### RNN models' results

df_rnn_model_results

Save estimation results.

df_rnn_model_results.to_csv("02 Output data\df_rnn_model_est_results.csv")

Create df for boxplots and show 10 best results.

df_rnn_model_results2 = df_rnn_model_results.copy()
df_rnn_model_results2['num_words'] = df_rnn_model_results2['input_data'].str.extract('(\d+)$').astype(float)
df_rnn_model_results2.sort_values('test_accuracy_score', ascending=False).head(10)

Compare impact of batch size vs estimation results.

df_rnn_model_results2.boxplot(column = 'test_accuracy_score', by = ['model_name', 'batch_size'], vert = False, figsize=(10,10))
plt.show()

Compare model performance vs. number of words.

df_rnn_model_results2.boxplot(column = 'test_accuracy_score', by = ['model_name', 'batch_size', 'num_words'], vert = False, figsize=(10,10))
plt.show()

Compare model performance vs. batch size.

df_rnn_model_results2.boxplot(column = 'test_accuracy_score', by = ['model_name', 'num_words', 'batch_size',], vert = False, figsize=(10,10))
plt.show()

**Conclusions for RNN models' results:**

1. Usually batch size 128 gives better results than batch size 32, but batch size 128 takes much more time to estimate
2. Usually estimation on 20000 words gives within model type the best results (comparin to 1000 and 5000 words)
3. The best accuracy on test sample is reached by GRU Bidirectional with dropout model estimated on 128 batch size and 20000 words

## Compare classical models with RNN models

# Mark best estimation result for each model as model_rank = 1
df_rnn_model_results2['model_rank'] = df_rnn_model_results2.groupby(['model_name'])['test_accuracy_score'].rank(ascending=False)

df_classic_models_res2 = df_classic_models_res.copy()
df_classic_models_res2['model_rank'] = df_classic_models_res2.groupby(['model_name'])['mean_test_accuracy_score'].rank(ascending=False)
df_classic_models_res2.rename(columns={'mean_test_accuracy_score':'test_accuracy_score'}, inplace=True)

# Print best results for each model
df_model_comparison = pd.concat([df_rnn_model_results2[['model_rank', 'model_name', 'test_accuracy_score']], df_classic_models_res2[['model_rank', 'model_name', 'test_accuracy_score']]], axis=0)
df_model_comparison[df_model_comparison['model_rank']==1.].sort_values(['test_accuracy_score'], ascending=False)

**Conclusion** <br>
Deep learning models on IMDB dataset perform better than classical models (except for Simple RNN)

## Data preparation for Kaggle

After finding which data preprocessing step and which models work the best for IMDB data set
repeat below all the steps above to score test data for Kaggle independent verification.

Prepare combined set (X_all = X_train + X_test).

X_all = kaggle_input_data_train.review
y_all = kaggle_input_data_train.sentiment

Dictionaries to store results.

kaggle_pipes = {}
kaggle_accuracy_score = {} #collects accuracy score on train data set
kaggle_roc_auc_score = {} #collects accuracy score on train data set
kaggle_predictions_on_kaggle_test_data = {} #dictionary of predictions to be send to Kaggle

Estimate best logistic regression model.

pipe_key = 'log_reg'

pipe = Pipeline([('noexpcontr_norempunc_noremsw_nostemmer', TextStemmer(expand_contractions=None, punctuation_list=None, stop_words_list=None, stemmer=None)),
                ('tfidvect5000', TfidfVectorizer(max_features=5000, min_df=0.001, max_df=0.5)),
                ('log_reg', LogisticRegression(C=1, penalty = 'l2'))
                ])
kaggle_pipes[pipe_key + "_trained_on_X_train"] = pipe.fit(X_train, y=y_train)
kaggle_accuracy_score[pipe_key + "_trained_on_X_train"] = accuracy_score(y_train, pipe.predict(X_train))
kaggle_roc_auc_score[pipe_key + "_trained_on_X_train"] = roc_auc_score(y_train, pipe.predict_proba(X_train)[:,1])

kaggle_predictions_on_kaggle_test_data[pipe_key + "_trained_on_X_train"] = pd.DataFrame(pipe.predict(kaggle_input_data_test.review), index=kaggle_input_data_test.index)


pipe = Pipeline([('noexpcontr_norempunc_noremsw_nostemmer', TextStemmer(expand_contractions=None, punctuation_list=None, stop_words_list=None, stemmer=None)),
                ('tfidvect5000', TfidfVectorizer(max_features=5000, min_df=0.001, max_df=0.5)),
                ('log_reg', LogisticRegression(C=1, penalty = 'l2'))
                ])
kaggle_pipes[pipe_key + "_trained_on_X_all"] = pipe.fit(X_all, y=y_all)
kaggle_accuracy_score[pipe_key + "_trained_on_X_all"] = accuracy_score(y_all, pipe.predict(X_all))
kaggle_roc_auc_score[pipe_key + "_trained_on_X_all"] = roc_auc_score(y_all, pipe.predict_proba(X_all)[:,1])
kaggle_predictions_on_kaggle_test_data[pipe_key + "_trained_on_X_all"] = pd.DataFrame(pipe.predict(kaggle_input_data_test.review), index=kaggle_input_data_test.index)

Show logistic regression accuracy score on train data sets.

print("Accuracy score:\n", kaggle_accuracy_score)
print("AUC ROC score:\n", kaggle_roc_auc_score)

Create embeddings matrices.

kaggle_rnn_embeddings = {}
kaggle_rnn_transformed_data = {}

pipe_data = Pipeline([('nostemmer_nosw_punc_expcontr', TextStemmer(stemmer=None, stop_words_list=None, punctuation_list=string.punctuation, expand_contractions = True)),
                ('rnn_tokenizer_20000', Rnn_Tokenizer(num_words=20000, max_seq_length=par_max_seq_length))])

kaggle_rnn_transformed_data['X_train'] = pipe_data.fit_transform(X_train, y_train)
kaggle_rnn_transformed_data['X_kaggle_trained_on_X_train'] = pipe_data.transform(kaggle_input_data_test.review)
kaggle_rnn_embeddings['X_train'] = create_embeddding_weights(pipe_data.steps[1][1].tokenizer, emb_words, emb_vectors)[0]


pipe_data = Pipeline([('nostemmer_nosw_punc_expcontr', TextStemmer(stemmer=None, stop_words_list=None, punctuation_list=string.punctuation, expand_contractions = True)),
                ('rnn_tokenizer_20000', Rnn_Tokenizer(num_words=20000, max_seq_length=par_max_seq_length))])

kaggle_rnn_transformed_data['X_all'] = pipe_data.fit_transform(X_all, y_all)
kaggle_rnn_transformed_data['X_kaggle_trained_on_X_all'] = pipe_data.transform(kaggle_input_data_test.review)
kaggle_rnn_embeddings['X_all'] = create_embeddding_weights(pipe_data.steps[1][1].tokenizer, emb_words, emb_vectors)[0]



Estimate best GRU model.

pipe_key = 'GRU'
bs = 128 #batch size
epochs = 100 #maximum number of epochs
y_data = {}
y_data["X_train"] = y_train
y_data["X_all"] = y_all

for data_key in ['X_train', 'X_all']:

    model = Sequential()
    model.add(Embedding(pipe_data.steps[1][1].tokenizer.num_words+1, par_embedding_length, input_length=par_max_seq_length, trainable=False, weights=[kaggle_rnn_embeddings[data_key]]))
    model.add(Bidirectional(GRU(100,dropout=0.3,recurrent_dropout=0.3)))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])

    early_stopping = EarlyStopping(patience=3, monitor="val_loss")
    take_best = ModelCheckpoint("weights.h5py", save_best_only=True)
    model.fit(kaggle_rnn_transformed_data[data_key], y_data[data_key], callbacks=[early_stopping, take_best], validation_split=0.25, batch_size=bs, epochs=epochs)

    model.load_weights("weights.h5py")
    os.remove("weights.h5py")
    kaggle_accuracy_score[pipe_key + "_trained_on_" + data_key] = model.evaluate(kaggle_rnn_transformed_data[data_key], y_data[data_key])[1]
    kaggle_roc_auc_score[pipe_key + "_trained_on_" + data_key] = roc_auc_score(y_data[data_key], model.predict_proba(kaggle_rnn_transformed_data[data_key]))
    kaggle_predictions_on_kaggle_test_data[pipe_key + "_trained_on_" + data_key] = pd.DataFrame(model.predict(kaggle_rnn_transformed_data['X_kaggle_trained_on_' + data_key]), index=kaggle_input_data_test.index[:len(kaggle_rnn_transformed_data['X_kaggle_trained_on_' + data_key])])

Output data for Kaggle competition.

for dict_key, kagle_pred in kaggle_predictions_on_kaggle_test_data.items():
    kaggle_predictions_on_kaggle_test_data[dict_key].columns = ['sentiment']
    kaggle_predictions_on_kaggle_test_data[dict_key].to_csv('02 Output data/kaggle_pred_' + dict_key + '.csv')

Csv files created above are sent to Kaggle for independent verification.

# Show models' metrics on train sets
kaggle_results = pd.concat([pd.Series(kaggle_accuracy_score, name = 'train_accuracy_score'), pd.Series(kaggle_roc_auc_score, name='train_roc_auc_score')], axis = 1)

# Value of roc_auc_score calculated by Kaggle are written below by hand in a code
kaggle_results = kaggle_results.assign(kaggle_roc_auc_score = pd.Series({
    "log_reg_trained_on_X_train":0.88140,
    "log_reg_trained_on_X_all": 0.88176,
    "GRU_trained_on_X_train":0.96035,
    "GRU_trained_on_X_all":0.96446}))

kaggle_results

ROC_AUC value equal to 0.96446 gives 39 position in Kaggle tutorial competition leader board (among 528 teams) <br>
Link for the tutorial competition: <br>
https://www.kaggle.com/c/word2vec-nlp-tutorial

## Final conclusions

#### Classical models:

1. The best model among classical models is Logistic Regression which is slightly surprising. 
2. XGBoost and SVM are slightly worse (SVM is calculated on 500 variables resulting from SVD decomposition).
3. It is possible that better tuning of hyperparameters for XGBoost and SVM could result in better performance (at least reaching logistic regression level).
4. Using SVD to preselect variables for Logistic Regression gives unsatisfactory results.
5. Logistic regression gives the best test accuracy score (0.87850)
6. Using 1000 words as a base for estimation gives noticeable worse result than estimation on 5000 most popular words.
7. Preprocessing data (expanding contraction, removing punctuation and stop words and using stemmer) gives slightly worse performance than data which are not preprocessed in any way.
8. Using count vectorizer gives slightly worse performance than using TFidf Vectorizer.


#### RNN models:

1. Usually batch size 128 gives better results than batch size 32, but batch size 128 takes much more time to estimate.
2. Usually estimation on 20000 words gives within model type the best results (comparing to 1000 and 5000 words).
3. The best accuracy (0.90000) on test sample is reached by GRU Bidirectional with dropout model estimated on 128 batch size and 20000 words.

#### Classical models vs RNN models:

1. Deep learning models on IMDB dataset perform better than classical models (except for Simple RNN).

#### Kaggle tutorial competition:

1. ROC_AUC value equal to 0.96446 (from GRU Bidirectional with dropout model) gives 39 position in Kaggle tutorial competition leader board (among 528 teams).
2. ROC_AUC for Logistic regression model is 0.88176 which is significantly worse than result from GRU Bidirectional with dropout model (0.96446).