<a href="https://colab.research.google.com/github/HannaKi/Sentiment-analysis-with-IMDB-data/blob/main/Sentiment_analysis_with_IMDB_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [42]:
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

from pprint import pprint
# from time import time
# import logging

## Load and inspect the data

The data is IMBD reviews made for Stanford University research project (https://www.aclweb.org/anthology/P11-1015/). To learn more about the data please visit the web page or read the README file printed below.

For my purposes the test data is bi enough and I will use it for training, valdating and testing the model.

In [43]:
%%bash
wget -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

File ‘aclImdb_v1.tar.gz’ already there; not retrieving.



In [44]:
%%bash
cd aclImdb
cat README | head -1000

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a scor

In [45]:
%%bash
cd aclImdb/test/neg
for f in *txt; do echo $f>> /content/neg_file_names.txt; done # appends filenames to file

In [46]:
%%bash
cd aclImdb/test/pos
for f in *txt; do echo $f>> /content/pos_file_names.txt; done # appends filenames to file

In [47]:
def open_file_by_looping(path, file):
  f=open(file)
  filenames = f.readlines()
  reviews = []
  for filename in filenames:
    fname = path + filename
    fname =fname.rstrip()
    f = open(fname)
    review = f.readlines()
    # print(type(review))
    reviews.append(review) # or extend?
    f.close()
  f.close()
  flat_list = [item for sublist in reviews for item in sublist] # we have list of lists, it needs to be flattened
  return flat_list

neg_reviews = open_file_by_looping("aclImdb/test/neg/", "neg_file_names.txt")
print("Number of negative reviews:", len(neg_reviews))

pos_reviews = open_file_by_looping("aclImdb/test/pos/", "pos_file_names.txt")
print("Number of positive reviews:", len(pos_reviews))

reviews=neg_reviews+pos_reviews
len(reviews)

Number of negative reviews: 25000
Number of positive reviews: 25000


50000

Data quality affects the performance of all machine learning algorithms and neural networks. Poor data can not be improved even with a sophisticated algorithm. In this case our data is balanced (classes have equal 50 % share) and well behaving in many aspects.

This is seldom true in real life applications. In these cases data balance must be taken care of with for example stratification or giving different weigths to different classes. If data is grouped or datarecords are not independent, even more caution should be given to the training process since this might lead to test data "leaking" into training data and thus highly optimistic model performance measeures. 

For these reasons one should familiarize her with a new dataset before rushing into further steps of modeling. If it is discovered that the data is imbalanced, grouped etc. we can fix the issues uprising from the nature of the data before we feed it to the algorithm, or at least take it in account when analysing the results. 

In [48]:
reviews=neg_reviews+pos_reviews
len(reviews)

# make labels for the reviews:
labels = ['neg']*len(neg_reviews) + ['pos']*len(pos_reviews)
print(len(labels))

# make shuffled indices and shuffle both of labels and reviews with them
indices = list(range(len(labels)))
random.shuffle(indices)

labels = [labels[index] for index in indices]
reviews = [reviews[index] for index in indices]

for label, text in zip(labels[:10], reviews[:10]):
  print("label:", label, "\ntext:", text, "\n")

50000
label: neg 
text: The murder of the Red Comyn in Grayfriars Abbey was a long way from one of the most horrendous things ever done in the Scottish War of Independence and fights (and killing) in churches wasn't unusual at all. Not that much later Robert Bruces wife, daughter, two of his sisters were captured during a fight in a church in which people were killed. And comparing it to the massacre of Berwick in which the English slaughtered at least 8000 non-combatants (some, yes, in churches) is ridiculous.<br /><br />That said this is not a well-made movie. It is slightly antidote to the absolutely RIDICULOUS sniveling representation of Robert Bruce in Braveheart. Whatever Bruce was, it wasn't a wuss.<br /><br />Too bad that they didn't do a better job of this because someone should make a really GOOD movie of a war that is so amazing that it sounds like something someone made up going from complete defeat at the Battle of Methven to a secret return from hiding to a long guerrilla

In [49]:
# Remove HTML tags

import re
pattern1=r"<br /><br />" 
reviews = [re.sub(pattern1, " ", item) for item in reviews]

for label, text in zip(labels[:10], reviews[:10]):
  print("label:", label, "\ntext:", text, "\n")

label: neg 
text: The murder of the Red Comyn in Grayfriars Abbey was a long way from one of the most horrendous things ever done in the Scottish War of Independence and fights (and killing) in churches wasn't unusual at all. Not that much later Robert Bruces wife, daughter, two of his sisters were captured during a fight in a church in which people were killed. And comparing it to the massacre of Berwick in which the English slaughtered at least 8000 non-combatants (some, yes, in churches) is ridiculous. That said this is not a well-made movie. It is slightly antidote to the absolutely RIDICULOUS sniveling representation of Robert Bruce in Braveheart. Whatever Bruce was, it wasn't a wuss. Too bad that they didn't do a better job of this because someone should make a really GOOD movie of a war that is so amazing that it sounds like something someone made up going from complete defeat at the Battle of Methven to a secret return from hiding to a long guerrilla war to Bannockburn. This is

In [50]:
# s = r"\s\tWord"
# prog = re.compile(s)
# prog

# re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

In [51]:
# import re
# pattern1=r"<br /><br />" # something fishy is going on with HTML tags. Get rid of them
# s = "\\'"
# print(s)
# # pattern2=re.escape(r"\'")
# # print("p2:", pattern2)
# pattern3 = s
# print(pattern3)
# # print(pattern3.replace('\\', '\\\\'))
# print(chr(39))

# fixed = re.sub(pattern3, chr(39), reviews[8])
# fixed_n = re.sub(pattern1, " ", fixed)
# fixed_n

## Train-Dev-Test split the data

In [52]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.33)

print(len(X_train), len(y_train))

for label, text in zip(y_train[:10], X_train[:10]):
  print("label:", label, "\ntext:", text, "\n")

33500 33500
label: neg 
text: Without saying how it ended, it is sufficient to say that the whole thing degenerates from about five minutes before the end. If the standard had been maintained throughout, the movie would be worth a seven. One wonders in a way why a woman was added to the cast. (Well - not really!) The premise is a good one The situation the victims find themselves in is pretty terrifying and it's rather well done, but you get the impression the makers of the film lost interest towards the end, or as a previous contributor said, they changed writers and handed over to someone else. 

label: neg 

label: pos 
text: In the year 2000 (keep in mind, this is two years ago, not four), two men had the motivation to create the most miraculous piece of art on this side of the Mississippi. Thanks to Jere Cunningham and Tom Flynn, the world can now enjoy Second String, a delicious TV movie depicting a tale of a rag-tag gang of second stringers (thus the title) who are thrust into t

## Preprocessing: Tfidf Vectorizer 

Since we are dealing with text data we need to transform it to format a basic SVM can handle. For that purpose I use sklearn TfidfVectorizer. 

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=50000)
fm_train = vectorizer.fit_transform(X_train)
fm_test = vectorizer.transform(X_test)

In [54]:
# input data size is limited by the vectorizer

print(f"We have {fm_train.shape[0]} rows and {fm_train.shape[1]} columns in the training data")
print(f"And {fm_test.shape[0]} rows and {fm_test.shape[1]} columns in the training data")


We have 33500 rows and 50000 columns in the training data
And 16500 rows and 50000 columns in the training data


In [55]:
# type of the input data is scipy.sparse.csr.csr_matrix
print(type(fm_train))

# What does it mean? It looks like this:
print(fm_train[0:2:])

# Each row of the sparse matrix contains the indices of the tfidf matrix 
#(for example (0, 12247) and the tfidf weight).
# The row index is the document index (the number of the review) and the column 
# index is for the token 

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 13199)	0.09941007866240618
  (0, 40011)	0.09720000564462355
  (0, 29653)	0.07463746999238682
  (0, 18383)	0.16408338868089525
  (0, 49305)	0.13276218932659853
  (0, 7366)	0.13785030116220048
  (0, 43712)	0.05078988207012413
  (0, 36551)	0.09847205302615367
  (0, 9260)	0.2506423047981331
  (0, 32344)	0.13157953301868805
  (0, 2368)	0.040067198120549485
  (0, 29370)	0.05194563556205639
  (0, 44535)	0.13083017720105566
  (0, 21816)	0.11904501361315997
  (0, 25191)	0.1099687038650605
  (0, 14964)	0.043764256249960866
  (0, 29093)	0.028977915286299553
  (0, 25677)	0.14185450401662508
  (0, 20897)	0.1416972537134192
  (0, 16614)	0.06403974351093289
  (0, 49637)	0.04386083341751073
  (0, 6095)	0.03683884375959961
  (0, 12148)	0.08953019366645623
  (0, 33836)	0.09601396927197847
  (0, 1717)	0.05690738615936637
  :	:
  (1, 2368)	0.0330230885226977
  (1, 29370)	0.06421981328863342
  (1, 25191)	0.04531769643008442
  (1, 14964)	0.018035088252543168
  (1,

In [56]:
# Columns are mostly empty because most words in the vocabulary do not appear in every sentence
# This is why sparse format is used instead of dense:
print(fm_train[0:2].todense())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [57]:
# In the vectorizer vocabulary we have the original words as key value pairs, where the 
# word is the key and matrix index is the value:

for (idx, item) in enumerate(vectorizer.vocabulary_.items()):
  print("Key:", item[0], "\tValue:",item[1])
  if (idx==8):
    break

Key: without 	Value: 48956
Key: saying 	Value: 36990
Key: how 	Value: 20045
Key: it 	Value: 22300
Key: ended 	Value: 13422
Key: is 	Value: 22232
Key: sufficient 	Value: 41989
Key: to 	Value: 44188
Key: say 	Value: 36985


## Finding the best model with GridSearch Cross Validation

The model can be trained by exploring the hyperparameters one by one or with nested for-loops. It will how ever become a frustrating task to keep up with the hyperparameter combinations and obtained performence values. A more systematic way to do this is by using GridSearch (GS). 

GridSearchCV allows us set grid (or multiple vectors) of hypermarameters to try with. The idea is to try and find a sweet spot (best performance measure) by adjusting the grid. K-fold CV also introduces a new hyperparameter, which affects the training results, namely the number of folds.

GS uses K-fold Cross Validation (CV) to find the best performing model. Depending on the algorithm and chosen parameters the data is "folded" (divided into subsets) n times and each of these folds is used once for testing while n-1 folds are used for training. Cross validation is also useful when the data set size is limited and we would like to "eat the cake and keep it".

For the task I have chosen Linear Support Vector Classifier. When classifying multiple classes and the number of classes in *n* LinearSVC implements “one-vs-the-rest” multi-class strategy, thus training *n* models ([Scikit-learn](https://scikit-learn.org/stable/modules/svm.html#svm-classification)). At prediction time all the classifers "vote", and item will be assigned to class with the lowest cost. Other possible models for text classification problem are for example K-Nearest-Neighbors and Multinomial Naive Bayes. Also classifiers can be compared with GS.

A simple pipeline is built for both the preprocessor (vectorizer) and the classifier so that we are able to find the best hyperparameters for both of them at once.

Sources:

[GridSearchCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

[GridSearchCV example 1](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html) 

[GridSearchCV example 2](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html) 

[SVM documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

[GridSearchCV scoring parameters](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) 

#### TfidfVectorizer with SVM

In [59]:
costs = np.logspace(-1, 1, num=5, endpoint = False)

pipeline = Pipeline([
    ('vec', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

parameters = {
    #'vec__binary': (True, False), # Previous runs revealed this does not seem to matter
    'vec__max_features': (20000, 50000),
    'vec__ngram_range': ((1, 1), (1, 2), (1, 3)),  
    'clf__C': (costs), 
}
# find the best parameters for both the feature extraction and the classifier
print("Running grid search...")
# n_jobs=-1: use as many cores as possible
# cv=3: three folds (this is kind of little but it speeds things up)
gridsearch = GridSearchCV(pipeline, parameters, cv=3, verbose=1, n_jobs=-1)
gridsearch.fit(X_train, y_train)
print("Grid search done!")
print()
print(f"Best score: {gridsearch.best_score_:0.2}")
print("Best of the observed hyperparameters:")
best_parameters = gridsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
      print(f"{param_name}: {best_parameters[param_name]}")

Running grid search...
Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 14.1min


KeyboardInterrupt: ignored

So was the selected model clearly the best?

In [None]:
# set column visibility in pandas df
pd.set_option("max_colwidth", None)

# extract mean score for each parameter combination trained
means = gridsearch.cv_results_['mean_test_score'] 

GSCV_results = pd.DataFrame(list(zip(means, gridsearch.cv_results_['params'])), 
               columns =['Score', 'Parameters']) 
# sort by the score
GSCV_results.sort_values(by="Score", ascending=False, inplace=True)
print(GSCV_results.head(7))

Not! This seemd to be a tight race.

##### Performance evaluation

In [None]:
# classifier=grid_search.best_estimator_
# classifier.fit(feature_matrix_train, train_label)

predictions = gridsearch.predict(X_test)
acc = accuracy_score(y_test, predictions)
# conf = confusion_matrix(test_labels, predictions)

print(f"Test accuracy: {acc:0.2f}")
print()
# note here we have to feed in the test data not feature matrix since esitimator is a pipeline, not a classifier!
plot_confusion_matrix(gridsearch.best_estimator_, X_test, y_test, cmap='Greens', values_format='d')  
plt.title("Confusion matrix")
plt.show()

plot_confusion_matrix(gridsearch.best_estimator_, X_test, y_test, cmap='Blues', normalize='true')  
plt.title("Normalized confusion matrix")
plt.show()

In [None]:
print(classification_report(y_test, predictions))

The model handled both of the classes well. This can be seen from the confusion matrix and from the classification report where precision and recall are in balance for both os the labels.

#### Simple Neural Network


In [None]:
# redo the split to overwrite old variables
X_train, X_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.33)

To build a NN we need to

1.   turn numpy vectors to tensors
2.   know the shape of input layer (number of features)
3.   know the shape of output layer (number of classes)

TfidfVectorizer gives the 2nd one and LabelEncoder (for example) the 3rd one. (or just len(set(train_labels))

"Keras models can be used in scikit-learn by wrapping them with the KerasClassifier or KerasRegressor class."(https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)

In [None]:
# 1) np vectors to TF tensors
import tensorflow as tf
import numpy as np

def convert_sparse_matrix_to_sparse_tensor(X):
    coo = X.tocoo()
    indices = np.mat([coo.row, coo.col]).transpose()
    return tf.sparse.reorder(tf.SparseTensor(indices, coo.data, coo.shape))

Since vectorizer affects the shape of the NN, we do not optimize it as a part of the pipeline.

In [None]:
# 2) size of input 
vectorizer = TfidfVectorizer(max_features=100000)

ft_matrix = vectorizer.fit_transform(X_train)
ft_matrix.shape # so we need the second dimension for building the nn
input_size = ft_matrix.shape[1]
input_size

In [None]:
# 3) size_of_output_layer
# use encoded labels when fitting the model and for testing
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder() #Turns class labels into integers
class_numbers_train = label_encoder.fit_transform(y_train)

print("class_numbers shape=", class_numbers_train.shape)
print("class labels", label_encoder.classes_) #this will let us translate back from indices to labels

output_size = len(label_encoder.classes_)
print("Shape of output layer:", output_size)

In [None]:
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Dropout
from keras import optimizers

def build_sequential_nn(input_size=100, output_size=2, hiddenlayer_size=200, drop_out= 0.3, learning_rate=0.001): 
  # let's make 200 default sixe of the hiddenlayer
  model = Sequential()
  model.add(Input(shape = (input_size, )))
  model.add(Dense(hiddenlayer_size, activation = "tanh", ))
  model.add(Dropout(rate=drop_out)) # Dropout regularizer to avoid over fitting
  # model.add(Dense(output_size, activation = "softmax"))
  model.compile(optimizer=optimizers.Adam(learning_rate=learning_rate), loss="sparse_categorical_crossentropy", metrics=['accuracy'])
  return model

model = build_sequential_nn(input_size=input_size, output_size=output_size)
model.summary()

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import FunctionTransformer
# from keras.callbacks import EarlyStopping
import time

pipeline = Pipeline([
    ('trans', FunctionTransformer(convert_sparse_matrix_to_sparse_tensor)), # wrapper for custom function
    ('clf', KerasClassifier(build_fn=build_sequential_nn)), # wrapper for Keras model
])

parameters = {
    'clf__hiddenlayer_size': (200, 300), 
    'clf__input_size': ([input_size]), # GS sets ALL the params
    'clf__output_size': ([output_size]),
    'clf__batch_size': (64, 265),
    'clf__drop_out': (0.2, 0.4),
    'clf__epochs': (3, 5), # do not use early stopping callback, number of epochs is best treated as a hyper parameter: https://stackoverflow.com/questions/48127550/early-stopping-with-keras-and-sklearn-gridsearchcv-cross-validation
    'clf__learning_rate': (0.001, 0.01) # 0.001 is default for Adam
}

t0=time.time()
print("Running grid search...")
gridsearch = GridSearchCV(pipeline, parameters, verbose=1, n_jobs=1) # n_jobs=-1: use as many cores as possible OR use GPU
gridsearch.fit(ft_matrix, class_numbers_train)
print()

print(f"Best score: {gridsearch.best_score_:0.2}")

print("Best of the observed hyperparameters:")
best_parameters = gridsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
      print(f"{param_name}: {best_parameters[param_name]}")

t1=time.time()

In [None]:
print(f"Time elapsed {(t1-t0)/60:0.3} minutes")

In [None]:
means = gridsearch.cv_results_['mean_test_score'] 

GSCV_results = pd.DataFrame(list(zip(means, gridsearch.cv_results_['params'])), 
               columns =['Score', 'Parameters']) 
# sort by the score
GSCV_results.sort_values(by="Score", ascending=False, inplace=True)
print(GSCV_results.head(7))

##### Performance evaluation

In [None]:
import seaborn as sns

# prepare test data
ftm_test=vectorizer.transform(X_test) # model needs to be Sequential for predicting
class_numbers_test = label_encoder.transform(y_test)

# predict
raw_predictions = gridsearch.predict(ftm_test)
predictions=label_encoder.inverse_transform(raw_predictions)

# results
acc = accuracy_score(y_test, predictions)
print(f"Test accuracy: {acc:0.2f}")
print()
cf_mat = tf.math.confusion_matrix(
    class_numbers_test, raw_predictions, num_classes=None, weights=None
)

def plot_cf_matrix(mat):
  sns.heatmap(mat, annot=True, fmt='d', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_, cmap='Greens')
  plt.title("Confusion matrix for test data", fontsize = 16)
  plt.ylabel("True class", fontsize = 14)
  plt.xlabel("Predicted class", fontsize = 14)

plot_cf_matrix(cf_mat)

In [None]:
print(classification_report(y_test, predictions))