### IMDB Movie Review

Context: IMDB dataset having 25K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 12,500 highly polar movie reviews for training and 12,500 for testing. Please use less data eg 6K reviews if you are facing memory issues but make sure to use equal number of positive and negative sentiment reviews. Mention clearly in the notebook, if you have used a reduced dataset. For more dataset information, please go through the following link, 
http://ai.stanford.edu/~amaas/data/sentiment/

Task: Goal of this project is to predict the number of positive and negative reviews using classification

In [1]:
#Load the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from nltk.tokenize import word_tokenize,sent_tokenize

import re, string

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [6]:
#importing the data
imdb_data=pd.read_excel('IMDB_dataset.xlsx')
imdb_data.head(10)

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative
5,Phil the Alien is one of those quirky films wh...,negative
6,I saw this movie when I was about 12 when it c...,negative
7,So im not a big fan of Boll's work but then ag...,negative
8,This a fantastic movie of three prisoners who ...,positive
9,This movie made it into one of my top 10 most ...,negative


#### Exploratory Data Analysis

In [7]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,25000,25000
unique,24898,2
top,"When i got this movie free from my job, along ...",positive
freq,3,12500


In [8]:
# What is the shape of the dataset?

print("Data has {} rows and {} columns".format(len(imdb_data), len(imdb_data.columns)))

Data has 25000 rows and 2 columns


In [9]:
# How many positive/negative sentiments are there?

print("Out of {} rows, {} are positive, {} are negative".format(len(imdb_data),
                                                       len(imdb_data[imdb_data['sentiment']=='positive']),
                                                       len(imdb_data[imdb_data['sentiment']=='negative'])))

Out of 25000 rows, 12500 are positive, 12500 are negative


In [10]:
# How much missing data is there?

print("Number of null in sentiment: {}".format(imdb_data['sentiment'].isnull().sum()))
print("Number of null in review: {}".format(imdb_data['review'].isnull().sum()))

Number of null in sentiment: 0
Number of null in review: 0


Observation from data, we have a total of 25000 rows in our dataset. We have 12,500 positive and negative sentiments each. There are no null values in the dataset. 

### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(remove_special_characters)

In [13]:
stop_words = set(stopwords.words('english'))
ps = nltk.PorterStemmer()

In [14]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stop_words]
    return text

### Applying TfidfVectorizer

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(imdb_data['review'])
print(X_tfidf.shape)

(25000, 93806)


Now I've cleaned the data and applied TfidfVectorizer on the data. I've also encoded the target feature. Next I'll be using GridSearch CV on Random Forest & Gradient Boosting Classifier.

### Using GridSearchCV on Random Forest & Gradient Boosting Classifier.

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

### Exploring Random Forest parameter settings using GridSearchCV

In [17]:
rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
        'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_tfidf, imdb_data['sentiment'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
11,470.035593,64.896608,2.378719,0.568143,,300,"{'max_depth': None, 'n_estimators': 300}",0.8574,0.8692,0.8582,0.8572,0.857,0.8598,0.004718,1
8,431.824259,7.395252,2.613095,0.161608,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.8578,0.8672,0.8564,0.855,0.8544,0.85816,0.004671,2
5,330.721986,3.518083,2.320362,0.201412,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.8546,0.8704,0.8544,0.8532,0.8534,0.8572,0.006622,3
4,159.588523,1.445921,1.173218,0.049049,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.8532,0.8606,0.8492,0.8504,0.852,0.85308,0.003999,4
7,219.450775,3.564048,1.393798,0.154384,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.8604,0.8616,0.8496,0.8432,0.8502,0.853,0.006988,5


From the GridSearchCV for Random Forest, I chose max depth of 60 and n_estimators of 150 as the best hyperparameters to tune our model. This is because of the consideration of the fit_time and a balanced mean test score, the n-estimators of 300 had better mean test scores but it takes a lot of time to fit the model. The difference in the test_scores are not too vast, so I settled on those parameters because it seemed the most balanced.

### Exploring GradientBoostingClassifier parameter settings using GridSearchCV

In [29]:
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150], 
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}

clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
cv_fit = clf.fit(X_tfidf, imdb_data['sentiment'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,988.903262,13.486077,0.132199,0.011215,0.1,7,150,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.8444,0.8534,0.847,0.843,0.839,0.84536,0.004781,1
3,1592.629991,56.766265,0.244802,0.039276,0.1,11,150,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.8446,0.85,0.8416,0.839,0.8444,0.84392,0.003667,2
5,1876.156588,317.179466,0.183998,0.034802,0.1,15,150,"{'learning_rate': 0.1, 'max_depth': 15, 'n_est...",0.8374,0.8488,0.8382,0.835,0.8404,0.83996,0.004745,3
2,1005.008991,39.940957,0.145399,0.026013,0.1,11,100,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.8374,0.8492,0.8346,0.8338,0.8382,0.83864,0.005532,4
0,620.5542,25.199149,0.096601,0.003826,0.1,7,100,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.8402,0.8436,0.8352,0.8346,0.8334,0.8374,0.003872,5


From the GridSearchCV for GradientBoostingClassifier, I chose the max depth of 7 and 150 n_estimators as the best hyperparameters to tune the model because it seems to be the most balanced amongst the top 5 parameters.

Next, I'll perform the final evaluation of the models.

In [19]:
#splitting data into train and test
X_train, X_test, y_train, y_test = train_test_split(imdb_data['review'], imdb_data['sentiment'], test_size=0.5)

In [20]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(12500,)
(12500,)
(12500,)
(12500,)


Now, I've split the Training and Testing to 12500 as instructed.

In [21]:
#vectorize text
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train)

tfidf_train = tfidf_vect_fit.transform(X_train)
tfidf_test = tfidf_vect_fit.transform(X_test)

X_train_vect = pd.DataFrame(tfidf_train.toarray())
X_test_vect = pd.DataFrame(tfidf_test.toarray())

X_train_vect.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62303,62304,62305,62306,62307,62308,62309,62310,62311,62312
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Final evaluation of models

In [23]:
import time
rf = RandomForestClassifier(n_estimators=150, max_depth=60, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 228.149 / Predict time: 24.48 ---- Precision: 0.851 / Recall: 0.847 / Accuracy: 0.849


In [31]:
gb = GradientBoostingClassifier(learning_rate=0.1, max_depth=7, n_estimators=150)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 27916.753 / Predict time: 24.319 ---- Precision: 0.83 / Recall: 0.858 / Accuracy: 0.841


For the final evaluation of the models, I'll say the Random forest classifier performed the best for our dataset. Regarding the fit_time and also the Precision and Accuracy score. The GradientBoostingClassifier had a better Recall score, but the RandomForestClassifier performed the best overall in for the dataset