IMDB Review Sentiment Analysis

In this assignment, we will explore one dataset for binary sentiment classification. The original dataset contains 50,000 reviews — 25,000 positive and 25,000 negative reviews. 

The number of stars would be a good proxy for sentiment classification. For example, we could pre-assign the following:
1. At least 7 out of 10 stars => positive (label=1)
2. At most 4 out of 10 stars => negative (label=0)

Here, we sample the original dataset and create a small-size data for you.

In [None]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import warnings
warnings.filterwarnings("ignore") #ignore any warnings

In [None]:
# examine the first 5 rows of X (including the feature names)
import pandas as pd
df_train = pd.read_csv("train_IMDB.csv")
df_test  = pd.read_csv("test_IMDB.csv")

In [None]:
df_train.head()

Unnamed: 0,review,sentiment,label
0,This is the first review I've wrote for IMDb s...,positive,1
1,"""The missing star"", who competed for the Golde...",positive,1
2,"Deep Water (2006) ****<br /><br />""It is indif...",positive,1
3,"I'm a rather pedestrian person, with somewhat ...",positive,1
4,"The film is very fast-moving, bizarre and colo...",negative,0


In [None]:
df_test.head()

Unnamed: 0,review,sentiment,label
0,This movie was made by a bunch of white guys t...,negative,0
1,"Despite a silly premise,ridiculous plot device...",positive,1
2,Why do they insist on making re-makes of great...,negative,0
3,The One and the Only!<br /><br />The only real...,positive,1
4,"While I hold its predecessor, ""Fast Times At R...",positive,1


In [None]:
X_train = df_train.review
y_train = df_train.label
X_test = df_test.review
y_test = df_test.label

## Task 1: Check the data shape (2 points)

How many training data and testing data do we have? 

In [None]:
# write your answer
print(df_train.shape) #25000 rows and 3 columns
print(df_test.shape) #15000 rows and 3 columns

(25000, 3)
(15000, 3)


### Vectorizing the IMDB data using Bag-of-Words method

In [None]:
# instantiate the vectorizer. We set the vocab size to 200
num_features = 200
vect = CountVectorizer(max_features=num_features)

In [None]:
# fit on training data and transform to vector (document-term matrix)
X_train_dtm = vect.fit_transform(X_train)

In [None]:
# examine the document-term matrix
print(X_train_dtm)

  (0, 163)	4
  (0, 82)	3
  (0, 153)	10
  (0, 57)	1
  (0, 174)	2
  (0, 58)	2
  (0, 144)	2
  (0, 192)	2
  (0, 100)	1
  (0, 31)	2
  (0, 116)	2
  (0, 11)	3
  (0, 177)	1
  (0, 83)	2
  (0, 67)	2
  (0, 80)	2
  (0, 54)	2
  (0, 16)	1
  (0, 189)	1
  (0, 169)	3
  (0, 61)	1
  (0, 162)	1
  (0, 148)	1
  (0, 157)	1
  (0, 107)	1
  :	:
  (24999, 153)	6
  (24999, 144)	2
  (24999, 192)	1
  (24999, 31)	1
  (24999, 11)	3
  (24999, 177)	4
  (24999, 83)	4
  (24999, 169)	4
  (24999, 113)	1
  (24999, 103)	3
  (24999, 145)	1
  (24999, 70)	1
  (24999, 139)	1
  (24999, 121)	1
  (24999, 33)	1
  (24999, 117)	1
  (24999, 45)	1
  (24999, 63)	1
  (24999, 159)	1
  (24999, 118)	2
  (24999, 126)	1
  (24999, 4)	1
  (24999, 55)	1
  (24999, 166)	1
  (24999, 37)	1


In [None]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

# examine the document-term matrix
print(X_test_dtm)

  (0, 1)	1
  (0, 3)	1
  (0, 6)	3
  (0, 11)	3
  (0, 14)	4
  (0, 15)	1
  (0, 16)	2
  (0, 17)	5
  (0, 18)	1
  (0, 20)	1
  (0, 30)	6
  (0, 31)	3
  (0, 32)	2
  (0, 33)	2
  (0, 34)	3
  (0, 37)	2
  (0, 42)	1
  (0, 45)	1
  (0, 50)	1
  (0, 52)	1
  (0, 54)	2
  (0, 59)	1
  (0, 60)	1
  (0, 61)	1
  (0, 65)	1
  :	:
  (14998, 168)	1
  (14998, 169)	1
  (14998, 170)	1
  (14998, 178)	1
  (14998, 183)	1
  (14998, 184)	1
  (14998, 194)	1
  (14999, 2)	1
  (14999, 11)	1
  (14999, 19)	1
  (14999, 49)	1
  (14999, 68)	1
  (14999, 79)	1
  (14999, 83)	1
  (14999, 110)	1
  (14999, 114)	1
  (14999, 116)	1
  (14999, 137)	1
  (14999, 146)	1
  (14999, 148)	1
  (14999, 153)	1
  (14999, 163)	2
  (14999, 169)	2
  (14999, 177)	1
  (14999, 195)	1


## Task 2: Use logistic regression (3 points)

1. Train the model using X_train_dtm
2. Test the model using X_test_dtm
3. Report the training time (**CPU Time**)
4. Report the testing accuracy

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# instantiate a logistic regression model
logreg = LogisticRegression()

train the model and report the training time

In [None]:
# write your code
import time

t0=time.time()
logreg.fit(X_train_dtm,y_train)
t_logr_200 = round(time.time()-t0, 3)
print("training time of Logistic regression with 200 features is :", t_logr_200, "s") # the time would be round to 3 decimal in seconds


training time of Logistic regression with 200 features is : 0.829 s


test the model and report the testing accuarcy

In [None]:
# write your answer
ypred=logreg.predict(X_test_dtm)
a_logr_200 = metrics.accuracy_score(y_test, ypred)
print('accuracy of Logistic regression with 200 features is :' , a_logr_200)

accuracy of Logistic regression with 200 features is : 0.7653333333333333


## Task 3: Use Linear SVM (3 points)

1. Train the model using X_train_dtm
2. Test the model using X_test_dtm
3. Report the training time (**CPU Time**)
4. Report the testing accuracy

In [None]:
from sklearn.svm import LinearSVC
svm_linear = LinearSVC()

train the model and report the training time

In [None]:
# write your code
t0=time.time()
t_SVM_200 = round(time.time()-t0, 3)
svm_linear.fit(X_train_dtm,y_train)
print("training time of Linear SVM with 200 features is :",t_SVM_200 , "s") # the time would be round to 3 decimal in seconds

training time of Linear SVM with 200 features is : 0.0 s


test the model and report the testing accuarcy

In [None]:
# write your answer
ypred=svm_linear.predict(X_test_dtm)
a_SVM_200 = metrics.accuracy_score(y_test, ypred)
print('accuracy of Linear SVM with 200 features is :' , a_SVM_200)

accuracy of Linear SVM with 200 features is : 0.7633333333333333


## Task 4:  Set vocab size to 2000  (3 points)

1. Change the number of features to 2000, i.e., a new vocabulary containing 2000 words.
2. Convert the text data into vectors via the new vocabulary.
3. Re-train Logistic regression (default settings as Task 2).
4. Report the training time and testing accuracy.

In [None]:
# write your code
vect = CountVectorizer(max_features=2000)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# instantiate a logistic regression model
logreg = LogisticRegression()

report the training time and testing accuarcy

In [None]:
# write your answer

t0=time.time()
logreg.fit(X_train_dtm,y_train)
t_logr_2000 = round(time.time()-t0, 3)
print("training time of Logistic regression with 2000 features is :", t_logr_2000, "s") # the time would be round to 3 decimal in seconds

a_logr_2000=metrics.accuracy_score(y_test, ypred)
print('accuracy of Logistic regression with 2000 features is :' , a_logr_2000)

training time of Logistic regression with 2000 features is : 1.331 s
accuracy of Logistic regression with 2000 features is : 0.7633333333333333


## Task 5:  Model Comparison (4 points)

In the above three tasks, we have trained three models. 
1. Compare their performances. Which model was able to acheive highest accuracy? Which model's training time was the longest? Try to explain your findings.
2. Based on Task 2 model, Task 3 model adopted a new classifcation/machine learning model (from Logistic Regression to SVM) and Task 4 used different features for text data (from 200 words to 2000 words). Which method was able to improve the model accuracy more? Try to explain why it is that. 


In [None]:
# write your answer
print('Accuracy of Logistic Regression with 200 features is :',a_logr_200,'Training time of Logistic Regression with 200 features is :',t_logr_200)
print('Accuracy of Logistic Regression with 2000 features is :',a_logr_2000,'Training time of Logistic Regression with 2000 features is :',t_logr_2000)
print('Accuracy of LinearSVM with 200 features is :',a_SVM_200,'Training time of LinearSVM with 200 features is :',t_SVM_200)



#As we can clearly see that Logistic regression model which was trained on 2000 features took most time in training
#In case of SVM it is more robust i.e. due to optimal margin gap between separating hyper planes.

Accuracy of Logistic Regression with 200 features is : 0.7653333333333333 Training time of Logistic Regression with 200 features is : 0.829
Accuracy of Logistic Regression with 2000 features is : 0.7633333333333333 Training time of Logistic Regression with 2000 features is : 1.331
Accuracy of LinearSVM with 200 features is : 0.7633333333333333 Training time of LinearSVM with 200 features is : 0.0


## Task 6 (Bonus)

Try to improve the model performance further. 

If your solution is able to achieve a testing accuracy that is over 88%, you will get 2 extra points. 

In [None]:
# write your code

# Grid search cross validation
from sklearn.model_selection import GridSearchCV


vect = CountVectorizer(max_features=20000)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train_dtm,y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)


print("accuracy :",logreg_cv.best_score_*100)

tuned hpyerparameters :(best parameters)  {'C': 0.1, 'penalty': 'l2'}
accuracy : 88.72000000000003


In [None]:
#Top accuracy was 88.72% achieved by using {'C': 0.1, 'penalty': 'l2'} as paramereters and 20000 feartres in Logistic Regression