# Nafisur Rahman
nafisur21@gmail.com<br>
https://www.linkedin.com/in/nafisur-rahman

# Sentiment Analysis
Finding the sentiment (positive or negative) from IMDB movie reviews.

## About this Project
This is a kaggle project based on kaggle dataset of "Bag of Words Meets Bags of Popcorn". Original dataset can be found from stanford website http://ai.stanford.edu/~amaas/data/sentiment/.<br>
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. <br>
* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## A. Loading libraries and Dataset

### Importing Packages

In [51]:
import nltk
import re
import numpy as np # linear algebra
import pandas as pd # data processing
import random
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import SnowballStemmer
stemmer=SnowballStemmer('english')

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import word_tokenize

%matplotlib inline

### Loading the dataset

In [2]:
raw_data_train=pd.read_csv('labeledTrainData.tsv',sep='\t')
raw_data_test=pd.read_csv('testData.tsv',sep='\t')

### Basic visualization of dataset

In [3]:
print(raw_data_train.shape)
print(raw_data_test.shape)

(25000, 3)
(25000, 2)


In [4]:
raw_data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 586.0+ KB


In [5]:
raw_data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
id        25000 non-null object
review    25000 non-null object
dtypes: object(2)
memory usage: 390.7+ KB


In [6]:
raw_data_train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [7]:
raw_data_test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [8]:
raw_data_train['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## B. Data Cleaning and Text Preprocessing

Removing tags and markup

In [9]:
from bs4 import BeautifulSoup
soup=BeautifulSoup(raw_data_train['review'][0],'lxml').text
soup

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

Removing non-letters

In [10]:
import re
re.sub('[^a-zA-Z]',' ',raw_data_train['review'][0])

'With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay  br    br   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him  br    br   The actual feature film bit when it finally sta

Word tokenization

In [13]:
from nltk.tokenize import word_tokenize
word_tokenize((raw_data_train['review'][0]).lower())[0:20]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 "'ve",
 'started',
 'listening',
 'to',
 'his',
 'music',
 ',',
 'watching']

Removing stopwords

In [14]:
from nltk.corpus import stopwords
from string import punctuation
Cstopwords=set(stopwords.words('english')+list(punctuation))

In [22]:
[w for w in word_tokenize(raw_data_train['review'][0]) if w not in Cstopwords][0:20]

['With',
 'stuff',
 'going',
 'moment',
 'MJ',
 "'ve",
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'The',
 'Wiz',
 'watched',
 'Moonwalker',
 'Maybe',
 'want',
 'get']

### Defining a function that will perform the preprocessing task at one go

In [23]:
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
stemmer=SnowballStemmer('english')
from nltk.stem import WordNetLemmatizer
lemma=WordNetLemmatizer()
from nltk.corpus import stopwords
from string import punctuation
Cstopwords=set(stopwords.words('english')+list(punctuation))
def clean_review(df):
    review_corpus=[]
    for i in range(0,len(df)):
        review=df[i]
        review=BeautifulSoup(review,'lxml').text
        review=re.sub('[^a-zA-Z]',' ',review)
        review=str(review).lower()
        review=word_tokenize(review)
        #review=[stemmer.stem(w) for w in word_tokenize(str(review).lower()) if w not in Cstopwords]
        review=[lemma.lemmatize(w) for w in review ]
        review=' '.join(review)
        review_corpus.append(review)
    return review_corpus

In [24]:
df=raw_data_train['review']
clean_train_review_corpus=clean_review(df)
clean_train_review_corpus[0]

'with all this stuff going down at the moment with mj i ve started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought wa really cool in the eighty just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it wa originally released some of it ha subtle message about mj s feeling towards the press and also the obvious message of drug are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fan would say that he made it for the fan which if true is really nice of him the actual feature film bit when it finally start is only on for minute or so excluding

In [25]:
df1=raw_data_test['review']
clean_test_review_corpus=clean_review(df1)
clean_test_review_corpus[0]

'naturally in a film who s main theme are of mortality nostalgia and loss of innocence it is perhaps not surprising that it is rated more highly by older viewer than younger one however there is a craftsmanship and completeness to the film which anyone can enjoy the pace is steady and constant the character full and engaging the relationship and interaction natural showing that you do not need flood of tear to show emotion scream to show fear shouting to show dispute or violence to show anger naturally joyce s short story lends the film a ready made structure a perfect a a polished diamond but the small change huston make such a the inclusion of the poem fit in neatly it is truly a masterpiece of tact subtlety and overwhelming beauty'

In [26]:
df=raw_data_train
df['clean_review']=clean_train_review_corpus
df.head()

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",the classic war of the world by timothy hines ...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film start with a manager nicholas bell gi...
3,3630_4,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious s...


## C. Creating Features
1. Bag of Words (CountVectorizer)
2. tf
3. tfidf

### 1. Bag of Words model

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the 5000 most frequent words (remembering that stop words have already been removed).

In [28]:
cv=CountVectorizer(max_features=20000,min_df=5,ngram_range=(1,2))

In [29]:
X1=cv.fit_transform(df['clean_review'])
X1.shape

(25000, 20000)

In [30]:
train_data_features = X1.toarray()

In [32]:
y=df['sentiment'].values
y.shape

(25000,)

## D. Machine Learning

#### Splitting data into Training and Test set

In [37]:
X=train_data_features

In [38]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(20000, 20000) (20000,)
(5000, 20000) (5000,)


In [39]:
# average positive reviews in train and test
print('mean positive review in train : {0:.3f}'.format(np.mean(y_train)))
print('mean positive review in test : {0:.3f}'.format(np.mean(y_test)))

mean positive review in train : 0.502
mean positive review in test : 0.490


### 1. Naive Bayes Classifier

In [42]:
from sklearn.naive_bayes import MultinomialNB
model_nb=MultinomialNB()
model_nb.fit(X_train,y_train)
y_pred_nb=model_nb.predict(X_test)
print('accuracy for Naive Bayes Classifier :',accuracy_score(y_test,y_pred_nb))
print('confusion matrix for Naive Bayes Classifier:\n',confusion_matrix(y_test,y_pred_nb))

accuracy for Naive Bayes Classifier : 0.8726
confusion matrix for Naive Bayes Classifier:
 [[2240  308]
 [ 329 2123]]


In [43]:
# get the feature names as numpy array
feature_names = np.array(cv.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model_nb.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['this trash' 'this turkey' 'slater' 'save yourself' 'even worth'
 'worst acting' 'this pile' 'just awful' 'this garbage' 'wayans']

Largest Coefs: 
['the' 'and' 'of' 'to' 'is' 'it' 'in' 'that' 'this' 'film']


### 2. Random Forest

In [44]:
from sklearn.ensemble import RandomForestClassifier

In [45]:
model_rf=RandomForestClassifier(random_state=0)

In [33]:
%%time
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100,200],'criterion':['entropy','gini'],
              'min_samples_leaf':[2,5,7],
              'max_depth':[5,6,7]
               }
grid_search = GridSearchCV(estimator = model_rf,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print('Best Accuracy :',best_accuracy)
print('Best parameters:\n',best_parameters)

Best Accuracy : 0.81605
Best parameters:
 {'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 2, 'n_estimators': 200}
Wall time: 27min 4s


In [48]:
%%time
model_rf=RandomForestClassifier()
model_rf.fit(X_train,y_train)
y_pred_rf=model_rf.predict(X_test)
print('accuracy for Random Forest Classifier :',accuracy_score(y_test,y_pred_rf))
print('confusion matrix for Random Forest Classifier:\n',confusion_matrix(y_test,y_pred_rf))

accuracy for Random Forest Classifier : 0.768
confusion matrix for Random Forest Classifier:
 [[2082  466]
 [ 694 1758]]
Wall time: 11.5 s


### 3. Logistic Regression

In [49]:
from sklearn.linear_model import LogisticRegression as lr

In [50]:
model_lr=lr(random_state=0)

In [52]:
%%time
model_lr=lr(penalty='l2',C=1.0,random_state=0)
model_lr.fit(X_train,y_train)
y_pred_lr=model_lr.predict(X_test)
print('accuracy for Logistic Regression :',accuracy_score(y_test,y_pred_lr))
print('confusion matrix for Logistic Regression:\n',confusion_matrix(y_test,y_pred_lr))
print('F1 score for Logistic Regression :',f1_score(y_test,y_pred_lr))
print('Precision score for Logistic Regression :',precision_score(y_test,y_pred_lr))
print('recall score for Logistic Regression :',recall_score(y_test,y_pred_lr))
print('AUC: ', roc_auc_score(y_test, y_pred_lr))

accuracy for Logistic Regression : 0.8856
confusion matrix for Logistic Regression:
 [[2244  304]
 [ 268 2184]]
F1 score for Logistic Regression : 0.884210526316
Precision score for Logistic Regression : 0.877813504823
recall score for Logistic Regression : 0.890701468189
AUC:  0.885696103011
Wall time: 7.48 s


In [53]:
# get the feature names as numpy array
feature_names = np.array(cv.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model_lr.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['awful' 'disappointment' 'worst' 'waste' 'poorly' 'boring' 'disappointing'
 'not worth' 'poor' 'worst movie']

Largest Coefs: 
['excellent' 'perfect' 'surprisingly' 'not bad' 'enjoyable'
 'definitely worth' 'delightful' 'wonderful' 'superb' 'refreshing']
