<a href="https://colab.research.google.com/github/Stallians/ML-Projects/blob/master/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMDB Movie Review Sentiment Analysis

**First version**:  
~70% accuracy score on both train and test data. Looking for how to interpret the trained model so can find out the relevant models.  
  1.1. Realised that weights for NB can't be determined straight-away. (Should've used a linear model instead)

**Second version**:  
Used Logistic Regression(LR), so that i can interpret the weights and improve feature vectors.  
~73% accuracy on train and test data (0.74092, 0.7352) //suspected slight overfit

**TODO**  
1. Review Data can be cleaned throughly. Some HTML elements can be removed.
2. Better values for Tfidf min and max df.
3. There are more than one reviews for any given movie. Can this data point be used? Figure out. 


In [0]:
import pandas as pd
import numpy as np
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Get the data


In [2]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2019-12-20 18:12:48--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2019-12-20 18:12:52 (20.5 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [0]:
# extracting the data
import tarfile
tar = tarfile.open('aclImdb_v1.tar.gz','r:gz')
tar.extractall()
tar.close()

## Read data

In [0]:
test_path_pos = 'aclImdb/test/pos/'
test_path_neg = 'aclImdb/test/neg/'
train_path_pos = 'aclImdb/train/pos/'
train_path_neg = 'aclImdb/train/neg/'
train_reviews, test_reviews = list(), list()

In [0]:
# method to read data into lists
def read_reviews(filedir, sentiment):
  temp_list = []
  for tfilen in os.listdir(filedir):
    filepath = os.path.join(filedir, tfilen)
    rev = open(filepath).read()
    temp_list.append([tfilen[:-4], rev, sentiment])
  return temp_list

In [0]:
train_reviews+=read_reviews(train_path_pos, 1)
train_reviews+=read_reviews(train_path_neg, 0)

test_reviews+=read_reviews(test_path_pos,1)
test_reviews+=read_reviews(test_path_neg,0)

In [0]:
# use of assert is not favourable when code is optimised
assert len(train_reviews) == 25000
assert len(test_reviews) == 25000

In [10]:
train_reviews[:5]

[['10152_9',
  'I have looked forward to seeing this since I first saw it listed in her work. Finally found it yesterday 2/13/02 on Lifetime Movie Channel.<br /><br />Jim Larson\'s comments about it being a "sweet funny story of 2 people crossing paths" were dead on. Writers probably shouldn\'t get a bonus, everyone else SRO for making the movie.<br /><br />Anybody who appreciates a romantic Movie SHOULD SEE IT.<br /><br />Natasha\'s screen presence is so warm and her smile so electric, to say nothing of her beauty, that anything she is in goes on my favorite list. Her TV and print interviews that I have seen are just as refreshing and well worth looking for.<br /><br />God Bless her, her family and future endeavors.<br /><br />This movie doesn\'t seem to available in DVD or video yet, but I would be the first to buy it and I think others would too.',
  1],
 ['3284_10',
  'My name is John Mourby and this is my story about Paperhouse: In May 2003 I saw Alfred Hitchcock\'s psycho, I was 

In [11]:
test_reviews[-5:]

[['4622_1',
  'This is the one movie that represents all that is bad in the movie business. The actors are pathetic and the script is awful. The special effects, if there are any, are so badly done that it would have been better to do it with cartoons instead. Besides that it\'s great! I think the creators of the movie meant it to have humor, but the only time i was laughing was when I saw Patrick S. with long hair and the colorful costumes that every one had. The scenes at the end were good but they were not a part of the movie. In the end you will ask yourself "why did I waste my time and money with that crap when I could have watched the plants growing or the clouds moving". I don\'t think that I am some critic or anything but this is a truly lame movie! DO NOT WATCH! DANGER OF STUPIDITY OVERLOAD!',
  0],
 ['217_3',
  "I have to admit that I am disappointed after seeing this movie. I had expected so much more from the trailers. The movie was absolutely horrible. It lacked a real sto

In [0]:
train = pd.DataFrame(data=train_reviews, columns=['filename','review', 'sentiment'])
test = pd.DataFrame(data=test_reviews, columns=['filename','review', 'sentiment'])

In [0]:
train_X, train_y = train.iloc[:,[0,1]], train.iloc[:,[2]]
test_X, test_y = test.iloc[:,[0,1]], test.iloc[:,[2]]

In [14]:
train_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
filename    25000 non-null object
review      25000 non-null object
dtypes: object(2)
memory usage: 390.8+ KB


In [15]:
test_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
filename    25000 non-null object
review      25000 non-null object
dtypes: object(2)
memory usage: 390.8+ KB


## Transforming text into vectors

In [0]:
tfidf = TfidfVectorizer(min_df=0.2, max_df=0.8, ngram_range=(1,2))

In [0]:
features = tfidf.fit_transform(train_X.review.tolist())

In [0]:
test_features = tfidf.transform(test_X.review.tolist())

In [0]:
fdf = pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())

In [0]:
test_fdf = pd.DataFrame(test_features.todense(), columns=tfidf.get_feature_names())

In [21]:
print(fdf.shape)
print(test_fdf.shape)

(25000, 119)
(25000, 119)


In [22]:
fdf.head()

Unnamed: 0,about,acting,after,all,also,an,and the,any,are,as,at,at the,bad,be,because,been,being,br,br br,br the,but,by,can,characters,could,do,don,even,film,first,for,for the,from,get,good,great,had,has,have,he,...,really,see,seen,she,so,some,story,than,the film,the movie,their,them,then,there,they,think,this film,this is,this movie,time,to be,to the,too,up,very,was,watch,way,we,well,were,what,when,which,who,will,with,with the,would,you
0,0.069793,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.059272,0.053753,0.0,0.0,0.0,0.058643,0.0,0.0,0.09722,0.572792,0.286409,0.0,0.049667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.176238,0.099734,0.0,0.0,0.086126,0.0,0.0,0.0,0.0,0.11705,0.0,...,0.0,0.079468,0.095079,0.0939,0.131304,0.0,0.081736,0.0,0.0,0.085061,0.0,0.0,0.0,0.0,0.0,0.094309,0.0,0.0,0.076121,0.0,0.0,0.0,0.092396,0.0,0.0,0.0,0.0,0.0,0.0,0.08216,0.086137,0.0,0.0,0.0,0.06686,0.0,0.0,0.0,0.158623,0.0
1,0.188229,0.0,0.124122,0.0,0.118545,0.043236,0.050116,0.0,0.0,0.0,0.0,0.0,0.0,0.079079,0.0,0.0,0.0,0.231719,0.115865,0.0,0.301386,0.0,0.09847,0.0,0.0,0.0,0.059167,0.0,0.199465,0.0,0.100867,0.0,0.0,0.0,0.0,0.0,0.166973,0.0,0.03946,0.047374,...,0.0,0.0,0.0,0.379865,0.08853,0.049172,0.110219,0.0,0.112949,0.0,0.0,0.0,0.125249,0.0,0.046217,0.0,0.0575,0.056367,0.0,0.103347,0.0,0.053382,0.124594,0.051853,0.103711,0.217013,0.0,0.060978,0.0,0.0,0.058077,0.048418,0.0,0.0,0.04508,0.0,0.034218,0.0,0.053475,0.080926
2,0.081259,0.0,0.0,0.0,0.0,0.223979,0.0,0.0,0.138019,0.125167,0.072128,0.0,0.0,0.068277,0.0,0.0,0.0,0.133378,0.066692,0.0,0.057826,0.076437,0.085019,0.0,0.0,0.101072,0.0,0.0,0.0,0.0,0.058059,0.104449,0.0,0.0,0.0,0.0,0.0,0.249184,0.0,0.409026,...,0.0,0.185046,0.110699,0.109326,0.152874,0.084911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.097334,0.088627,0.0,0.0,0.0,0.21515,0.0,0.0,0.062457,0.0,0.105298,0.211946,0.191315,0.0,0.0,0.0,0.0,0.0,0.103803,0.059088,0.0,0.092341,0.209615
3,0.380086,0.0,0.050127,0.0,0.047875,0.034922,0.121439,0.0,0.161395,0.029273,0.0,0.0,0.049703,0.09581,0.097088,0.094534,0.0,0.249549,0.12478,0.0,0.135241,0.0,0.079535,0.0,0.0,0.0,0.095581,0.0,0.0,0.0,0.027157,0.0,0.0,0.0,0.039727,0.0,0.044955,0.0,0.063744,0.0,...,0.0,0.0,0.103558,0.0,0.143014,0.0,0.311589,0.0,0.0,0.0,0.0,0.050781,0.0,0.149492,0.0,0.15408,0.0,0.045528,0.0,0.0,0.0,0.086235,0.0,0.083764,0.0,0.029214,0.0,0.049253,0.495687,0.044744,0.0,0.078216,0.041052,0.0,0.036411,0.048554,0.082914,0.04932,0.086385,0.032682
4,0.0,0.289908,0.0,0.094047,0.0,0.0,0.0,0.0,0.0,0.246428,0.09467,0.0,0.0,0.089616,0.0,0.0,0.0,0.350126,0.175071,0.0,0.151798,0.0,0.0,0.0,0.0,0.0,0.0,0.119214,0.180834,0.0,0.076204,0.0,0.100346,0.0,0.0,0.135646,0.126147,0.0,0.0,0.107372,...,0.124853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130323,0.127754,0.116325,0.0,0.120475,0.12099,0.0,0.0,0.0,0.327905,0.0,0.0,0.0,0.0,0.131631,0.10974,0.0,0.0,0.102172,0.0,0.077554,0.0,0.0,0.0


In [23]:
train_y.sentiment.shape

(25000,)

## Modeling

### Using Naive Bayes

In [0]:
nb = MultinomialNB()

In [0]:
nb=nb.fit(fdf, train_y.sentiment)

In [0]:
predictions=nb.predict(fdf)

In [27]:
# score on training data
accuracy_score(train_y.sentiment, predictions)

0.713

In [28]:
# score on testing data
test_predictions = nb.predict(test_fdf)
print(accuracy_score( test_y.sentiment, test_predictions))

0.70752


### Using linear model

In [0]:
lr = LogisticRegression()

In [30]:
lr = lr.fit(fdf, train_y.sentiment)



In [0]:
train_predictions_lr = lr.predict(fdf)

In [32]:
# score on training data for  logistic(not 'linear') regression
accuracy_score(train_y.sentiment, train_predictions_lr)

0.74092

In [33]:
# score on testing data for logistic regression
test_predictions_lr = lr.predict(test_fdf)
print(accuracy_score(test_y.sentiment, test_predictions_lr))

0.7352


## Interpretations from Linear Model
- based on weights from logistic regression

In [34]:
print(lr.coef_.shape)

(1, 119)


In [0]:
interpret = pd.DataFrame.from_dict({"word":tfidf.get_feature_names(),
                                   "weight":lr.coef_[0].tolist() })

In [36]:
interpret.nlargest(5,'weight')

Unnamed: 0,word,weight
35,great,5.752874
108,well,3.297956
4,also,2.542649
72,one of,2.096197
114,will,2.071102


In [37]:
interpret.nsmallest(5,'weight')

Unnamed: 0,word,weight
12,bad,-7.402652
66,no,-3.59022
27,even,-2.902132
1,acting,-2.747727
78,plot,-2.742868


# Notes

In [0]:
'''
First run:
~70% accuracy score on both train and test data. Looking for how to interpret the trained model so can find out the relevant models.
  * realised that weights for NB can't be determined straight-away. (Should've used a linear model instead)

Second version:
tried to use LR(linear) (linear model), so that i can interpret the weights and improve feature vectors
LR worked but, could not remember how to use it for binary classification ???

Seriously??? I should have used logistic model instead.
++ changed the model to logistic regression
  Second run:
  ~ 73% accuracy on train and test data (0.74092, 0.7352) //suspected slight overfitting
'''


In [0]:
"""
Tips:
1. to get the values of various hyperparameters/coefficient/weights of trained models/transformers/vectorizers, 
look into "Attributes" section of the method's scikit learn documentation
"""