# Sentiment Analysis on IMDB Movie Review using Scikit-Learn

This project aims to use Scikit-Learn to find out the positive and negative movie reviews. We use Logistic Regression to give the probability of the sentiment.
If review is atleast 7, then it is a positive review(output is 1).
If review is less than 4, it is considered to be negative(output is 0).

We will also use natural language processing library nltk to perform feature extraction on the reviews and lastly also evaluate the model using sklearn.

### Task 1 : Importing the dataset

The dataset consists of 25000 positive reviews and 25000 negative reviews. We will use the pandas library to load the dataset.

In [1]:
import pandas as pd


In [2]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


### Task 2 : Data preparation

We will use the regex or regular expression library to get rid of all the unnecessary symbols, HTML tags,emojis in the dataset. For example,

In [3]:
df.loc[0,'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [4]:
import re

In [13]:
def prepare(txt):
    
    #replaces all the html tags and special characters with blank
    txt = re.sub('<[^>]*>','',txt)
    
    #finds all the emojis present in the reviews 
    emojis = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',txt)
    
    #attaches all the emojis at the end of the line
    txt = re.sub('[\W]+',' ',txt.lower()) + \
          ' '.join(emojis).replace('-','')
    
    return txt

In [14]:
prepare(df.loc[0,'review'][-50:])

'is seven title brazil not available'

In [17]:
prepare ('Hi :) Welcome here')

'hi welcome here:)'

In [18]:
# Apply the prepare function to review column in the dataset
df['review'] = df['review'].apply(prepare)

### Task 3 : Tokenization of documents

Documents can contain several forms of the same word like run can be written as running or runners. So we use the nltk or natural language processing toolkit to stem the words.
Stemming is the process by which we can trim the ends of the words and reduce them into a base form.

In [20]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

In [21]:
# split the words in a sentence
def spl(txt):
    return txt.split()

In [23]:
# stemming takes the individual words and stems them
def stemming(txt):
    return [porter.stem(word) for word in txt.split()]

### Task 4 : Converting the documents to tf-idf values

At first the documents are converted to feature matrix where each of the words are converted in matrix form showing the frequency of each word in each document. Mostly these words are in sparse matrix form. 

#### Bag of words model -- CountVectorizer

In [26]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                'The sun is shining,the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [28]:
#count.vocabulary is a dictionary which conists of all the unique words in the documents 
#and assigns a numeric index to each of them. 
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [29]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


For example 'weather' is assigned index 8 and the last two documents had occurrence of the word weather once in each of them thus the column is 0,1,1. 

#### Tf-idf

Sometimes multiple documents contain the same words which are irrelevant(like, is ,the).So tf-idf or Term Frequency Inverse Document Frequency is used to weigh down the values of those words.

In [30]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,norm='l2',smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(bag).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


We can apply both these methods - converting into tf values and then to idf values together on our dataset.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents = None,
                       lowercase = None,
                       preprocessor = None,
                       tokenizer = stemming,
                       use_idf = True,
                       norm = 'l2',
                       smooth_idf = True)
# review dataset
X = tfidf.fit_transform(df.review)

#sentiment dataset
Y = df.sentiment.values

### Task 5 : Logistic Regression model

First we need to split the dataset into training and testing dataset using sklearn.

In [32]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,random_state = 1,test_size =0.5,shuffle = False)

Logisitic Regression model of sklearn : Tuning of hyperparameters using cross-validation

In [33]:
import pickle
from sklearn.linear_model import LogisticRegressionCV

#Logistic Regression model
clf = LogisticRegressionCV(cv=5,
                          scoring='accuracy',
                          random_state=0,
                          n_jobs=-1,
                          verbose=3,
                          max_iter=300).fit(X_train,Y_train)

#Pickle is used to dump the classifier in a file on the disk
saved_model = open('saved_model.sav','wb')
pickle.dump(clf,saved_model)
saved_model.close()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  3.9min remaining:  5.8min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.9min finished


### Task 6 : Model Evaluation

In [34]:
#load the saved classifier
saved_clf = pickle.load(open('saved_model.sav','rb'))

In [35]:
#calculate accuracy
print("Accuracy is {}".format(saved_clf.score(X_test,Y_test)))

Accuracy is 0.89608
