# Introduction

In this __Machine Learning__ Project, we’ll build binary classification on an __IMDB__ movies dataset that puts movie reviews texts into one of two categories — __negative or positive__ sentiment. We’re going to use some popular text classification algorithms including the __Naive Bayes__ family of algorithms, support vector machines (__SVM__) and the plain old __logistic regression__. 

# Importing Libraries

In [1]:
import numpy as np 
import pandas as pd 
import re 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import pickle

# Reading and exploring data

In [2]:
data = pd.read_csv('IMDB-Dataset.csv')
print(data.shape)
data.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [4]:
data.sentiment.value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

# Data Preprocessing

#### Label encoding the sentiments

In [5]:
data.sentiment.replace('positive',1,inplace=True)
data.sentiment.replace('negative',0,inplace=True)
data.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
5,"Probably my all-time favorite movie, a story o...",1
6,I sure would like to see a resurrection of a u...,1
7,"This show was an amazing, fresh & innovative i...",0
8,Encouraged by the positive comments about this...,0
9,If you like original gut wrenching laughter yo...,1


#### Removing html elements from the text

In [6]:
def clean(text):
    cleaned = re.compile(r'<.*?>')
    return re.sub(cleaned,'',text)

data.review = data.review.apply(clean)
data.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

#### Removing special characters from the text

In [7]:
def is_special(text):
    rem = ''
    for i in text:
        if i.isalnum():
            rem = rem + i
        else:
            rem = rem + ' '
    return rem

data.review = data.review.apply(is_special)
data.review[0]

'One of the other reviewers has mentioned that after watching just 1 Oz episode you ll be hooked  They are right  as this is exactly what happened with me The first thing that struck me about Oz was its brutality and unflinching scenes of violence  which set in right from the word GO  Trust me  this is not a show for the faint hearted or timid  This show pulls no punches with regards to drugs  sex or violence  Its is hardcore  in the classic use of the word It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary  It focuses mainly on Emerald City  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  Em City is home to many  Aryans  Muslims  gangstas  Latinos  Christians  Italians  Irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away I would say the main appeal of the show is due to the fact that it goes where other shows wo

#### Converting text to lower case

In [8]:
def to_lower(text):
    return text.lower()

data.review = data.review.apply(to_lower)
data.review[0]

'one of the other reviewers has mentioned that after watching just 1 oz episode you ll be hooked  they are right  as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence  which set in right from the word go  trust me  this is not a show for the faint hearted or timid  this show pulls no punches with regards to drugs  sex or violence  its is hardcore  in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wo

#### Removing words with length less than 3

In [9]:
def rem_shorter(text):
    
    words = word_tokenize(text)
    
    return ' '.join([word for word in words if len(word) > 2])
             
data.review = data.review.apply(rem_shorter)
data.review[0]

'one the other reviewers has mentioned that after watching just episode you hooked they are right this exactly what happened with the first thing that struck about was its brutality and unflinching scenes violence which set right from the word trust this not show for the faint hearted timid this show pulls punches with regards drugs sex violence its hardcore the classic use the word called that the nickname given the oswald maximum security state penitentary focuses mainly emerald city experimental section the prison where all the cells have glass fronts and face inwards privacy not high the agenda city home many aryans muslims gangstas latinos christians italians irish and more scuffles death stares dodgy dealings and shady agreements are never far away would say the main appeal the show due the fact that goes where other shows wouldn dare forget pretty pictures painted for mainstream audiences forget charm forget romance doesn mess around the first episode ever saw struck nasty was s

#### Stemming the tokens using a Porter Stemmer. 

In [10]:
stemmer = PorterStemmer()

def stem_words(text):
    
    words = word_tokenize(text)
    return ' '.join([stemmer.stem(w) for w in words])

data.review = data.review.apply(stem_words)
data.review[0]

'one the other review ha mention that after watch just episod you hook they are right thi exactli what happen with the first thing that struck about wa it brutal and unflinch scene violenc which set right from the word trust thi not show for the faint heart timid thi show pull punch with regard drug sex violenc it hardcor the classic use the word call that the nicknam given the oswald maximum secur state penitentari focus mainli emerald citi experiment section the prison where all the cell have glass front and face inward privaci not high the agenda citi home mani aryan muslim gangsta latino christian italian irish and more scuffl death stare dodgi deal and shadi agreement are never far away would say the main appeal the show due the fact that goe where other show wouldn dare forget pretti pictur paint for mainstream audienc forget charm forget romanc doesn mess around the first episod ever saw struck nasti wa surreal couldn say wa readi for but watch more develop tast for and got accu

#### Removing stopwords

In [11]:
def rem_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    return ' '.join([w for w in words if w not in stop_words])

data.review = data.review.apply(rem_stopwords)
data.review[0]

'one review ha mention watch episod hook right thi exactli happen first thing struck wa brutal unflinch scene violenc set right word trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use word call nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda citi home mani aryan muslim gangsta latino christian italian irish scuffl death stare dodgi deal shadi agreement never far away would say main appeal show due fact goe show dare forget pretti pictur paint mainstream audienc forget charm forget romanc mess around first episod ever saw struck nasti wa surreal say wa readi watch develop tast got accustom high level graphic violenc violenc injustic crook guard sold nickel inmat kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch may becom comfort uncomfort view get touch darker side'

#### Creating a bag of words (BOW) from the corpus

In [12]:
X = np.array(data.iloc[:,0].values)
y = np.array(data.sentiment.values)

cv = CountVectorizer(max_features = 1000)
X = cv.fit_transform(data.review).toarray()

print("X.shape = ",X.shape)
print("y.shape = ",y.shape)

X.shape =  (50000, 1000)
y.shape =  (50000,)


#### Preparing train, test data

In [13]:
trainx,testx,trainy,testy = train_test_split(X,y,test_size=0.3,random_state=9)
print("Train shapes : X = {}, y = {}".format(trainx.shape,trainy.shape))
print("Test shapes : X = {}, y = {}".format(testx.shape,testy.shape))

Train shapes : X = (35000, 1000), y = (35000,)
Test shapes : X = (15000, 1000), y = (15000,)


## Model Training

For this supervised classification task, we have used the following 5 models and analysed the accuracy for each of these to determine the best model. 

1. Gaussian Naive Bayes
2. Multinomial Naive Bayes 
3. Bernoulli Naive Bayes
4. Logistic Regression
5. Support Vector Machines (SVM) 

In [14]:
gnb,mnb,bnb = GaussianNB(),MultinomialNB(alpha=1.0,fit_prior=True),BernoulliNB(alpha=1.0,fit_prior=True)

lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
svm = SGDClassifier(loss='hinge', max_iter=100)

gnb.fit(trainx,trainy)
mnb.fit(trainx,trainy)
bnb.fit(trainx,trainy)
lr.fit(trainx, trainy)
svm.fit(trainx, trainy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


SGDClassifier(max_iter=100)

#### Performing predictions on the test data

In [15]:
ypg = gnb.predict(testx)
ypm = mnb.predict(testx)
ypb = bnb.predict(testx)
ypl = lr.predict(testx)
yps = svm.predict(testx)

#### Analysing performance metrics

In [16]:
print('----Accuracy for different models on test data----')
print("Gaussian = ",accuracy_score(testy,ypg))
print("Multinomial = ",accuracy_score(testy,ypm))
print("Bernoulli = ",accuracy_score(testy,ypb))
print("Logistic Regression = ",accuracy_score(testy,ypl))
print("SVM = ",accuracy_score(testy,yps))

----Accuracy for different models on test data----
Gaussian =  0.7831333333333333
Multinomial =  0.8258
Bernoulli =  0.8354666666666667
Logistic Regression =  0.8652
SVM =  0.8606


In [17]:
# Classification report for Logistic regression
print(classification_report(testy, ypl))

              precision    recall  f1-score   support

           0       0.87      0.86      0.86      7525
           1       0.86      0.87      0.87      7475

    accuracy                           0.87     15000
   macro avg       0.87      0.87      0.87     15000
weighted avg       0.87      0.87      0.87     15000



In [18]:
# Classification report for SVM
print(classification_report(testy, yps))

              precision    recall  f1-score   support

           0       0.87      0.84      0.86      7525
           1       0.85      0.88      0.86      7475

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



# Summary

Looking at the accuracy and the f1-score for Logistic Regression, it seems viable to go ahead with __Logistic Regression model__ for this data. 