<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#STEPS-TO-CLEAN-THE-REVIEWS-:" data-toc-modified-id="STEPS-TO-CLEAN-THE-REVIEWS-:-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>STEPS TO CLEAN THE REVIEWS :</a></span></li><li><span><a href="#Train-Model" data-toc-modified-id="Train-Model-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Train Model</a></span></li></ul></li><li><span><a href="#Bag-of-Words-model" data-toc-modified-id="Bag-of-Words-model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Bag of Words model</a></span><ul class="toc-item"><li><span><a href="#Modeling:" data-toc-modified-id="Modeling:-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Modeling:</a></span></li></ul></li></ul></div>

In [1]:
import numpy as np 
import pandas as pd 
import nltk

In [2]:
data=pd.read_csv('IMDB Dataset.csv')


In [3]:
data.shape
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
## Word count Before Preprocessing
data["pre_process_len"]=data["review"].str.len()

### STEPS TO CLEAN THE REVIEWS :
 - Remove HTML tags
 - Remove special characters(punctuation) & lowercaseed
 - Remove stopwords
 - Stemming with PorterStemmer
 - word to vectorizer using  tf-Idf vectorizer
 - Target Encoding
 
### Train Model
 - Train-test-split
 - Modeling

In [5]:
## Remove html tags
import re

def remove_html_tag(text):
    remove_html_tag=re.compile(r'<.*?>')
    return re.sub(remove_html_tag,'',text)

data['review']=data['review'].apply(remove_html_tag)

In [6]:
## Remove punctuations
import string
def remove_punctuation(sentence):
    review=[letters.lower() for letters in sentence if letters not in string.punctuation]
    review=''.join(review)
    return review

data['review']=data['review'].apply(remove_punctuation)

In [7]:
## Remove Stopwords
from nltk.corpus import stopwords
def remove_stopwords(sentence):
    stop_words=stopwords.words('english')
    review=[words for words in sentence.split() if words not in stop_words]
    review=' '.join(review)
    return review

data['review']=data['review'].apply(remove_stopwords)

In [8]:
## Stemming
from nltk import PorterStemmer 
ps=PorterStemmer() 

data['review']=data['review'].apply(ps.stem)

In [9]:
## Word count After Preprocessing
data["post_process_len"]=data["review"].str.len()   ## sentence length After preprocessing 
data["reduction_percent"]=round((data["post_process_len"]/data["pre_process_len"])*100)  ## % of length reduction Afetr preprocessing 
print(data["reduction_percent"].mean())      ## Reduction Avg. = 63 %

data.head()

63.53842


Unnamed: 0,review,sentiment,pre_process_len,post_process_len,reduction_percent
0,one reviewers mentioned watching 1 oz episode ...,positive,1761,1158,66.0
1,wonderful little production filming technique ...,positive,998,655,66.0
2,thought wonderful way spend time hot summer we...,positive,926,588,63.0
3,basically theres family little boy jake thinks...,negative,748,459,61.0
4,petter matteis love time money visually stunni...,positive,1317,863,66.0


In [10]:
## Encoding Sentment Column
#sentiment_mapper={"positive":1,"negative":0}
#data['label']=data['sentiment'].map(sentiment_mapper)

data['sentiment'].replace(['positive','negative'],[1,0],inplace=True)
data.head()

labels=data['sentiment'].to_numpy()
type(labels)
labels

Unnamed: 0,review,sentiment,pre_process_len,post_process_len,reduction_percent
0,one reviewers mentioned watching 1 oz episode ...,1,1761,1158,66.0
1,wonderful little production filming technique ...,1,998,655,66.0
2,thought wonderful way spend time hot summer we...,1,926,588,63.0
3,basically theres family little boy jake thinks...,0,748,459,61.0
4,petter matteis love time money visually stunni...,1,1317,863,66.0


## Bag of Words model

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
c_vectorizer=CountVectorizer(max_features=10000)
x=c_vectorizer.fit_transform(data["review"])#.toarray()

#c_vectorizer.get_feature_names()    # Return feature names
#c_vectorizer.vocabulary_            # word mapping(dict)
#c_vectorizer.fixed_vocabulary_      # False: As indices mapping is provided by user

In [16]:
x.shape
#np.unique(x[0])
type(x)

scipy.sparse.csr.csr_matrix

In [20]:
## split data 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,labels,train_size=0.75)

### Modeling:

In [24]:
from sklearn.linear_model import LogisticRegression


lr=LogisticRegression(C=0.1)
lr.fit(x_train,y_train)

ypred=lr.predict(x_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score


gnb=GaussianNB()
mnb=BernoulliNB(alpha=1.0,fit_prior=True)
bnb=MultinomialNB(alpha=1.0,fit_prior=True)


gnb.fit(x_train.toarray(),y_train)
mnb.fit(x_train.toarray(),y_train)
bnb.fit(x_train.toarray(),y_train)

MultinomialNB()

In [37]:
ypg=gnb.predict(x_test.toarray())
ypm=mnb.predict(x_test.toarray())
ypb=bnb.predict(x_test.toarray())

print("Gaussian =",accuracy_score(y_test,ypg))
print("Multinomial = ",accuracy_score(y_test,ypm))
print("Bernoulli = ",accuracy_score(y_test,ypb))

Gaussian = 0.73976
Multinomial =  0.85352
Bernoulli =  0.85016


In [28]:
## cross-val_score with 'data-splitting methods'
from sklearn.model_selection import cross_val_score
scores=cross_val_score(lr,x_test,y_test,cv=15,scoring="accuracy") 
np.average(scores)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.8665590936614455

In [32]:
## Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,ypred)

array([[5444,  720],
       [ 699, 5637]], dtype=int64)

In [31]:
## Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_test,ypred))


              precision    recall  f1-score   support

           0       0.89      0.88      0.88      6164
           1       0.89      0.89      0.89      6336

    accuracy                           0.89     12500
   macro avg       0.89      0.89      0.89     12500
weighted avg       0.89      0.89      0.89     12500

