# IMDB Dataset of 50K Movie Reviews

IMDB dataset having 50K movie reviews for Text classification using Multinomial Naive Bayes.

This is a dataset for binary sentiment classification.

We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

Data Source: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

## [0]. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split , KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix , accuracy_score
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import re
import string
# Library for stopwords
from nltk.corpus import stopwords
# Library for Stemmer
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
# Library for Lemmatizer
from nltk.stem.wordnet import WordNetLemmatizer

# from tqdm import tqdm
from tqdm.notebook import tqdm
import os

## [1]. Reading Data

### [1.1] Loading the data

The dataset is available in .csv File forms

Here as we only want to get the global sentiment of the recommendations (positive or negative).

In [2]:
# importing the dataset
df = pd.read_csv('IMDB dataset.csv')
# to check the structure of dataset and is there any null value
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB
None


In [3]:
# priniting top 5 rows to have a look on how data look like
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
print(df.sentiment.value_counts())      # print(dataset['sentiment'].value_counts())

positive    25000
negative    25000
Name: sentiment, dtype: int64


In [5]:
df_count = df.groupby(['review', 'sentiment']).size().reset_index(name='count')
df_count

Unnamed: 0,review,sentiment,count
0,A Turkish Bath sequence in a film noir loc...,positive,1
1,"!!! Spoiler alert!!!<br /><br />The point is, ...",negative,1
2,!!!! MILD SPOILERS !!!!<br /><br />The premise...,negative,1
3,!!!! MILD SPOILERS !!!!<br /><br />With the ex...,negative,1
4,!!!! POSSIBLE MILD SPOILER !!!!!<br /><br />As...,negative,1
...,...,...,...
49577,{Possible spoilers coming up... you've been fo...,positive,1
49578,{rant start} I didn't want to believe them at ...,negative,1
49579,~~I was able to see this movie yesterday morni...,positive,1
49580,Film auteur Stephan Woloszczuk explores th...,positive,1


In [6]:
df_count['count'].sort_values( axis=0, ascending=False)

26260    5
11782    4
48038    3
27652    3
30884    3
        ..
32982    1
32981    1
32980    1
32979    1
0        1
Name: count, Length: 49582, dtype: int64

### Conclusion
1. Dataset have 2 columns i.e. Review and Sentiment
2. Review on whatever written by user and sentiment is whether they like the movie or not 
3. Dataset is of binary classification type, since sentiment is either positive, if user like the movie or negative, if user dislike the movie
4. Dataset have no null value and it is balanced data
5. Here we observe data duplication, so we need to do Data Cleaning

## [2] Exploratory Data Analysis

### [2.1] Data Cleaning: Deduplication
It is observed (as shown in the table above) that the reviews data had duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

In [7]:
print(df_count.iloc[26260])

review       Loved today's show!!! It was a variety and not...
sentiment                                             positive
count                                                        5
Name: 26260, dtype: object


In [8]:
#Deduplication of entries
df.drop_duplicates(subset={'review','sentiment'}, keep='first', inplace=True, ignore_index=False)

# after deduplication again check the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49582 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     49582 non-null  object
 1   sentiment  49582 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB


Resolving another problem of replacing cateogry feature with ordinal numeric value
1 for Positive
0 for Negative

In [9]:
# creating another column, with column name as response positive ---> 1 and negative ---> 0 
df['response'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment,response
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [10]:
# firstly dropping sentiment column and then rename response column to sentiment
del df['sentiment']
df=df.rename(columns = {'response':'sentiment'})
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [11]:
((1-((df['sentiment'].size*1.0)/(50000*1.0)))*100)

0.8360000000000034

In [12]:
#Checking to see how much % of data still remains
print('Total lost due to DeDuplcatin {} %'.format(((1-((df['sentiment'].size*1.0)/(50000*1.0))))))

Total lost due to DeDuplcatin 0.008360000000000034 %


### Categorical value converted to Numerical value
1. Sentiment column value changed from categorical to numerical
2. Positive value converted to 1 and negative value converted to 0

## 3.Text pre-processing 

## 3.1 Splitting the dataset into the Training set and Test set

### Splitting the data before doing data pre processing

In [17]:
# split the dataset into train and test set
x_train, x_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.3, random_state=42)

In [18]:
print('no of row of training dataset review:',len(x_train))
print('no of row of training dataset sentiment:',len(y_train))

no of row of training dataset review: 34707
no of row of training dataset sentiment: 34707


In [19]:
print('no of row of test dataset review:',len(x_test))
print('no of row of test dataset sentiment:',len(y_test))

no of row of test dataset review: 14875
no of row of test dataset sentiment: 14875


Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords<br>
8. Finally use Lemmantizer

Create function for all user work

In [20]:
# function to clear all html tags
def clearHtml(sentence):
    cleanr = re.compile('<.*?>')
    cleanText = re.sub(cleanr,' ',sentence)
    return cleanText       # output is in string
# Function to clear all extra symbols except single quote
def clearPunc(sentence):
#     cleaned = re.sub(r'[?|!\|"|#|.|,|)|(\||\\|/|:]',r' ',sentence)
    cleaned = re.sub('[^A-Za-z0-9\']+', ' ', sentence)
    return cleaned        # output is in string

In [21]:
# function to convert common contractions to Normal words
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [22]:
# function to remove all other single quotes after treating common contraction
def clearRestSingleQuotes(sentence):
    cleaned = re.sub('[^A-Za-z0-9]+', ' ', sentence)
    return cleaned  

In [23]:
stop_words =set(stopwords.words('english'))    # set of stopwords
print(stop_words)

{'a', "should've", 'ours', 'i', "that'll", 'are', 'herself', 'what', 'had', 'into', "you're", 'wasn', 'any', 'because', 'aren', 'while', 'be', 'out', 'each', 'll', 'ain', 'd', 'this', 'above', 'off', 'whom', 'will', 'again', 'same', 'an', 'than', 'so', 'only', 'shan', 'its', 'were', "aren't", "haven't", 'with', 'other', 'theirs', 'after', 'under', 'weren', 'both', "needn't", 'the', "shouldn't", 'been', 'ourselves', 'these', 'we', 'at', 'them', 'below', "isn't", 'such', 'mustn', 'can', 'am', 'few', 'our', "hadn't", 'isn', "you've", 'between', "you'd", 'should', 'he', 'why', 'from', 'until', 'y', 'him', 'before', 'don', 'and', 'no', 'her', 'myself', 'down', 'needn', 'nor', "mustn't", 'yours', 'm', 'ma', 'on', 'if', 'is', 'but', "it's", 'over', 'just', "didn't", 'was', 'having', 'very', 'me', 'all', 'o', 'to', 'hadn', 'have', 'those', 'wouldn', 'she', 'himself', 'then', 'when', "you'll", 'won', "hasn't", 'of', 'they', 'which', "don't", 'own', 'how', 'most', 'his', 'that', "she's", 'during

In [24]:
df['review'][1]      #html tags are present in review data. Example is given below

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [25]:
from tqdm.notebook import tqdm
i=0
lemmatizer_word = WordNetLemmatizer() 
x_train_pre_processed = []
all_positive_word_train = []
all_negative_words_train = []
for sent in tqdm(x_train.values):
    filtered_sentence=[]
#     print(sent)
    sent = clearHtml(sent)
    sent=  clearPunc(sent)
    sent = decontracted(sent)
    sent = clearRestSingleQuotes(sent)
    for words in sent.split():
        for clear_words in clearPunc(words).split():
            if ((clear_words.isalpha())) & (len(clear_words)>2):
                if (clear_words.lower() not in stop_words):
#                     s = (ps.stem(clear_words.lower())).encode('utf8')      #PorterStemmer
#                     s = (sno.stem(clear_words.lower())).encode('utf8')      #Snowball Stemmer
                      s = (lemmatizer_word.lemmatize(clear_words.lower())).encode('utf8')      # Lemmantizer
                      filtered_sentence.append(s)
                      if (y_train.values)[i]:
                            all_positive_word_train.append(s)
                      else:
                            all_negative_words_train.append(s)
        str = b" ".join(filtered_sentence)
#     print(str)
#     print('-'*90)
    x_train_pre_processed.append(str)
    i+=1

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=34707.0), HTML(value='')))




In [26]:
print(x_train_pre_processed[1])
print('-'*90) 
print(all_positive_word_train[0:10])
print('-'*90)
freq_dist_positive =nltk.FreqDist(all_positive_word_train)
print('Most positive words',freq_dist_positive.most_common(20))
print('-'*90)
print(all_negative_words_train[0:10])
print('-'*90)
freq_dist_negative =nltk.FreqDist(all_negative_words_train)
print('Most negative words',freq_dist_negative.most_common(20))

b'know film made react viscerally perhaps character unlikable compelling enough care perhaps disorganized storyline perhaps fact rob lowe wore long dangly earring eyeliner perhaps point movie break song perhaps never perhaps everything garish hyperbole perhaps character pump fist driving away camera fade know made hate mean trying watch willing find'
------------------------------------------------------------------------------------------
[b'loved', b'movie', b'beginning', b'end', b'musician', b'let', b'drug', b'get', b'way', b'thing']
------------------------------------------------------------------------------------------
Most positive words [(b'film', 34495), (b'movie', 31163), (b'one', 19737), (b'like', 12607), (b'time', 11215), (b'good', 10515), (b'story', 9798), (b'character', 9720), (b'would', 9093), (b'great', 9083), (b'see', 8876), (b'well', 8847), (b'get', 7777), (b'make', 7718), (b'also', 7581), (b'really', 7387), (b'scene', 7084), (b'life', 6948), (b'show', 6711), (b'even

### Observation
- Few words are there in negative and positive both like 'good', which actually positive word in real life scenario so it might be chances that it used with not and not is removed in stopwords.
- To prevent that I am removing not from set of stopwords and re-run the code to check whether my observation is correct or not.

In [27]:
#removing stop words like "not" should be avoided before building n-grams

print(len(stop_words))
stop_words.remove('not')       #removing not words from stop words
print(len(stop_words))

179
178


In [28]:
from tqdm.notebook import tqdm
i=0
lemmatizer_word = WordNetLemmatizer() 
x_train_pre_processed = []
all_positive_word_train = []
all_negative_words_train = []
for sent in tqdm(x_train.values):
    filtered_sentence=[]
#     print(sent)
    sent = clearHtml(sent)
    sent=  clearPunc(sent)
    sent = decontracted(sent)
    sent = clearRestSingleQuotes(sent)
    for words in sent.split():
        for clear_words in clearPunc(words).split():
            if ((clear_words.isalpha())) & (len(clear_words)>2):
                if (clear_words.lower() not in stop_words):
#                     s = (ps.stem(clear_words.lower())).encode('utf8')      #PorterStemmer
#                     s = (sno.stem(clear_words.lower())).encode('utf8')      #Snowball Stemmer
                      s = (lemmatizer_word.lemmatize(clear_words.lower())).encode('utf8')      # Lemmantizer
                      filtered_sentence.append(s)
                      if (y_train.values)[i]:
                            all_positive_word_train.append(s)
                      else:
                            all_negative_words_train.append(s)
        str = b" ".join(filtered_sentence)
#     print(str)
#     print('-'*90)
    x_train_pre_processed.append(str)
    i+=1

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=34707.0), HTML(value='')))




In [29]:
print(x_train_pre_processed[1])
print('-'*90) 
print(all_positive_word_train[0:10])
print('-'*90)
freq_dist_positive =nltk.FreqDist(all_positive_word_train)
print('Most positive words',freq_dist_positive.most_common(20))
print('-'*90)
print(all_negative_words_train[0:10])
print('-'*90)
freq_dist_negative =nltk.FreqDist(all_negative_words_train)
print('Most negative words',freq_dist_negative.most_common(20))

b'not know film made react viscerally perhaps character unlikable not compelling enough care perhaps disorganized storyline perhaps fact rob lowe wore long dangly earring eyeliner perhaps point movie break song perhaps never perhaps everything garish hyperbole perhaps character pump fist driving away camera fade not know made hate mean trying watch not willing find'
------------------------------------------------------------------------------------------
[b'loved', b'movie', b'beginning', b'end', b'musician', b'let', b'drug', b'get', b'way', b'thing']
------------------------------------------------------------------------------------------
Most positive words [(b'not', 38169), (b'film', 34495), (b'movie', 31163), (b'one', 19737), (b'like', 12607), (b'time', 11215), (b'good', 10515), (b'story', 9798), (b'character', 9720), (b'would', 9093), (b'great', 9083), (b'see', 8876), (b'well', 8847), (b'get', 7777), (b'make', 7718), (b'also', 7581), (b'really', 7387), (b'scene', 7084), (b'life'

In [30]:
from tqdm.notebook import tqdm
i=0
lemmatizer_word = WordNetLemmatizer() 
x_test_pre_processed = []
all_positive_words_test = []
all_negative_words_test = []
for sent in tqdm(x_test.values):
    filtered_sentence=[]
#     print(sent)
    sent = clearHtml(sent)
    sent=  clearPunc(sent)
    sent = decontracted(sent)
    sent = clearRestSingleQuotes(sent)
    for words in sent.split():
        for clear_words in clearPunc(words).split():
            if ((clear_words.isalpha())) & (len(clear_words)>2):
                if (clear_words.lower() not in stop_words):
#                     s = (ps.stem(clear_words.lower())).encode('utf8')      #PorterStemmer
#                     s = (sno.stem(clear_words.lower())).encode('utf8')      #Snowball Stemmer
                      s = (lemmatizer_word.lemmatize(clear_words.lower())).encode('utf8')      # Lemmantizer
                      filtered_sentence.append(s)
                      if (y_test.values)[i]:
                            all_positive_words_test.append(s)
                      else:
                            all_negative_words_test.append(s)
        str = b" ".join(filtered_sentence)
#     print(str)
#     print('-'*90)
    x_test_pre_processed.append(str)
    i+=1

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=14875.0), HTML(value='')))




In [31]:
print(x_test_pre_processed[1])
print('-'*90) 
print(all_positive_words_test[0:10])
print('-'*90)
freq_dist_positive =nltk.FreqDist(all_positive_words_test)
print('Most positive words',freq_dist_positive.most_common(20))
print('-'*90)
print(all_negative_words_test[0:10])
print('-'*90)
freq_dist_negative =nltk.FreqDist(all_negative_words_test)
print('Most negative words',freq_dist_negative.most_common(20))

b'guest future tell fascinating story time travel friendship battle good evil small budget child actor special effect something spielberg lucas learn sixth grader kolya nick gerasimov find time machine basement decrepit building travel year future discovers near perfect utopian society robot play guitar write poetry everyone kind people enjoy everything technology offer alice daughter prominent scientist invented device called mielophone allows read mind human animal device put good bad use depending whose hand fall two evil space pirate saturn want rule universe attempt steal mielophone fall hand century school boy nick pirate hot track travel back time followed pirate alice chaos confusion funny situation follow luckless pirate try blend earthling alice enrolls school nick go demonstrates superhuman ability class catch alice not know nick look like pirate also pirate able change appearance turn literally anyone hmm wonder james cameron got idea terminator get nick mielophone first ex

<h2>[3.2] Preprocessing Review Summary</h2>

- So our early observation was correct not appear 50031 times in negative words so it might be appear as not good or not like 

## [4] Featurization

### [4.1] Bag of Words (BOW)

In [40]:
# creating BOW wit bigram
count_vect = CountVectorizer(ngram_range=(1,2))
x_train_count = count_vect.fit_transform(x_train_pre_processed)
x_test_count = count_vect.transform(x_test_pre_processed)
print(x_train_count.get_shape())
print(x_test_count.get_shape())

(34707, 2218175)
(14875, 2218175)


In [41]:
# getting all list of features or words 
count_vect.get_feature_names()

['aaa',
 'aaa ball',
 'aaa even',
 'aaa favorite',
 'aaa jawani',
 'aaa level',
 'aaa not',
 'aaa yeah',
 'aaaaaaaaaaaahhhhhhhhhhhhhh',
 'aaaaaaaaaaaahhhhhhhhhhhhhh hurting',
 'aaaaaaaargh',
 'aaaaaaah',
 'aaaaaaah saw',
 'aaaaaaahhhhhhggg',
 'aaaaagh',
 'aaaaagh scene',
 'aaaaah',
 'aaaaah movie',
 'aaaaah never',
 'aaaaahhhh',
 'aaaaahhhh get',
 'aaaaatch',
 'aaaaatch kah',
 'aaaaaw',
 'aaaaaw cry',
 'aaaahhhhhh',
 'aaaahhhhhh terrible',
 'aaaahhhhhhh',
 'aaaahhhhhhh run',
 'aaaarrgh',
 'aaaarrgh former',
 'aaaawwwwww',
 'aaaawwwwww well',
 'aaaggghhhhhhh',
 'aaaggghhhhhhh not',
 'aaah',
 'aaah friggin',
 'aaah leg',
 'aaahhhhhhh',
 'aaahhhhhhh scene',
 'aaall',
 'aaall way',
 'aaam',
 'aaam going',
 'aaargh',
 'aaargh bad',
 'aaargh dead',
 'aaargh not',
 'aab',
 'aab tak',
 'aachen',
 'aachen palm',
 'aachen two',
 'aada',
 'aada adhura',
 'aag',
 'aag actually',
 'aag break',
 'aag director',
 'aag fail',
 'aag figure',
 'aag fire',
 'aag hit',
 'aag jugnu',
 'aag make',
 'aag nev

In [42]:
feature_array = np.array(count_vect.get_feature_names())
feature_array[5000:5660]

array(['absurd quite', 'absurd rather', 'absurd rating',
       'absurd reaction', 'absurd reference', 'absurd regard',
       'absurd renny', 'absurd result', 'absurd revelation',
       'absurd ride', 'absurd ridiculous', 'absurd road',
       'absurd romance', 'absurd rule', 'absurd said', 'absurd sake',
       'absurd scenario', 'absurd scheider', 'absurd seemed',
       'absurd self', 'absurd serious', 'absurd setting', 'absurd sight',
       'absurd silliness', 'absurd sinister', 'absurd situation',
       'absurd sivaji', 'absurd sometimes', 'absurd sound',
       'absurd soundtrack', 'absurd space', 'absurd stagebound',
       'absurd stealing', 'absurd story', 'absurd stupid', 'absurd style',
       'absurd sucker', 'absurd suddenly', 'absurd superman',
       'absurd surreal', 'absurd take', 'absurd ted', 'absurd tedious',
       'absurd terrible', 'absurd theory', 'absurd thing', 'absurd think',
       'absurd though', 'absurd thre', 'absurd tired', 'absurd trash',
       'a

In [43]:
# getting top 10 words with most appearance
responses = feature_array
tfidf_sorting = np.argsort(responses).flatten()[::-1]

n = 10
top_n = feature_array[tfidf_sorting][:n]
top_n

array(['zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz excuse',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz ooops',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzz imdb',
       'zzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzz way', 'zzzzzzzzzzzzz',
       'zzzzzzzzzzzz pop'], dtype='<U145')

## [5] Modelling

### [5.1] Selecting the right hyperparameter i.e. alpha using 10-fold C.V.

In [44]:
# simulate splitting a dataset of 25 observations into 5 folds
kf = KFold(n_splits=10)
kf

KFold(n_splits=10, random_state=None, shuffle=False)

In [45]:
alphas = [0.0001,00.1,0.1,1,10,20,200,2000,20000,2000000]    # List of some alpha values
for i in alphas:
    print('''Multinomial model with alpha = {} have mean {} and standard deviation {}''' \
          .format(i,(cross_val_score(MultinomialNB(alpha=i),x_train_count,y_train,cv=10).mean())*100,cross_val_score(MultinomialNB(alpha=i),x_train_count,y_train,cv=10).mean()))

Multinomial model with alpha = 0.0001 have mean 85.14994723675876 and standard deviation 0.8514994723675876
Multinomial model with alpha = 0.1 have mean 87.52123440246355 and standard deviation 0.8752123440246355
Multinomial model with alpha = 0.1 have mean 87.52123440246355 and standard deviation 0.8752123440246355
Multinomial model with alpha = 1 have mean 87.9361195313661 and standard deviation 0.879361195313661
Multinomial model with alpha = 10 have mean 87.49239935339084 and standard deviation 0.8749239935339084
Multinomial model with alpha = 20 have mean 87.1149557843208 and standard deviation 0.871149557843208
Multinomial model with alpha = 200 have mean 84.9050377894402 and standard deviation 0.8490503778944021
Multinomial model with alpha = 2000 have mean 79.04454197272253 and standard deviation 0.7904454197272253
Multinomial model with alpha = 20000 have mean 68.15912247797102 and standard deviation 0.6815912247797102
Multinomial model with alpha = 2000000 have mean 65.272114

### Conclusion 
- Multinomial model with alpha = 1 have the highest mean accuracy and standard deviation
- From alpha = 0.0001 to alpha = 1 mean accuracy increasing but after alpha =1 mean accuracy start decreasing 

### [5.2.] Select the model with best suitable hyperparameter

In [46]:
model = MultinomialNB(alpha=1)
model.fit(x_train_count,y_train)

MultinomialNB(alpha=1)

In [47]:
model.score(x_train_count,y_train)

0.9959086063330164

In [48]:
y_pred = model.predict(x_test_count) 
y_pred

array([0, 1, 1, ..., 1, 0, 0], dtype=int64)

## [6] Performance Matrix

### [6.1] Confusion Matrix

In [49]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[6639  765]
 [ 984 6487]]


### [6.2] Accuracy

In [50]:
accuracy_score(y_test, y_pred)

0.8824201680672269

## [7] Summary

- The given dataset is balanced
- The given datset is belong to binary classification, value of output is either positive or negative, so we can convert it into ordinal numeric value
- The given dataset have duplication problem, we check and drop duplicate value, keeping the first of them only.
- We can't remove 'not' word from the review during removal of stopwords because there are high chance that user use not good , not like kind of combination
- Bag of words technique could be used to convert word to vector
- Dataset is split into training and test dataset by 70:30 ratio
- Going to use Naive Bayes for practice purpose
- In Naive Bayes, particularly Multinomial Naive Bayes because it is suitable for classification with discrete features (e.g., word counts for text classification)
- 10 - fold c.v. is used to select the best hyperparameter i.e. alpha = 1
- Multinomial Naive Bayes model is trained using training data and training data score 0.9959 on the same model, which is pretty decent
- We had used two performance matrix 
1) Confusion Matrix
2) Accuracy, since dataset is balanced
-The model has accuracy of 0.8824 on test data, which is average performance 