# IMDB Dataset of 50K Movie Reviews

IMDB dataset having 50K movie reviews for Text classification using Multinomial Naive Bayes.

This is a dataset for binary sentiment classification.

We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

Data Source: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

# Data Preprocessing

## 1. Import Required Libraries

In [34]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split , KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import re
import string
# Library for stopwords
from nltk.corpus import stopwords
# Library for Stemmer
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
# Library for Lemmatizer
from nltk.stem.wordnet import WordNetLemmatizer

# from tqdm import tqdm
from tqdm.notebook import tqdm
import os

In [2]:
# importing the dataset
df = pd.read_csv('IMDB dataset.csv')

In [3]:
# to check the structure of dataset and is there any null value
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [4]:
# priniting top 5 rows to have a look on how data look like
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
print(df.sentiment.value_counts())      # print(dataset['sentiment'].value_counts())

positive    25000
negative    25000
Name: sentiment, dtype: int64


### Observation 1.1
1. Dataset have 2 columns i.e. Review and Sentiment
2. Review on whatever written by user and sentiment is whether they like the movie or not 
3. Dataset is of binary classification type, since sentiment is either positive, if user like the movie or negative, if user dislike the movie
4. Dataset have no null value and it is balanced data

In [6]:
print(len(set(df.review)))    #remove

49582


In [7]:
print(df.groupby('review')['sentiment'].nunique())    

review
A Turkish Bath sequence in a film noir located in New York in the 50's, that must be a hint at something ! Something that curiously, in all the previous comments, no one has pointed out , but seems to me essential to the understanding of this movie <br /><br />the Turkish Baths sequence: a back street at night, the entrance of a sleazy sauna, and Scalise wrapped in a sheet, getting his thighs massaged. Steve, the masseur is of the young rough boxer ( Beefcake!) type , and another guy, a bodyguard? finishes dressing up. Dixon obviously hates what he sees there and gets rough right away. We know he has a reputation for roughing up suspects. Good cop but getting out of control easy. Why is it that he hates them so much ? <br /><br />Could it be that he hates himself. This part of himself he inherited from his father ? That dark side that could lead him right at the end of the sidewalk, into the gutter ? What if that dark side lurked within a "closet" ? Remember : whenever Dixon

In [8]:
df.duplicated()    #remove

0        False
1        False
2        False
3        False
4        False
         ...  
49995    False
49996    False
49997    False
49998    False
49999    False
Length: 50000, dtype: bool

In [9]:
print(df.review.value_counts())    #remove

Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming up next. She is as down to earth and as personable as you get, like one of us which made the show all the more enjoyable. Special guests, who are friends as well made for a nice surprise too. Loved the 'first' theme and that the audience was invited to play along too. I must admit I was shocked to see her come in under her time limits on a few things, but she did it and by golly I'll be writing those recipes down. Saving time in the kitchen means more time with family. Those who haven't tuned in yet, find out what channel and the time, I assure you that you won't be disappointed.                                                                                                                                                                                                                

In [10]:
df.drop_duplicates(subset={'review','sentiment'}, keep='first', inplace=True, ignore_index=False)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49582 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     49582 non-null  object
 1   sentiment  49582 non-null  object
dtypes: object(2)
memory usage: 1.1+ MB


In [12]:
# creating another column, with column name as response positive ---> 1 and negative ---> 0 
df['response'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment,response
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [13]:
df.columns = ['review','sentiment']    #remove
df.head()

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

In [14]:
# firstly dropping sentiment column and then rename response column to sentiment
del df['sentiment']
df=df.rename(columns = {'response':'sentiment'})
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


### Categorical value converted to Numerical value
1. Sentiment column value changed from categorical to numerical
2. Positive value converted to 1 and negative value converted to 0
3. Initially storing the value 

## 2.Text pre-processing 

In [15]:
# function to clear all html tags
def clearHtml(sentence):
    cleanr = re.compile('<.*?>')
    cleanText = re.sub(cleanr,' ',sentence)
    return cleanText       # output is in string
# Function to clear all extra symbols except single quote
def clearPunc(sentence):
#     cleaned = re.sub(r'[?|!\|"|#|.|,|)|(\||\\|/|:]',r' ',sentence)
    cleaned = re.sub('[^A-Za-z0-9\']+', ' ', sentence)
    return cleaned        # output is in string

In [16]:
# function to convert common contractions to Normal words
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [17]:
# function to remove all other single quotes after treating common contraction
def clearRestSingleQuotes(sentence):
    cleaned = re.sub('[^A-Za-z0-9]+', ' ', sentence)
    return cleaned  

In [18]:
# remove
count_tag=0
for sent in range(len(df['review'].values)):
    list_of_tag = re.findall('<.*?>',df['review'].values[sent])
    count_tag+=len(list_of_tag)
print(count_tag)

200456


In [19]:
# partially remove
stop_words =set(stopwords.words('english'))    # set of stopwords
sno = nltk.stem.SnowballStemmer('english')
ps = PorterStemmer()
print(stop_words)
print('-'*125)
print(sno.stem('quantative'))
print('-'*125)
print(ps.stem('quantative'))

{'above', 'which', 'their', 'not', 'our', 'or', 'all', 'doing', 'as', 'hadn', 'had', 'ours', 'him', 'who', 'because', 'out', 'hasn', 'has', 'why', 'then', 'ourselves', 'any', 'very', "aren't", 'if', 'isn', 'shouldn', 'now', 'only', 'over', 's', "don't", "she's", 'before', 'can', 'such', 'this', 'during', 'weren', "weren't", 'down', 'of', 'aren', "you'd", 'they', 'didn', "that'll", 'we', 'll', 'at', 'haven', 'there', 'into', 'won', 'she', "mightn't", 've', 'her', 'after', "mustn't", 'between', 'mustn', "hadn't", 'below', "wouldn't", 'in', 'but', 'when', 'other', 'off', 'for', 'don', 'with', 'once', 'i', 'here', 'few', 'yourselves', 'an', 'from', 'wouldn', 'just', 'be', "should've", 'me', 'have', 'was', 'will', 'theirs', "didn't", 'itself', "won't", 'he', 'having', 'to', "needn't", 'under', "wasn't", 'hers', 'most', "isn't", 'been', 'being', 't', 'yourself', 'while', 'same', 'up', 'more', "you've", 'ain', 'the', 'my', 'until', "you'll", "doesn't", 'yours', 'how', 'do', 'couldn', "it's", 

In [20]:
#remove
sample_text = dataset['review'].values[1]
sample_text      # output is in Array 

NameError: name 'dataset' is not defined

In [21]:
#remove
sample_text= clearHtml(sample_text)
sample_text= clearPunc(sample_text)
sample_text

NameError: name 'sample_text' is not defined

In [22]:
# remove
sample_text2 = dataset['review'].values[2]
sample_text2     # output is in Array 

NameError: name 'dataset' is not defined

In [23]:
# remove
sample_text = decontracted(sample_text)
sample_text

NameError: name 'sample_text' is not defined

In [24]:
#remove
sample_text = clearRestSingleQuotes(sample_text)
sample_text

NameError: name 'sample_text' is not defined

In [25]:
#sample_text is ready now 

## 2.1 Splitting the dataset into the Training set and Test set

In [26]:
# split the dataset into train and test set
x_train, x_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.3, random_state=42)

In [27]:
y_train.head()   #remove

8980     0
35912    0
47554    0
10799    1
38610    0
Name: sentiment, dtype: int64

In [28]:
print('no of row of training dataset review:',len(x_train))
print('no of row of training dataset sentiment:',len(y_train))

no of row of training dataset review: 34707
no of row of training dataset sentiment: 34707


In [29]:
print('no of row of test dataset review:',len(x_test))
print('no of row of test dataset sentiment:',len(y_test))

no of row of test dataset review: 14875
no of row of test dataset sentiment: 14875


In [30]:
x_train.head(20)     #remove

8980     The Blob starts with one of the most bizarre t...
35912    I don't know what it was about this film that ...
47554    I'd heard about this movie a while ago from a ...
10799    I loved this movie from beginning to end.I am ...
38610    Not even 'lesser' Hitch, but simply a bad movi...
19307    What an insult to the SA film industry! I have...
14227    The Treasure Island DVD should be required vie...
8251     The 1998 version of "Psycho" needed to be set ...
48666    When "The Net" was first being advertised, the...
44247    I've read every book to date in the left behin...
48081    This movie is excellent!Angel is beautiful and...
10798    There should be a rule that states quite clear...
15332    This scary and rather gory adaptation of Steph...
16630    And I'm serious! Truly one of the most fantast...
24900    Not that I dislike childrens movies, but this ...
44542    I happened upon a rare copy of this early Almo...
49903    This is one of the most hateful and cruel movi.

In [31]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [32]:
from tqdm.notebook import tqdm
i=0
lemmatizer_word = WordNetLemmatizer() 
x_train_pre_processed = []
all_positive_word_train = []
all_negative_words_train = []
for sent in tqdm(x_train.values):
    filtered_sentence=[]
#     print(sent)
    sent = clearHtml(sent)
    sent=  clearPunc(sent)
    sent = decontracted(sent)
    sent = clearRestSingleQuotes(sent)
    for words in sent.split():
        for clear_words in clearPunc(words).split():
            if ((clear_words.isalpha())) & (len(clear_words)>2):
                if (clear_words.lower() not in stop_words):
#                     s = (ps.stem(clear_words.lower())).encode('utf8')      #PorterStemmer
#                     s = (sno.stem(clear_words.lower())).encode('utf8')      #Snowball Stemmer
                      s = (lemmatizer_word.lemmatize(clear_words.lower())).encode('utf8')      # Lemmantizer
                      filtered_sentence.append(s)
                      if (y_train.values)[i]:
                            all_positive_word_train.append(s)
                      else:
                            all_negative_words_train.append(s)
        str = b" ".join(filtered_sentence)
#     print(str)
#     print('-'*90)
    x_train_pre_processed.append(str)
    i+=1

HBox(children=(FloatProgress(value=0.0, max=34707.0), HTML(value='')))




In [37]:
print(x_train_pre_processed[1])
print('-'*90) 
print(all_positive_word_train[0:10])
print('-'*90)
freq_dist_positive =nltk.FreqDist(all_positive_word_train)
print('Most positive words',freq_dist_positive.most_common(20))
print('-'*90)
print(all_negative_words_train[0:10])
print('-'*90)
freq_dist_negative =nltk.FreqDist(all_negative_words_train)
print('Most negative words',freq_dist_negative.most_common(20))

b'know film made react viscerally perhaps character unlikable compelling enough care perhaps disorganized storyline perhaps fact rob lowe wore long dangly earring eyeliner perhaps point movie break song perhaps never perhaps everything garish hyperbole perhaps character pump fist driving away camera fade know made hate mean trying watch willing find'
------------------------------------------------------------------------------------------
[b'loved', b'movie', b'beginning', b'end', b'musician', b'let', b'drug', b'get', b'way', b'thing']
------------------------------------------------------------------------------------------
Most positive words [(b'film', 34495), (b'movie', 31163), (b'one', 19737), (b'like', 12607), (b'time', 11215), (b'good', 10515), (b'story', 9798), (b'character', 9720), (b'would', 9093), (b'great', 9083), (b'see', 8876), (b'well', 8847), (b'get', 7777), (b'make', 7718), (b'also', 7581), (b'really', 7387), (b'scene', 7084), (b'life', 6948), (b'show', 6711), (b'even

### Observation
- Few words are there in negative and positive both like 'good', which actually positive word in real life scenario so it might be chances that it used with not and not is removed in stopwords.
- To prevent that I am removing not from set of stopwords and re-run the code to check whether my observation is correct or not.

In [38]:
#removing stop words like "not" should be avoided before building n-grams

print(len(stop_words))
stop_words.remove('not')       #removing not words from stop words
print(len(stop_words))

179
178


In [39]:
from tqdm.notebook import tqdm
i=0
lemmatizer_word = WordNetLemmatizer() 
x_train_pre_processed = []
all_positive_word_train = []
all_negative_words_train = []
for sent in tqdm(x_train.values):
    filtered_sentence=[]
#     print(sent)
    sent = clearHtml(sent)
    sent=  clearPunc(sent)
    sent = decontracted(sent)
    sent = clearRestSingleQuotes(sent)
    for words in sent.split():
        for clear_words in clearPunc(words).split():
            if ((clear_words.isalpha())) & (len(clear_words)>2):
                if (clear_words.lower() not in stop_words):
#                     s = (ps.stem(clear_words.lower())).encode('utf8')      #PorterStemmer
#                     s = (sno.stem(clear_words.lower())).encode('utf8')      #Snowball Stemmer
                      s = (lemmatizer_word.lemmatize(clear_words.lower())).encode('utf8')      # Lemmantizer
                      filtered_sentence.append(s)
                      if (y_train.values)[i]:
                            all_positive_word_train.append(s)
                      else:
                            all_negative_words_train.append(s)
        str = b" ".join(filtered_sentence)
#     print(str)
#     print('-'*90)
    x_train_pre_processed.append(str)
    i+=1

HBox(children=(FloatProgress(value=0.0, max=34707.0), HTML(value='')))




In [40]:
print(x_train_pre_processed[1])
print('-'*90) 
print(all_positive_word_train[0:10])
print('-'*90)
freq_dist_positive =nltk.FreqDist(all_positive_word_train)
print('Most positive words',freq_dist_positive.most_common(20))
print('-'*90)
print(all_negative_words_train[0:10])
print('-'*90)
freq_dist_negative =nltk.FreqDist(all_negative_words_train)
print('Most negative words',freq_dist_negative.most_common(20))

b'not know film made react viscerally perhaps character unlikable not compelling enough care perhaps disorganized storyline perhaps fact rob lowe wore long dangly earring eyeliner perhaps point movie break song perhaps never perhaps everything garish hyperbole perhaps character pump fist driving away camera fade not know made hate mean trying watch not willing find'
------------------------------------------------------------------------------------------
[b'loved', b'movie', b'beginning', b'end', b'musician', b'let', b'drug', b'get', b'way', b'thing']
------------------------------------------------------------------------------------------
Most positive words [(b'not', 38169), (b'film', 34495), (b'movie', 31163), (b'one', 19737), (b'like', 12607), (b'time', 11215), (b'good', 10515), (b'story', 9798), (b'character', 9720), (b'would', 9093), (b'great', 9083), (b'see', 8876), (b'well', 8847), (b'get', 7777), (b'make', 7718), (b'also', 7581), (b'really', 7387), (b'scene', 7084), (b'life'

### Observation
- So our early observation was correct not appear 50031 times in negative words so it might be appear as not good or not like 

In [44]:
from tqdm.notebook import tqdm
i=0
lemmatizer_word = WordNetLemmatizer() 
x_test_pre_processed = []
all_positive_words_test = []
all_negative_words_test = []
for sent in tqdm(x_test.values):
    filtered_sentence=[]
#     print(sent)
    sent = clearHtml(sent)
    sent=  clearPunc(sent)
    sent = decontracted(sent)
    sent = clearRestSingleQuotes(sent)
    for words in sent.split():
        for clear_words in clearPunc(words).split():
            if ((clear_words.isalpha())) & (len(clear_words)>2):
                if (clear_words.lower() not in stop_words):
#                     s = (ps.stem(clear_words.lower())).encode('utf8')      #PorterStemmer
#                     s = (sno.stem(clear_words.lower())).encode('utf8')      #Snowball Stemmer
                      s = (lemmatizer_word.lemmatize(clear_words.lower())).encode('utf8')      # Lemmantizer
                      filtered_sentence.append(s)
                      if (y_test.values)[i]:
                            all_positive_words_test.append(s)
                      else:
                            all_negative_words_test.append(s)
        str = b" ".join(filtered_sentence)
#     print(str)
#     print('-'*90)
    x_test_pre_processed.append(str)
    i+=1

HBox(children=(FloatProgress(value=0.0, max=14875.0), HTML(value='')))




In [45]:
print(x_test_pre_processed[1])
print('-'*90) 
print(all_positive_words_test[0:10])
print('-'*90)
freq_dist_positive =nltk.FreqDist(all_positive_words_test)
print('Most positive words',freq_dist_positive.most_common(20))
print('-'*90)
print(all_negative_words_test[0:10])
print('-'*90)
freq_dist_negative =nltk.FreqDist(all_negative_words_test)
print('Most negative words',freq_dist_negative.most_common(20))

b'guest future tell fascinating story time travel friendship battle good evil small budget child actor special effect something spielberg lucas learn sixth grader kolya nick gerasimov find time machine basement decrepit building travel year future discovers near perfect utopian society robot play guitar write poetry everyone kind people enjoy everything technology offer alice daughter prominent scientist invented device called mielophone allows read mind human animal device put good bad use depending whose hand fall two evil space pirate saturn want rule universe attempt steal mielophone fall hand century school boy nick pirate hot track travel back time followed pirate alice chaos confusion funny situation follow luckless pirate try blend earthling alice enrolls school nick go demonstrates superhuman ability class catch alice not know nick look like pirate also pirate able change appearance turn literally anyone hmm wonder james cameron got idea terminator get nick mielophone first ex

In [None]:
len(train_pre_processed) #remove

In [None]:
#remove
# from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Data can be assigned directly to cells
i = 1
for i in range(len(pre_processed_string)):
    ws['A{}'.format(i+1)] = pre_processed_string[i]

# Save the file
wb.save("clean_data.csv")

In [213]:
#remove
df = pd.read_csv('clean_data.csv')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2


## Featurization

In [46]:
# creating BOW wit bigram
count_vect = CountVectorizer(ngram_range=(1,2))
x_train_count = count_vect.fit_transform(x_train_pre_processed)
x_test_count = count_vect.transform(x_test_pre_processed)
print(x_train_count.get_shape())
print(x_test_count.get_shape())

(34707, 2218175)
(14875, 2218175)


In [48]:
#remove
count_vect = CountVectorizer(ngram_range=(1,2))
x_train_count = count_vect.fit_transform(X_train_pre_processed)
x_train_count.get_shape()

(34707, 2218175)

In [78]:
# remove
count_vect_test = CountVectorizer(ngram_range=(1,2))
x_test_count = count_vect_test.fit_transform(X_test_pre_processed)
x_test_count.get_shape()

(14875, 1116928)

In [47]:
# getting all list of features or words 
count_vect.get_feature_names()

['aaa',
 'aaa ball',
 'aaa even',
 'aaa favorite',
 'aaa jawani',
 'aaa level',
 'aaa not',
 'aaa yeah',
 'aaaaaaaaaaaahhhhhhhhhhhhhh',
 'aaaaaaaaaaaahhhhhhhhhhhhhh hurting',
 'aaaaaaaargh',
 'aaaaaaah',
 'aaaaaaah saw',
 'aaaaaaahhhhhhggg',
 'aaaaagh',
 'aaaaagh scene',
 'aaaaah',
 'aaaaah movie',
 'aaaaah never',
 'aaaaahhhh',
 'aaaaahhhh get',
 'aaaaatch',
 'aaaaatch kah',
 'aaaaaw',
 'aaaaaw cry',
 'aaaahhhhhh',
 'aaaahhhhhh terrible',
 'aaaahhhhhhh',
 'aaaahhhhhhh run',
 'aaaarrgh',
 'aaaarrgh former',
 'aaaawwwwww',
 'aaaawwwwww well',
 'aaaggghhhhhhh',
 'aaaggghhhhhhh not',
 'aaah',
 'aaah friggin',
 'aaah leg',
 'aaahhhhhhh',
 'aaahhhhhhh scene',
 'aaall',
 'aaall way',
 'aaam',
 'aaam going',
 'aaargh',
 'aaargh bad',
 'aaargh dead',
 'aaargh not',
 'aab',
 'aab tak',
 'aachen',
 'aachen palm',
 'aachen two',
 'aada',
 'aada adhura',
 'aag',
 'aag actually',
 'aag break',
 'aag director',
 'aag fail',
 'aag figure',
 'aag fire',
 'aag hit',
 'aag jugnu',
 'aag make',
 'aag nev

In [347]:
feature_array = np.array(count_vect.get_feature_names())
feature_array[5000:5660]

array(['absurd quite', 'absurd rather', 'absurd rating',
       'absurd reaction', 'absurd reference', 'absurd regard',
       'absurd renny', 'absurd result', 'absurd revelation',
       'absurd ride', 'absurd ridiculous', 'absurd road',
       'absurd romance', 'absurd rule', 'absurd said', 'absurd sake',
       'absurd scenario', 'absurd scheider', 'absurd seemed',
       'absurd self', 'absurd serious', 'absurd setting', 'absurd sight',
       'absurd silliness', 'absurd sinister', 'absurd situation',
       'absurd sivaji', 'absurd sometimes', 'absurd sound',
       'absurd soundtrack', 'absurd space', 'absurd stagebound',
       'absurd stealing', 'absurd story', 'absurd stupid', 'absurd style',
       'absurd sucker', 'absurd suddenly', 'absurd superman',
       'absurd surreal', 'absurd take', 'absurd ted', 'absurd tedious',
       'absurd terrible', 'absurd theory', 'absurd thing', 'absurd think',
       'absurd though', 'absurd thre', 'absurd tired', 'absurd trash',
       'a

In [349]:
# getting top 10 words with most appearance
responses = feature_array
tfidf_sorting = np.argsort(responses).flatten()[::-1]

n = 10
top_n = feature_array[tfidf_sorting][:n]
top_n

array(['zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz excuse',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz ooops',
       'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzz imdb',
       'zzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzz way', 'zzzzzzzzzzzzz',
       'zzzzzzzzzzzz pop'], dtype='<U145')

In [350]:
# 10-Fold 

In [49]:
# simulate splitting a dataset of 25 observations into 5 folds
kf = KFold(n_splits=10) #, shuffle=True, random_state=25)
kf

KFold(n_splits=10, random_state=None, shuffle=False)

In [50]:
# remove
print('{} {:^61} {}'.format('Iteration', 'Training set obsevations', 'Testing set observations'))
for iteration,data in enumerate(kf.split(range(10)), start=1):
#     print(data)
#     print(iteration)
#     print(iteration,data)
    print('{!s:^9} {!s:^58} {!s:^30}'.format(iteration, data[0],data[1]))

Iteration                   Training set obsevations                    Testing set observations
    1                        [1 2 3 4 5 6 7 8 9]                                  [0]              
    2                        [0 2 3 4 5 6 7 8 9]                                  [1]              
    3                        [0 1 3 4 5 6 7 8 9]                                  [2]              
    4                        [0 1 2 4 5 6 7 8 9]                                  [3]              
    5                        [0 1 2 3 5 6 7 8 9]                                  [4]              
    6                        [0 1 2 3 4 6 7 8 9]                                  [5]              
    7                        [0 1 2 3 4 5 7 8 9]                                  [6]              
    8                        [0 1 2 3 4 5 6 8 9]                                  [7]              
    9                        [0 1 2 3 4 5 6 7 9]                                  [8]              
   

In [98]:

# remove
def get_score (model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train)
    return model.predict(X_test,y_test)

In [58]:
# remove
from sklearn.naive_bayes import CategoricalNB
print('{} {:^61} {}'.format('Iteration', 'Training set obsevations', 'Testing set observations'))
for iteration,data in kf.split(X_train):
#     print(iteration,data)
#     print(X_train[data])
#     X_train_train, X_train_test, y_train_train, y_train_test = X_train[iteration],X_train[data],y_train[iteration],y_train[data]
#     print(get_score(CategoricalNB(), X_train_train, X_train_test, y_train_train, y_train_test))

Iteration                   Training set obsevations                    Testing set observations
[ 6942  6943  6944 ... 34704 34705 34706] [   0    1    2 ... 6939 6940 6941]


KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

In [69]:
alphas = [0.0001,00.1,0.1,1,10,20,200,2000,20000,2000000]
for i in alphas:
    print('''Multinomial model with alpha = {} have mean {} and standard deviation {}''' \
          .format(i,(cross_val_score(MultinomialNB(alpha=i),x_train_count,y_train,cv=10).mean())*100,cross_val_score(MultinomialNB(alpha=i),x_train_count,y_train,cv=10).mean()))

Multinomial model with alpha = 0.0001 have mean 85.14994723675876 and standard deviation 0.8514994723675876
Multinomial model with alpha = 0.1 have mean 87.52123440246355 and standard deviation 0.8752123440246355
Multinomial model with alpha = 0.1 have mean 87.52123440246355 and standard deviation 0.8752123440246355
Multinomial model with alpha = 1 have mean 87.9361195313661 and standard deviation 0.879361195313661
Multinomial model with alpha = 10 have mean 87.49239935339084 and standard deviation 0.8749239935339084
Multinomial model with alpha = 20 have mean 87.1149557843208 and standard deviation 0.871149557843208
Multinomial model with alpha = 200 have mean 84.9050377894402 and standard deviation 0.8490503778944021
Multinomial model with alpha = 2000 have mean 79.04454197272253 and standard deviation 0.7904454197272253
Multinomial model with alpha = 20000 have mean 68.15912247797102 and standard deviation 0.6815912247797102
Multinomial model with alpha = 2000000 have mean 65.272114

In [72]:
model = MultinomialNB(alpha=1)
model.fit(x_train_count,y_train)

MultinomialNB(alpha=1)

In [73]:
model.fit(x_train_count,y_train)

MultinomialNB(alpha=1)

In [74]:
model.score(x_train_count,y_train)

0.9959086063330164

In [110]:
model.predict(x_test_count,y_test)

TypeError: predict() takes 2 positional arguments but 3 were given

In [75]:
y_pred = model.predict_proba(x_test_count) 
y_pred

array([[1.00000000e+00, 5.69418085e-23],
       [3.23408478e-06, 9.99996766e-01],
       [2.74476561e-07, 9.99999726e-01],
       ...,
       [3.76080015e-29, 1.00000000e+00],
       [9.79189752e-01, 2.08102481e-02],
       [1.00000000e+00, 6.68614728e-18]])

In [76]:
y_pred = model.predict(x_test_count) 
y_pred

array([0, 1, 1, ..., 1, 0, 0], dtype=int64)

In [77]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[6639  765]
 [ 984 6487]]


In [78]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8824201680672269