# Extracting Dataset from Kaggle API

In [1]:
import opendatasets as od

dataset='https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews'
od.download(dataset)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: ankit117kr
Your Kaggle Key: ········
Downloading imdb-dataset-of-50k-movie-reviews.zip to .\imdb-dataset-of-50k-movie-reviews


100%|██████████| 25.7M/25.7M [00:31<00:00, 844kB/s] 





In [1]:
import pandas as pd
import numpy as np
text = pd.read_csv('imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
text.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Visualizing the review column

# Exploratory Data Analysis

In [2]:
text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [3]:
text.describe(include='object')

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


Herer we can see that there are some 500 dublicates row so removing dublicates row from dataset

In [4]:
text.duplicated().sum()

418

In [5]:
text.drop_duplicates(inplace= True)
print(text.duplicated().sum())

0


Handeling any missing value in the dataset

In [6]:
text.isnull().sum()

review       0
sentiment    0
dtype: int64

Finding target column distribution

In [7]:
text['sentiment'].value_counts()

positive    24884
negative    24698
Name: sentiment, dtype: int64

# Performing Text Pre-Proccessing 

Exploring the review column of the datset

In [8]:
text['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

Removing the html Tags from the review columns using regular expression

In [9]:
import re
def remove_html(txt):
    para=re.compile(r"<.*?>")
    return para.sub('',txt)

text['review']=text['review'].apply(remove_html)

Since python is case sensitve language so converting all to lower case 

In [10]:
text['review']=text['review'].str.lower()

Removing any emoji because for algorithum it don't conclude meaning

In [12]:
import emoji

def replace_emoji(text):
    return emoji.demojize(text)

text['review']=text['review'].apply(replace_emoji)

Removing the punctuation sign from the text

In [13]:
import string
print('All these mentioned sign will be removed ',string.punctuation)

All these mentioned sign will be removed  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [14]:
def remove_punc(text):
    for char in string.punctuation: 
        text=text.replace(char,'')
    return text

text['review']= text['review'].apply(remove_punc)

In [15]:
text['review'][0]

'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictur

So here after basic cleaning we found out there are stop words replacing them with ("") because these words don't contribute more to the sentiment alaysis and make build extra vectors

In [16]:
from nltk.corpus import stopwords

sw_list = stopwords.words('english')

text['review'] = text['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

In [17]:
text['review'][0]

'one reviewers mentioned watching 1 oz episode youll hooked right exactly happened methe first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awayi would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannered middle 

# Feature Engineering

Creating a word 2 vect

In [32]:
import gensim
from nltk import sent_tokenize
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

In [33]:
text_corpus = [] 
for i  in text['review']:
    raw_sent =  sent_tokenize(i)
    for sent in raw_sent:
        text_corpus.append(simple_preprocess(sent))

print(len(text_corpus))

49582


In [34]:
model= Word2Vec(window=10,
                min_count=3,
                vector_size=200,
                epochs=10)

In [35]:
model.build_vocab(text_corpus)
print("total number of documents or text pieces ",model.corpus_count)

total number of documents or text pieces  49582


In [36]:
model.train(text_corpus,total_examples=model.corpus_count, epochs=10)

(54925013, 58676190)

In [37]:
print("Total no of words in the model's vocabulary",len(model.wv.index_to_key))

Total no of words in the model's vocabulary 59327


Converting the review columns into vector of 200-Dimension

In [38]:
def document_vector(doc):
    doc=[word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

In [39]:
from tqdm import tqdm
x=[]
for doc in tqdm(text['review'].values):
    x.append(document_vector(doc))

100%|██████████| 49582/49582 [36:21<00:00, 22.73it/s]  


In [40]:
x=np.array(x)
x.shape

(49582, 200)

Creating dependent variable

In [42]:
y = text['sentiment']

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)

y

array([1, 1, 1, ..., 0, 0, 0])

Performing train test splitting of model

In [43]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

# Model Building

In [44]:
from sklearn.metrics import accuracy_score,confusion_matrix

Dry running on different type of machine learning algorithum

In [45]:
from sklearn.naive_bayes import GaussianNB 
gnb = GaussianNB()

gnb.fit(x_train,y_train)
y_pred = gnb.predict(x_test)
accuracy_score(y_test,y_pred)

0.7854189775133609

In [70]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1)

rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
accuracy_score(y_test,y_pred)

0.8458203085610567

In [51]:
from sklearn.svm import SVC
sv = SVC()

sv.fit(x_train,y_train)
y_pred = sv.predict(x_test)
accuracy_score(y_test,y_pred)

0.8867601089039024

In [52]:
from sklearn.ensemble import AdaBoostClassifier
abc=AdaBoostClassifier()

abc.fit(x_train,y_train)
y_pred = abc.predict(x_test)
accuracy_score(y_test,y_pred)

0.8388625592417062

In [55]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()

bnb.fit(x_train,y_train)
y_pred = bnb.predict(x_test)
accuracy_score(y_test,y_pred)

0.7702934355147726

Performing Hyper parameter tunning of the model

In [56]:
from sklearn.model_selection import GridSearchCV

In [72]:
param_grid = { 
            'n_estimators':[90,100,110],
            'criterion': ['gini',"entropy","log_loss"],
            'max_features':["sqrt", "log2", None],
            'bootstrap': [True,False],
            'random_state': [42,None]
            
}

In [None]:
reg = GridSearchCV(rf,param_grid=param_grid,n_jobs=-1)
reg.fit(x_train,y_train)

In [None]:
print(reg.best_score_)
reg.best_params_

GaussianNB

Score = 0.7875204840539519

parameter = {'priors': [0.35, 0.65]}

In [None]:
param_grid = { 
            'C':[90,100,110],
            'kernel': ['gini',"entropy","log_loss"],
            'gamma':["sqrt", "log2", None],
            'class_weight': [True,False],
            'max_iter': [42,None]
            
}