INTRODUCTION

Here, I'm using an IMDB dataset of 50K movie reviews.

The dataset consists of two columns namely, reviews and sentiments which will help to identify the nature of review i.e, positive or negative. I will be using different machine learning algorithms to predict sentiment for a given movie review.

IMPORTING LIBRARIES

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup

PROCESSING DATASET

In [3]:
import os
for direc, _, files in os.walk('/Users/input'):
    for file in files:
        print(os.path.join(direc, file))

In [4]:
Dataset = pd.read_csv(r"\Users\shiva\Downloads\arch\IMDB Dataset.csv")      #Importing training dataset
Dataset

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative




SPLITTING TRAINING & TESTING SET FROM THE DATA

In [5]:
from sklearn.model_selection import train_test_split

train,test = train_test_split(Dataset, test_size = 0.33, random_state = 42)

In [6]:
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

train_y.value_counts()       #sentiment count


negative    16792
positive    16708
Name: sentiment, dtype: int64




EXPLORATORY DATA ANALYSIS

In [7]:
Dataset.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


ELIMINATING NOISE & HTML TEXTS

In [8]:
def pipe_html(text):
    soup = BeautifulSoup(text, "html.parser")                  #Removing HTML texts
    return soup.get_text()


def denoise_text(text):                                       #Removing noise
    text = pipe_html(text)
    return text

Dataset['review'] = Dataset['review'].apply(denoise_text)



BAG OF WORDS (BoW) MODEL

This model is being used to extract features from raw texts in order to implement it in machine learning algorithms. These algorithms take input in the form of numerical vectors, therefore the texts are converted into vectors by counting the frequency of each word appearing; this process is termed as vectorization.

I am using Term Frequency - Inverse Document Frequency (TF-IDF) algorithm for vectorization.


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)

test_x_vector = tfidf.transform(test_x)

In [10]:
pd.DataFrame.sparse.from_spmatrix(train_x_vector,
                                  index=train_x.index,
                                  columns=tfidf.get_feature_names_out())

Unnamed: 0,00,000,00000000000,00000001,00001,000dm,000s,001,003830,007,...,übermenschlich,überwoman,ünel,üvegtigris,üzümcü,ýs,þorleifsson,þór,יגאל,כרמון
23990,0.0,0.052519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8729,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3451,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2628,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38352,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44732,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38158,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


DETERMINING SUITABLE MACHINE LEARNING MODEL

Here, I will be using supervised learning algorithms - Decision tree, logistic regression & support vector machine, since labelled data is used for both input and output. Eventually, the ML model having higher accuracy will be selected for this analysis.

1. DECISION TREE

In [11]:
from sklearn.tree import DecisionTreeClassifier

Dtree = DecisionTreeClassifier()
Dtree.fit(train_x_vector, train_y)

2. LOGISTIC REGRESSION

In [12]:
from sklearn.linear_model import LogisticRegression

LogReg = LogisticRegression()
LogReg.fit(train_x_vector,train_y)

In [13]:
Logr = LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)

Logr_tfidf = Logr.fit(train_x_vector,train_y)
print(Logr_tfidf)

LogisticRegression(C=1, max_iter=500, random_state=42)


3. SUPPORT VECTOR MACHINE (SVM)

In [14]:
from sklearn.svm import SVC
SVM = SVC(kernel='linear')
SVM.fit(train_x_vector, train_y)


print(SVM.predict(tfidf.transform(['A good movie'])))
print(SVM.predict(tfidf.transform(['An excellent movie'])))
print(SVM.predict(tfidf.transform(['I did not like this movie at all I gave this movie away'])))

['positive']
['positive']
['negative']


ASSESSMENT OF MODEL

1. AVERAGE ACCURACY

In [15]:

print(Dtree.score(test_x_vector, test_y))

print(LogReg.score(test_x_vector, test_y))

print(SVM.score(test_x_vector, test_y))

0.7246666666666667
0.8918181818181818
0.8938787878787878


As per the above results, the support vector machine algorithm generated output with maximum accuracy. Therefore, I will be taking this model into consideration for further assessment.

2. F1 SCORE

It evaluates the predictive skill of this model by elaborating on its class-wise performance, measuring the frequency of correct predictions made throughout the whole class-balanced dataset.


In [16]:
from sklearn.metrics import f1_score

f1_score(test_y,SVM.predict(test_x_vector),
          labels = ['positive','negative'],average=None)

array([0.89553129, 0.89217316])

3. CLASSIFICATION REPORT

Report summarizing the performance evaluation metrics of support vector machine model.


In [17]:
from sklearn.metrics import classification_report

print(classification_report(test_y,
                            SVM.predict(test_x_vector),
                            labels = ['positive','negative']))

              precision    recall  f1-score   support

    positive       0.89      0.91      0.90      8292
    negative       0.90      0.88      0.89      8208

    accuracy                           0.89     16500
   macro avg       0.89      0.89      0.89     16500
weighted avg       0.89      0.89      0.89     16500



4. CONFUSION MATRIX

The confusion matrix is a representation of performance evaluation of this classification model - SVM. It compares the actual output values with those predicted by the SVM model.

In [19]:
from sklearn.metrics import confusion_matrix

con_matrix = confusion_matrix(test_y,
                           SVM.predict(test_x_vector),
                           labels = ['positive', 'negative'])
con_matrix

array([[7505,  787],
       [ 964, 7244]], dtype=int64)

INFERENCE


My overall analysis shows that support vector machine algorithm has higher accuracy as compared to other supervised learning models that have been tested to obtain desired output.
The accuracy of the model can be revamped by -

~ Refined data pre-processing

~ Avoiding data loss during vectorization and converting back to scalar form
