

**Objective:** 
Our primary objective is to develop a sentiment analysis model that can automatically classify IMDb movie reviews into positive or negative sentiments. By analyzing these reviews, we aim to uncover patterns in sentiment expression and build a predictive model for sentiment classification. 

**Dataset Overview:** 
The dataset exclusively contains two columns: "review" containing the movie reviews themselves, and "sentiment" providing labels as either "positive" or "negative". This simplicity reflects the real-world scenario of sentiment analysis, where the task is to derive sentiment from textual content. 

**Significance:** 
Sentiment analysis holds practical value in understanding public opinion, guiding decision-making, and gauging audience reactions. Analyzing movie reviews demonstrates the application of sentiment analysis in the entertainment industry, aiding filmmakers and studios in assessing audience perceptions.

In [1]:
import pandas as pd

In [2]:
#import the dataset
df=pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

**Data Exploration**

In [3]:
#Shape of the dataset
df.shape

(50000, 2)

In [4]:
df.sample(10)

Unnamed: 0,review,sentiment
33671,This 2003 made for TV movie was shown on a wom...,negative
1067,I've never laughed and giggled so much in my l...,positive
16243,It's a shame House Calls isn't better known. I...,positive
24430,this is a dreadful adaption of Charles Kingsle...,negative
722,I only heard about Driving Lessons through the...,positive
34100,"We have all been asking ourselves ""why don't t...",negative
3040,I never thought I would absolutly hate an Arno...,negative
18922,I really can't see why people seem to dislike ...,positive
43683,Spacecamp is one of the movies that kids just ...,positive
33065,"Writer & director Jay Andrews, a.k.a. Jim Wyno...",negative


In [5]:
#Data type of the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
#Determining the no. of missing values
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [8]:
#Count the no. of positive and negative review
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

**Text Preprocessing**

In [9]:
#Lowercase the review 
df['review']= df['review'].str.lower()
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [10]:
#Remove the HTML tags
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)


In [11]:
df['review']=df['review'].apply(remove_html_tags)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [12]:
#Remove URLs
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [13]:
df['review']=df['review'].apply(remove_url)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [14]:
import string
exclude=string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [15]:
#Remove Punctuations
def remove_punc1(text, exclude):
    return text.translate(str.maketrans('', '', exclude))

In [16]:
df['review']=df['review'].apply(remove_punc1, exclude=exclude)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,i am a catholic taught in parochial elementary...,negative
49998,im going to have to disagree with the previous...,negative


In [17]:
#Chat word treatment
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""


# Create a dictionary from the chat words and their expansions
chat_words = {}
for line in chat_words_str.strip().split('\n'):
    shorthand, expansion = line.lower().split('=')
    chat_words[shorthand] = expansion

# Define the function to remove chat words
def remove_chat_words(text, chat_words):
    words = text.split()
    cleaned_words = [chat_words.get(word.lower(), word) for word in words]
    cleaned_text = ' '.join(cleaned_words)
    return cleaned_text

# Apply the function to the 'review' column
df['review'] = df['review'].apply(remove_chat_words, chat_words=chat_words)
df


Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,i am a catholic taught in parochial elementary...,negative
49998,im going to have to disagree with the previous...,negative


In [18]:
#Remove the stop words
from nltk.corpus import stopwords


In [19]:
# Get the English stopwords as a set for faster membership checking
stop_words = set(stopwords.words('english'))

# Function to remove stop words from a review
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])



In [20]:
df['review']=df['review'].apply(remove_stopwords)
df

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive
...,...,...
49995,thought movie right good job wasnt creative or...,positive
49996,bad plot bad dialogue bad acting idiotic direc...,negative
49997,catholic taught parochial elementary schools n...,negative
49998,im going disagree previous comment side maltin...,negative


In [21]:
#Tokenization
import nltk
from nltk.tokenize import word_tokenize
df['review']=df['review'].apply(word_tokenize)

In [22]:
#Apply Lemmatization to the review column
!pip install spacy
!python -m spacy download en_core_web_sm

import spacy

# Load the spaCy language model
nlp = spacy.load('en_core_web_sm')

# Assuming 'df' contains your DataFrame with the tokenized 'review' column
def lemmatize_tokens(tokens):
    lemmatized_tokens = [token.lemma_ for token in nlp(" ".join(tokens))]
    return lemmatized_tokens

# Apply lemmatization to the tokenized 'review' column
df['review'] = df['review'].apply(lemmatize_tokens)
df

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m92.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Unnamed: 0,review,sentiment
0,"[one, reviewer, mention, watch, 1, oz, episode...",positive
1,"[wonderful, little, production, filming, techn...",positive
2,"[think, wonderful, way, spend, time, hot, summ...",positive
3,"[basically, there, s, family, little, boy, jak...",negative
4,"[petter, matteis, love, time, money, visually,...",positive
...,...,...
49995,"[think, movie, right, good, job, be, not, crea...",positive
49996,"[bad, plot, bad, dialogue, bad, act, idiotic, ...",negative
49997,"[catholic, teach, parochial, elementary, schoo...",negative
49998,"[I, m, go, disagree, previous, comment, side, ...",negative


**Text Representation**

In [23]:
#TF-IDF representation
from sklearn.feature_extraction.text import TfidfVectorizer
df['review'] = df['review'].apply(' '.join) #Convert the review from a list to a string format
tfidf=TfidfVectorizer()
x=tfidf.fit_transform(df['review'])


In [24]:
#Change the sentiment values as 1 and 0 for positive and negative respectively
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(df['sentiment'])

In [25]:
#Split the data into the training and the testing set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train,y_test=train_test_split(x,y,test_size=0.2, random_state=42)

**Model Building**

In [26]:
#Build the Logistic Regression Model
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train,y_train)

In [27]:
#Make the predictions
y_pred=model.predict(x_test)

**Model Evaluation**

In [28]:

from sklearn.metrics import accuracy_score,confusion_matrix, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)


In [29]:
print("Accuracy:",accuracy*100, "%")
print("Confusion Matrix:")
print(cm)
print("Precison:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 88.86 %
Confusion Matrix:
[[4336  625]
 [ 489 4550]]
Precison: 0.8792270531400966
Recall: 0.9029569358999802
F1-score: 0.8909340121401997


**Model Performance:**
The sentiment analysis model yielded impressive results when applied to IMDb movie reviews:

Accuracy: 88.86%

Precision: 87.92%

Recall: 90.29%

F1-score: 89.09%

These metrics indicate the model's accuracy in distinguishing between positive and negative sentiments. 

**Interpretation and Significance:**
The confusion matrix reveals insightful details:

True Positives: 4550

False Positives: 625

True Negatives: 4336

False Negatives: 489


The precision of 87.92% signifies the model's ability to accurately predict positive reviews, while the recall of 90.29% suggests effective capturing of actual positive reviews.