# Data Acquisition

In [1]:
!kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [2]:
!unzip imdb-dataset-of-50k-movie-reviews.zip

Archive:  imdb-dataset-of-50k-movie-reviews.zip
replace IMDB Dataset.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: IMDB Dataset.csv        


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
temp_df = pd.read_csv('IMDB Dataset.csv')

In [5]:
df = temp_df.iloc[:10000]

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df.shape

(10000, 2)

# Text Pre-processing

In [8]:
df.review[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [9]:
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [10]:
df.sentiment.value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,5028
negative,4972


In [11]:
df.duplicated().sum()

17

In [12]:
df.drop_duplicates(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace = True)


In [13]:
df.shape

(9983, 2)

## Text processing

In [14]:
# 1) REMOVE TAGS

import re
def remove_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)


In [15]:
df.review = df.review.apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.review = df.review.apply(remove_tags)


In [16]:
df.review[1]

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

In [17]:
# 2) Lower case conversion

df.review = df.review.apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.review = df.review.apply(lambda x: x.lower())


In [18]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [19]:
# 3) Stopwords removal
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
def remove_stopwords(text):
    text = [word for word in text.split() if word not in stop_words]
    return ' '.join(text)

In [21]:
df.review = df.review.apply(remove_stopwords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.review = df.review.apply(remove_stopwords)


In [22]:
# 4) Remove punctuations
import string
def remove_punctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

In [23]:
df.review = df.review.apply(remove_punctuations)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.review = df.review.apply(remove_punctuations)


In [24]:
# 5) Spelling mistakes

from textblob import TextBlob
def correct_spell(text):
    return str(TextBlob(text).correct().string)

In [25]:
# df.review = df.review.apply(correct_spell)

In [26]:
# 6) Stemming

from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [27]:
def stemming(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [28]:
# df.review = df.review.apply(stemming)

# Feature Engineering

In [29]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


In [30]:
x = df.drop('sentiment', axis = 1)
y = df.sentiment

In [31]:
from sklearn.preprocessing import LabelEncoder

In [32]:
le = LabelEncoder()
y = le.fit_transform(y)

In [33]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [34]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

In [35]:
x_train.shape

(7986, 1)

In [36]:
x_test.shape

(1997, 1)

In [37]:
# Bag of Words for x_train (B.o.W)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

x_train_bow = cv.fit_transform(x_train.review).toarray()
x_test_bow = cv.transform(x_test.review).toarray()

# Modelling

In [38]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [40]:
gnb.fit(x_train_bow, y_train)

In [41]:
y_pred = gnb.predict(x_test_bow)

In [42]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [43]:
accuracy_score(y_test, y_pred)

0.656484727090636

In [44]:
confusion_matrix(y_test, y_pred)

array([[697, 255],
       [431, 614]])

In [45]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(x_train_bow,y_train)
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test,y_pred)

0.842764146219329

# Using Tf-Idf

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

x_train_tfidf = tfidf.fit_transform(x_train.review).toarray()
x_test_tfidf = tfidf.transform(x_test.review).toarray()

In [47]:
rf = RandomForestClassifier()

rf.fit(x_train_tfidf,y_train)
y_pred = rf.predict(x_test_tfidf)
accuracy_score(y_test,y_pred)

0.842764146219329