In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import re
import spacy
import nltk

import os
print(os.listdir("../data"))
import warnings
warnings.filterwarnings('ignore')

['IMDB Dataset.csv']


In [2]:
df=pd.read_csv('../data/IMDB Dataset.csv')
print("Data shape: ", df.shape) # Total review
df

Data shape:  (50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
df['clean_review'] = df['review']
df['clean_review'].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: clean_review, dtype: str

In [4]:
def clean_text(s: str) -> str:
    """
    Cleans text using regex.
    :param s: string
    :return: text
    """

    s = s.lower()
    s = re.sub('<.*?>',' ', s) # Remove HTML tags
    s = re.sub('[^a-zA-Z]', ' ', s) # Remove punctuation & number
    s = re.sub(r'\s+',' ', s) # Normalize whitespace
    return s.strip()

In [5]:
df['clean_review'] = df['clean_review'].apply(clean_text)
df['clean_review'].head()

0    one of the other reviewers has mentioned that ...
1    a wonderful little production the filming tech...
2    i thought this was a wonderful way to spend ti...
3    basically there s a family where a little boy ...
4    petter mattei s love in the time of money is a...
Name: clean_review, dtype: str

In [6]:
from nltk.corpus import  stopwords
stopwords = set(stopwords.words('english'))

def remove_stopwords(text:str) -> str:
    """
    Removes stopwords from text.
    :param text: string
    :return: str
    """
    return ' '.join([w for w in text.split() if w not in stopwords])

# not, no is important for sentiment and sometime remove stopword is worsen
df['clean_review_nostop'] = df['clean_review'].apply(remove_stopwords)

In [7]:
# spaCy for Lemmatization
nlp = spacy.load("en_core_web_sm")

def lemmatize(text: str) -> str:
    return ' '.join([token.lemma_ for token in nlp(text)])

In [8]:
df['clean_review_lemma'] = df['clean_review'].apply(lemmatize)
df['clean_review_lemma'].head()

0    one of the other reviewer have mention that af...
1    a wonderful little production the filming tech...
2    I think this be a wonderful way to spend time ...
3    basically there s a family where a little boy ...
4    petter mattei s love in the time of money be a...
Name: clean_review_lemma, dtype: str

In [9]:
df['len_raw'] = df['review'].apply(lambda x: len(x.split()))
df['len_clean'] = df['clean_review'].apply(lambda x: len(x.split()))
df['len_nostop'] = df['clean_review_nostop'].apply(lambda x: len(x.split()))

In [17]:
print("Word count - Raw:", df['len_raw'].head())
print("Word count - Clean text: ",df['len_clean'].head())
print("Word count - Remove stopword: ", df['len_nostop'].head())

Word count - Raw: 0    307
1    162
2    166
3    138
4    230
Name: len_raw, dtype: int64
Word count - Clean text:  0    313
1    160
2    167
3    133
4    228
Name: len_clean, dtype: int64
Word count - Remove stopword:  0    162
1     86
2     84
3     64
4    125
Name: len_nostop, dtype: int64


Remove stop word reduce a lot of words

In [24]:
pd.set_option("display.max_colwidth", None)
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [25]:
df['clean_review'][0]

'one of the other reviewers has mentioned that after watching just oz episode you ll be hooked they are right as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to many aryans muslims gangstas latinos christians italians irish and more so scuffles death stares dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wouldn t dare forget pretty p

In [26]:
df['clean_review_nostop'][0]

'one reviewers mentioned watching oz episode hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street skill

Still reable but df[clean_review_nostop] have a little hard to detect positive sentiment

In [32]:
df['clean_review'][3]

'basically there s a family where a little boy jake thinks there s a zombie in his closet his parents are fighting all the time this movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombie ok first of all when you re going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots out of just for the well playing parents descent dialogs as for the shots with jake just ignore them'

In [31]:
df['clean_review_nostop'][3]

'basically family little boy jake thinks zombie closet parents fighting time movie slower soap opera suddenly jake decides become rambo kill zombie ok first going make film must decide thriller drama drama movie watchable parents divorcing arguing like real life jake closet totally ruins film expected see boogeyman similar movie instead watched drama meaningless thriller spots well playing parents descent dialogs shots jake ignore'

Still detect the negative sentiment

In [33]:
df['clean_review'][49996]

'bad plot bad dialogue bad acting idiotic directing the annoying porn groove soundtrack that ran continually over the overacted script and a crappy copy of the vhs cannot be redeemed by consuming liquor trust me because i stuck this turkey out to the end it was so pathetically bad all over that i had to figure it was a fourth rate spoof of springtime for hitler the girl who played janis joplin was the only faint spark of interest and that was only because she could sing better than the original if you want to watch something similar but a thousand times better then watch beyond the valley of the dolls'

In [34]:
df['clean_review_nostop'][49996]

'bad plot bad dialogue bad acting idiotic directing annoying porn groove soundtrack ran continually overacted script crappy copy vhs cannot redeemed consuming liquor trust stuck turkey end pathetically bad figure fourth rate spoof springtime hitler girl played janis joplin faint spark interest could sing better original want watch something similar thousand times better watch beyond valley dolls'

Still detect the negative sentiment

## Day 2 – Cleaning Summary

- Lowercasing and HTML removal significantly reduce noise
- Removing punctuation and numbers simplifies text for TF-IDF
- Stopword removal may remove sentiment cues (e.g. "not")
- Keep both versions for model comparison
