In [1]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('data/imdb_dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [10]:
df.shape

(50000, 2)

In [11]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [15]:
df.duplicated().sum()

np.int64(418)

In [18]:
# let's see the length of the review
df['review_length'] = df['review'].apply(lambda x: len(x.split()))
df

Unnamed: 0,review,sentiment,review_length
0,One of the other reviewers has mentioned that ...,positive,307
1,A wonderful little production. <br /><br />The...,positive,162
2,I thought this was a wonderful way to spend ti...,positive,166
3,Basically there's a family where a little boy ...,negative,138
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,230
...,...,...,...
49995,I thought this movie did a down right good job...,positive,194
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,112
49997,I am a Catholic taught in parochial elementary...,negative,230
49998,I'm going to have to disagree with the previou...,negative,212


In [20]:
df['review_length'].describe()

count    50000.000000
mean       231.156940
std        171.343997
min          4.000000
25%        126.000000
50%        173.000000
75%        280.000000
max       2470.000000
Name: review_length, dtype: float64

In [21]:
df.loc[3]

review           Basically there's a family where a little boy ...
sentiment                                                 negative
review_length                                                  138
Name: 3, dtype: object

In [24]:
# let's see one sample review
df.loc[3, 'review']

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

## Text Preprocessing
Now that we have got the idea about the data let's perfrom some preprocessing

#### Removing Duplicates

In [26]:
df = df.drop_duplicates().reset_index(drop=True)
print('After remvoing dupicates length of the dataset : ', len(df))

After remvoing dupicates length of the dataset :  49582


#### Removing HTML Tags

In [29]:
# first let's check how many reviews has HTML tags
import re
def has_html(text):
    return bool(re.search(r'<.*?>', text))

print('Number of reviews has HTML Tags : ', df['review'].apply(has_html).sum())

Number of reviews has HTML Tags :  28968


In [30]:
def remove_html(text):
    return re.sub(r'<.*?>', ' ',text)

df['review'] = df['review'].apply(remove_html)

#### Removeing Punctuation

In [36]:
import string

print(string.punctuation)
exclude = string.punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [37]:
def remove_punctuation(text):
    return text.translate(str.maketrans('','',exclude))

df['review'] = df['review'].apply(remove_punctuation)

In [38]:
# to ensure consistency, let's convert lowercaes()
df['review'] = df['review'].str.lower()

#### Removing Stopwords

In [39]:
from nltk.corpus import stopwords

In [40]:
stop_words = set(stopwords.words('english'))
# we can add custom words to this stopwords list
stop_words.update(['br', 'film', 'movie', 'one', 'character'])  # Common in movie reviews

In [41]:
def remove_stopwords(text):
    words = re.findall(r'\b[a-zA-Z]{2,}\b', text)
    return ' '.join(w for w in words if w not in stop_words)

df['review'] = df['review'].apply(remove_stopwords)
# also instead of .apply() func we can use .map() which is slightly faster

Let's explain a bit how this regex is actually working : 
* \b: This matches a "word boundary." It ensures that the pattern only matches whole words, not characters inside a larger word (e.g., it will find "is" but not "this").
* [a-zA-Z]: This is a character set that matches any single lowercase or uppercase letter.
* {2,}: This is a quantifier that specifies the character set must be repeated two or more times. This is a simple way to filter out very short words, such as "a" or "I". Or In simple words it will keep the words that has at least 2 characters.
- Additionally, this function will be applied on each review one by one

## Text Representation (TF-IDF)

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [43]:
tv = TfidfVectorizer(max_features=5000, ngram_range=(1,2))  # Unigrams + bigrams
X = tv.fit_transform(df['review'])

#### For now lets train model with MultinomialNB

In [44]:
# Labels
y = df['sentiment']

# Now train your model!
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X, y)

What does .fit_transform() do?

Two steps in one:

* .fit():
Scans all cleaned reviews
Builds vocabulary of up to 5000 best words (with TF-IDF weights)
Learns which words matter
* .transform():
Converts each review into a numerical vector
Each number = TF-IDF score of a word
* 🎯 Output: X is a sparse matrix of shape (50000, ~5000)
→ 50k rows (reviews), 5000 columns (word scores)

This is the feature matrix — input to ML model.