This is the data preprocessing notebook.

The steps taken to preprocess the data will include:
1. Checking the dataset for NA values
2. Checking the dataset for duplicate records


We import the libraries required to preprocess the data.

In [1]:
import pandas as pd

We import the CSV data of the IMDB Movie Reviews. Here we also explore the dataset in terms of its length, columns and balance.
- We observe that the dataset contains 50,000 records.
- The dataset contains 2 columns: review and sentiment. The "review" column contains records of the textual data (i.e. str) of the IMDB Movie Reviews. The "sentiment" column contains the corresponding label of the review. It tells us whether the review has a positive or negative sentiment. There are only 2 possible sentiments: positive and negative.
- The dataset is also balanced with an equal number of positive and negative sentiment reviews. There are 25,000 records for each sentiment.

In [2]:
df = pd.read_csv("IMDB Dataset.csv")

print(df.head())
print()
print(df.describe())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

                                                   review sentiment
count                                               50000     50000
unique                                              49582         2
top     Loved today's show!!! It was a variety and not...  positive
freq                                                    5     25000


In [3]:
#Length
print(f'\nThe length of the dataframe is {len(df)}.')
print()

#Columns
print(df.columns)

#Balance
df["sentiment"].value_counts()


The length of the dataframe is 50000.

Index(['review', 'sentiment'], dtype='object')


sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [4]:
#Checking the dataset for NA values 
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [5]:
#Checking the dataset for duplicate records
print(df["review"].duplicated().value_counts())

#Removing the 418 duplicate records and reset the index for the deduplicated dataset
df = df.drop_duplicates()
df = df.reset_index(drop = True)


review
False    49582
True       418
Name: count, dtype: int64


In [7]:
#Check again that the data is balanced. The dataset is still balanced with 24884 positive sentiment records and 24698 negative sentiment records.
df["sentiment"].value_counts()

sentiment
positive    24884
negative    24698
Name: count, dtype: int64