This is the data preprocessing notebook.

The steps taken to preprocess the data will include:
1. Checking the dataset for NA values
2. Checking the dataset for duplicate records
3. Tokenizing the textual data
4. Removing irrelevant data (i.e. HTML tags and special characters)
5. Removing stopwords (i.e. Common words like "the", "a")
6. Lemmatization - Representing each token in its base word
7. Correcting spelling errors and standardising words to lowercase
8. Representing the textual data in a suitable model (i.e. Bag of Words, TF-IDF Vectors)


We download and import the libraries required to preprocess the data.

In [1]:
!pip install nltk
!pip install textblob

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from textblob import TextBlob

#Download stopwords and punkt tokenizer
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\arona\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arona\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We import the CSV data of the IMDB Movie Reviews. Here we also explore the dataset in terms of its length, columns and balance.
- We observe that the dataset contains 50,000 records.
- The dataset contains 2 columns: review and sentiment. The "review" column contains records of the textual data (i.e. str) of the IMDB Movie Reviews. The "sentiment" column contains the corresponding label of the review. It tells us whether the review has a positive or negative sentiment. There are only 2 possible sentiments: positive and negative.
- The dataset is also balanced with an equal number of positive and negative sentiment reviews. There are 25,000 records for each sentiment.

In [2]:
df = pd.read_csv("IMDB Dataset.csv")

print(df.head())
print()
print(df.describe())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

                                                   review sentiment
count                                               50000     50000
unique                                              49582         2
top     Loved today's show!!! It was a variety and not...  positive
freq                                                    5     25000


In [3]:
#Length
print(f'\nThe length of the dataframe is {len(df)}.')
print()

#Columns
print(df.columns)

#Balance
df["sentiment"].value_counts()


The length of the dataframe is 50000.

Index(['review', 'sentiment'], dtype='object')


positive    25000
negative    25000
Name: sentiment, dtype: int64

In [4]:
#Checking the dataset for NA values 
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [5]:
#Checking the dataset for duplicate records
print(df["review"].duplicated().value_counts())

#Removing the 418 duplicate records and reset the index for the deduplicated dataset
df = df.drop_duplicates()
df = df.reset_index(drop = True)


False    49582
True       418
Name: review, dtype: int64


In [6]:
#Check again that the data is balanced. The dataset is still balanced with 24884 positive sentiment records and 24698 negative sentiment records.
df["sentiment"].value_counts()

positive    24884
negative    24698
Name: sentiment, dtype: int64

Define the preprocessing functions below.

In [7]:
#Tokenization
def tokenize(text):
    return nltk.word_tokenize(text)

In [8]:
#Removal of irrelevant data
def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

In [9]:
#Removal of Stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

In [10]:
#Stemming
lemmatizer = WordNetLemmatizer()
def lemmatize_words(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

In [11]:
#Correcting spelling errors and standardising words to lowercase
def correct_spelling(text):
    blob = TextBlob(text)
    return str(blob.correct()).lower()

In [12]:
# Full text preprocessing pipeline
def preprocess_text(text):
    text = clean_text(text)  # Clean the text
    text = correct_spelling(text)  # Correct spelling and convert to lowercase
    tokens = tokenize(text)  # Tokenize
    tokens = remove_stopwords(tokens)  # Remove stopwords
    tokens = lemmatize_words(tokens)  # Apply stemming
    return ' '.join(tokens)  # Return processed text

Apply the preprocessing functions to the dataset.

In [None]:
# Apply preprocessing to the entire dataset
from joblib import Parallel, delayed
from tqdm import tqdm

# Wrap tqdm in a function to work with joblib
tqdm.pandas()

# Track progress with Parallel and delayed
# 'n_jobs=-1' uses all available CPU cores, and 'total=len(df)' lets tqdm know the total iterations
df['cleaned_review'] = Parallel(n_jobs=-1)(
    delayed(preprocess_text)(text) for text in tqdm(df['review'], total=len(df))
)

  0%|          | 192/49582 [01:29<7:02:46,  1.95it/s]

Once the data has been processed, we download it as a CSV file.

In [None]:
#Download the preprocessed data as a CSV file
df.to_csv("IMDB Dataset Processed.csv")

We represent the textual data as a Bag of Words or TF-IDF Vectors to be used when training the models.
We split the data between the train and test sets.

In [None]:
#Representing the textual data in a suitable model (i.e. Bag of Words, TF-IDF Vectors)

#Represent the text data using Bag of Words
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['cleaned_review'])

#Alternatively, represent the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_review'])


#Splitting the data into the training and test sets. Ensure that the train and test datasets are balanced by using stratify on the sentiments data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#Labels (i.e. Sentiment)
y = df['sentiment']
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X_bow, y_encoded, test_size=0.2, random_state=42, stratify=y)