# Dataset preparation

This notebook is used to prepare dataset for IMDb sentement analysis
from raw reviews. 
Link on [Kaggel](https://www.kaggle.com/datasets/pawankumargunjan/imdb-review)

##

Importing libraries and downloading necessary packages.

In [1]:
import os
import re
import random
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm.notebook import tqdm
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Maks\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Maks\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Maks\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Default train-test split divides the data in half. This results in 25,000 training instances and 25,000 test instances. 
In order to achieve better results, we can consider using a custom train-test split with our own ratio.
For this purpose, I merged all the file paths for training and testing into one source containing 50,000 reviews.

Additionally, we have 50,000 unlabeled instances that can be used for unsupervised training.

In [3]:
filepaths_train_pos = []
for dirpath, dirnames, filenames in os.walk("D:\\Datasets\\aclImdb\\train\\pos\\"):
    for file in filenames:
        filepaths_train_pos.append(os.path.join(dirpath, file))

filepaths_train_neg = []
for dirpath, dirnames, filenames in os.walk("D:\\Datasets\\aclImdb\\train\\neg\\"):
    for file in filenames:
        filepaths_train_neg.append(os.path.join(dirpath, file))
    
filepaths_test_pos = []
for dirpath, dirnames, filenames in os.walk("D:\\Datasets\\aclImdb\\test\\pos\\"):
    for file in filenames:
        filepaths_test_pos.append(os.path.join(dirpath, file))

filepaths_test_neg = []
for dirpath, dirnames, filenames in os.walk("D:\\Datasets\\aclImdb\\test\\neg\\"):
    for file in filenames:
        filepaths_test_neg.append(os.path.join(dirpath, file))
        
sup_filepaths = []
sup_filepaths.extend(filepaths_train_pos)
sup_filepaths.extend(filepaths_train_neg)
sup_filepaths.extend(filepaths_test_pos)
sup_filepaths.extend(filepaths_test_neg)

print(f"Number of supervised filepaths: {len(sup_filepaths)}")

unsup_filepaths = []
for dirpath, dirnames, filenames in os.walk("D:\\Datasets\\aclImdb\\train\\unsup\\"):
    for file in filenames:
        unsup_filepaths.append(os.path.join(dirpath, file))
        
print(f"Number of unsupervised filepaths: {len(unsup_filepaths)}")

Number of supervised filepaths: 50000
Number of unsupervised filepaths: 50000


Extracting reviews and labels from txt files

In [4]:
%%time
sup_text = []
labels = []
for filepath in tqdm(sup_filepaths):
    with open(filepath, 'r') as file:
        sup_text.append(file.read())       
        labels.append(os.path.basename(os.path.dirname(filepath)))       

  0%|          | 0/50000 [00:00<?, ?it/s]

CPU times: total: 30.5 s
Wall time: 6min 58s


In [5]:
%%time
unsup_text = []
for filepath in tqdm(unsup_filepaths):
    with open(filepath, 'r') as file:
        unsup_text.append(file.read())

  0%|          | 0/50000 [00:00<?, ?it/s]

CPU times: total: 27.8 s
Wall time: 5min 56s


Creating a Pandas dataframe with review text and corresponding labels for supervised data, and another dataframe with review text for unsupervised data.

In [6]:
df_supervised_text = pd.DataFrame({"review_text":sup_text, "label":labels})

df_unsupervised_text = pd.DataFrame({"review_text":unsup_text})

In [7]:
df_supervised_text.head()

Unnamed: 0,review_text,label
0,Bromwell High is a cartoon comedy. It ran at t...,pos
1,Homelessness (or Houselessness as George Carli...,pos
2,Brilliant over-acting by Lesley Ann Warren. Be...,pos
3,This is easily the most underrated film inn th...,pos
4,This is not the typical Mel Brooks film. It wa...,pos


In [8]:
df_unsupervised_text.head()

Unnamed: 0,review_text
0,"I admit, the great majority of films released ..."
1,"Take a low budget, inexperienced actors doubli..."
2,"Everybody has seen 'Back To The Future,' right..."
3,Doris Day was an icon of beauty in singing and...
4,"After a series of silly, fun-loving movies, 19..."


We can consider reducing the length of tokens by applying 'stop words'.

In [9]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

As we can see, 'stop words' is a list of common words. However, in the case of a sentiment classification task, we may reasonably exclude certain words from this list, such as: 'are', 'aren't', 'did', 'didn't', 'does', 'doesn't'. If we remove these words, we can achieve a completely different meaning in sentences.

In [12]:
custom_stopwords = set(stopwords.words('english'))

custom_stopwords -= set([
    'are', 'aren', "aren't", 
    'could', 'couldn', "couldn't", 
    'did', 'didn', "didn't", 
    'does', 'doesn', "doesn't", 
    'had', 'hadn', "hadn't", 
    'has', 'hasn', "hasn't", 
    'have', 'haven', "haven't", 
    'is', 'isn', "isn't", 
    'might', 'ma', 'mightn', "mightn't", 
    'must', 'mustn', "mustn't", 
    'need', 'needn', "needn't", 
    'shall', 'shan', "shan't", 
    'should', 'shouldn', "shouldn't", 
    'was', 'wasn', "wasn't", 
    'were', 'weren', "weren't", 
    'will', 'won', "won't", 
    'would', 'wouldn', "wouldn't", 
    'do', 'don', "don't", 
    't', 's',
    
    'too', 'very', 'should', 'no', 'not', 'against', 'nor', 'ain',
])

custom_stopwords.update([
    "'s", "'ve", "'ll", "'m"
])

print(custom_stopwords)

{'m', 'am', "'m", 'o', 'her', "you're", 'a', 'own', 'about', 'why', 'their', 'hers', 'him', 'where', 'now', 'here', 'yours', 'd', "that'll", "'ll", 'each', 'ourselves', "you'll", 'some', 'these', 'down', 'to', 'them', 'whom', 'or', 'between', 'are', 'as', 'herself', 'itself', 'both', 'because', 'same', 'most', 'and', 'they', 'myself', 'been', 'by', 'only', 'having', 'me', 'off', 'be', 'for', 'while', 'yourselves', 'into', 'himself', 'y', "'s", 'doing', 'i', 'of', 'other', 'can', 'it', "it's", 'yourself', 'theirs', 'up', 'at', 'so', 're', 'this', 'in', 'ours', 'that', 'how', 'with', 'below', 'few', "should've", 'until', 'our', 'when', 'after', "you'd", 'out', 'during', 'under', 'an', 'there', 'above', 'more', 'what', 'through', 'the', 'your', 'she', 'my', 'who', 'then', 'those', "she's", 'from', 'which', 'than', 'you', 'once', 'again', 'any', 'his', 'further', 'll', 'over', 'all', 'its', 'if', "'ve", 'themselves', "you've", 'we', 'he', 'such', 've', 'before', 'on', 'being', 'just', 'but

We will create functions for text preprocessing and extracting the length of the tokens. The raw data contains HTML syntax, so we can use the bs4 library for text cleaning. Next, we'll preprocess the text using the re library, convert it to lowercase, remove words according to the 'stop words' list, and lemmatize sentences. Lemmatization can reduce the diversity of sentences that are similar in context but differ in syntax.

In [38]:
lemmatizer = WordNetLemmatizer()

def clean_text_fn(x):
   
    x = BeautifulSoup(x, 'html.parser').get_text()
    
    x = re.sub(r"[;:]", '.', x)
    x = re.sub(r"[^a-zA-Z0-9'!?.,;: ]", '', x)
    x = re.sub(r'\.', ' . ', x)
    x = re.sub(r'\s+', ' ', x)

    x = x.lower()

    words = nltk.word_tokenize(x)

    words = [word for word in words if word not in custom_stopwords]
    
    words = [lemmatizer.lemmatize(word, "v") for word in words]
    words = [lemmatizer.lemmatize(word) for word in words]

    x = ' '.join(words)   
    return x

def get_length(x):
    return len(x.split())

In this cell we can see work of cleaning function on random review.

In [56]:
test_text = random.choice(df_supervised_text["review_text"].values)
print(test_text)
print("\n")
print(clean_text_fn(test_text))
print("\n")
print(f"The length of raw review:        {len(test_text.split())}")
print(f"The length of cleaned review:    {len(clean_text_fn(test_text).split())}")

can any movie become more naive than this? you cant believe a piece of this script. and its ssooooo predictable that you can tell the plot and the ending from the first 10 minutes. the leading actress seems like she wants to be Barbie (but she doesn't make it, the doll has MORE acting skills).<br /><br />the easiness that the character passes and remains in a a music school makes the phantom of the opera novel seem like a historical biography. i wont even comment on the shallowness of the characters but the ONE good thing of the film is Madsen's performance which manages to bring life to a melo-like one-dimensional character.<br /><br />The movie is so cheesy that it sticks to your teeth. i can think some 13 year old Britney-obsessed girls shouting "O, do give us a break! If we want fairy tales there is always the Brothers Grimm book hidden somewhere in the attic". I gave it 2 instead of one only for Virginia Madsen.


movie become naive ? cant believe piece script . ssooooo predictabl

Adding the cleaned text and the length of tokens to corresponding pandas dataframes.

In [57]:
%%time
df_supervised_text["cleaned_review_text"] = df_supervised_text["review_text"].apply(lambda x: clean_text_fn(x))
df_supervised_text["number_of_tokens"] = df_supervised_text["cleaned_review_text"].apply(lambda x: get_length(x))

  x = BeautifulSoup(x, 'html.parser').get_text()


CPU times: total: 6min 1s
Wall time: 6min 10s


In [58]:
%%time
df_unsupervised_text["cleaned_review_text"] = df_unsupervised_text["review_text"].apply(lambda x: clean_text_fn(x))
df_unsupervised_text["number_of_tokens"] = df_unsupervised_text["cleaned_review_text"].apply(lambda x: get_length(x))

  x = BeautifulSoup(x, 'html.parser').get_text()


CPU times: total: 6min 11s
Wall time: 6min 21s


prepared supervised dataframe contains raw reviews, labels, cleaned reviews, and the length of cleaned reviews. The prepared unsupervised dataframe has the same columns except for labels.

In [59]:
df_supervised_text.head()

Unnamed: 0,review_text,label,cleaned_review_text,number_of_tokens
0,Bromwell High is a cartoon comedy. It ran at t...,pos,bromwell high be cartoon comedy . run time pro...,113
1,Homelessness (or Houselessness as George Carli...,pos,homelessness houselessness george carlin state...,263
2,Brilliant over-acting by Lesley Ann Warren. Be...,pos,brilliant overact lesley ann warren . best dra...,110
3,This is easily the most underrated film inn th...,pos,be easily underrate film inn brook cannon . su...,90
4,This is not the typical Mel Brooks film. It wa...,pos,be not typical mel brook film . be much le sla...,81


In [60]:
df_unsupervised_text.head()

Unnamed: 0,review_text,cleaned_review_text,number_of_tokens
0,"I admit, the great majority of films released ...","admit , great majority film release say 1933 n...",109
1,"Take a low budget, inexperienced actors doubli...","take low budget , inexperienced actor double p...",113
2,"Everybody has seen 'Back To The Future,' right...","everybody have see 'back future , ' right ? wh...",238
3,Doris Day was an icon of beauty in singing and...,doris day be icon beauty sing act warm voice g...,72
4,"After a series of silly, fun-loving movies, 19...","series silly , funloving movie , 1955 be big y...",144


Creating CSV files that contain the prepared dataframes.

In [61]:
df_supervised_text.to_csv("supervised_with_stop_words.csv", index=False)
df_unsupervised_text.to_csv("unsupervised_with_stop_words.csv", index=False)