# Dataset Preprocessing

## Import The Necessary Libraries

In [35]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas", "spacy", "nltk"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

pandas is already installed.
spacy is already installed.
nltk is already installed.


## First Dataset: `other_reviews_df.pkl`

In this section, we load and explore the `other_reviews_df.pkl` dataset.
The goal is to understand the structure of the DataFrame, particularly identifying the column that contains the review text, in order to proceed with the preprocessing required for keyword extraction using KeyBERT.

In [36]:
import pandas as pd

# Load the dataset
df = pd.read_pickle('../Dataset/others_reviews_df.pkl')

# Display available columns and the first rows
print("Available columns:", df.columns.tolist())
df.head()

Available columns: ['Review_ID', 'Movie_ID', 'Movie_Title', 'Rating', 'Review_Date', 'Review_Title', 'Review_Text', 'Helpful_Votes', 'Total_Votes']


Unnamed: 0,Review_ID,Movie_ID,Movie_Title,Rating,Review_Date,Review_Title,Review_Text,Helpful_Votes,Total_Votes
0,9637661,tt6751668,Parasite,5.0,23 February 2024,"Solid Film Craftsmanship, Trash Story",I'm genuinely baffled this film won not only b...,3.0,8.0
1,5510542,tt6751668,Parasite,10.0,26 February 2020,MASTERPIECE,Just watch it. It has everything; entertainmen...,3.0,5.0
2,5182892,tt6751668,Parasite,10.0,12 October 2019,First Hit: I really enjoyed this story as it d...,First Hit: I really enjoyed this story as it d...,24.0,40.0
3,5499682,tt6751668,Parasite,9.0,21 February 2020,If you love cliché stories this movie is not f...,I was not expecting that much of this movie. N...,2.0,5.0
4,6094155,tt6751668,Parasite,8.0,14 September 2020,Amazing.,"Good acting, cinematography, twists and screen...",0.0,0.0


### Preprocessing of `Review_Text` for KeyBERT

### Installing and Using `en_core_web_sm`

`en_core_web_sm` is a lightweight English language model provided by **spaCy**. It includes essential NLP features such as:
- **Tokenization**: Splits text into individual words.
- **Part-of-Speech (POS) Tagging**: Assigns grammatical categories to words.
- **Lemmatization**: Converts words to their base forms.
- **Named Entity Recognition (NER)**: Identifies entities like names, dates, and locations.

In our case, we use `en_core_web_sm` specifically for **lemmatization**, which helps standardize words by reducing them to their root form. This improves the quality of keyword extraction with **KeyBERT**, as it avoids redundant variations of the same word.

Before using it, we need to install the model.

In [37]:
import spacy
import warnings
warnings.simplefilter("ignore", category=UserWarning)

# Check if en_core_web_sm is already installed
try:
    spacy.load("en_core_web_sm")
    print("spaCy model 'en_core_web_sm' is already installed.")
# Install the model if it's not already installed
except OSError:
    print("Downloading 'en_core_web_sm' model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    print("Model 'en_core_web_sm' installed successfully.")

spaCy model 'en_core_web_sm' is already installed.


### Preprocessing the reviews 

To prepare the review text for keyword extraction with KeyBERT, we apply several preprocessing steps:
- Convert all text to **lowercase**.
- Remove **punctuation** and **special characters**.
- Remove **common stopwords** to focus on meaningful words.
- Apply **lemmatization** or **stemming** to standardize word forms.
- (**Optional**) Filter out very short reviews that may not provide useful keywords.

**NOTE:**

Both Lemmization and Stemming reduce words to their base forms, but with key differences:
- Lemmatization uses a linguistic dictionary to find the base form (lemma) of a word. It is more accurate but slower.
    - Example: “running” → “run”, “better” → “good”.
- Stemming removes suffixes without considering the meaning, sometimes producing incorrect word forms. It is faster but less precise.
    - Example: “running” → “runn”, “better” → “better”.

For KeyBERT, lemmatization is preferable as it preserves readable and correct words.

In [38]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Load NLP tools
nlp = spacy.load("en_core_web_sm")                      # Load spaCy's English tokenizer to lemmatize text
nltk.download('stopwords')                              # Download NLTK's stopwords to remove them from text
stop_words = set(stopwords.words('english'))            # Get the list of stopwords in English
stemmer = PorterStemmer()                               # Initialize Porter Stemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/manuelemustari/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
print(stop_words)

{"you'll", 'she', 'all', 'against', 'while', 'both', 'each', 'o', 'yourself', 'can', 'our', 'very', 'those', 'will', 'shan', 'some', 'ain', 'below', 'doesn', 't', "it'll", 'how', 'or', 'which', 'are', 'ma', 'mightn', "it'd", 'we', "we're", 'during', 'themselves', 'he', 'ours', "needn't", "haven't", "she's", 'isn', 'am', 'aren', 'yourselves', 'myself', "hasn't", 'him', "wouldn't", 'through', 'hers', 'being', 'in', 'didn', 'don', "won't", 'theirs', 'then', "aren't", 'haven', "she'll", "you've", 'doing', 'needn', 'been', "she'd", 'won', 'by', 'should', 's', 'as', 'again', 'down', "we've", 'mustn', 'so', "mightn't", 'there', 'hadn', "weren't", 'because', 'at', 'their', 'who', 'with', 'few', "they're", 'shouldn', 'where', 'were', 'an', 'having', 'any', "he's", 'out', 'nor', 'other', 'here', 'of', 'does', 'hasn', 'before', "isn't", "mustn't", 'once', 'only', 'such', 'up', "wasn't", 'further', 'if', 'above', 'll', 'more', 'my', "hadn't", "we'd", 'a', 'wasn', 'you', 'than', "i've", 'most', 'no

In [40]:
# Clean Text Function
def clean_text(text):
    """ Convert text to lowercase and remove special characters """
    if not isinstance(text, str):  # Handle non-string values
        return ""
    
    text = text.lower()  # Lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation and special characters
    return text

# Lemmization Function
def lemmatize_text(text):
    """ Apply lemmatization using spaCy """
    # Tokenize the text
    doc = nlp(text)    

    # Lemmatize each token and join             
    return " ".join([token.lemma_ for token in doc if token.text not in stop_words])

# Stemming Function
def stem_text(text):
    """ Apply stemming using Porter Stemmer """
    # Tokenize the text
    words = text.split()
    # Stem each word and join
    return " ".join([stemmer.stem(word) for word in words if word not in stop_words])

# Preprocess Reviews Function
def preprocess_text(text, method="lemma"):
    """
    Preprocess text by cleaning and applying either lemmatization or stemming.
    - 'lemma' applies lemmatization using spaCy.
    - 'stem' applies stemming using NLTK.
    """
    cleaned = clean_text(text)

    if method == "lemma":
        return lemmatize_text(cleaned)
    elif method == "stem":
        return stem_text(cleaned)
    else:
        raise ValueError("Invalid method. Choose 'lemma' or 'stem'.")

In [41]:
from tqdm import tqdm

# Enable tqdm for pandas apply
tqdm.pandas()

# Create a copy of the original DataFrame
df_processed = df.copy()

# Apply lemmatization to review body text
df_processed['Processed_Review_Text'] = df_processed['Review_Text'].progress_apply(lambda x: preprocess_text(x, method="lemma"))

# Apply stemming to review titles
df_processed['Processed_Review_Title'] = df_processed['Review_Title'].progress_apply(lambda x: preprocess_text(x, method="lemma"))

# Display a preview of the result
df_processed[['Review_Title', 'Processed_Review_Title', 'Review_Text', 'Processed_Review_Text']].head()

100%|██████████| 15132/15132 [12:19<00:00, 20.47it/s]
100%|██████████| 15132/15132 [01:10<00:00, 213.60it/s]


Unnamed: 0,Review_Title,Processed_Review_Title,Review_Text,Processed_Review_Text
0,"Solid Film Craftsmanship, Trash Story",solid film craftsmanship trash story,I'm genuinely baffled this film won not only b...,genuinely baffle film good foreign film good d...
1,MASTERPIECE,masterpiece,Just watch it. It has everything; entertainmen...,watch everything entertainment comedy thrill h...
2,First Hit: I really enjoyed this story as it d...,first hit really enjoy story dive hilarious ab...,First Hit: I really enjoyed this story as it d...,first hit really enjoy story dive hilarious ab...
3,If you love cliché stories this movie is not f...,love clich story movie,I was not expecting that much of this movie. N...,expect much movie normally film nominate oscar...
4,Amazing.,amazing,"Good acting, cinematography, twists and screen...",good act cinematography twist screenplay side ...


### Saving the First Preprocessed Dataset

After completing the text preprocessing steps, we save the resulting DataFrame as a `.pkl` file to ensure consistency with the original dataset format.  
The output filename uses the same base name as the original file, prefixed with `preprocessed_`, allowing us to distinguish it while preserving traceability.

In [42]:
# Save the processed DataFrame to a new pickle file
output_path = '../Dataset/preprocessed_others_reviews_df.pkl'
df_processed.to_pickle(output_path)
print(f"Preprocessed dataset saved to: {output_path}")

Preprocessed dataset saved to: ../Dataset/preprocessed_others_reviews_df.pkl


## Second Dataset; `sw_reviews.pkl`

We now apply the same preprocessing pipeline to the second dataset, `sw_reviews.pkl`.  
The objective is to clean and standardize both the review text and titles using the previously defined functions:
- Lemmatization is applied to the main review text.
- Stemming is applied to the review titles.

In [None]:
# Load the second dataset
df_sw = pd.read_pickle('../Dataset/sw_reviews_df.pkl')

# Enable tqdm for pandas apply
tqdm.pandas()

# Create a processed copy
df_sw_processed = df_sw.copy()

# Apply preprocessing
df_sw_processed['Processed_Review_Text'] = df_sw_processed['Review_Text'].progress_apply(lambda x: preprocess_text(x, method="lemma"))
df_sw_processed['Processed_Review_Title'] = df_sw_processed['Review_Title'].progress_apply(lambda x: preprocess_text(x, method="lemma"))

# Display a preview of the result
df_sw_processed[['Review_Title', 'Processed_Review_Title', 'Review_Text', 'Processed_Review_Text']].head()

FileNotFoundError: [Errno 2] No such file or directory: '../Dataset/sw_reviews.pkl'

### Saving the Second Preprocessed Dataset

The preprocessed version of `sw_reviews.pkl` is saved in `.pkl` format using the same naming convention as before.  
The file is named `preprocessed_sw_reviews.pkl` to ensure clarity and consistency across datasets.

In [None]:
# Save the processed DataFrame to a new pickle file
output_path = '../Dataset/preprocessed_sw_reviews_df.pkl'
df_sw_processed.to_pickle(output_path)
print(f"Preprocessed dataset saved to: {output_path}")