# Preprocessing of Review Texts for Transformer-Based Keyword Extraction

This section focuses on applying essential **text preprocessing** to a collection of movie review datasets located in the `Reviews_By_Movie` folder. Each file in the folder is a `.pkl` dataset containing raw user reviews for a specific movie, including:

- The 9 *Star Wars* episodes: `SW_Episode1.pkl` to `SW_Episode9.pkl`  
- Other films: `HarryPotter.pkl`, `IndianaJones.pkl`, `LaLaLand.pkl`, `Parasite.pkl`, `GoodBadUgly.pkl`, `Oppenheimer.pkl`  

For each dataset, a new column called `Preprocessed_Review` will be created, containing the **cleaned and normalized version** of the original review text.

### Context: Transformer-based Keyword Extraction

Since these reviews will be later processed by a Transformer-based keyword extraction model (specifically **KeyBERT** with the `all-MiniLM-L6-v2` embedding backend), the preprocessing is deliberately **minimal but targeted**.

Transformers internally handle many aspects like tokenization, subword modeling, casing, and truncation. Therefore, the goal of this preprocessing is not to reshape the text dramatically, but to improve its quality and consistency, especially for keyword selection.

### Preprocessing Steps


#### 1. **Punctuation Spacing Normalization**
Ensures punctuation marks (e.g., `.`, `!`, `?`) are followed by a space **only if** the next character is a word character.

- `"hello.great"` → `"hello. great"`
- Preserves numbers: `$300,000.00` remains unchanged.

#### 2. **Typo Correction**
Typo correction is handled using the `autocorrect` library, with an enhanced strategy to deal with common limitations of naive spell-checking.  
Special attention is paid to **letter repetitions**, which often occur in user-generated reviews (e.g., `"loooong"`, `"baaad"`, `"amazzing"`).

**How it works:**
- **Valid English words** (based on frequency in the `wordfreq` lexicon) are left unchanged.
- **Whitespace is normalized** to collapse multiple spaces.
- **Capitalized words** not at the beginning of a sentence are assumed to be **proper nouns** and left untouched.
- Words with **3+ repeated letters** are reduced to 2, then to 1 if necessary, checking at each step for validity.
- If still unrecognized, the word is passed to `autocorrect` as a last resort.

This avoids common pitfalls such as:
- `"good"` being miscorrected to `"god"`
- `"baad"` corrected to `"band"` instead of `"bad"`
- `"stunnnning"` → `"stunning"` (correctly fixed)

#### 3. **Nonsense and Empty Review Removal**
Short or unintelligible reviews (e.g., only numbers, emojis, or symbols) are removed using a character ratio heuristic.  
This helps eliminate non-informative inputs from downstream analysis.

### Why **Lemmatization Is Not Applied**

Although lemmatization (e.g., `running` → `run`) is common in traditional NLP pipelines, it is **intentionally omitted** here.

#### BERT does not need it:
Transformer models like BERT are pre-trained on raw, natural text and already handle semantic normalization via **subword tokenization**. Lemmatization can **break expected token patterns**.  
For instance:
  - `"was"` → `"be"`  
  - `"actors"` → `"actor"`  

These changes may reduce model performance in embedding generation.

#### KeyBERT prefers natural text:
KeyBERT uses **BERT embeddings directly**, and performs best with text that reflects how humans write. Lemmatization can distort the input and degrade semantic richness.

> **Note**: Stop word removal is **not performed here**, as it is handled by the **KeyBERT extensions** during keyword scoring.  
> This keeps the preprocessing flexible and model-agnostic.


## Setup: Installing and Importing Required Libraries

In [8]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas",                         
    "spacy",           
    "autocorrect",
    "wordfreq",
    "tqdm"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.
spacy is already installed.
autocorrect is already installed.
wordfreq is already installed.
tqdm is already installed.


In [9]:
# Standard imports for preprocessing
import pandas as pd

# Text processing
import re
from autocorrect import Speller
from wordfreq import zipf_frequency # type: ignore

# spaCy for NLP tasks
import spacy

# Load the English language model in spaCy (download if not present)
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading 'en_core_web_sm' model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")

# tqdm for progress bars
from tqdm import tqdm

# os library for file operations
import os

### Importing the Custom Preprocessor

This cell imports the `Preprocessor` class from the custom `preprocessing.py` module.  
The class encapsulates all the text cleaning operations required to prepare review texts before passing them to a Transformer-based model.  
It provides methods for typo correction, punctuation normalization and filtering of nonsensical content, and will be applied to each review in the dataset.

In [10]:
from preprocessing import Preprocessor  # Custom preprocessor module

## Batch Preprocessing of Movie Review Datasets

In this step, we apply the custom `Preprocessor` class to all movie review datasets stored in the `Review_By_Movie` folder.  
Each `.pkl` file corresponds to a different movie and contains a column named `Review` with raw user reviews.

For each dataset, the following operations are performed:

- The reviews are preprocessed using the `Preprocessor.preprocess_review()` pipeline, which includes:
  - Typo correction with repetition reduction and intelligent spell-checking
  - Punctuation spacing normalization to ensure clean token boundaries
  - Nonsense or empty review filtering, based on character composition

- The cleaned review is stored in a new column called `Preprocessed_Review`.

- Any rows where preprocessing failed (e.g., unintelligible or empty reviews) are removed.

- The updated dataset is saved back to disk, overwriting the original file.

- A summary is printed showing the number of reviews before and after preprocessing. Optionally, the discarded reviews can be inspected for transparency.

>Note: Lemmatization is not applied, since the downstream model (KeyBERT) is based on Transformer embeddings, which already handle word normalization internally. Lemmatization may reduce semantic richness and interfere with subword tokenization.

In [None]:
# Folder path
folder_path = "../Dataset/Reviews_By_Movie"

# Initialize preprocessor
pre = Preprocessor()

# Process each .pkl file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".pkl"):
        file_path = os.path.join(folder_path, filename)

        # Load dataset
        df = pd.read_pickle(file_path)
        original_count = len(df)

        # Apply preprocessing
        tqdm.pandas(desc=f"Processing {filename}")
        df["Preprocessed_Review"] = df["Review_Text"].progress_apply(pre.preprocess_review)

        # Identify and print removed rows
        removed = df[df["Preprocessed_Review"].isna()]
        if not removed.empty:
            print(f"\n--- Removed reviews from {filename} ---")
            for idx, row in removed.iterrows():
                print(f"[Review ID: {idx}] {row['Review_Text']}\n")

        # Drop rows where preprocessing returned None
        df.dropna(subset=["Preprocessed_Review"], inplace=True)
        new_count = len(df)

        # Save back to disk (overwrite)
        df.to_pickle(file_path)

        # Summary
        print(f"{filename}: {original_count} → {new_count} valid reviews after preprocessing\n")

Processing GoodBadUgly.pkl: 100%|██████████| 1430/1430 [06:11<00:00,  3.85it/s]



--- Removed reviews from GoodBadUgly.pkl ---
[Review ID: 40764] Exelent ........
...
....
....
.....
......
......

GoodBadUgly.pkl: 1430 → 1429 valid reviews after preprocessing



Processing Parasite.pkl: 100%|██████████| 3702/3702 [10:23<00:00,  5.94it/s] 


Parasite.pkl: 3702 → 3702 valid reviews after preprocessing



Processing SW_Episode2.pkl: 100%|██████████| 3880/3880 [18:52<00:00,  3.43it/s]  


SW_Episode2.pkl: 3880 → 3880 valid reviews after preprocessing



Processing SW_Episode3.pkl: 100%|██████████| 3876/3876 [18:42<00:00,  3.45it/s] 



--- Removed reviews from SW_Episode3.pkl ---
[Review ID: 15381] Unbearably dry writing with poor acting delivery can't match the magic of the originals. CGI shines.Screenplay...................................... 2 / 10 Acting............................................... 3 Cinematography/VFX............................ 10 Sound............................................... 10 Editing................................................ 3 Music....................................................... 8 Timeless Utility................................... 4 Total.................................................... 40 / 70 ~= 5.7 (rounded to 6) Verdict................................................. Enjoyable for some fans of the series.

SW_Episode3.pkl: 3876 → 3875 valid reviews after preprocessing



Processing Oppenheimer.pkl: 100%|██████████| 4375/4375 [23:38<00:00,  3.08it/s]  



--- Removed reviews from Oppenheimer.pkl ---
[Review ID: 45716] I really liked Oppenheimer. But! I was kinda disappointed by the atomic bomb scene because i thought it would be more impressive! He should have actually made it more like a real nuke, that's what i think.... . . . . . . . . . . .. . .. . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . . .. . . . . . .. .. . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . . .. . . . .. .. . . .. . . .. . .. .. . . . . . .. . .. . . . . . . . .. . .. .. . . . .. . . . . . . .. . . . .. . . . . . . . .. . . . . . . . . . . .. .. . .. . . . .. .. .. . . . . . .. . . . . . . . . . . . . . . ..

Oppenheimer.pkl: 4375 → 4374 valid reviews after preprocessing



Processing SW_Episode1.pkl: 100%|██████████| 4094/4094 [18:54<00:00,  3.61it/s] 


SW_Episode1.pkl: 4094 → 4094 valid reviews after preprocessing



Processing IndianaJones.pkl: 100%|██████████| 1197/1197 [05:04<00:00,  3.94it/s]


IndianaJones.pkl: 1197 → 1197 valid reviews after preprocessing



Processing SW_Episode4.pkl: 100%|██████████| 2158/2158 [08:14<00:00,  4.37it/s]



--- Removed reviews from SW_Episode4.pkl ---
[Review ID: 2140] The Wild West and samurai in space. Williams' score, Lucas' direction, and a phenomenal cast captivate.Screenplay...................................... 10 / 10 Acting...............................................9 Cinematography............................... 10 Sound................................................... 9 Editing................................................ 8 Score...................................................... 10 Timeless Utility................................... 9 Total.................................................... 65 / 70 ~= 9.3 (rounded to 9) Verdict................................................. Canonical.

SW_Episode4.pkl: 2158 → 2157 valid reviews after preprocessing



Processing SW_Episode5.pkl: 100%|██████████| 1507/1507 [05:23<00:00,  4.66it/s]



--- Removed reviews from SW_Episode5.pkl ---
[Review ID: 2351] Romance among the stars. Hope lost and miraculously scavenged. One of the biggest plot twists ever.Screenplay...................................... 10 / 10 Acting............................................... 9 Cinematography................................ 10 Sound................................................. 10 Editing................................................ 10 Score.................................................... 10 Timeless Utility................................... 10 Total.................................................... 69 / 70 ~= 9.9 (which I rounded to 10) Verdict................................................. Timeless Masterpiece.

[Review ID: 2520] Star wars...............................................................................

SW_Episode5.pkl: 1507 → 1505 valid reviews after preprocessing



Processing SW_Episode7.pkl: 100%|██████████| 4860/4860 [26:52<00:00,  3.01it/s]  


SW_Episode7.pkl: 4860 → 4860 valid reviews after preprocessing



Processing SW_Episode6.pkl: 100%|██████████| 1017/1017 [04:38<00:00,  3.65it/s]



--- Removed reviews from SW_Episode6.pkl ---
[Review ID: 4351] Fun finale to the Original Trilogy with quirkiness the other two movies lack. Cast gels well.Screenplay...................................... 7 / 10 Acting............................................... 8 Cinematography................................ 8 Sound................................................. 9 Editing................................................ 6 Music.................................................... 10 Timeless Utility................................... 7 Total.................................................... 55 / 70 ~= 7.9 (rounded to 8) Verdict................................................. Recommended.

SW_Episode6.pkl: 1017 → 1016 valid reviews after preprocessing



Processing LaLaLand.pkl: 100%|██████████| 2369/2369 [09:05<00:00,  4.34it/s]



--- Removed reviews from LaLaLand.pkl ---
[Review ID: 48466] Such a Great film!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

[Review ID: 48947] A perfect blend of music, sights, sounds, and brilliant performances.
What more could anyone ask for?Screenplay...................................... 10 / 10 Acting............................................... 10 Cinematography............................... 10 Sound................................................... 10 Editing................................................ 10 Score.................................................... 10 Timeless Utility................................. 10
Total.................................................... 70 / 70 = 10
Verdict................................................. Masterpiece

LaLaLand.pkl: 2369 → 2367 valid reviews after preprocessing



Processing HarryPotter.pkl: 100%|██████████| 2059/2059 [08:00<00:00,  4.29it/s]



--- Removed reviews from HarryPotter.pkl ---
[Review ID: 43169] 👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻👌🏻

HarryPotter.pkl: 2059 → 2058 valid reviews after preprocessing



Processing SW_Episode8.pkl: 100%|██████████| 6909/6909 [40:34<00:00,  2.84it/s]   



--- Removed reviews from SW_Episode8.pkl ---
[Review ID: 22795] Worst piece of dumpster trash I've ever seen!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

[Review ID: 23910] You've lost the idea.. Very Shame...
.................................................................................................................................................................................................................................................................................................................................................................................................

[Review ID: 24725] Trash!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

[Review ID: 25595] Bad movie_______________________________________________________________________________________________________________________________________________________________________________________________________________

Processing SW_Episode9.pkl: 100%|██████████| 7891/7891 [21:13<00:00,  6.20it/s]  

SW_Episode9.pkl: 7891 → 7891 valid reviews after preprocessing






In [12]:
# load the fist 10 rows of the first file to verify
first_file_path = os.path.join(folder_path, os.listdir(folder_path)[0])
df_first = pd.read_pickle(first_file_path)
df_first.head(10)

Unnamed: 0,Review_ID,Movie_ID,Movie_Title,Rating,Review_Date,Review_Title,Review_Text,Helpful_Votes,Total_Votes,Preprocessed_Review
39894,4931608,tt0060196,"The Good, the Bad and the Ugly",9.0,13 June 2019,Good movie,If The Good The Bad and The Ugly was not 2hrs ...,0.0,1.0,If The Good The Bad and The Ugly was not 2hrs ...
39895,92838,tt0060196,"The Good, the Bad and the Ugly",,23 April 2004,"Clint, Lee, and Eli",In his third and final go around as the laconi...,2.0,3.0,In his third and final go around as the iconic...
39896,92840,tt0060196,"The Good, the Bad and the Ugly",,28 April 2004,John Wayne and Gene Autrey can go lasso themse...,"Clint Eastwood and Sergio Leone OWN the ""Ameri...",0.0,1.0,"Clint Eastwood and Sergio Leone OWN the ""Ameri..."
39897,2311851,tt0060196,"The Good, the Bad and the Ugly",,17 September 2010,"Never seen so many men, wasted so badly","""Nostalgia is a product of dissatisfaction and...",11.0,18.0,"""Nostalgia is a product of dissatisfaction and..."
39898,92820,tt0060196,"The Good, the Bad and the Ugly",8.0,16 November 2003,A Classic Western Movie With an Unforgettable ...,Three bad guys  the chaser of rewards Joe (Cl...,10.0,19.0,Three bad guys the chaser of rewards Joe (Clin...
39899,92672,tt0060196,"The Good, the Bad and the Ugly",10.0,1 February 2000,The Greatest Film Ever,This is a virtually flawless film. Stylish and...,0.0,1.0,This is a virtually flawless film. Stylish and...
39900,2466802,tt0060196,"The Good, the Bad and the Ugly",,30 July 2011,"The long, the unsynchronised and the improbable","Has 'The Good, The Bad And The Ugly' acquired ...",1.0,4.0,"Has 'The Good, The Bad And The Ugly' acquired ..."
39901,3501397,tt0060196,"The Good, the Bad and the Ugly",9.0,9 July 2016,"Masterful, immersive, and showing off all that...",This classic western Film by legendary directo...,0.0,4.0,This classic western Film by legendary directo...
39902,2041196,tt0060196,"The Good, the Bad and the Ugly",10.0,22 March 2009,A classic western,This movie blew me away. It introduces the goo...,1.0,1.0,This movie blew me away. It introduces the goo...
39903,92856,tt0060196,"The Good, the Bad and the Ugly",9.0,19 July 2004,A revolutionary and exciting Spaghetti Western,"This is a bemusing, violent and stylish Wester...",16.0,25.0,"This is a refusing, violent and stylish Wester..."
