# Preprocessing of Review Texts for Transformer-Based Keyword Extraction

This section focuses on applying essential **text preprocessing** to a collection of movie review datasets located in the `Review_By_Movie` folder. Each file in the folder is a `.pkl` dataset containing raw user reviews for a specific movie, including:

- The 9 *Star Wars* episodes: `SW_Episode1.pkl` to `SW_Episode9.pkl`  
- Other films: `HarryPotter.pkl`, `IndianaJones.pkl`, `LaLaLand.pkl`, `Parasite.pkl`, `GoodBadUgly.pkl`, `Oppenheimer.pkl`  

For each dataset, a new column called `Preprocessed_Review` will be created, containing the **cleaned and normalized version** of the original review text.

Since these reviews will be later processed by a Transformer-based keyword extraction model (specifically **KeyBERT** with the `all-MiniLM-L6-v2` embedding backend), the preprocessing is deliberately **minimal but targeted**.

Transformers internally handle many aspects like tokenization, lowercasing, and truncation. Therefore, the goal of this preprocessing is not to reshape the text dramatically, but to improve its quality and consistency, especially for keyword selection.

The following operations are applied:

- **Typo correction**  
  Typo correction is handled using the `autocorrect` library, but with an **enhanced logic** to deal with common limitations of naive spell-checking. In particular, special care is taken with **letter repetitions**, which often occur in user-generated reviews (e.g., `"loooong"`, `"baaad"`, `"amazzing"`).  

  The logic works as follows:
  1. **Words already recognized as valid English** (based on frequency from the `wordfreq` lexicon) are left unchanged.

  2. **Whitespace normalization** is also applied, collapsing multiple consecutive spaces into one (e.g., `"What     a mess"` → `"What a mess"`).

  3. Words that begin with a **capital letter and are not at the start of the sentence** are assumed to be **proper nouns** and are not altered, to preserve named entities like `"Harry"` or `"Oppenheimer"`.

  4. If a word has **three or more repeated letters** (e.g., `"stunnnning"`), these are first reduced to **two** repeated characters (`"stunning"`), and if still invalid, to **one** (`"stuning"`), checking validity at each step.

  5. If a word has **two repeated letters** (e.g., `"baad"`), it is tentatively reduced to one (`"bad"`) **only if** the resulting word is frequent enough to be valid. If not, the original word is passed to `autocorrect` as a fallback.

  This layered strategy helps avoid common mistakes, such as:
  - `"good"` → being incorrectly reduced to `"god"`  
  - `"baad"` → being incorrectly corrected to `"band"` instead of `"bad"`  
  - `"stunning"` written as `"stunnnning"` → correctly restored by reducing repetitions first

  The result is a **more accurate and robust correction process**, which avoids over-correcting valid words while still handling noisy user input effectively.

- **Punctuation spacing normalization**  
  Ensures that punctuation marks (e.g., `.`, `!`, `?`) are followed by a space **only if** the next character is a word character. This avoids token fusion issues (e.g., `hello.great` → `hello. great`), while preserving expressive punctuation (e.g., `!!!`) or numbers expressing values (e.g., `$300,000`)

- **Nonsense and empty review removal**  
  Short or unintelligible reviews (e.g., only numbers, symbols, or emojis) are discarded using a character ratio heuristic, to avoid noise in downstream tasks.

- **Lemmatization**  
  Words are reduced to their base (dictionary) form using **spaCy**, improving the semantic clarity and consistency of text (e.g., `running` → `run`, `actors` → `actor`).

This lightweight pipeline ensures that the text is **clean, meaningful, and semantically compact**, while still preserving the structure expected by transformer-based models.

> **Note**: Stop word removal is **not performed here**, as it is managed internally by the **KeyBERT extensions**, which apply stop word filtering during keyword scoring. This keeps the preprocessing stage model-agnostic and allows for richer semantic extraction.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas",                         
    "spacy",           
    "autocorrect",
    "wordfreq",
    "tqdm"
]


def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.




spacy is already installed.
autocorrect is already installed.
wordfreq is already installed.
tqdm is already installed.


In [2]:
# Standard imports for preprocessing
import pandas as pd

# Text processing
import re
from autocorrect import Speller
from wordfreq import zipf_frequency # type: ignore

# spaCy for lemmatization
import spacy

# Load the English language model in spaCy (download if not present)
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading 'en_core_web_sm' model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")

# tqdm for progress bars
from tqdm import tqdm

# os library for file operations
import os

### Importing the Custom Preprocessor

This cell imports the `Preprocessor` class from the custom `preprocessing.py` module.  
The class encapsulates all the text cleaning operations required to prepare review texts before passing them to a Transformer-based model.  
It provides methods for typo correction, punctuation normalization, lemmatization, and filtering of nonsensical content, and will be applied to each review in the dataset.

In [3]:
from preprocessing import Preprocessor  # Custom preprocessor module

## Batch Preprocessing of Movie Review Datasets

In this step, we apply the custom `Preprocessor` class to **all movie review datasets** stored in the `Review_By_Movie` folder.  
Each `.pkl` file corresponds to a different movie and contains a column named `Review` with raw user reviews.

For each dataset, the following operations are performed:

- The reviews are preprocessed using the `Preprocessor.preprocess_review()` pipeline, which includes:
  - **Typo correction**
  - **Punctuation spacing normalization**
  - **Nonsense or empty review filtering**
  - **Lemmatization** (via spaCy)
- The cleaned review is stored in a new column called `Preprocessed_Review`.
- Any rows where preprocessing failed (e.g., meaningless reviews) are removed.
- The updated dataset is **saved back to disk, overwriting the original file**.
- A summary is printed showing the number of reviews before and after preprocessing.

This batch step ensures that **all datasets are ready for transformer-based keyword extraction** using models like KeyBERT, with improved quality and consistency.


In [None]:
import os
import pandas as pd
from tqdm import tqdm
from Preprocessing.preprocessing import Preprocessor  # Assicurati che il path e nome siano corretti

# Folder path
folder_path = "../Dataset/Reviews_By_Movie"

# Initialize preprocessor
pre = Preprocessor()

# Process each .pkl file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".pkl"):
        file_path = os.path.join(folder_path, filename)

        # Load dataset
        df = pd.read_pickle(file_path)
        original_count = len(df)

        # Apply preprocessing
        tqdm.pandas(desc=f"Processing {filename}")
        df["Preprocessed_Review"] = df["Review_Text"].progress_apply(pre.preprocess_review)

        # Identify and print removed rows
        removed = df[df["Preprocessed_Review"].isna()]
        if not removed.empty:
            print(f"\n--- Removed reviews from {filename} ---")
            for idx, row in removed.iterrows():
                print(f"[Review ID: {idx}] {row['Review_Text']}\n")

        # Drop rows where preprocessing returned None
        df.dropna(subset=["Preprocessed_Review"], inplace=True)
        new_count = len(df)

        # Save back to disk (overwrite)
        df.to_pickle(file_path)

        # Summary
        print(f"{filename}: {original_count} → {new_count} valid reviews after preprocessing\n")


Processing GoodBadUgly.pkl: 100%|██████████| 1430/1430 [09:04<00:00,  2.63it/s]


GoodBadUgly.pkl: 1430 → 1429 valid reviews after preprocessing



Processing Parasite.pkl: 100%|██████████| 3702/3702 [10:41<00:00,  5.77it/s] 


Parasite.pkl: 3702 → 3702 valid reviews after preprocessing



Processing SW_Episode2.pkl:   6%|▌         | 221/3880 [01:55<31:53,  1.91it/s]  


Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/manuelemustari/Library/Python/3.9/lib/python/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/3k/39cgj5q54l52b9t9s6kj73kr0000gn/T/ipykernel_7269/4228999202.py", line 18, in <module>
    df["Preprocessed_Review"] = df["Review_Text"].progress_apply(pre.preprocess_review)
  File "/Users/manuelemustari/Library/Python/3.9/lib/python/site-packages/tqdm/std.py", line 917, in inner
    return getattr(df, df_function)(wrapper, **kwargs)
  File "/Users/manuelemustari/Library/Python/3.9/lib/python/site-packages/pandas/core/series.py", line 4917, in apply
    return SeriesApply(
  File "/Users/manuelemustari/Library/Python/3.9/lib/python/site-packages/pandas/core/apply.py", line 1427, in apply
    return self.apply_standard()
  File "/Users/manuelemustari/Library/Python/3.9/lib/python/site-packages/pandas/core/apply.py", line 1507, in apply_standard
  