# Preprocessing of Review Texts for Transformer-Based Keyword Extraction

This section focuses on applying essential **text preprocessing** to a collection of movie review datasets located in the `Review_By_Movie` folder. Each file in the folder is a `.pkl` dataset containing raw user reviews for a specific movie, including:

- The 9 *Star Wars* episodes: `SW_Episode1.pkl` to `SW_Episode9.pkl`  
- Other films: `HarryPotter.pkl`, `IndianaJones.pkl`, `LaLaLand.pkl`, `Parasite.pkl`, `GoodBadUgly.pkl`, `Oppenheimer.pkl`  

For each dataset, a new column called `Preprocessed_Review` will be created, containing the **cleaned and normalized version** of the original review text.

### Context: Transformer-based Keyword Extraction

Since these reviews will be later processed by a Transformer-based keyword extraction model (specifically **KeyBERT** with the `all-MiniLM-L6-v2` embedding backend), the preprocessing is deliberately **minimal but targeted**.

Transformers internally handle many aspects like tokenization, subword modeling, casing, and truncation. Therefore, the goal of this preprocessing is not to reshape the text dramatically, but to improve its quality and consistency, especially for keyword selection.

### Preprocessing Steps


#### 1. **Punctuation Spacing Normalization**
Ensures punctuation marks (e.g., `.`, `!`, `?`) are followed by a space **only if** the next character is a word character.

- `"hello.great"` → `"hello. great"`
- Preserves numbers: `$300,000.00` remains unchanged.

#### 2. **Typo Correction**
Typo correction is handled using the `autocorrect` library, with an enhanced strategy to deal with common limitations of naive spell-checking.  
Special attention is paid to **letter repetitions**, which often occur in user-generated reviews (e.g., `"loooong"`, `"baaad"`, `"amazzing"`).

**How it works:**
- **Valid English words** (based on frequency in the `wordfreq` lexicon) are left unchanged.
- **Whitespace is normalized** to collapse multiple spaces.
- **Capitalized words** not at the beginning of a sentence are assumed to be **proper nouns** and left untouched.
- Words with **3+ repeated letters** are reduced to 2, then to 1 if necessary, checking at each step for validity.
- If still unrecognized, the word is passed to `autocorrect` as a last resort.

This avoids common pitfalls such as:
- `"good"` being miscorrected to `"god"`
- `"baad"` corrected to `"band"` instead of `"bad"`
- `"stunnnning"` → `"stunning"` (correctly fixed)

#### 3. **Nonsense and Empty Review Removal**
Short or unintelligible reviews (e.g., only numbers, emojis, or symbols) are removed using a character ratio heuristic.  
This helps eliminate non-informative inputs from downstream analysis.

### Why **Lemmatization Is Not Applied**

Although lemmatization (e.g., `running` → `run`) is common in traditional NLP pipelines, it is **intentionally omitted** here.

#### BERT does not need it:
Transformer models like BERT are pre-trained on raw, natural text and already handle semantic normalization via **subword tokenization**. Lemmatization can **break expected token patterns**.  
For instance:
  - `"was"` → `"be"`  
  - `"actors"` → `"actor"`  
These changes may reduce model performance in embedding generation.

#### KeyBERT prefers natural text:
KeyBERT uses **BERT embeddings directly**, and performs best with text that reflects how humans write. Lemmatization can distort the input and degrade semantic richness.

> **Note**: Stop word removal is **not performed here**, as it is handled by the **KeyBERT extensions** during keyword scoring.  
> This keeps the preprocessing flexible and model-agnostic.


## Setup: Installing and Importing Required Libraries

In [None]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas",                         
    "spacy",           
    "autocorrect",
    "wordfreq",
    "tqdm"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.




spacy is already installed.
autocorrect is already installed.
wordfreq is already installed.
tqdm is already installed.


In [None]:
# Standard imports for preprocessing
import pandas as pd

# Text processing
import re
from autocorrect import Speller
from wordfreq import zipf_frequency # type: ignore

# spaCy for NLP tasks
import spacy

# Load the English language model in spaCy (download if not present)
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading 'en_core_web_sm' model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")

# tqdm for progress bars
from tqdm import tqdm

# os library for file operations
import os

### Importing the Custom Preprocessor

This cell imports the `Preprocessor` class from the custom `preprocessing.py` module.  
The class encapsulates all the text cleaning operations required to prepare review texts before passing them to a Transformer-based model.  
It provides methods for typo correction, punctuation normalization and filtering of nonsensical content, and will be applied to each review in the dataset.

In [3]:
from preprocessing import Preprocessor  # Custom preprocessor module

## Batch Preprocessing of Movie Review Datasets

In this step, we apply the custom `Preprocessor` class to all movie review datasets stored in the `Review_By_Movie` folder.  
Each `.pkl` file corresponds to a different movie and contains a column named `Review` with raw user reviews.

For each dataset, the following operations are performed:

- The reviews are preprocessed using the `Preprocessor.preprocess_review()` pipeline, which includes:
  - Typo correction with repetition reduction and intelligent spell-checking
  - Punctuation spacing normalization to ensure clean token boundaries
  - Nonsense or empty review filtering, based on character composition

- The cleaned review is stored in a new column called `Preprocessed_Review`.

- Any rows where preprocessing failed (e.g., unintelligible or empty reviews) are removed.

- The updated dataset is saved back to disk, overwriting the original file.

- A summary is printed showing the number of reviews before and after preprocessing. Optionally, the discarded reviews can be inspected for transparency.

>Note: Lemmatization is not applied, since the downstream model (KeyBERT) is based on Transformer embeddings, which already handle word normalization internally. Lemmatization may reduce semantic richness and interfere with subword tokenization.

In [5]:
import os
import pandas as pd
from tqdm import tqdm
from preprocessing import Preprocessor

# Folder path
folder_path = "../Dataset/Reviews_By_Movie"

# Initialize preprocessor
pre = Preprocessor()

# Process each .pkl file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".pkl"):
        file_path = os.path.join(folder_path, filename)

        # Load dataset
        df = pd.read_pickle(file_path)
        original_count = len(df)

        # Apply preprocessing
        tqdm.pandas(desc=f"Processing {filename}")
        df["Preprocessed_Review"] = df["Review_Text"].progress_apply(pre.preprocess_review)

        # Identify and print removed rows
        removed = df[df["Preprocessed_Review"].isna()]
        if not removed.empty:
            print(f"\n--- Removed reviews from {filename} ---")
            for idx, row in removed.iterrows():
                print(f"[Review ID: {idx}] {row['Review_Text']}\n")

        # Drop rows where preprocessing returned None
        df.dropna(subset=["Preprocessed_Review"], inplace=True)
        new_count = len(df)

        # Save back to disk (overwrite)
        df.to_pickle(file_path)

        # Summary
        print(f"{filename}: {original_count} → {new_count} valid reviews after preprocessing\n")


Processing GoodBadUgly.pkl: 100%|██████████| 1430/1430 [07:50<00:00,  3.04it/s]



--- Removed reviews from GoodBadUgly.pkl ---
[Review ID: 40764] Exelent ........
...
....
....
.....
......
......

GoodBadUgly.pkl: 1430 → 1429 valid reviews after preprocessing



Processing Parasite.pkl:   4%|▍         | 140/3702 [00:51<21:55,  2.71it/s]


KeyboardInterrupt: 

In [7]:
import pandas as pd
from preprocessing import Preprocessor  # Assicurati che il file sia importabile

# Path al file
file_path = "../Dataset/Reviews_By_Movie/Parasite.pkl"

# Carica il dataset
df = pd.read_pickle(file_path)

# Istanzia il preprocessore
pre = Preprocessor()

# Applica la pipeline alla prima recensione
first_review = df["Review_Text"].iloc[456]
processed = pre.preprocess_review(first_review)

# Stampa risultato
print("\n=== PREPROCESSING ===")
print("Original:", first_review)
print("Processed:", processed)



=== PREPROCESSING ===
Original: A poor South Korean family lives in a ramshackle semi-basement apartment and gets by on hustles and cons. One day the son manages to get a job tutoring the child of a wealthy family. He sees an opportunity to get his parents and sister jobs in the household too. Soon all of them of are in and life is looking much rosier. Then fate throws them a curveball.Superb. Written and directed by Bong Joon Ho who gave us the superb crime-drama Memories of Murder plus the entertaining Snowpiercer, Parasite is a great mix of comedy and drama, pathos and social commentary.It starts off in very entertaining fashion as we meet the family that are experts in cons and manipulation. It is quite funny and initially they just seem like slackers. However, after a while it is quite awe-inspiring to see the work that goes into their deceptions.As the movie progresses it becomes darker and darker. The plot takes on a few surprising twists and turns and the two families involved