# Preprocessing of Review Texts for Transformer-Based Keyword Extraction

In this section, the goal is to perform basic but essential preprocessing on a collection of movie review datasets stored in the `Review_By_Movie` folder.  
The folder contains individual `.pkl` files, each corresponding to a specific movie. The datasets included are:

- SW_Episode1.pkl 
- SW_Episode2.pkl  
- SW_Episode3.pkl  
- SW_Episode4.pkl  
- SW_Episode5.pkl  
- SW_Episode6.pkl  
- SW_Episode7.pkl  
- SW_Episode8.pkl  
- SW_Episode9.pkl  
- HarryPotter.pkl  
- IndianaJones.pkl  
- LaLaLand.pkl
- Parasite.pkl  
- GoodBadUgly.pkl  
- Oppenheimer.pkl

The preprocessing step will add a new column named `Preprocessed_Review` to each dataset, containing the cleaned version of the review text.  
Since the processed reviews will later be passed to a Transformer model (specifically `KeyBERT` using `all-MiniLM-L6-v2` as the embedding model), only minimal preprocessing is needed.

Transformers are generally robust to text noise and handle tokenization, lowercasing, and truncation internally. However, to improve the quality of the extracted keywords, the following custom preprocessing steps will be applied:

- **Typo correction** for common misspellings.  
- **Punctuation spacing normalization**: ensure a space follows punctuation marks **only** if followed by a word character, and **not** by other punctuation (e.g., `hello.could` → `hello. could`, but `!!!` is left unchanged).  
- **Removal of nonsensical or empty reviews**, such as strings with only symbols, numbers, or unintelligible text.  
- **Lemmatization**, to reduce inflected words to their base form, improving the consistency of the text fed to the model.

This light preprocessing aims to clean the text just enough to improve embedding quality, without interfering with the structure expected by the transformer model.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas",          
    "tqdm",            
    "nltk",            
    "spacy",           
    "textblob",
    "autocorrect"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.
tqdm is already installed.
nltk is already installed.
spacy is already installed.
textblob is already installed.
autocorrect is already installed.




In [2]:
# === Core Libraries ===
import re                      # Regular expressions for text cleaning

from autocorrect import Speller  # For spell checking
from textblob import TextBlob

### Importing the Custom Preprocessor

This cell imports the `Preprocessor` class from the custom `preprocessing.py` module.  
The class encapsulates all the text cleaning operations required to prepare review texts before passing them to a Transformer-based model.  
It provides methods for typo correction, punctuation normalization, lemmatization, and filtering of nonsensical content, and will be applied to each review in the dataset.

In [3]:
from preprocessing import Preprocessor  # Custom preprocessor module

## Testing the Preprocessor on a Sample Review

In this cell, the `Preprocessor` class is instantiated and applied to a sample movie review.  
This test allows us to verify that the preprocessing pipeline performs as expected, including punctuation normalization, typo correction, nonsense filtering, and lemmatization.  
The output will help ensure that the resulting cleaned text is appropriate for Transformer-based keyword extraction.

In [5]:
pre = Preprocessor()

samples = [
    "This movi was amaaazing!!!The direction is...well,not good.I think.",
    "....!!!???",
    "!!!",
    "goood!!",
    "           ",
    None
]

for s in samples:
    print("Original:", repr(s))
    print("Processed:", pre.preprocess_review(s))
    print("---")


Original: 'This movi was amaaazing!!!The direction is...well,not good.I think.'
Processed: This move was amaaazing!!! The direction is... well,not good. I think.
---
Original: '....!!!???'
Processed: None
---
Original: '!!!'
Processed: None
---
Original: 'goood!!'
Processed: None
---
Original: '           '
Processed: None
---
Original: None
Processed: None
---
