# Text Preprocessing Test Notebook

This notebook is designed to **test and demonstrate the behavior of a lightweight text preprocessing pipeline** implemented in the `Preprocessor` class. The preprocessing steps are tailored to prepare movie reviews for transformer-based keyword extraction using models like **KeyBERT** with **`all-MiniLM-L6-v2`** embeddings.

The main preprocessing operations include:
- **Typo correction** using `autocorrect`
- **Punctuation-spacing normalization**, ensuring readability for tokenizers
- **Nonsense/empty review filtering**, removing unusable entries
- **Lemmatization** using **spaCy** for efficient text normalization

Each step is tested with controlled input examples to verify correctness and robustness before applying the pipeline to full datasets.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas",                         
    "spacy",           
    "textblob",
    "autocorrect",
    "wordfreq"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.




spacy is already installed.
textblob is already installed.
autocorrect is already installed.
wordfreq is already installed.


In [None]:
# Standard imports for preprocessing
import pandas as pd

# Text processing
import re
from textblob import TextBlob
from autocorrect import Speller
from wordfreq import zipf_frequency # type: ignore

# spaCy for lemmatization
import spacy

# Load the English language model in spaCy (download if not present)
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading 'en_core_web_sm' model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")


### Importing the Custom Preprocessor

This cell imports the `Preprocessor` class from the custom `preprocessing.py` module.  
The class encapsulates all the text cleaning operations required to prepare review texts before passing them to a Transformer-based model.  
It provides methods for typo correction, punctuation normalization, lemmatization, and filtering of nonsensical content, and will be applied to each review in the dataset.

In [3]:
from preprocessing import Preprocessor 

### Test 1 – Typo Correction

This test evaluates the typo correction capabilities of the `Preprocessor` class.

The input consists of sample sentences that contain common spelling mistakes, such as:
- `"amazng"` instead of `"amazing"`
- `"dirction"` instead of `"direction"`
- `"absolutly"` instead of `"absolutely"`

The goal is to determine whether the typo correction module can successfully recover the correct words. Accurate correction is essential for downstream tasks such as keyword extraction and semantic embedding, as many NLP models rely on semantically clean inputs.

The implemented method uses a combination of heuristics and the `autocorrect` library:
1. **Initial Check**: Each word is first checked against its **Zipf frequency** score. Words with high frequency (Zipf > 3.5) are assumed to be valid and skipped.

2. **Character Repetition Handling**:
   - If a word contains 3 or more repeated alphabetic characters (e.g., `"loooong"`), they are first reduced to 2 (→ `"loong"`), then to 1 (→ `"long"`), checking validity at each step.
   - If reducing the repetition results in a valid word, that version is kept.
   
3. **Fallback**: If no valid form is found, the word is passed through `autocorrect` for a final correction attempt.

This approach helps mitigate common issues where `autocorrect` alone may produce incorrect words (e.g., `"baad"` → `"band"` instead of `"bad"`). By combining frequency checks with controlled repetition reduction, the system achieves more reliable corrections without distorting valid words.

In [4]:
# Initialize the preprocessor
pre = Preprocessor()

# Sample reviews with typos
typo_reviews = [
    "This movi was amazng!",
    "The dirction of the film is goooood.",
    "Charactrs were not believabl.",
    "Absolutly stunning performnce by the lead actr.",
]

# Apply typo correction only
print("=== Typo Correction Test ===\n")
for i, review in enumerate(typo_reviews, 1):
    corrected = pre.correct_typos(review)
    print(f"Review {i}:\nOriginal:  {review}\nCorrected: {corrected}\n")


=== Typo Correction Test ===

Review 1:
Original:  This movi was amazng!
Corrected: This move was amazing!

Review 2:
Original:  The dirction of the film is goooood.
Corrected: The direction of the film is good.

Review 3:
Original:  Charactrs were not believabl.
Corrected: Characters were not believable.

Review 4:
Original:  Absolutly stunning performnce by the lead actr.
Corrected: Absolutely stunning performance by the lead actor.



### Test 2 – Punctuation Spacing Normalization

This test focuses on evaluating the punctuation spacing normalization step implemented in the `Preprocessor` class.

The goal is to ensure that a space is inserted after punctuation marks (e.g., `.`, `,`, `!`, `?`, `;`, `:`) **only if** they are directly followed by an alphanumeric character.  
This is intended to improve sentence readability and avoid unintended word merging, which may negatively impact tokenization and embedding models.

Examples of input include:
- `"This movie is great!Amazing direction."` → `"This movie is great! Amazing direction."`
- `"Wait...what?Really?"` → `"Wait...what? Really?"`

Only the punctuation normalization function is applied in this test to evaluate its isolated behavior.

In [5]:
# Instantiate the Preprocessor
pre = Preprocessor()

# Sample reviews with punctuation issues
sample_texts = [
    "This movie is great!Amazing direction.",
    "Wait...what?Really?",
    "Incredible,unbelievable!Must watch.",
    "I loved it.The actors were amazing.",
    "Terrible;no plot,no logic;just noise."
]

# Apply only punctuation spacing normalization
for i, text in enumerate(sample_texts, 1):
    normalized = pre.normalize_spacing(text)
    print(f"Original {i}: {text}")
    print(f"Normalized {i}: {normalized}\n")



Original 1: This movie is great!Amazing direction.
Normalized 1: This movie is great! Amazing direction.

Original 2: Wait...what?Really?
Normalized 2: Wait... what? Really?

Original 3: Incredible,unbelievable!Must watch.
Normalized 3: Incredible, unbelievable! Must watch.

Original 4: I loved it.The actors were amazing.
Normalized 4: I loved it. The actors were amazing.

Original 5: Terrible;no plot,no logic;just noise.
Normalized 5: Terrible; no plot, no logic; just noise.



### Test 3 – Nonsense Detection

This test evaluates the ability of the `Preprocessor` class to detect and flag **nonsensical or low-quality reviews**.

The implemented logic marks a review as *nonsense* if it satisfies one of the following conditions:
- The text is **too short** (e.g., fewer than 10 characters).
- The **ratio of alphabetic characters** to total characters is very low (e.g., dominated by symbols or numbers).

This filtering step is essential to discard meaningless entries that could negatively affect downstream tasks such as embedding generation or keyword extraction.

We isolate and apply only the **nonsense detection** module in this test, checking how it handles various inputs including:
- Empty strings  
- Symbol-only content  
- Short but meaningful phrases  
- Number-dominated text  

Each input is labeled as either `OK` (valid) or `NONSENSE` (to be discarded).


In [6]:
# Instantiate the Preprocessor
from preprocessing import Preprocessor
pre = Preprocessor()

# Test cases for nonsense detection
samples = [
    "!!!...??",               # Only punctuation
    "1234567890",             # Only numbers
    "Ok",                     # Too short
    "This is fine.",          # Valid sentence
    "....",                   # Dots only
    "!!!!????....",           # Random punctuation
    "The movie was good.",    # Proper review
    "👍🏻👍🏻👍🏻"                   # Emoticons only
]

# Apply nonsense detection logic
for i, sample in enumerate(samples, 1):
    result = pre.is_nonsense(sample)
    status = "NONSENSE" if result else "OK"
    print(f"Sample {i}: '{sample}' → {status}\n")


Sample 1: '!!!...??' → NONSENSE

Sample 2: '1234567890' → NONSENSE

Sample 3: 'Ok' → NONSENSE

Sample 4: 'This is fine.' → OK

Sample 5: '....' → NONSENSE

Sample 6: '!!!!????....' → NONSENSE

Sample 7: 'The movie was good.' → OK

Sample 8: '👍🏻👍🏻👍🏻' → NONSENSE



### Test 4 – Lemmatization

In this test, we evaluate the **lemmatization** capability of the `Preprocessor` class, implemented using `spaCy`.

Lemmatization is the process of reducing words to their base or dictionary form (lemma), which helps normalize textual data. For instance:
- `"running"` → `"run"`
- `"cars"` → `"car"`
- `"was"` → `"be"`

This normalization is critical for downstream NLP tasks such as:
- Keyword extraction
- Embedding generation
- Clustering or classification

In this test, we apply **only the lemmatization step**, using a variety of phrases containing inflected forms of verbs and nouns, and inspect whether the transformations are performed correctly.

In [7]:
# Sample sentences with inflected forms
lemmatization_samples = [
    "The cats are running in the gardens.",
    "She was eating apples.",
    "They have been thinking about it.",
    "He walks, talks, and smiles.",
    "Children played with toys yesterday."
]

# Apply lemmatization
print("=== Lemmatization Test ===")
for i, text in enumerate(lemmatization_samples, 1):
    result = pre.lemmatize_text(text)
    print(f"Sample {i}:\nOriginal    → {text}\nLemmatized  → {result}\n")


=== Lemmatization Test ===
Sample 1:
Original    → The cats are running in the gardens.
Lemmatized  → The cat be run in the garden.

Sample 2:
Original    → She was eating apples.
Lemmatized  → She be eat apple.

Sample 3:
Original    → They have been thinking about it.
Lemmatized  → They have be think about it.

Sample 4:
Original    → He walks, talks, and smiles.
Lemmatized  → He walk, talk, and smile.

Sample 5:
Original    → Children played with toys yesterday.
Lemmatized  → Child play with toy yesterday.



### Test 5 – Full Preprocessing Pipeline

This test evaluates the **entire preprocessing pipeline** implemented in the `Preprocessor` class.  
The pipeline includes all previously tested steps, executed in sequence:

1. **Typo Correction** → Fixes common spelling mistakes using `autocorrect`.
2. **Punctuation Spacing Normalization** → Ensures correct spacing after punctuation marks, but only when followed by a letter (e.g., `"Hello.This"` → `"Hello. This"`).
3. **Nonsense Detection** → Removes reviews that are too short or composed mostly of symbols and digits.
4. **Lemmatization** → Converts words to their base form using `spaCy` while preserving punctuation formatting and casing.

We apply this pipeline to a variety of noisy reviews to verify its effectiveness on:
- Misspelled words
- Incorrect punctuation
- Empty or meaningless reviews
- Complex but valid inputs

This final test validates the correctness and stability of the preprocessing logic before applying it at scale to our review datasets.


In [9]:
# Define sample reviews for the full pipeline
samples = [
    "This movie is absoltly amazng!The charactrs were believabl.",
    "Whaat??Noo...thiiiiis is baad dirction!!!",
    "1234 .... 🤖🤖🤖 ???",  # nonsense
    "I was stunned.The performnce was stunning.",
    "LOOOOOVED the filmmmm!!!!!The end was...unexpected!",
    "bad.",  # likely to be nonsense
    "It is not very bad,but not that good either."
]

# Apply full preprocessing pipeline
for i, sample in enumerate(samples, 1):
    result = pre.preprocess_review(sample)
    status = "REMOVED (nonsense)" if result is None else f"Processed: {result}"
    print(f"\n=== Sample {i} ===\nOriginal:  {sample}\n{status}")



=== Sample 1 ===
Original:  This movie is absoltly amazng!The charactrs were believabl.
Processed: This movie be absolutely amazing! The character be believable.

=== Sample 2 ===
Original:  Whaat??Noo...thiiiiis is baad dirction!!!
Processed: What?? No... this be bad direction!!!

=== Sample 3 ===
Original:  1234 .... 🤖🤖🤖 ???
REMOVED (nonsense)

=== Sample 4 ===
Original:  I was stunned.The performnce was stunning.
Processed: I be stun. The performance be stunning.

=== Sample 5 ===
Original:  LOOOOOVED the filmmmm!!!!!The end was...unexpected!
Processed: Love the film!!!!! The end be... unexpected!

=== Sample 6 ===
Original:  bad.
REMOVED (nonsense)

=== Sample 7 ===
Original:  It is not very bad,but not that good either.
Processed: It be not very bad, but not that good either.
