# Text Preprocessing Test Notebook

This notebook is designed to **test and demonstrate the behavior of a lightweight text preprocessing pipeline** implemented in the `Preprocessor` class. The preprocessing steps are tailored to prepare movie reviews for transformer-based keyword extraction using models like **KeyBERT** with **`all-MiniLM-L6-v2`** embeddings.

The main preprocessing operations include:
- **Punctuation-spacing normalization**, ensuring better readability and compatibility with tokenizer expectations  
- **Typo correction** using `autocorrect`, with enhanced handling of repeated letters and proper nouns 
- **Nonsense/empty review filtering**, removing short or unintelligible entries that would reduce model quality  

Each step is tested with controlled input examples to verify correctness and robustness before applying the pipeline to full datasets.

> **Note**:  
> - **Lemmatization is not performed**, as transformer models (like MiniLM) internally manage word variation through subword tokenization and contextual embeddings.  
> - **Stop word removal is also skipped**, since this is handled during keyword selection by custom KeyBERT extensions.  
> This ensures compatibility with downstream transformer models and preserves the contextual richness needed for accurate keyword extraction.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas",                         
    "spacy",           
    "autocorrect",
    "wordfreq"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.




spacy is already installed.
autocorrect is already installed.
wordfreq is already installed.


In [2]:
# Standard imports for preprocessing
import pandas as pd

# Text processing
import re
from autocorrect import Speller
from wordfreq import zipf_frequency # type: ignore

# spaCy for NLP tasks
import spacy

# Load the English language model in spaCy (download if not present)
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading 'en_core_web_sm' model...")
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")


### Importing the Custom Preprocessor

This cell imports the `Preprocessor` class from the custom `preprocessing.py` module.  
The class encapsulates all the text cleaning operations required to prepare review texts before passing them to a Transformer-based model.  
It provides methods for typo correction, punctuation normalization and filtering of nonsensical content, and will be applied to each review in the dataset.

In [3]:
from preprocessing import Preprocessor 

### Test 1 – Typo Correction

This test evaluates the typo correction capabilities of the `Preprocessor` class.

The input consists of sentences with common spelling errors such as:
- `"amazng"` → `"amazing"`  
- `"dirction"` → `"direction"`  
- `"absolutly"` → `"absolutely"`

Typo correction is a key step in improving the quality of keyword extraction and semantic embeddings.  
The logic implemented combines several techniques:

- **Whitespace normalization**: collapses multiple consecutive spaces into a single space.

- **Proper noun preservation**: capitalized words that are not at the beginning of a sentence are excluded from correction to avoid altering named entities.

- **Character repetition handling**:
   - If a word contains 3 or more repeated alphabetic characters (e.g., "loooong"), they are first reduced to 2 (→ "loong"), then to 1 (→ "long"), checking validity at each step.
   - If reducing the repetition results in a valid word, that version is kept.

- **Autocorrect fallback**: if no valid form is found through the above steps, the word is passed to `autocorrect` for correction.

This combined approach prevents overcorrection (e.g., `"baad"` becoming `"band"`) and enhances the robustness of the text preprocessing pipeline.


In [4]:
# Initialize the preprocessor
pre = Preprocessor()

# Sample reviews with typos
typo_reviews = [
    "This movi was amazng",
    "The dirction of the film is goooood",
    "Charactrs were not believabl",
    "Absolutly stunning      performnce by the lead actr",
]

# Apply typo correction only
print("=== Typo Correction Test ===\n")
for i, review in enumerate(typo_reviews, 1):
    corrected = pre.correct_typos(review)
    print(f"Review {i}:\nOriginal:  {review}\nCorrected: {corrected}\n")


=== Typo Correction Test ===

Review 1:
Original:  This movi was amazng
Corrected: This move was amazing

Review 2:
Original:  The dirction of the film is goooood
Corrected: The direction of the film is good

Review 3:
Original:  Charactrs were not believabl
Corrected: Characters were not believable

Review 4:
Original:  Absolutly stunning      performnce by the lead actr
Corrected: Absolutely stunning performance by the lead actor



### Test 2 – Punctuation Spacing Normalization

This test evaluates the punctuation spacing normalization step of the `Preprocessor` class.

The objective is to ensure that a **space is inserted after punctuation marks** (such as `.`, `,`, `!`, `?`, `;`, `:`) **only when appropriate**.  
Specifically, a space is added **only if** the punctuation is **directly followed by a letter or underscore**, and **not** by a digit or another punctuation mark.

This normalization improves **readability** and prevents the **merging of adjacent words**, which could negatively affect downstream tasks like tokenization or embedding.  
At the same time, it preserves numeric formats and punctuation sequences such as:
- `"Hello.This"` → `"Hello. This"` 
- `"Wow!!!Great"` → `"Wow!!! Great"` 
- `"Price is $200,000.00"` → remains unchanged

By applying this rule selectively, the model maintains clean sentence structure without corrupting numerical data or stylistic emphasis.

In [5]:
# Instantiate the Preprocessor
pre = Preprocessor()

# Sample reviews with punctuation issues
sample_texts = [
    "This movie is great!Amazing direction.",
    "Wait...what?Really?",
    "Incredible,unbelievable!Must watch.",
    "I loved it.The actors were amazing.",
    "I paid 300,000$ this house!!!It was worth it.",
]

# Apply only punctuation spacing normalization
for i, text in enumerate(sample_texts, 1):
    normalized = pre.normalize_spacing(text)
    print(f"Original {i}: {text}")
    print(f"Normalized {i}: {normalized}\n")



Original 1: This movie is great!Amazing direction.
Normalized 1: This movie is great! Amazing direction.

Original 2: Wait...what?Really?
Normalized 2: Wait... what? Really?

Original 3: Incredible,unbelievable!Must watch.
Normalized 3: Incredible, unbelievable! Must watch.

Original 4: I loved it.The actors were amazing.
Normalized 4: I loved it. The actors were amazing.

Original 5: I paid 300,000$ this house!!!It was worth it.
Normalized 5: I paid 300,000$ this house!!! It was worth it.



### Test 3 – Nonsense Detection

This test evaluates the ability of the `Preprocessor` class to detect and flag **nonsensical or low-quality reviews**.

The implemented logic marks a review as *nonsense* if it satisfies one of the following conditions:
- The text is **too short** (e.g., fewer than 10 characters).
- The **ratio of alphabetic characters** to total characters is very low (e.g., dominated by symbols or numbers).

This filtering step is essential to discard meaningless entries that could negatively affect downstream tasks such as embedding generation or keyword extraction.

We isolate and apply only the **nonsense detection** module in this test, checking how it handles various inputs including:
- Empty strings  
- Symbol-only content  
- Short but meaningful phrases  
- Number-dominated text  

Each input is labeled as either `OK` (valid) or `NONSENSE` (to be discarded).


In [6]:
# Instantiate the Preprocessor
from preprocessing import Preprocessor
pre = Preprocessor()

# Test cases for nonsense detection
samples = [
    "!!!...??",               # Only punctuation
    "1234567890",             # Only numbers
    "Ok",                     # Too short
    "This is fine.",          # Valid sentence
    "....",                   # Dots only
    "!!!!????....",           # Random punctuation
    "The movie was good.",    # Proper review
    "👍🏻👍🏻👍🏻"                   # Emoticons only
]

# Apply nonsense detection logic
for i, sample in enumerate(samples, 1):
    result = pre.is_nonsense(sample)
    status = "NONSENSE" if result else "OK"
    print(f"Sample {i}: '{sample}' → {status}\n")


Sample 1: '!!!...??' → NONSENSE

Sample 2: '1234567890' → NONSENSE

Sample 3: 'Ok' → NONSENSE

Sample 4: 'This is fine.' → OK

Sample 5: '....' → NONSENSE

Sample 6: '!!!!????....' → NONSENSE

Sample 7: 'The movie was good.' → OK

Sample 8: '👍🏻👍🏻👍🏻' → NONSENSE



### Test 4 – Full Preprocessing Pipeline

This test evaluates the **entire preprocessing pipeline** implemented in the `Preprocessor` class.  
The pipeline includes all previously tested steps, executed in sequence:

1. **Typo Correction** → Fixes common spelling mistakes.
2. **Punctuation Spacing Normalization** → Ensures correct spacing after punctuation marks, but only when followed by a letter.
3. **Nonsense Detection** → Removes reviews that are too short or composed mostly of symbols and digits.

We apply this pipeline to a variety of noisy reviews to verify its effectiveness.

In [12]:
# Define sample reviews for the full pipeline
samples = [
    "This movie is absoltly      amazng!The charactrs were believabl(not all of theeeem).",
    "Whaat??Noo...thiiiiis is baad dirction!!!",
    "1234 .... 🤖🤖🤖 ???",  # nonsense
    "OMG thisss   moviiee wazzz sooo 😭😭baaad...but kinda goood??!!(I thinkkk).LOVEDDDD ittttt 😅 4 realzz!!!!!!!",
    "LOOOOOVED the filmmmm!!!!!I see it on VHS and the end was...unexpected!",
    "It is nots very          baaaaad,burt        not thatttt goad eithaer."
]

# Apply full preprocessing pipeline
for i, sample in enumerate(samples, 1):
    result = pre.preprocess_review(sample)
    status = "REMOVED (nonsense)" if result is None else f"Processed: {result}"
    print(f"\n=== Sample {i} ===\nOriginal:  {sample}\n{status}")



=== Sample 1 ===
Original:  This movie is absoltly      amazng!The charactrs were believabl(not all of theeeem).
Processed: This movie is absolutely amazing! The characters were believable (not all of them).

=== Sample 2 ===
Original:  Whaat??Noo...thiiiiis is baad dirction!!!
Processed: What?? No... this is bad direction!!!

=== Sample 3 ===
Original:  1234 .... 🤖🤖🤖 ???
REMOVED (nonsense)

=== Sample 4 ===
Original:  OMG thisss   moviiee wazzz sooo 😭😭baaad...but kinda goood??!!(I thinkkk).LOVEDDDD ittttt 😅 4 realzz!!!!!!!
Processed: OMG this movie jazz soo bad... but kinda good??!! (I think). LOVED it 4 really!!!!!!!

=== Sample 5 ===
Original:  LOOOOOVED the filmmmm!!!!!I see it on VHS and the end was...unexpected!
Processed: LOVED the film!!!!! I see it on HS and the end was... unexpected!

=== Sample 6 ===
Original:  It is nots very          baaaaad,burt        not thatttt goad eithaer.
Processed: It is not very bad, but not that good either.
