# XGBoost Preprocessing Pipeline Visualization

This notebook demonstrates how Romanian reviews are preprocessed at each step of the XGBoost pipeline.

We'll see the transformation from raw text through:
1. **Original Text** - Raw review as scraped
2. **After Cleaning** - Lowercase, special characters handled
3. **After Tokenization** - Split into words, filtered by length
4. **After Lemmatization** - Words reduced to base forms, stopwords removed
5. **After Stemming** - Further reduction to word stems

In [13]:
# Setup
import sys
from pathlib import Path
import pandas as pd

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Import preprocessing components
from preprocessing.cleaner import TextCleaner
from preprocessing.tokenizer import WhitespaceTokenizer
from preprocessing.lemmatizer import RomanianLemmatizer
from preprocessing.stemmer import RomanianStemmer
from preprocessing.pipeline import PreprocessingPipeline
from utils.config import load_config

print("Modules imported successfully")

Modules imported successfully


In [14]:
# Load a sample of reviews from LaRoSeDa dataset
from huggingface_hub import hf_hub_download

print("Loading LaRoSeDa dataset...")
train_file = hf_hub_download(
    repo_id="universityofbucharest/laroseda",
    filename="laroseda/train/0000.parquet",
    repo_type="dataset",
    revision="refs/convert/parquet"
)

df = pd.read_parquet(train_file)
print(f"Loaded {len(df)} reviews")
print(f"\nColumns: {list(df.columns)}")
df.head()

Loading LaRoSeDa dataset...
Loaded 12000 reviews

Columns: ['index', 'title', 'content', 'starRating']


Unnamed: 0,index,title,content,starRating
0,11262,Foarte slab,ca aspect este foarte frumoasa dar cine vrea s...,1
1,3890,Foarte multumit,se incarca repede si tine 3 incarcari complete...,5
2,9413,»öeapa de zile mari!!!,in primul rand nu este de stica dupa cum spune...,1
3,9350,Nu merita cumparate,nu merita cumparate... sunt create cu limitare...,1
4,7126,Recomand,un ceas excelent. face cam tot ce ai nevoie ca...,5


In [15]:
# Select a few diverse examples
# Let's get both positive and negative reviews with varying lengths

# Get one negative (1-2 stars) and one positive (4-5 stars)
negative_sample = df[df['starRating'].isin([1, 2])].iloc[1]
positive_sample = df[df['starRating'].isin([4, 5])].iloc[3]

# Combine title and content as done in training
sample_reviews = [
    {
        'sentiment': 'NEGATIVE',
        'stars': negative_sample['starRating'],
        'text': f"{negative_sample['title']} {negative_sample['content']}"
    },
    {
        'sentiment': 'POSITIVE',
        'stars': positive_sample['starRating'],
        'text': f"{positive_sample['title']} {positive_sample['content']}"
    }
]

print("Selected samples:")
for i, review in enumerate(sample_reviews, 1):
    print(f"\n{i}. {review['sentiment']} ({review['stars']} stars)")
    print(f"   Length: {len(review['text'])} chars")
    print(f"   Preview: {review['text'][:100]}...")

Selected samples:

1. NEGATIVE (1 stars)
   Length: 293 chars
   Preview: »öeapa de zile mari!!! in primul rand nu este de stica dupa cum spune producatorul, este una de plast...

2. POSITIVE (5 stars)
   Length: 163 chars
   Preview: Recomand am acest produs de aproape jumatate de an timp in care nu am avut probleme cu el. s-a fixat...


## Step-by-Step Preprocessing

Now let's see how each component transforms the text.

In [16]:
# Load preprocessing config (same as used in training)
preprocessing_config = load_config('../configs/preprocessing_config.yaml')
print("Preprocessing configuration:")
print(preprocessing_config)

Preprocessing configuration:
{'preprocessing': {'language': 'romanian', 'lowercase': True, 'remove_stopwords': True, 'lemmatize': True, 'stem': True, 'min_token_length': 2, 'max_token_length': 50}}


In [17]:
# Initialize each component separately to see individual effects

# Extract the preprocessing section from config
config = preprocessing_config['preprocessing']

cleaner = TextCleaner(
    lowercase=config['lowercase']
)

tokenizer = WhitespaceTokenizer(
    min_token_length=config['min_token_length'],
    max_token_length=config['max_token_length']
)

lemmatizer = RomanianLemmatizer(
    remove_stopwords=config['remove_stopwords']
)

stemmer = RomanianStemmer()

print("All preprocessing components initialized")

All preprocessing components initialized


In [18]:
def visualize_preprocessing(text, sentiment_label):
    """
    Show each preprocessing step for a single review.
    """
    print("="*100)
    print(f"SENTIMENT: {sentiment_label}")
    print("="*100)
    
    # Step 0: Original
    print(f"\nüìù STEP 0: ORIGINAL TEXT")
    print(f"-" * 100)
    print(text[:500])  # Show first 500 chars
    if len(text) > 500:
        print(f"... (total {len(text)} chars)")
    
    # Step 1: Cleaning
    cleaned = cleaner.clean(text)
    print(f"\nüßπ STEP 1: AFTER CLEANING (lowercase, normalize)")
    print(f"-" * 100)
    print(cleaned[:500])
    if len(cleaned) > 500:
        print(f"... (total {len(cleaned)} chars)")
    
    # Step 2: Tokenization
    tokens = tokenizer.tokenize(cleaned)
    print(f"\n‚úÇÔ∏è STEP 2: AFTER TOKENIZATION (split, filter length)")
    print(f"-" * 100)
    print(f"Token count: {len(tokens)}")
    print(f"Tokens: {' | '.join(tokens[:30])}")
    if len(tokens) > 30:
        print(f"... and {len(tokens) - 30} more tokens")
    
    # Step 3: Lemmatization (also removes stopwords)
    lemmatized = lemmatizer.lemmatize(tokens)
    print(f"\nüî§ STEP 3: AFTER LEMMATIZATION (base forms + stopword removal)")
    print(f"-" * 100)
    print(f"Token count: {len(lemmatized)} (removed {len(tokens) - len(lemmatized)} stopwords)")
    print(f"Tokens: {' | '.join(lemmatized[:30])}")
    if len(lemmatized) > 30:
        print(f"... and {len(lemmatized) - 30} more tokens")
    
    # Step 4: Stemming
    stemmed = stemmer.stem(lemmatized)
    print(f"\nüå± STEP 4: AFTER STEMMING (further reduction)")
    print(f"-" * 100)
    print(f"Token count: {len(stemmed)}")
    print(f"Tokens: {' | '.join(stemmed[:30])}")
    if len(stemmed) > 30:
        print(f"... and {len(stemmed) - 30} more tokens")
    
    # Final output
    final_text = ' '.join(stemmed)
    print(f"\n‚ú® FINAL PREPROCESSED TEXT")
    print(f"-" * 100)
    print(final_text[:500])
    if len(final_text) > 500:
        print(f"... (total {len(final_text)} chars)")
    
    print(f"\nüìä SUMMARY:")
    print(f"   Original length: {len(text)} chars, ~{len(text.split())} words")
    print(f"   After tokenization: {len(tokens)} tokens")
    print(f"   After lemmatization: {len(lemmatized)} tokens ({100*(len(tokens)-len(lemmatized))/len(tokens):.1f}% removed)")
    print(f"   After stemming: {len(stemmed)} tokens")
    print(f"   Final text length: {len(final_text)} chars")
    print(f"   Reduction: {100*(len(text)-len(final_text))/len(text):.1f}%")
    print()
    
    return final_text

In [19]:
# Visualize preprocessing for NEGATIVE review
negative_preprocessed = visualize_preprocessing(
    sample_reviews[0]['text'],
    f"{sample_reviews[0]['sentiment']} ({sample_reviews[0]['stars']} stars)"
)

SENTIMENT: NEGATIVE (1 stars)

üìù STEP 0: ORIGINAL TEXT
----------------------------------------------------------------------------------------------------
»öeapa de zile mari!!! in primul rand nu este de stica dupa cum spune producatorul, este una de plastic de nici 10 lei   calitatea este 0 a mea a venit gata zgariata plus de asta nu se lipeste pe ecran sub nici o forma desi am urmat atent instructiunile   daca nu aveti pe ce da bani comandati!

üßπ STEP 1: AFTER CLEANING (lowercase, normalize)
----------------------------------------------------------------------------------------------------
»õeapa de zile mari!!! in primul rand nu este de stica dupa cum spune producatorul este una de plastic de nici 10 lei calitatea este 0 a mea a venit gata zgariata plus de asta nu se lipeste pe ecran sub nici o forma desi am urmat atent instructiunile daca nu aveti pe ce da bani comandati!

‚úÇÔ∏è STEP 2: AFTER TOKENIZATION (split, filter length)
---------------------------------------------

In [21]:
# Visualize preprocessing for POSITIVE review
positive_preprocessed = visualize_preprocessing(
    sample_reviews[1]['text'],
    f"{sample_reviews[1]['sentiment']} ({sample_reviews[1]['stars']} stars)"
)

SENTIMENT: POSITIVE (5 stars)

üìù STEP 0: ORIGINAL TEXT
----------------------------------------------------------------------------------------------------
Recomand am acest produs de aproape jumatate de an timp in care nu am avut probleme cu el. s-a fixat din prima, telefoanele stau bine si e foarte usor de manevrat.

üßπ STEP 1: AFTER CLEANING (lowercase, normalize)
----------------------------------------------------------------------------------------------------
recomand am acest produs de aproape jumatate de an timp in care nu am avut probleme cu el. s a fixat din prima telefoanele stau bine si e foarte usor de manevrat.

‚úÇÔ∏è STEP 2: AFTER TOKENIZATION (split, filter length)
----------------------------------------------------------------------------------------------------
Token count: 29
Tokens: recomand | am | acest | produs | de | aproape | jumatate | de | an | timp | in | care | nu | am | avut | probleme | cu | el. | fixat | din | prima | telefoanele | stau | bine | s