# üßπ Text Preprocessing & Feature Engineering

**Project:** Restaurant Sentiment Analysis  
**Author:** Akakinad  
**Date:** January 29, 2026  
**Objective:** Prepare text data for machine learning by cleaning, tokenizing, and vectorizing reviews

---

## Table of Contents
1. Setup & Data Loading
2. Text Cleaning
3. Tokenization
4. TF-IDF Vectorization
5. Train-Test Split
6. Save Processed Data

---

## 1Ô∏è‚É£ Setup & Data Loading

Import libraries and load the cleaned dataset from Phase 1.

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


In [2]:
# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

print("‚úÖ NLTK data downloaded!")

‚úÖ NLTK data downloaded!


In [3]:
# Load the cleaned dataset from Phase 1
df = pd.read_csv('./data/processed/reviews_with_features.csv')

print("‚úÖ Dataset loaded successfully!")
print(f"üìä Shape: {df.shape}")
print(f"\nFirst 3 rows:")
df.head(3)

‚úÖ Dataset loaded successfully!
üìä Shape: (996, 4)

First 3 rows:


Unnamed: 0,Review,Liked,review_length,word_count
0,Wow... Loved this place.,1,24,4
1,Crust is not good.,0,18,4
2,Not tasty and the texture was just nasty.,0,41,8


## 2Ô∏è‚É£ Text Cleaning

Remove punctuation, numbers, convert to lowercase, and prepare text for analysis.

In [4]:
# Function to clean text
def clean_text(text):
    """
    Clean and normalize text for NLP processing
    
    Steps:
    1. Convert to lowercase
    2. Remove punctuation and numbers
    3. Remove extra whitespace
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and numbers, keep only letters and spaces
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

print("‚úÖ Text cleaning function created!")

‚úÖ Text cleaning function created!


In [5]:
# Test the cleaning function
sample_review = df['Review'].iloc[0]

print("BEFORE Cleaning:")
print(f"'{sample_review}'")
print("\nAFTER Cleaning:")
print(f"'{clean_text(sample_review)}'")

BEFORE Cleaning:
'Wow... Loved this place.'

AFTER Cleaning:
'wow loved this place'


In [6]:
# Apply cleaning to all reviews
df['Review_Clean'] = df['Review'].apply(clean_text)

print("‚úÖ All reviews cleaned!")
print(f"\nDataset now has {len(df.columns)} columns:")
print(df.columns.tolist())

‚úÖ All reviews cleaned!

Dataset now has 5 columns:
['Review', 'Liked', 'review_length', 'word_count', 'Review_Clean']


In [7]:
# Display first 5 reviews before and after cleaning
print("Before vs After Cleaning:")
print("=" * 80)

for i in range(5):
    print(f"\n{i+1}. ORIGINAL: {df['Review'].iloc[i]}")
    print(f"   CLEANED:  {df['Review_Clean'].iloc[i]}")

Before vs After Cleaning:

1. ORIGINAL: Wow... Loved this place.
   CLEANED:  wow loved this place

2. ORIGINAL: Crust is not good.
   CLEANED:  crust is not good

3. ORIGINAL: Not tasty and the texture was just nasty.
   CLEANED:  not tasty and the texture was just nasty

4. ORIGINAL: Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.
   CLEANED:  stopped by during the late may bank holiday off rick steve recommendation and loved it

5. ORIGINAL: The selection on the menu was great and so were the prices.
   CLEANED:  the selection on the menu was great and so were the prices


## 3Ô∏è‚É£ Remove Stopwords

Remove common words that don't carry sentiment (the, and, is, etc.)  
**Important:** We keep negation words (not, never, no) as they're critical for sentiment!

In [15]:
# Get English stopwords
stop_words = set(stopwords.words('english'))

# CRITICAL: Keep negation words - they're important for sentiment!
negation_words = {'not', 'no', 'nor', 'never', 'neither', 'nobody', 
                  'nothing', 'nowhere', "don't", "didn't", "doesn't", 
                  "won't", "wouldn't", "shouldn't", "couldn't", "can't"}

# Remove negation words from stopwords
stop_words = stop_words - negation_words

# Function to remove stopwords
def remove_stopwords(text):
    """Remove common English stopwords from text (keeping negations)"""
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

print(f"‚úÖ Stopword removal function created!")
print(f"üìä Total stopwords: {len(stop_words)}")
print(f"‚ö†Ô∏è Kept {len(negation_words)} negation words for sentiment")

‚úÖ Stopword removal function created!
üìä Total stopwords: 188
‚ö†Ô∏è Kept 16 negation words for sentiment


In [16]:
# Remove stopwords from cleaned reviews
df['Review_Processed'] = df['Review_Clean'].apply(remove_stopwords)

print("‚úÖ Stopwords removed from all reviews!")
print(f"\nDataset now has {len(df.columns)} columns:")
print(df.columns.tolist())

‚úÖ Stopwords removed from all reviews!

Dataset now has 7 columns:
['Review', 'Liked', 'review_length', 'word_count', 'Review_Clean', 'Review_Processed', 'processed_word_count']


In [17]:
# Verify "not" is preserved
test_idx = 1  # "Crust is not good"
print("Verification that 'not' is preserved:")
print("=" * 60)
print(f"ORIGINAL:  {df['Review'].iloc[test_idx]}")
print(f"CLEANED:   {df['Review_Clean'].iloc[test_idx]}")
print(f"PROCESSED: {df['Review_Processed'].iloc[test_idx]}")
print("\n‚úÖ 'not' successfully preserved!")

Verification that 'not' is preserved:
ORIGINAL:  Crust is not good.
CLEANED:   crust is not good
PROCESSED: crust not good

‚úÖ 'not' successfully preserved!


In [24]:
# Show the complete processing pipeline
print("Complete Text Processing Pipeline:")
print("=" * 80)

for i in range(5):
    print(f"\n{i+1}. ORIGINAL:  {df['Review'].iloc[i]}")
    print(f"   CLEANED:   {df['Review_Clean'].iloc[i]}")
    print(f"   PROCESSED: {df['Review_Processed'].iloc[i]}")

Complete Text Processing Pipeline:

1. ORIGINAL:  Wow... Loved this place.
   CLEANED:   wow loved this place
   PROCESSED: wow loved place

2. ORIGINAL:  Crust is not good.
   CLEANED:   crust is not good
   PROCESSED: crust not good

3. ORIGINAL:  Not tasty and the texture was just nasty.
   CLEANED:   not tasty and the texture was just nasty
   PROCESSED: not tasty texture nasty

4. ORIGINAL:  Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.
   CLEANED:   stopped by during the late may bank holiday off rick steve recommendation and loved it
   PROCESSED: stopped late may bank holiday rick steve recommendation loved

5. ORIGINAL:  The selection on the menu was great and so were the prices.
   CLEANED:   the selection on the menu was great and so were the prices
   PROCESSED: selection menu great prices


In [20]:
# Calculate how many words were removed
df['processed_word_count'] = df['Review_Processed'].apply(lambda x: len(x.split()))

print("Word Count Comparison:")
print("=" * 60)
print(f"Average words BEFORE stopword removal: {df['word_count'].mean():.1f}")
print(f"Average words AFTER stopword removal:  {df['processed_word_count'].mean():.1f}")
print(f"\n‚úÖ Reduced by ~{df['word_count'].mean() - df['processed_word_count'].mean():.1f} words per review")

Word Count Comparison:
Average words BEFORE stopword removal: 10.9
Average words AFTER stopword removal:  5.8

‚úÖ Reduced by ~5.1 words per review


## 4Ô∏è‚É£ TF-IDF Vectorization

Convert text to numerical features using Term Frequency-Inverse Document Frequency.

**What is TF-IDF?**
- Measures how important a word is to a review
- Common words get lower scores
- Rare but meaningful words get higher scores

In [25]:
# Import TF-IDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

print("‚úÖ TF-IDF Vectorizer imported!")

‚úÖ TF-IDF Vectorizer imported!


In [27]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=1000)  # Keep top 1000 words

# Fit and transform the processed reviews
X = vectorizer.fit_transform(df['Review_Processed'])

# Extract target variable
y = df['Liked'].values

print("‚úÖ Text vectorized successfully!")
print(f"üìä Shape of feature matrix: {X.shape}")
print(f"   - {X.shape[0]} reviews")
print(f"   - {X.shape[1]} features (words)")

‚úÖ Text vectorized successfully!
üìä Shape of feature matrix: (996, 1000)
   - 996 reviews
   - 1000 features (words)


In [28]:
# Get the feature names (words used by vectorizer)
feature_names = vectorizer.get_feature_names_out()

print(f"Total vocabulary: {len(feature_names)} words")
print(f"\nFirst 20 features:")
print(feature_names[:20])
print(f"\nLast 20 features:")
print(feature_names[-20:])

Total vocabulary: 1000 words

First 20 features:
['absolutely' 'acknowledged' 'actually' 'added' 'ago' 'ala' 'albondigas'
 'allergy' 'almonds' 'almost' 'alone' 'also' 'although' 'always' 'amazing'
 'amazingrge' 'ambiance' 'ambience' 'amount' 'ample']

Last 20 features:
['wont' 'word' 'work' 'world' 'worse' 'worst' 'worth' 'would' 'wouldnt'
 'wow' 'wrap' 'wrong' 'year' 'years' 'yet' 'youd' 'youre' 'yum' 'yummy'
 'zero']


## 5Ô∏è‚É£ Train-Test Split

Split data into training (80%) and testing (20%) sets for model evaluation.

In [29]:
# Import train-test split function
from sklearn.model_selection import train_test_split

print("‚úÖ Train-test split function imported!")

‚úÖ Train-test split function imported!


In [30]:
# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Keep same ratio of positive/negative in both sets
)

print("‚úÖ Data split successfully!")
print(f"\nTraining set: {X_train.shape[0]} reviews ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Testing set:  {X_test.shape[0]} reviews ({X_test.shape[0]/len(df)*100:.1f}%)")
print(f"\nTarget distribution in training set:")
print(f"  Positive: {y_train.sum()} ({y_train.sum()/len(y_train)*100:.1f}%)")
print(f"  Negative: {len(y_train) - y_train.sum()} ({(len(y_train) - y_train.sum())/len(y_train)*100:.1f}%)")

‚úÖ Data split successfully!

Training set: 796 reviews (79.9%)
Testing set:  200 reviews (20.1%)

Target distribution in training set:
  Positive: 399 (50.1%)
  Negative: 397 (49.9%)


## 6Ô∏è‚É£ Save Processed Data

Save the vectorized data and preprocessing objects for model training.

In [33]:
# Import pickle to save Python objects
import pickle

# Save the vectorizer (we'll need it for new predictions)
with open('./models/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

# Save the processed data
with open('./data/processed/train_test_data.pkl', 'wb') as f:
    pickle.dump({
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test,
        'feature_names': feature_names
    }, f)

print("‚úÖ All processed data saved successfully!")
print(f"\nüìÅ Saved files:")
print(f"   1. models/tfidf_vectorizer.pkl")
print(f"   2. data/processed/train_test_data.pkl")

‚úÖ All processed data saved successfully!

üìÅ Saved files:
   1. models/tfidf_vectorizer.pkl
   2. data/processed/train_test_data.pkl


## üìä Phase 2 Complete!

### Summary of Preprocessing Steps:

1. ‚úÖ **Text Cleaning:** Converted to lowercase, removed punctuation
2. ‚úÖ **Stopword Removal:** Removed common words (kept negations!)
3. ‚úÖ **TF-IDF Vectorization:** Converted text to 1000 numerical features
4. ‚úÖ **Train-Test Split:** 80% train (796 reviews), 20% test (200 reviews)
5. ‚úÖ **Data Saved:** Ready for model training in Phase 3!

---

### Next Steps (Phase 3):
- Train multiple classification models
- Compare model performance
- Evaluate accuracy, precision, recall
- Select best model