# Module 07: Text Feature Engineering

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 70 minutes  
**Prerequisites**: Module 06 (Datetime Feature Engineering)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Clean and preprocess text data for machine learning
2. Convert text to numerical features using CountVectorizer (bag-of-words)
3. Apply TF-IDF vectorization to weight terms by importance
4. Create n-gram features to capture word sequences
5. Understand basic word embeddings and when to use them
6. Compare bag-of-words vs TF-IDF performance on sentiment analysis

## 1. Why Text Feature Engineering Matters

**Text data is everywhere but models need numbers!**
- Customer reviews and sentiment analysis
- Email spam detection
- Document classification
- Chatbot intent recognition
- Social media analysis

**The challenge**: 
- Computers don't understand words like "excellent" or "terrible"
- We need to convert text → numbers while preserving meaning

**Common approaches**:
1. **Bag-of-Words (CountVectorizer)**: Count word occurrences
2. **TF-IDF**: Weight words by importance
3. **N-grams**: Capture word sequences
4. **Word Embeddings**: Dense vector representations

In this module, we'll use **customer review sentiment analysis** to demonstrate these techniques.

## 2. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter

# Text processing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.precision', 3)

print("✓ Setup complete!")

## 3. Create Customer Review Dataset

We'll create a realistic dataset of product reviews with positive and negative sentiment.

In [None]:
# Positive reviews
positive_reviews = [
    "This product is amazing! Highly recommend it to everyone.",
    "Excellent quality and fast shipping. Very satisfied with my purchase.",
    "Love it! Works perfectly and exactly as described.",
    "Outstanding product. Great value for money. Will buy again!",
    "Fantastic! Exceeded my expectations. Five stars!",
    "Best purchase I've made this year. Absolutely love it.",
    "Incredible product quality. Customer service was also excellent.",
    "Perfect! Just what I needed. Highly satisfied.",
    "Wonderful experience. Product arrived quickly and works great.",
    "Amazing quality! Worth every penny. Definitely recommend.",
    "Superb product. Easy to use and very effective.",
    "Brilliant! This has made my life so much easier.",
    "Exceptional quality. Can't believe how good this is.",
    "Perfect solution to my problem. Very happy with this.",
    "Great product! Exactly what was advertised. No complaints."
] * 20  # Repeat to get more data

# Negative reviews
negative_reviews = [
    "Terrible product. Broke after one day. Complete waste of money.",
    "Very disappointed. Not as described. Requesting refund.",
    "Poor quality. Would not recommend to anyone.",
    "Awful! Stopped working after a week. Don't buy this.",
    "Horrible experience. Product is defective and support is terrible.",
    "Waste of money. Cheap quality and doesn't work properly.",
    "Disappointed with this purchase. Not worth the price.",
    "Bad product. Shipping took forever and item was damaged.",
    "Poor craftsmanship. Fell apart immediately. Very unhappy.",
    "Terrible! Nothing like the description. Total scam.",
    "Not recommended. Broke on first use. Asking for refund.",
    "Worst purchase ever. Cheap materials and poor design.",
    "Useless product. Doesn't do what it claims. Very angry.",
    "Horrible quality. Save your money and buy something else.",
    "Completely unsatisfied. Product is garbage. Don't waste your time."
] * 20  # Repeat to get more data

# Create dataframe
reviews_df = pd.DataFrame({
    'review_text': positive_reviews + negative_reviews,
    'sentiment': ['positive'] * len(positive_reviews) + ['negative'] * len(negative_reviews)
})

# Shuffle the data
reviews_df = reviews_df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Created dataset with {len(reviews_df)} reviews")
print(f"\nSentiment distribution:")
print(reviews_df['sentiment'].value_counts())
print(f"\nSample reviews:")
reviews_df.head(10)

## 4. Text Preprocessing and Cleaning

**Before converting text to features, we need to clean it**:
- Convert to lowercase
- Remove punctuation and special characters
- Remove extra whitespace
- (Optional) Remove stop words
- (Optional) Stemming/Lemmatization

In [None]:
def clean_text(text):
    """
    Clean text data for feature extraction.
    
    Steps:
    1. Convert to lowercase
    2. Remove URLs
    3. Remove special characters and digits
    4. Remove extra whitespace
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Keep only letters and spaces (remove punctuation and numbers)
    text = re.sub(r'[^a-z\s]', ' ', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply cleaning
reviews_df['review_clean'] = reviews_df['review_text'].apply(clean_text)

print("Text cleaning examples:")
print("\nOriginal vs Cleaned:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Original: {reviews_df.iloc[i]['review_text']}")
    print(f"Cleaned:  {reviews_df.iloc[i]['review_clean']}")

In [None]:
# Analyze word frequency in positive vs negative reviews
positive_text = ' '.join(reviews_df[reviews_df['sentiment']=='positive']['review_clean'])
negative_text = ' '.join(reviews_df[reviews_df['sentiment']=='negative']['review_clean'])

positive_words = Counter(positive_text.split())
negative_words = Counter(negative_text.split())

print("Most common words in POSITIVE reviews:")
for word, count in positive_words.most_common(10):
    print(f"  {word}: {count}")

print("\nMost common words in NEGATIVE reviews:")
for word, count in negative_words.most_common(10):
    print(f"  {word}: {count}")

print("\nNotice how sentiment is reflected in word choice!")

## 5. Technique 1: Bag-of-Words (CountVectorizer)

**Bag-of-Words approach**:
- Count how many times each word appears in each document
- Create a matrix where rows = documents, columns = words
- Ignore word order (that's why it's a "bag")

**Example**:
```
Doc 1: "I love this product"
Doc 2: "I hate this product"

Vocabulary: [I, love, hate, this, product]

Vector for Doc 1: [1, 1, 0, 1, 1]  # Has "love", no "hate"
Vector for Doc 2: [1, 0, 1, 1, 1]  # Has "hate", no "love"
```

In [None]:
# Create CountVectorizer
count_vectorizer = CountVectorizer(
    max_features=100,  # Keep only top 100 most frequent words
    min_df=2,  # Word must appear in at least 2 documents
    max_df=0.9,  # Ignore words in more than 90% of documents
    stop_words='english'  # Remove common English stop words
)

# Fit and transform the cleaned text
bow_features = count_vectorizer.fit_transform(reviews_df['review_clean'])

print(f"Bag-of-Words feature matrix shape: {bow_features.shape}")
print(f"  - {bow_features.shape[0]} reviews")
print(f"  - {bow_features.shape[1]} unique words (features)")
print(f"\nFeature sparsity: {(1 - bow_features.nnz / (bow_features.shape[0] * bow_features.shape[1])) * 100:.1f}%")
print("(Most entries are 0 because each review uses only a small subset of vocabulary)")

In [None]:
# Examine the vocabulary
vocabulary = count_vectorizer.get_feature_names_out()
print(f"Vocabulary (first 30 words):")
print(vocabulary[:30])

# Show bag-of-words representation for a sample review
sample_idx = 0
sample_review = reviews_df.iloc[sample_idx]['review_clean']
sample_vector = bow_features[sample_idx].toarray()[0]

print(f"\nSample review: '{sample_review}'")
print(f"\nBag-of-Words representation (non-zero features only):")
for word, count in zip(vocabulary, sample_vector):
    if count > 0:
        print(f"  {word}: {count}")

## 6. Technique 2: TF-IDF (Term Frequency-Inverse Document Frequency)

**Problem with Bag-of-Words**: All words are treated equally!
- "the", "is", "and" appear frequently but are not informative
- "excellent" or "terrible" appear less but are very informative

**TF-IDF solution**: Weight words by how unique/important they are

**Formula**:
```
TF-IDF = (Term Frequency) × (Inverse Document Frequency)

TF = (# times word appears in document) / (total words in document)
IDF = log(total documents / documents containing word)
```

**Result**:
- Common words (appear everywhere) → low score
- Rare but informative words → high score

In [None]:
# Create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=100,
    min_df=2,
    max_df=0.9,
    stop_words='english'
)

# Fit and transform
tfidf_features = tfidf_vectorizer.fit_transform(reviews_df['review_clean'])

print(f"TF-IDF feature matrix shape: {tfidf_features.shape}")
print(f"  - {tfidf_features.shape[0]} reviews")
print(f"  - {tfidf_features.shape[1]} unique words (features)")

In [None]:
# Compare Bag-of-Words vs TF-IDF for same review
sample_idx = 0
sample_review = reviews_df.iloc[sample_idx]['review_clean']

bow_vector = bow_features[sample_idx].toarray()[0]
tfidf_vector = tfidf_features[sample_idx].toarray()[0]
vocabulary_tfidf = tfidf_vectorizer.get_feature_names_out()

print(f"Sample review: '{sample_review}'\n")
print(f"{'Word':<20} {'Bag-of-Words':<15} {'TF-IDF':<15}")
print("-" * 50)

for word, bow_val, tfidf_val in zip(vocabulary_tfidf, bow_vector, tfidf_vector):
    if bow_val > 0 or tfidf_val > 0:
        print(f"{word:<20} {bow_val:<15.0f} {tfidf_val:<15.4f}")

print("\nNotice: TF-IDF gives different weights to words based on their importance!")

In [None]:
# Visualize top TF-IDF scores for positive vs negative reviews
positive_indices = reviews_df[reviews_df['sentiment']=='positive'].index
negative_indices = reviews_df[reviews_df['sentiment']=='negative'].index

# Average TF-IDF scores across all positive/negative reviews
positive_tfidf_mean = tfidf_features[positive_indices].mean(axis=0).A1
negative_tfidf_mean = tfidf_features[negative_indices].mean(axis=0).A1

# Get top words for each sentiment
top_positive_idx = positive_tfidf_mean.argsort()[-10:][::-1]
top_negative_idx = negative_tfidf_mean.argsort()[-10:][::-1]

top_positive_words = [(vocabulary_tfidf[i], positive_tfidf_mean[i]) for i in top_positive_idx]
top_negative_words = [(vocabulary_tfidf[i], negative_tfidf_mean[i]) for i in top_negative_idx]

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

words, scores = zip(*top_positive_words)
axes[0].barh(words, scores, color='lightgreen', edgecolor='black')
axes[0].set_xlabel('Average TF-IDF Score')
axes[0].set_title('Top Words in Positive Reviews', fontsize=12, fontweight='bold')
axes[0].invert_yaxis()

words, scores = zip(*top_negative_words)
axes[1].barh(words, scores, color='lightcoral', edgecolor='black')
axes[1].set_xlabel('Average TF-IDF Score')
axes[1].set_title('Top Words in Negative Reviews', fontsize=12, fontweight='bold')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print("Notice how different words characterize positive vs negative sentiment!")

## 7. Technique 3: N-grams

**Problem**: Bag-of-Words ignores word order!
- "not good" is very different from "good"
- "highly recommend" has different meaning than just "highly" or "recommend"

**Solution: N-grams** = sequences of N consecutive words
- **Unigrams** (1-gram): Single words ["not", "good"]
- **Bigrams** (2-gram): Word pairs ["not good"]
- **Trigrams** (3-gram): Word triplets ["not very good"]

**Trade-off**: N-grams capture context but increase feature space dramatically!

In [None]:
# Create TF-IDF with bigrams
tfidf_bigram = TfidfVectorizer(
    ngram_range=(1, 2),  # Use both unigrams and bigrams
    max_features=200,  # More features because we have bigrams
    min_df=2,
    stop_words='english'
)

bigram_features = tfidf_bigram.fit_transform(reviews_df['review_clean'])

print(f"TF-IDF with bigrams feature matrix: {bigram_features.shape}")
print(f"  - {bigram_features.shape[0]} reviews")
print(f"  - {bigram_features.shape[1]} features (unigrams + bigrams)")

In [None]:
# Show examples of bigrams captured
vocabulary_bigram = tfidf_bigram.get_feature_names_out()

# Find bigrams (contain space)
bigrams_only = [word for word in vocabulary_bigram if ' ' in word]

print(f"Total bigrams captured: {len(bigrams_only)}")
print(f"\nSample bigrams:")
for bigram in bigrams_only[:30]:
    print(f"  '{bigram}'")

print("\nNotice phrases like 'highly recommend', 'waste money', etc.")
print("These capture sentiment better than individual words!")

In [None]:
# Find most important bigrams for each sentiment
positive_indices = reviews_df[reviews_df['sentiment']=='positive'].index
negative_indices = reviews_df[reviews_df['sentiment']=='negative'].index

positive_bigram_mean = bigram_features[positive_indices].mean(axis=0).A1
negative_bigram_mean = bigram_features[negative_indices].mean(axis=0).A1

# Get top bigrams
bigram_indices = [i for i, word in enumerate(vocabulary_bigram) if ' ' in word]

top_positive_bigrams = sorted(
    [(vocabulary_bigram[i], positive_bigram_mean[i]) for i in bigram_indices],
    key=lambda x: x[1],
    reverse=True
)[:10]

top_negative_bigrams = sorted(
    [(vocabulary_bigram[i], negative_bigram_mean[i]) for i in bigram_indices],
    key=lambda x: x[1],
    reverse=True
)[:10]

print("Top bigrams in POSITIVE reviews:")
for bigram, score in top_positive_bigrams:
    print(f"  '{bigram}': {score:.4f}")

print("\nTop bigrams in NEGATIVE reviews:")
for bigram, score in top_negative_bigrams:
    print(f"  '{bigram}': {score:.4f}")

## 8. Model Performance Comparison

Let's compare sentiment classification performance using different text features:
1. **Bag-of-Words** (CountVectorizer)
2. **TF-IDF** (unigrams only)
3. **TF-IDF + Bigrams** (unigrams + bigrams)

In [None]:
# Prepare labels
y = (reviews_df['sentiment'] == 'positive').astype(int)

# Split data
test_size = 0.25

# Feature sets
feature_sets = {
    'Bag-of-Words': bow_features,
    'TF-IDF (unigrams)': tfidf_features,
    'TF-IDF (unigrams + bigrams)': bigram_features
}

results = []

for name, features in feature_sets.items():
    # Split
    X_train, X_test, y_train, y_test = train_test_split(
        features, y, test_size=test_size, random_state=42, stratify=y
    )
    
    # Train Logistic Regression
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    # Cross-validation
    cv_scores = cross_val_score(model, features, y, cv=5)
    
    results.append({
        'Method': name,
        'Num Features': features.shape[1],
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc,
        'CV Accuracy (mean)': cv_scores.mean(),
        'CV Accuracy (std)': cv_scores.std()
    })

results_df = pd.DataFrame(results)
print("\nSentiment Classification Performance:")
print("="*90)
results_df

In [None]:
# Visualize performance comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Accuracy comparison
x = np.arange(len(results_df))
width = 0.35

axes[0].bar(x - width/2, results_df['Train Accuracy'], width, 
           label='Train Accuracy', color='skyblue', edgecolor='black')
axes[0].bar(x + width/2, results_df['Test Accuracy'], width, 
           label='Test Accuracy', color='salmon', edgecolor='black')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Accuracy by Feature Type', fontsize=12, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(results_df['Method'], rotation=15, ha='right')
axes[0].legend()
axes[0].set_ylim([0.8, 1.0])
axes[0].grid(True, alpha=0.3, axis='y')

# Cross-validation scores
axes[1].bar(results_df['Method'], results_df['CV Accuracy (mean)'], 
           yerr=results_df['CV Accuracy (std)'],
           color='lightgreen', edgecolor='black', capsize=5)
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Cross-Validation Accuracy (with std)', fontsize=12, fontweight='bold')
axes[1].set_xticklabels(results_df['Method'], rotation=15, ha='right')
axes[1].set_ylim([0.8, 1.0])
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- TF-IDF typically outperforms simple Bag-of-Words")
print("- Adding bigrams can improve performance by capturing context")
print("- Trade-off: More features = more complexity")

In [None]:
# Detailed classification report for best model
best_method = results_df.loc[results_df['Test Accuracy'].idxmax(), 'Method']
best_features = feature_sets[best_method]

X_train, X_test, y_train, y_test = train_test_split(
    best_features, y, test_size=0.25, random_state=42, stratify=y
)

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Best performing method: {best_method}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title(f'Confusion Matrix - {best_method}', fontsize=12, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

## 9. Introduction to Word Embeddings

**Limitation of Bag-of-Words and TF-IDF**:
- Each word is independent (no semantic similarity)
- "excellent" and "outstanding" are treated as completely different
- Very high-dimensional sparse vectors

**Word Embeddings**: Dense vector representations that capture semantic meaning
- Words with similar meanings have similar vectors
- Typically 50-300 dimensions (vs 1000s for BoW)
- Pre-trained models available (Word2Vec, GloVe, FastText)

**Example**:
```
word2vec["excellent"] ≈ word2vec["outstanding"]
word2vec["terrible"] ≈ word2vec["horrible"]
```

**When to use**:
- ✅ Small datasets (leverage pre-trained knowledge)
- ✅ Need semantic similarity
- ✅ Deep learning models
- ❌ Traditional ML often works fine with TF-IDF
- ❌ Interpretability is important

In [None]:
# Simple demonstration of embedding concept (using random embeddings)
# In practice, you'd use pre-trained embeddings like Word2Vec or GloVe

print("Conceptual comparison: Sparse BoW vs Dense Embeddings\n")

# Bag-of-Words: Sparse, high-dimensional
vocab_size = 10000
bow_vector = np.zeros(vocab_size)
bow_vector[[42, 156, 1523, 8932]] = 1  # Only 4 words present

print(f"Bag-of-Words representation:")
print(f"  Dimension: {len(bow_vector)}")
print(f"  Non-zero values: {np.count_nonzero(bow_vector)}")
print(f"  Sparsity: {(1 - np.count_nonzero(bow_vector) / len(bow_vector)) * 100:.1f}%")
print(f"  Sample values: {bow_vector[:20]}\n")

# Word Embedding: Dense, low-dimensional
embedding_dim = 100
embedding_vector = np.random.randn(embedding_dim)  # Dense vector

print(f"Word Embedding representation:")
print(f"  Dimension: {len(embedding_vector)}")
print(f"  Non-zero values: {np.count_nonzero(embedding_vector)}")
print(f"  Sparsity: {(1 - np.count_nonzero(embedding_vector) / len(embedding_vector)) * 100:.1f}%")
print(f"  Sample values: {embedding_vector[:20]}\n")

print("Key differences:")
print("- Embeddings are DENSE (most values non-zero)")
print("- Embeddings are LOWER dimensional (100 vs 10,000)")
print("- Embeddings capture SEMANTIC meaning")
print("\nNote: This is just a conceptual demo. Real embeddings require training!")

## 10. Exercise Section

### Exercise 1: Email Spam Classification

Create a spam detection system using TF-IDF features.

In [None]:
# Exercise 1: Spam detection dataset

spam_emails = [
    "Congratulations! You've won $1000000. Click here to claim now!",
    "FREE PRIZE! Limited time offer. Act now!",
    "Earn money fast! Work from home opportunity.",
    "Click here for amazing deals! Buy now!",
    "You are a winner! Claim your prize today!"
] * 30

legitimate_emails = [
    "Meeting scheduled for tomorrow at 2pm in conference room.",
    "Please review the attached document and provide feedback.",
    "Reminder: Project deadline is next Friday.",
    "Thank you for your purchase. Your order will ship soon.",
    "Team lunch on Thursday. Please confirm your attendance."
] * 30

email_data = pd.DataFrame({
    'email_text': spam_emails + legitimate_emails,
    'is_spam': [1] * len(spam_emails) + [0] * len(legitimate_emails)
}).sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Email dataset: {len(email_data)} emails")
print(f"Spam: {email_data['is_spam'].sum()}, Legitimate: {(1-email_data['is_spam']).sum()}")

# TODO: 
# 1. Clean the email text
# 2. Create TF-IDF features with bigrams
# 3. Train a classifier (Naive Bayes works well for text)
# 4. Evaluate accuracy

# Your code here:


In [None]:
# Solution to Exercise 1

# 1. Clean email text
email_data['email_clean'] = email_data['email_text'].apply(clean_text)

# 2. Create TF-IDF features with bigrams
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=100,
    min_df=2,
    stop_words='english'
)
X = tfidf.fit_transform(email_data['email_clean'])
y = email_data['is_spam']

# 3. Train classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)

# 4. Evaluate
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
y_pred = model.predict(X_test)

print(f"Spam Detection Results:")
print(f"Train Accuracy: {train_acc:.3f}")
print(f"Test Accuracy: {test_acc:.3f}\n")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Spam']))

# Show top spam indicators
feature_names = tfidf.get_feature_names_out()
top_spam_features = model.feature_log_prob_[1].argsort()[-10:][::-1]
print("\nTop spam indicators:")
for idx in top_spam_features:
    print(f"  '{feature_names[idx]}'")

### Exercise 2: Compare Different N-gram Ranges

Experiment with different n-gram settings and see which works best.

In [None]:
# Exercise 2: N-gram comparison

# TODO: Test these n-gram ranges on the review sentiment data:
# 1. (1, 1) - unigrams only
# 2. (2, 2) - bigrams only
# 3. (1, 2) - unigrams + bigrams
# 4. (1, 3) - unigrams + bigrams + trigrams
#
# Compare their performance on sentiment classification

# Your code here:


In [None]:
# Solution to Exercise 2

ngram_configs = [
    (1, 1, 'Unigrams only'),
    (2, 2, 'Bigrams only'),
    (1, 2, 'Unigrams + Bigrams'),
    (1, 3, 'Unigrams + Bigrams + Trigrams')
]

ngram_results = []

for min_n, max_n, description in ngram_configs:
    # Create vectorizer
    vectorizer = TfidfVectorizer(
        ngram_range=(min_n, max_n),
        max_features=200,
        min_df=2,
        stop_words='english'
    )
    
    # Transform
    X = vectorizer.fit_transform(reviews_df['review_clean'])
    y = (reviews_df['sentiment'] == 'positive').astype(int)
    
    # Train and evaluate
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    ngram_results.append({
        'N-gram Range': f'({min_n}, {max_n})',
        'Description': description,
        'Num Features': X.shape[1],
        'Test Accuracy': model.score(X_test, y_test)
    })

ngram_results_df = pd.DataFrame(ngram_results)
print("N-gram Range Comparison:")
print("="*70)
print(ngram_results_df)

# Visualize
plt.figure(figsize=(10, 5))
plt.bar(ngram_results_df['Description'], ngram_results_df['Test Accuracy'], 
        color='steelblue', edgecolor='black')
plt.ylabel('Test Accuracy')
plt.title('Performance vs N-gram Range', fontsize=12, fontweight='bold')
plt.xticks(rotation=15, ha='right')
plt.ylim([0.9, 1.0])
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Bigrams alone may underperform (lose single word info)")
print("- Combining unigrams + bigrams often works well")
print("- Trigrams add complexity with diminishing returns")

### Exercise 3: Custom Text Cleaning

Enhance the text cleaning function to handle more edge cases.

In [None]:
# Exercise 3: Advanced text cleaning

messy_reviews = [
    "OMG!!! This is THE BEST product EVER!!! 5 stars ⭐⭐⭐⭐⭐",
    "Sooooo disappointed :( Waste of $$$. Contact support@company.com",
    "Check out my review at http://example.com/review123",
    "Product received on 01/15/2024. Working perfectly!!!"
]

# TODO: Create an enhanced clean_text function that:
# 1. Handles repeated characters ("sooooo" → "so")
# 2. Removes emojis
# 3. Removes URLs
# 4. Removes email addresses
# 5. Removes numbers and dates

def enhanced_clean_text(text):
    # Your code here
    pass

# Test your function
# for review in messy_reviews:
#     print(f"Original: {review}")
#     print(f"Cleaned:  {enhanced_clean_text(review)}\n")

In [None]:
# Solution to Exercise 3

def enhanced_clean_text(text):
    """
    Enhanced text cleaning with additional preprocessing.
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove dates (simple pattern)
    text = re.sub(r'\d{1,2}/\d{1,2}/\d{2,4}', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Handle repeated characters (e.g., "sooooo" → "soo")
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)
    
    # Remove emojis and special unicode characters
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    
    # Keep only letters and spaces
    text = re.sub(r'[^a-z\s]', ' ', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

print("Enhanced text cleaning examples:\n")
for review in messy_reviews:
    print(f"Original: {review}")
    print(f"Cleaned:  {enhanced_clean_text(review)}\n")

print("Improvements:")
print("✓ Removed URLs and emails")
print("✓ Normalized repeated characters")
print("✓ Removed emojis and special characters")
print("✓ Removed numbers and dates")

## 11. Summary

### Key Takeaways

1. **Text must be converted to numbers** for machine learning
   - Models can't understand words directly
   - Feature engineering bridges the gap

2. **Four core text vectorization techniques**:
   - **Bag-of-Words (CountVectorizer)**: Simple word counts
   - **TF-IDF**: Weight words by importance (usually better than BoW)
   - **N-grams**: Capture word sequences and context
   - **Word Embeddings**: Dense semantic representations

3. **Text preprocessing is critical**:
   - Lowercase conversion
   - Remove punctuation, URLs, special characters
   - Handle repeated characters
   - Remove stop words (optional)

4. **TF-IDF typically outperforms Bag-of-Words**:
   - Down-weights common uninformative words
   - Up-weights rare important words
   - Good default choice for traditional ML

5. **N-grams capture context but increase complexity**:
   - Unigrams + bigrams often optimal
   - Trigrams+ have diminishing returns
   - Trade-off between performance and feature space size

### When to Use Each Method

**Bag-of-Words (CountVectorizer)**:
- ✅ Simple baseline
- ✅ Naive Bayes classifier
- ✅ Document frequency matters more than term importance

**TF-IDF**:
- ✅ Most text classification tasks
- ✅ Document similarity
- ✅ Information retrieval
- ✅ Works well with SVM, Logistic Regression

**N-grams**:
- ✅ Context matters ("not good" vs "good")
- ✅ Sentiment analysis
- ✅ Phrase detection

**Word Embeddings**:
- ✅ Small datasets (use pre-trained)
- ✅ Need semantic similarity
- ✅ Deep learning models
- ✅ Multi-language tasks

### Best Practices

1. **Always clean text first**: Garbage in = garbage out
2. **Start with TF-IDF unigrams**: Good baseline
3. **Add bigrams if context matters**: Sentiment, negation, phrases
4. **Limit vocabulary size**: Use max_features to control complexity
5. **Remove stop words**: Often helps, but test both ways
6. **Consider min_df and max_df**: Filter rare and common words
7. **Evaluate on held-out test set**: Avoid overfitting

### What's Next?

**Module 08**: Feature Selection Methods - Learn to identify and keep only the most important features

### Additional Resources

- [Scikit-learn Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [TF-IDF Explained](https://monkeylearn.com/blog/what-is-tf-idf/)
- [Word Embeddings Guide](https://machinelearningmastery.com/what-are-word-embeddings/)
- [NLTK Documentation](https://www.nltk.org/)

---

**Congratulations!** You've completed Module 07. You now understand:
- How to clean and preprocess text data
- Bag-of-Words and TF-IDF vectorization
- N-grams for capturing word sequences
- When to use each text feature engineering method
- The performance impact of different approaches

Ready to learn feature selection? Let's move to **Module 08: Feature Selection Methods**!