# 📝 Bag of Words and TF-IDF

In this notebook, we'll explore two fundamental text representation techniques:
- **Bag of Words (BoW)**: Simple word frequency counting
- **TF-IDF**: Term Frequency-Inverse Document Frequency weighting

These techniques convert text into numerical features that machine learning algorithms can understand.

## 📚 Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 📊 Sample Dataset Creation

Let's create a sample dataset to understand BoW and TF-IDF concepts:

In [None]:
# Sample movie reviews dataset
documents = [
    "This movie is fantastic and amazing",
    "Great movie with excellent acting",
    "I love this movie so much",
    "This is a terrible movie",
    "Worst movie I have ever seen",
    "I hate this boring movie",
    "The acting is superb in this film",
    "Amazing cinematography and great story",
    "Poor storyline and bad acting",
    "Excellent direction and fantastic cast"
]

# Labels: 1 = positive, 0 = negative
labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1]

# Create DataFrame
df = pd.DataFrame({
    'text': documents,
    'sentiment': labels
})

print("Sample Dataset:")
print(df)

## 🎒 Bag of Words (BoW)

Bag of Words creates a vocabulary of all unique words and counts their occurrences in each document.

In [None]:
# Create Bag of Words vectorizer
bow_vectorizer = CountVectorizer(
    lowercase=True,           # Convert to lowercase
    stop_words='english',     # Remove English stop words
    max_features=100          # Limit vocabulary size
)

# Fit and transform the text data
bow_matrix = bow_vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
feature_names = bow_vectorizer.get_feature_names_out()

print(f"Vocabulary size: {len(feature_names)}")
print(f"Vocabulary: {list(feature_names)}")
print(f"\nBoW matrix shape: {bow_matrix.shape}")

In [None]:
# Convert to dense matrix for visualization
bow_dense = bow_matrix.toarray()

# Create DataFrame for better visualization
bow_df = pd.DataFrame(bow_dense, columns=feature_names)
bow_df['document'] = [f"Doc {i+1}" for i in range(len(documents))]
bow_df['text'] = documents

print("Bag of Words Matrix:")
print(bow_df[['document'] + list(feature_names[:10])].head())

## 📊 Visualize Word Frequencies

In [None]:
# Calculate word frequencies across all documents
word_freq = np.sum(bow_dense, axis=0)
word_freq_df = pd.DataFrame({
    'word': feature_names,
    'frequency': word_freq
}).sort_values('frequency', ascending=False)

# Plot word frequencies
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.barplot(data=word_freq_df.head(10), x='frequency', y='word')
plt.title('Top 10 Most Frequent Words (BoW)')
plt.xlabel('Frequency')

# Show word distribution
plt.subplot(1, 2, 2)
plt.hist(word_freq, bins=10, alpha=0.7, color='skyblue')
plt.title('Distribution of Word Frequencies')
plt.xlabel('Frequency')
plt.ylabel('Number of Words')

plt.tight_layout()
plt.show()

## 🔢 TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF weights words based on:
- **Term Frequency (TF)**: How often a word appears in a document
- **Inverse Document Frequency (IDF)**: How rare/common a word is across all documents

**Formula**: TF-IDF = TF × IDF

In [None]:
# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    max_features=100,
    ngram_range=(1, 2)        # Include unigrams and bigrams
)

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names
tfidf_features = tfidf_vectorizer.get_feature_names_out()

print(f"TF-IDF vocabulary size: {len(tfidf_features)}")
print(f"Sample features: {list(tfidf_features[:15])}")
print(f"\nTF-IDF matrix shape: {tfidf_matrix.shape}")

In [None]:
# Convert to dense matrix
tfidf_dense = tfidf_matrix.toarray()

# Create DataFrame for visualization
tfidf_df = pd.DataFrame(tfidf_dense, columns=tfidf_features)
tfidf_df['document'] = [f"Doc {i+1}" for i in range(len(documents))]

print("TF-IDF Matrix (first 5 features):")
print(tfidf_df[['document'] + list(tfidf_features[:5])].round(3))

## 📊 Compare BoW vs TF-IDF

In [None]:
# Compare scores for a specific word across documents
word_to_compare = 'movie'

if word_to_compare in feature_names:
    bow_idx = list(feature_names).index(word_to_compare)
    bow_scores = bow_dense[:, bow_idx]
    
    if word_to_compare in tfidf_features:
        tfidf_idx = list(tfidf_features).index(word_to_compare)
        tfidf_scores = tfidf_dense[:, tfidf_idx]
        
        # Create comparison DataFrame
        comparison_df = pd.DataFrame({
            'Document': [f"Doc {i+1}" for i in range(len(documents))],
            'Text': [doc[:50] + '...' if len(doc) > 50 else doc for doc in documents],
            'BoW Score': bow_scores,
            'TF-IDF Score': tfidf_scores.round(3)
        })
        
        print(f"Comparison for word '{word_to_compare}':")
        print(comparison_df)
    else:
        print(f"Word '{word_to_compare}' not found in TF-IDF features")
else:
    print(f"Word '{word_to_compare}' not found in BoW features")

## 🤖 Machine Learning with BoW and TF-IDF

Let's compare how BoW and TF-IDF perform in a classification task:

In [None]:
# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    documents, labels, test_size=0.3, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

In [None]:
# Train models with BoW features
bow_vectorizer_train = CountVectorizer(lowercase=True, stop_words='english')
X_train_bow = bow_vectorizer_train.fit_transform(X_train)
X_test_bow = bow_vectorizer_train.transform(X_test)

# Naive Bayes with BoW
nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)
y_pred_nb_bow = nb_bow.predict(X_test_bow)

# Logistic Regression with BoW
lr_bow = LogisticRegression(random_state=42)
lr_bow.fit(X_train_bow, y_train)
y_pred_lr_bow = lr_bow.predict(X_test_bow)

print("BoW Results:")
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_nb_bow):.3f}")
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr_bow):.3f}")

In [None]:
# Train models with TF-IDF features
tfidf_vectorizer_train = TfidfVectorizer(lowercase=True, stop_words='english')
X_train_tfidf = tfidf_vectorizer_train.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer_train.transform(X_test)

# Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)
y_pred_nb_tfidf = nb_tfidf.predict(X_test_tfidf)

# Logistic Regression with TF-IDF
lr_tfidf = LogisticRegression(random_state=42)
lr_tfidf.fit(X_train_tfidf, y_train)
y_pred_lr_tfidf = lr_tfidf.predict(X_test_tfidf)

print("TF-IDF Results:")
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_nb_tfidf):.3f}")
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr_tfidf):.3f}")

## 📊 Feature Importance Analysis

In [None]:
# Get feature importance from Logistic Regression (TF-IDF)
feature_names_tfidf = tfidf_vectorizer_train.get_feature_names_out()
coefficients = lr_tfidf.coef_[0]

# Create DataFrame with feature importance
feature_importance = pd.DataFrame({
    'feature': feature_names_tfidf,
    'coefficient': coefficients,
    'abs_coefficient': np.abs(coefficients)
}).sort_values('abs_coefficient', ascending=False)

# Plot top positive and negative features
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Top positive features (positive sentiment indicators)
top_positive = feature_importance.nlargest(10, 'coefficient')
ax1.barh(range(len(top_positive)), top_positive['coefficient'], color='green', alpha=0.7)
ax1.set_yticks(range(len(top_positive)))
ax1.set_yticklabels(top_positive['feature'])
ax1.set_title('Top Positive Sentiment Features')
ax1.set_xlabel('Coefficient Value')

# Top negative features (negative sentiment indicators)
top_negative = feature_importance.nsmallest(10, 'coefficient')
ax2.barh(range(len(top_negative)), top_negative['coefficient'], color='red', alpha=0.7)
ax2.set_yticks(range(len(top_negative)))
ax2.set_yticklabels(top_negative['feature'])
ax2.set_title('Top Negative Sentiment Features')
ax2.set_xlabel('Coefficient Value')

plt.tight_layout()
plt.show()

## 🔍 Testing with New Examples

In [None]:
# Test with new examples
new_reviews = [
    "This is an amazing and fantastic film",
    "Terrible movie with poor acting",
    "Great storyline and excellent direction",
    "Worst film I have ever watched"
]

# Transform new reviews
new_reviews_tfidf = tfidf_vectorizer_train.transform(new_reviews)

# Make predictions
predictions = lr_tfidf.predict(new_reviews_tfidf)
probabilities = lr_tfidf.predict_proba(new_reviews_tfidf)

# Display results
results_df = pd.DataFrame({
    'Review': new_reviews,
    'Predicted_Sentiment': ['Positive' if p == 1 else 'Negative' for p in predictions],
    'Confidence': [max(prob) for prob in probabilities]
})

print("Predictions on New Reviews:")
for idx, row in results_df.iterrows():
    print(f"\nReview: {row['Review']}")
    print(f"Sentiment: {row['Predicted_Sentiment']}")
    print(f"Confidence: {row['Confidence']:.3f}")

## 📋 Key Takeaways

### **Bag of Words (BoW):**
- ✅ Simple and intuitive
- ✅ Fast to compute
- ✅ Good baseline for text classification
- ❌ Ignores word importance
- ❌ High-dimensional and sparse

### **TF-IDF:**
- ✅ Weights words by importance
- ✅ Better performance than BoW
- ✅ Good for information retrieval
- ❌ More complex than BoW
- ❌ Still loses word order information

### **When to Use:**
- **BoW**: Simple classification tasks, quick prototypes
- **TF-IDF**: Better classification performance, information retrieval
- **Both**: Good starting points before trying deep learning approaches

### **Next Steps:**
- Experiment with n-grams (bigrams, trigrams)
- Try different preprocessing techniques
- Explore word embeddings (Word2Vec, GloVe)
- Compare with deep learning approaches