# Spam Detection using Linear Regression

This notebook demonstrates how to build a spam detection system using linear regression.
We'll use text feature extraction techniques and logistic regression (which is a linear model)
to classify emails as spam or not spam.

## Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
import kagglehub
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## Load Real Spam Email Dataset

We'll use a "real" spam email dataset from Kaggle to make our analysis more realistic and robust.
This dataset contains actual spam and ham (legitimate) emails. The community calls the positive labels ham for I assume reasons of it being a cute little pun (though tbf, I think spam is very delicious so I'm not sure I see where they're coming from)

In [None]:
# Download the spam email dataset from Kaggle
print("Downloading spam email dataset from Kaggle...")
path = kagglehub.dataset_download("jackksoncsie/spam-email-dataset")
print("Path to dataset files:", path)

In [None]:
# Load the dataset
import os
# List files in the dataset directory
dataset_files = os.listdir(path)
print("Files in dataset:", dataset_files)

# Load the CSV file 
dataset_file = dataset_files[0]  # Take the first CSV file
df = pd.read_csv(os.path.join(path, dataset_file))
print(f"\nLoaded dataset: {dataset_file}")
print(f"Dataset shape: {df.shape}")
print(f"Column names: {list(df.columns)}")
print("\nSome random rows:")
print(df.sample(20))

In [None]:
# Rename columns for ease of labeling
text_col = 'text'
label_col = 'spam'
df = df.rename(columns={text_col: 'message', label_col: 'label'})

# Check unique values in label column
print(f"\nUnique labels: {df['label'].unique()}")
print(f"Label counts:\n{df['label'].value_counts()}")

In [None]:
# Ensure labels are integers
df['label'] = df['label'].astype(int)

# Remove any rows with missing values
df = df.dropna(subset=['message', 'label'])

print(f"\nFinal dataset shape: {df.shape}")
print(f"Spam messages: {df[df['label'] == 1].shape[0]}")
print(f"Ham messages: {df[df['label'] == 0].shape[0]}")
print(f"Spam ratio: {df['label'].mean():.3f}")

# Show sample messages
print(f"Sample spam messages:")
spam_samples = df[df['label'] == 1]['message'].head(3)
for i, msg in enumerate(spam_samples, 1):
    print(f"{i}. {msg[:100]}..." if len(msg) > 100 else f"{i}. {msg}")

print(f"Sample ham messages:")
ham_samples = df[df['label'] == 0]['message'].head(3)
for i, msg in enumerate(ham_samples, 1):
    print(f"{i}. {msg[:100]}..." if len(msg) > 100 else f"{i}. {msg}")

## Exploratory Data Analysis

In [None]:
# Display first few messages
print("Sample messages:")
print("Spam messages:")
print(df[df['label'] == 1]['message'].head(3).values)
print("Ham messages:")
print(df[df['label'] == 0]['message'].head(3).values)

## Understanding Bag of Words (BoW)

Before we build our model, let's understand how **Bag of Words** works step by step.
Bag of Words is a fundamental text representation technique that converts text into numerical vectors by assigning each word to a one-hot vector and then making the feature into the sum of the vectors.

### Step 1: Building the Vocabulary

First, let's manually walk through how bag of words is constructed using a few sample messages.

In [None]:
# Sample messages for demonstration
demo_messages = [
    "Free money now!",
    "Meeting at noon",
    "Free lunch offer",
    "Money back guarantee"
]

print("Demo messages:")
for i, msg in enumerate(demo_messages):
    print(f"{i+1}. '{msg}'")

# Step 1: Tokenization - split into words
print("\nStep 1: Tokenization")
print("=" * 30)
tokenized_messages = []
for i, msg in enumerate(demo_messages):
    tokens = msg.lower().split()  # Simple tokenization
    tokenized_messages.append(tokens)
    print(f"Message {i+1}: {tokens}")

# Step 2: Build vocabulary (unique words)
print("\nStep 2: Building Vocabulary")
print("=" * 30)
vocabulary = set()
for tokens in tokenized_messages:
    vocabulary.update(tokens)

vocabulary = sorted(list(vocabulary))  # Sort for consistent ordering
print(f"Vocabulary: {vocabulary}")
print(f"Vocabulary size: {len(vocabulary)}")

# Step 3: Create word-to-index mapping
print("\nStep 3: Word-to-Index Mapping")
print("=" * 30)
word_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
for word, idx in word_to_idx.items():
    print(f"'{word}' -> index {idx}")

In [None]:
# Step 4: Convert messages to numerical vectors
print("Step 4: Converting Messages to Vectors")
print("=" * 40)

bow_matrix = []
for i, tokens in enumerate(tokenized_messages):
    # Initialize vector with zeros
    vector = [0] * len(vocabulary)
    
    # Count occurrences of each word
    for token in tokens:
        if token in word_to_idx:
            idx = word_to_idx[token]
            vector[idx] += 1
    
    bow_matrix.append(vector)
    print(f"\nMessage {i+1}: '{demo_messages[i]}'")
    print(f"Tokens: {tokens}")
    print(f"Vector: {vector}")
    
    # Show which positions correspond to which words
    non_zero_positions = [(idx, count) for idx, count in enumerate(vector) if count > 0]
    print(f"Non-zero positions: {[(vocabulary[idx], count) for idx, count in non_zero_positions]}")

In [None]:
# Visualize the Bag of Words matrix
import pandas as pd

bow_df = pd.DataFrame(bow_matrix, columns=vocabulary)
bow_df.index = [f"Msg {i+1}" for i in range(len(demo_messages))]

print("\nBag of Words Matrix:")
print("=" * 50)
print(bow_df)

# Visualize as heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(bow_df, annot=True, cmap='Blues', fmt='d', cbar_kws={'label': 'Word Count'})
plt.title('Bag of Words Matrix Visualization\n(Rows = Messages, Columns = Words)')
plt.xlabel('Words in Vocabulary')
plt.ylabel('Messages')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Key Insights from the Manual BoW Construction:

1. **Vocabulary Creation**: We collect all unique words across all documents
2. **Vector Size**: Each document becomes a vector of size = vocabulary size
3. **Word Counts**: Each position in the vector represents the count of that word
4. **Sparsity**: Most positions are 0 (words don't appear in most documents)
5. **Order Independence**: "free money" and "money free" will wind up as the same.

Now let's use scikit-learn's CountVectorizer to do this automatically for our spam detection task. CountVectorizer is an internal tool for implementing bag of words.

## Data Preprocessing and Feature Engineering

We'll use **Bag of Words (CountVectorizer)** to convert text messages
into numerical features that our linear regression model can understand.

In [None]:
# Split the data
X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training set spam ratio: {y_train.mean():.2f}")
print(f"Test set spam ratio: {y_test.mean():.2f}")

## Model Training

We'll use a pipeline that combines Bag of Words vectorization with Logistic Regression.

In [None]:
# Create and train the model pipeline
pipeline = Pipeline([
    ('bow', CountVectorizer(
        max_features=1000,  # Limit to top 1000 features
        stop_words='english',  # Remove common English stop words
        lowercase=True,  # Convert to lowercase
        ngram_range=(1, 1)  # Use both unigrams and bigrams
    )),
    ('classifier', LogisticRegression(
        random_state=42,
        max_iter=1000
    ))
])

# Train the model
print("Training the model...")
pipeline.fit(X_train, y_train)
print("Model training completed!")

## Examining the Bag of Words Matrix for Our Dataset

Let's look at how our actual spam/ham messages are converted to bag of words vectors.

In [None]:
import random

# Get the fitted vectorizer and examine a random sample
bow_vectorizer = pipeline.named_steps['bow']
feature_names = bow_vectorizer.get_feature_names_out()

print(f"Vocabulary size: {len(feature_names)}")
print(f"Sample of 20 random words: {random.sample(list(feature_names), 20)}")

In [None]:
# Visualize the bag of words matrix for a subset of messages
# Select a few messages for visualization
sample_indices = [0, 1, 5, 6, 10, 15]  # Mix of spam and ham
sample_msgs = [df.iloc[i]['message'] for i in sample_indices]
sample_lbls = [df.iloc[i]['label'] for i in sample_indices]

# Transform these messages
bow_matrix = bow_vectorizer.transform(sample_msgs)

# Convert to dense array and create DataFrame
bow_dense = bow_matrix.toarray()

# Only show features that appear in at least one of these messages
feature_mask = bow_dense.sum(axis=0) > 0
active_features = feature_names[feature_mask]
active_bow_matrix = bow_dense[:, feature_mask]

# Create DataFrame for visualization
bow_viz_df = pd.DataFrame(
    active_bow_matrix,
    columns=active_features,
    index=[f"{'SPAM' if lbl else 'HAM'} {i+1}" for i, lbl in enumerate(sample_lbls)]
)

print(f"\nBag of Words Matrix for {len(sample_msgs)} sample messages:")
print(f"Showing {len(active_features)} features that appear in these messages")
print("=" * 70)

# Show the matrix
if len(active_features) <= 20:  # If few features, show all
    print(bow_viz_df)
    
    # Visualize as heatmap
    plt.figure(figsize=(15, 8))
    sns.heatmap(bow_viz_df, annot=True, cmap='Blues', fmt='d', cbar_kws={'label': 'Word Count'})
    plt.title('Bag of Words Matrix for Sample Messages\n(Rows = Messages, Columns = Words)')
    plt.xlabel('Words')
    plt.ylabel('Messages')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:  # If many features, show top ones
    # Show features with highest variance (most discriminative)
    feature_variance = bow_viz_df.var(axis=0)
    top_features = feature_variance.nlargest(15).index
    
    print(bow_viz_df[top_features])
    
    # Visualize as heatmap
    plt.figure(figsize=(15, 8))
    sns.heatmap(bow_viz_df[top_features], annot=True, cmap='Blues', fmt='d', cbar_kws={'label': 'Word Count'})
    plt.title('Bag of Words Matrix for Sample Messages\n(Top 15 Most Variable Features)')
    plt.xlabel('Words')
    plt.ylabel('Messages')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Implementing Logistic Regression with Gradient Descent

Before we evaluate our scikit-learn model, let's implement logistic regression from scratch
using gradient descent to understand what's happening under the hood.

### Mathematical Foundation

Logistic regression uses the sigmoid function to map any real number to a probability between 0 and 1:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where $z = \mathbf{w}^T \mathbf{x} + b$ (weights times features plus bias)

The cost function (log-loss) we want to minimize is:
$$J(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))]$$

In [None]:
# Implement logistic regression from scratch
class LogisticRegressionGD:
    """Logistic Regression using Gradient Descent"""
    
    def __init__(self, learning_rate=0.01, max_iterations=1000, tolerance=1e-6):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.tolerance = tolerance
        self.weights = None
        self.bias = None
        self.cost_history = []
        
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip z to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def compute_cost(self, y_true, y_pred):
        """Compute logistic regression cost (log-loss)"""
        # Add small epsilon to prevent log(0)
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        
        m = len(y_true)
        cost = -1/m * np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return cost
    
    def fit(self, X, y):
        """Train the logistic regression model using gradient descent"""
        # Initialize parameters
        m, n = X.shape
        self.weights = np.zeros(n)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.max_iterations):
            # Forward pass
            z = X.dot(self.weights) + self.bias
            predictions = self.sigmoid(z)
            
            # Compute cost
            cost = self.compute_cost(y, predictions)
            self.cost_history.append(cost)
            
            # Compute gradients
            dw = (1/m) * X.T.dot(predictions - y)
            db = (1/m) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Check for convergence
            if i > 0 and abs(self.cost_history[-2] - self.cost_history[-1]) < self.tolerance:
                print(f"Converged after {i+1} iterations")
                break
                
        print(f"Final cost: {cost:.6f}")
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities"""
        z = X.dot(self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X):
        """Make binary predictions"""
        return (self.predict_proba(X) >= 0.5).astype(int)

### Training Our Custom Model

Let's train our gradient descent implementation on the same bag of words features:

In [None]:
# Get the bag of words features from our pipeline
X_train_bow = pipeline.named_steps['bow'].transform(X_train).toarray()
X_test_bow = pipeline.named_steps['bow'].transform(X_test).toarray()

print(f"Training set shape: {X_train_bow.shape}")
print(f"Test set shape: {X_test_bow.shape}")

# Train our custom logistic regression
print("Training custom logistic regression with gradient descent...")
custom_lr = LogisticRegressionGD(learning_rate=0.1, max_iterations=1000)
custom_lr.fit(X_train_bow, y_train.values)

### Visualizing the Learning Process

In [None]:
# Plot the cost function over iterations
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(custom_lr.cost_history)
plt.title('Cost Function During Training')
plt.xlabel('Iteration')
plt.ylabel('Cost (Log-Loss)')
plt.grid(True, alpha=0.3)

# Zoom in on the last part of training
plt.subplot(1, 2, 2)
start_idx = max(0, len(custom_lr.cost_history) - 200)
plt.plot(range(start_idx, len(custom_lr.cost_history)), custom_lr.cost_history[start_idx:])
plt.title('Cost Function (Last 200 Iterations)')
plt.xlabel('Iteration')
plt.ylabel('Cost (Log-Loss)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Comparing Our Implementation with Scikit-learn

In [None]:
# Make predictions with our custom model
custom_y_pred = custom_lr.predict(X_test_bow)
custom_y_pred_proba = custom_lr.predict_proba(X_test_bow)

# Get scikit-learn predictions for comparison
sklearn_y_pred = pipeline.predict(X_test)
sklearn_y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

# Compare accuracies
custom_accuracy = accuracy_score(y_test, custom_y_pred)
sklearn_accuracy = accuracy_score(y_test, sklearn_y_pred)

print("Model Comparison:")
print("=" * 40)
print(f"Custom Gradient Descent Accuracy: {custom_accuracy:.4f}")
print(f"Scikit-learn Accuracy:           {sklearn_accuracy:.4f}")
print(f"Difference:                      {abs(custom_accuracy - sklearn_accuracy):.4f}")

# Compare AUC scores
custom_auc = roc_auc_score(y_test, custom_y_pred_proba)
sklearn_auc = roc_auc_score(y_test, sklearn_y_pred_proba)

print(f"\nCustom Gradient Descent AUC:     {custom_auc:.4f}")
print(f"Scikit-learn AUC:                {sklearn_auc:.4f}")
print(f"Difference:                      {abs(custom_auc - sklearn_auc):.4f}")

### Visualizing Model Comparison

In [None]:
# Create comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# ROC Curves comparison
fpr_custom, tpr_custom, _ = roc_curve(y_test, custom_y_pred_proba)
fpr_sklearn, tpr_sklearn, _ = roc_curve(y_test, sklearn_y_pred_proba)

axes[0, 0].plot(fpr_custom, tpr_custom, color='red', lw=2, 
                label=f'Custom GD (AUC = {custom_auc:.3f})')
axes[0, 0].plot(fpr_sklearn, tpr_sklearn, color='blue', lw=2, 
                label=f'Scikit-learn (AUC = {sklearn_auc:.3f})')
axes[0, 0].plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
axes[0, 0].set_xlim([0.0, 1.0])
axes[0, 0].set_ylim([0.0, 1.05])
axes[0, 0].set_xlabel('False Positive Rate')
axes[0, 0].set_ylabel('True Positive Rate')
axes[0, 0].set_title('ROC Curve Comparison')
axes[0, 0].legend(loc="lower right")

# Prediction probability comparison
axes[0, 1].scatter(sklearn_y_pred_proba, custom_y_pred_proba, alpha=0.6)
axes[0, 1].plot([0, 1], [0, 1], 'r--', lw=2)
axes[0, 1].set_xlabel('Scikit-learn Predicted Probability')
axes[0, 1].set_ylabel('Custom GD Predicted Probability')
axes[0, 1].set_title('Prediction Probability Comparison')
axes[0, 1].grid(True, alpha=0.3)

# Feature weights comparison (top 20 features)
feature_names = pipeline.named_steps['bow'].get_feature_names_out()
sklearn_coef = pipeline.named_steps['classifier'].coef_[0]

# Get top 20 features by absolute weight in sklearn model
top_indices = np.argsort(np.abs(sklearn_coef))[-20:]
top_features = feature_names[top_indices]
sklearn_top_weights = sklearn_coef[top_indices]
custom_top_weights = custom_lr.weights[top_indices]

x_pos = np.arange(len(top_features))
width = 0.35

axes[1, 0].barh(x_pos - width/2, sklearn_top_weights, width, 
                label='Scikit-learn', alpha=0.8)
axes[1, 0].barh(x_pos + width/2, custom_top_weights, width, 
                label='Custom GD', alpha=0.8)
axes[1, 0].set_yticks(x_pos)
axes[1, 0].set_yticklabels(top_features)
axes[1, 0].set_xlabel('Weight Value')
axes[1, 0].set_title('Top 20 Feature Weights Comparison')
axes[1, 0].legend()

# Weight correlation
axes[1, 1].scatter(sklearn_coef, custom_lr.weights, alpha=0.6)
axes[1, 1].plot([sklearn_coef.min(), sklearn_coef.max()], 
                [sklearn_coef.min(), sklearn_coef.max()], 'r--', lw=2)
axes[1, 1].set_xlabel('Scikit-learn Weights')
axes[1, 1].set_ylabel('Custom GD Weights')
axes[1, 1].set_title('All Feature Weights Correlation')
axes[1, 1].grid(True, alpha=0.3)

# Calculate correlation
correlation = np.corrcoef(sklearn_coef, custom_lr.weights)[0, 1]
axes[1, 1].text(0.05, 0.95, f'Correlation: {correlation:.4f}', 
                transform=axes[1, 1].transAxes, fontsize=12,
                bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.5))

plt.tight_layout()
plt.show()

print(f"\nWeight correlation between models: {correlation:.4f}")

### Some takeaways

1. **Similar Performance**: Our gradient descent implementation achieves very similar results to scikit-learn
2. **Learning Process**: We can visualize how the cost function decreases during training
3. **Weight Correlation**: The learned weights are highly correlated between implementations. *Why might they be different?*

### Why Use Scikit-learn in Practice?

It'll do the thing you want quickly. But, you'll sometimes need something customized and then maybe you want to break out the explicit solution. Besides which, if you don't understand what it's doing under the hood things can go very awry.

## Model Evaluation

In [None]:
# Make predictions (using scikit-learn model)
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.3f}")
print(f"AUC Score: {auc_score:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

## Visualization of Results

In [None]:
# Create visualization plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_xlabel('Predicted')
axes[0, 0].set_ylabel('Actual')
axes[0, 0].set_xticklabels(['Ham', 'Spam'])
axes[0, 0].set_yticklabels(['Ham', 'Spam'])

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {auc_score:.3f})')
axes[0, 1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0, 1].set_xlim([0.0, 1.0])
axes[0, 1].set_ylim([0.0, 1.05])
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend(loc="lower right")

# Feature Importance (Top Bag of Words features)
feature_names = pipeline.named_steps['bow'].get_feature_names_out()
coefficients = pipeline.named_steps['classifier'].coef_[0]

# Get top 10 positive and negative coefficients
top_positive_idx = np.argsort(coefficients)[-10:]
top_negative_idx = np.argsort(coefficients)[:10]

top_features = np.concatenate([top_negative_idx, top_positive_idx])
top_coeffs = coefficients[top_features]
top_feature_names = [feature_names[i] for i in top_features]

colors = ['red' if coeff < 0 else 'blue' for coeff in top_coeffs]
axes[1, 0].barh(range(len(top_coeffs)), top_coeffs, color=colors, alpha=0.7)
axes[1, 0].set_yticks(range(len(top_coeffs)))
axes[1, 0].set_yticklabels(top_feature_names)
axes[1, 0].set_xlabel('Coefficient Value')
axes[1, 0].set_title('Top 20 Features (Red=Ham, Blue=Spam)')

# Prediction probabilities distribution
axes[1, 1].hist(y_pred_proba[y_test == 0], alpha=0.7, label='Ham', bins=10, density=True)
axes[1, 1].hist(y_pred_proba[y_test == 1], alpha=0.7, label='Spam', bins=10, density=True)
axes[1, 1].set_xlabel('Spam Probability')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Distribution of Prediction Probabilities')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## Testing with New Messages

In [None]:
# Test the model with some new messages
test_messages = [
    "Congratulations! You've won a free iPhone! Click here to claim!",
    "Hey, can we meet for lunch tomorrow?",
    "URGENT: Your bank account needs verification!",
    "The meeting is scheduled for 2 PM in conference room A",
    "Get rich quick with this amazing opportunity!"
]

print("Testing with new messages:")
print("=" * 50)

for i, message in enumerate(test_messages, 1):
    prediction = pipeline.predict([message])[0]
    probability = pipeline.predict_proba([message])[0]
    
    label = "SPAM" if prediction == 1 else "HAM"
    spam_prob = probability[1]
    
    print(f"\nMessage {i}: {message}")
    print(f"Prediction: {label}")
    print(f"Spam Probability: {spam_prob:.3f}")

## Model Interpretation

Let's examine what the model learned by looking at the most important features.

In [None]:
# Get feature importance
feature_names = pipeline.named_steps['bow'].get_feature_names_out()
coefficients = pipeline.named_steps['classifier'].coef_[0]

# Create a DataFrame for easier analysis
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
})

# Sort by coefficient value
feature_importance = feature_importance.sort_values('coefficient', key=abs, ascending=False)

print("Top 15 Most Important Features:")
print("=" * 40)
print("Positive coefficients indicate spam-like features")
print("Negative coefficients indicate ham-like features")
print()

for idx, row in feature_importance.head(15).iterrows():
    direction = "SPAM" if row['coefficient'] > 0 else "HAM"
    print(f"{row['feature']:20} | {row['coefficient']:8.3f} | {direction}")

## Summary and Conclusions

In this notebook, we successfully built a spam detection system using logistic regression. However, can you see why this approach might fail if I were to deploy it and not retrain it frequently?

### Why This Model Might Fail in Production Without Retraining

Let's demonstrate with some examples of how spammers might evolve:

In [None]:
# Examples of evolved spam that would likely bypass our model
evolved_spam_examples = [
    # Character substitution
    "Fr33 M0n3y N0W! C1ick h3r3 for amaz1ng d3als!",
    
    # New technology terms (post-training)
    "Exclusive NFT drop! Mint your crypto fortune today!",
    
    # Misspellings and creative spacing
    "F R E E   G I F T S   A V A I L A B L E   N O W",
    
    # New scam types
    "Your Amazon Prime account needs verification. Click here to avoid suspension.",
    
    # Social media style
    "OMG bestie! This side hustle is literally printing money 💰💰💰 DM me!"
]

print("Testing our model on evolved spam examples:")
print("=" * 60)

for i, message in enumerate(evolved_spam_examples, 1):
    prediction = pipeline.predict([message])[0]
    probability = pipeline.predict_proba([message])[0]
    
    label = "SPAM" if prediction == 1 else "HAM"
    spam_prob = probability[1]
    
    print(f"\nExample {i}: {message}")
    print(f"Prediction: {label}")
    print(f"Spam Probability: {spam_prob:.3f}")
    
    # Check if any words from this message are in our vocabulary
    bow_vector = pipeline.named_steps['bow'].transform([message])
    recognized_words = bow_vector.nnz
    total_words = len(message.split())
    
    print(f"Words recognized by model: {recognized_words}/{total_words} ({recognized_words/total_words:.1%})")

In [None]:
print("Analysis complete!")
print(f"Final model accuracy: {accuracy:.1%}")
print(f"Final model AUC score: {auc_score:.3f}")