# Spam vs. Ham: Building a Spam Detection System üìß

Hi there! üëã In this notebook, we're going to build a Machine Learning model to detect spam emails. 
We'll start by exploring our dataset, visualizing the data to understand the patterns, and then we'll clean the text data. 
Finally, we'll train a couple of models and see how well they perform. Let's get started!

## 1. Import Libraries and Load Data üìö
First things first, we need to import the necessary tools. We'll be using `pandas` for data manipulation, `matplotlib` and `seaborn` for visualization, and `sklearn` for the magic (machine learning).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Setting plot styles for better aesthetics
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

Now, let's load our dataset. We have a file named `spam_emails_data.csv`. Let's peek inside.

In [None]:
# Load the dataset
df = pd.read_csv('artifacts/spam_emails_data.csv')

# Display the first few rows
df.head()

## 2. Exploratory Data Analysis (EDA) üîç
Before we dive into modeling, it's crucial to understand our data. Let's look at the shape of the data and check for any missing values.

In [None]:
# Check dataset shape
print(f"Dataset Shape: {df.shape}")

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check for duplicates
print("\nDuplicate Entries:", df.duplicated().sum())

In [None]:
# Remove duplicates to avoid biasing our model.
df.drop_duplicates(inplace=True)
print(f"Shape after removing duplicates: {df.shape}")

### Visualizing the Ham / Spam Emails


In [None]:
# Count of Spam vs Ham
plt.figure(figsize=(8, 6))
sns.countplot(x='label', data=df, palette='viridis')
plt.title('Distribution of Spam vs. Ham Emails', fontsize=16)
plt.xlabel('Label', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()

### Feature Engineering: Message Length
I'm curious: do spam emails tend to be longer or shorter than regular emails? Let's create a new feature `message_length` and find out.

In [None]:
df['message_length'] = df['text'].apply(len)
df.head()

In [None]:
# Plotting the length distribution
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='message_length', hue='label', element='step', stat='density', common_norm=False)
plt.title('Message Length Distribution by Label', fontsize=16)
plt.xlabel('Message Length', fontsize=12)
plt.show()

### Word Cloud Visualization ‚òÅÔ∏è
Let's visualize the most common words in Spam and Ham emails using WordClouds.

In [None]:
from wordcloud import WordCloud

# Combine all text for Spam and Ham
spam_text = " ".join(df[df['label'] == 'Spam']['text'])
ham_text = " ".join(df[df['label'] == 'Ham']['text'])

# Create WordClouds
wc_spam = WordCloud(width=800, height=400, background_color='black', colormap='Reds').generate(spam_text)
wc_ham = WordCloud(width=800, height=400, background_color='white', colormap='Greens').generate(ham_text)

# Plotting
plt.figure(figsize=(16, 8))

plt.subplot(1, 2, 1)
plt.imshow(wc_spam, interpolation='bilinear')
plt.axis('off')
plt.title('Spam Email Word Cloud', fontsize=16)

plt.subplot(1, 2, 2)
plt.imshow(wc_ham, interpolation='bilinear')
plt.axis('off')
plt.title('Ham Email Word Cloud', fontsize=16)

plt.show()

## 3. Text Preprocessing üßπ
Computers understand numbers, not words. But before we convert text to numbers, we need to clean it up. 
We will:
1. Convert to lowercase.
2. Remove punctuation and special characters.
3. Remove stopwords (common words like 'the', 'is', etc.).

In [None]:
import string
from nltk.corpus import stopwords
import nltk

# Try to download stopwords if not present
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

def clean_text(text):
    # Remove punctuation
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    # Remove stopwords and convert to lowercase
    clean_words = [word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    return " ".join(clean_words)

# Apply the cleaning function (this might take a moment!)
print("Cleaning text data... this might take a few seconds.")
df['clean_text'] = df['text'].apply(clean_text)
print("Text cleaning done!")
df[['text', 'clean_text']].head()

## 4. Vectorization (Feature Extraction) üî¢
Now we convert our cleaned text into numerical features using `TfidfVectorizer` (Term Frequency-Inverse Document Frequency). This downweights words that appear too frequently across all documents (like common but not 'stop' words).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000) # Limit to top 3000 features to keep the model lightweight
X = tfidf.fit_transform(df['clean_text']).toarray()

# Encode the target variable
y = df['label'].map({'Spam': 1, 'Ham': 0})

print(f"Feature Matrix Shape: {X.shape}")

## 5. Model Training ü§ñ
We'll split our data into training and testing sets, then train two models:
1. **Multinomial Naive Bayes**: A classic algorithm for text classification.
2. **Support Vector Machine (SVC)**: Often performs well on high-dimensional data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Data Shape: {X_train.shape}")
print(f"Testing Data Shape: {X_test.shape}")

### Training Naive Bayes

In [None]:
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, nb_pred))

### Training Support Vector Machine (SVM)

In [None]:
svm_model = SVC(probability=True, kernel='linear') # Linear kernel usually works best for text
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)

print("SVM Accuracy:", accuracy_score(y_test, svm_pred))

## 6. Model Evaluation üìä
Let's dive deeper than just accuracy. We'll look at the Confusion Matrix and Classification Report for our best performing model (likely SVM or NB, they are usually close).

In [None]:
# Confusion Matrix for SVM
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, svm_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, 
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.title('Confusion Matrix (SVM)', fontsize=16)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.show()

In [None]:
print("Classification Report (SVM):\n")
print(classification_report(y_test, svm_pred, target_names=['Ham', 'Spam']))

### ROC Curve
An ROC curve helps us visualize the trade-off between the true positive rate and false positive rate.

In [None]:
from sklearn.metrics import roc_curve, auc

y_prob = svm_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)', fontsize=16)
plt.legend(loc="lower right")
plt.show()

## Conclusion üèÅ
We successfully built a Spam Detection model using this email dataset. 

- We cleaned the text data to remove noise.
- We visualized the data to find insights (like message length).
- We trained a powerful SVM model (and a Naive Bayes baseline).
- The SVM model showed excellent performance with high accuracy and a strong ROC AUC score.

This model could be the start of a real-world spam filter! Thanks for following along. üéâ