In [2]:
%pip install scikit-learn numpy requests

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os, tarfile, urllib.request

# Download and extract spam/ham datasets
def get_data():
    os.makedirs("data", exist_ok=True)
    
    urls = [
        ("https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2", "spam.tar.bz2"),
        ("https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2", "ham.tar.bz2")
    ]
    
    for url, file in urls:
        path = f"data/{file}"
        if not os.path.exists(path):
            urllib.request.urlretrieve(url, path)
        tarfile.open(path).extractall("data")

get_data()

  tarfile.open(path).extractall("data")


# Email Spam Detection

## Step 1: Getting Data 

- I downloaded **real email data** from the internet
- This data has **examples** of both spam and good emails


Downloaded: spam.tar.bz2 (500 spam emails) 
Downloaded: ham.tar.bz2 (2,500 good emails)
Extracted to: data/ folder



- **Spam emails**: So computer learns what bad emails look like
- **Ham emails**: So computer learns what good emails look like
  

data/
‚îú‚îÄ‚îÄ spam/           ‚Üê 500 bad emails 
‚îî‚îÄ‚îÄ easy_ham/       ‚Üê 2,500 good emails 


In [4]:
import numpy as np
from sklearn.model_selection import train_test_split

# Get file paths and labels
spam_files = [f"data/spam/{f}" for f in os.listdir("data/spam")]
ham_files = [f"data/easy_ham/{f}" for f in os.listdir("data/easy_ham")]

# Combine files and labels (1=spam, 0=ham)
X = spam_files + ham_files
y = [1] * len(spam_files) + [0] * len(ham_files)

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Total: {len(X)}, Train: {len(X_train)}, Test: {len(X_test)}")

Total: 3002, Train: 2401, Test: 601


## Step 2: Teaching the Computer with Labels 

- This is called **"Supervised Learning"** because we supervise (teach) the computer

**How I labeled my emails:**
```python
# Like flashcards for the computer:
email1.txt ‚Üí SPAM (label = 1) 
email2.txt ‚Üí SPAM (label = 1)
email3.txt ‚Üí GOOD (label = 0)
email4.txt ‚Üí GOOD (label = 0)
```

**train/test split :**
- **Training set (80%)**: Computer learns from these
- **Test set (20%)**: I test if computer learned correctly


**Why split the data?**
- Computer might just memorize instead of actually learning

**My dataset**: 3,002 total emails ‚Üí 2,401 for training, 601 for testing

In [None]:
import re
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

# Clean text function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', 'URL', text)
    text = re.sub(r'\d+', 'NUMBER', text)
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

# Load and clean email
def load_email(filename):
    with open(filename, "r", encoding="latin-1") as f:
        content = f.read()
        body = content[content.find('\n\n'):] if '\n\n' in content else content
        return clean_text(body)

# Process training data
X_train_cleaned = [load_email(f) for f in X_train]

# Create pipeline and transform data
pipeline = Pipeline([('vectorizer', CountVectorizer(stop_words='english'))])
X_train_transformed = pipeline.fit_transform(X_train_cleaned)

print(f"Shape: {X_train_transformed.shape}")

Shape: (2401, 52508)


## Step 3: Cleaning the Text  



### **BEFORE vs AFTER Example:**
```
Original email:
"VISIT WWW.SPAM.COM NOW!!! Call 123-456-7890 for $$$MONEY$$$"

Step 1: Make lowercase
"visit www.spam.com now!!! call 123-456-7890 for $$$money$$$"

Step 2: Replace websites with 'URL'
"visit URL now!!! call 123-456-7890 for $$$money$$$"

Step 3: Replace numbers with 'NUMBER' 
"visit URL now!!! call NUMBER-NUMBER-NUMBER for $$$money$$$"

Step 4: Remove special characters (!@#$%^&*)
"visit URL now    call NUMBER NUMBER NUMBER for    money   "

Step 5: Fix extra spaces
"visit URL now call NUMBER NUMBER NUMBER for money"
```

 - Computer can see the pattern better for cleening the text.

**Converting words to numbers (CountVectorizer):**
```
üìù Clean emails: ["free money now", "get money fast", "hello friend"]

üî¢ Computer sees this table:
           free  money  get  fast  hello  friend  now
Email 1:    1     1     0    0     0      0      1
Email 2:    0     1     1    1     0      0      0  
Email 3:    0     0     0    0     1      1      0
```

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Train model
model = MultinomialNB()
model.fit(X_train_transformed, y_train)

# Process test data and predict
X_test_cleaned = [load_email(f) for f in X_test]
X_test_transformed = pipeline.transform(X_test_cleaned)
y_pred = model.predict(X_test_transformed)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

Accuracy: 95.67%
              precision    recall  f1-score   support

         Ham       0.95      1.00      0.97       482
        Spam       0.99      0.79      0.88       119

    accuracy                           0.96       601
   macro avg       0.97      0.89      0.93       601
weighted avg       0.96      0.96      0.95       601



## Naive Bayes! 

- It can read emails and decide if they're spam or not


 (Training phase) :
```python
model.fit(X_train_transformed, y_train)
```
1. I showed it had  2,401 emails with answers
2. word patterns:
   - "Words like 'free', 'money', 'click' often appear in SPAM"
   - "Words like 'meeting', 'project', 'thanks' often appear in GOOD emails"


Testing :
```python
y_pred = model.predict(X_test_transformed)
```

## RESULTS - 

**Overall Score: 95.67% Accuracy!** 
- AI got 575 out of 601 emails correct

**Detailed Performance:**
- **Ham (Good emails)**: 100% caught ‚úÖ (Never blocked important emails!)
- **Spam (Bad emails)**: 79% caught ‚úÖ (21% spam got through)

**In real life:**
- Out of 100 emails: AI correctly sorts 96 emails


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train Logistic Regression model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_transformed, y_train)

# Predict and evaluate
y_pred_lr = lr_model.predict(X_test_transformed)
accuracy_lr = accuracy_score(y_test, y_pred_lr)

print("--- Logistic Regression Results ---")
print(f"Accuracy: {accuracy_lr:.2%}")
print(classification_report(y_test, y_pred_lr, target_names=['Ham', 'Spam']))

--- Logistic Regression Results ---
Accuracy: 97.84%
              precision    recall  f1-score   support

         Ham       0.98      1.00      0.99       482
        Spam       0.98      0.91      0.94       119

    accuracy                           0.98       601
   macro avg       0.98      0.95      0.96       601
weighted avg       0.98      0.98      0.98       601



## Comparing Different Models 

**Logistic Regression**


**Naive Bayes ():**
- This model acts like a probability calculator. Its main goal is to determine the probability that an email is spam, given the specific words inside it.
- It studies the training emails to learn the probability of each word appearing in spam versus ham.
- The "Naive" Assumption: The model's key feature (and weakness) is that it treats every word as independent. This means it assumes the presence of one word has no effect on another. It analyzes "free" and "money" separately, without understanding that their appearance together is extra suspicious. This simplification is why it's called "naive.

**Logistic Regression ():**
- Instead of just counting, this model assigns a weight (or importance score) to every word in its vocabulary.
- To classify a new email, it adds up the weights of all the words it contains. A high positive total score means "spam," while a negative score means "ham."
- This allows it to learn context