# üìò P2.2.2.2 ‚Äì Supervised Learning

## Topic: Spam Email Classifier Example (Classification)
---


## üéØ Learning Objectives

By the end of this notebook, you will be able to:

- See the steps in a real ML pipeline (vectorize ‚Üí split ‚Üí train ‚Üí predict ‚Üí evaluate)
- See how text becomes numbers (vectorization) for the model
- Build a simple spam classifier and check how well it works


## üìå What is classification?

**Classification** means predicting a **category** or **label** for something ‚Äî not a number, but a choice from a fixed set of options.

- We train it using **labeled examples**: many inputs that already have the correct category.
- The model learns patterns from those examples and then assigns a category to **new** inputs it has never seen.

**Some real-world examples:**

| Problem | Input | Categories (labels) |
|--------|--------|---------------------|
| Spam detection | Email text | Spam / Not spam |
| Sentiment | Review or tweet | Positive / Negative / Neutral |
| Image recognition | Photo | Cat / Dog / Bird / ‚Ä¶ |
| Disease screening | Test results, symptoms | Positive / Negative |
| Fraud detection | Transaction | Fraud / Not fraud |
| Support ticket | Message | Urgent / Normal / Low |

In this notebook we focus on **one** of these: **spam vs not spam** for emails. The same idea (labeled data ‚Üí train ‚Üí predict a category) applies to all classification problems.

## üìù Problem Statement

We want to build a program that classifies emails as **Spam** or **Ham** using machine learning. (*Ham* = not spam ‚Äî normal email.) The goal is to automate the detection of unwanted emails.


**Why is this important?**

- Spam emails waste time and can be dangerous. Automating detection helps keep inboxes clean and safe.


## ü§ñ Choosing the Model & Why

We use the **Naive Bayes** model because:
- It works well for text classification
- It is fast and simple
- It handles word frequencies efficiently

**Why not other models?**
- Decision Trees, SVM, etc. can be used, but Naive Bayes is a classic choice for spam detection due to its performance on text data

## üõ†Ô∏è Example: Spam Email Classifier Pipeline

This example shows the steps:
1. Convert emails to numbers (vectorization)
2. Split data into train/test
3. Train Naive Bayes model
4. Predict on test set and evaluate (accuracy, confusion matrix, classification report)
5. Predict on a **new** email

*We use a **tiny dataset** (6 emails) so the flow is easy to follow; in practice you would use thousands of labeled emails.*


**Before running the program:** Install the library we use for vectorization, models, and evaluation:

```bash
pip install scikit-learn
```

*We need **scikit-learn** because it gives us*
- vectorization (text ‚Üí numbers), 
- train‚Äìtest split, the Naive Bayes model, 
- evaluation metrics in one place 
> *so we can focus on the pipeline instead of writing everything from scratch.*

In [None]:
"""
Spam Email Classifier using Scikit-learn
----------------------------------------
This program classifies emails as Spam or Ham
using text vectorization and Naive Bayes.

"""

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


def main():
    print("SPAM EMAIL CLASSIFIER")
    print("----------------------")

    # Dataset
    emails = [
        "Win money now",
        "Limited offer win big",
        "Win a free prize now",
        "Meeting tomorrow at office",
        "Project discussion scheduled",
        "Let us plan the meeting"
    ]

    labels = [
        "Spam",
        "Spam",
        "Spam",
        "Ham",
        "Ham",
        "Ham"
    ]

    # Convert text to numerical features
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(emails)

    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, labels, test_size=0.33, random_state=42
    )

    # Train model
    model = MultinomialNB()
    model.fit(X_train, y_train)

    # Predictions
    predictions = model.predict(X_test)

    # Evaluation
    print("\nAccuracy:", accuracy_score(y_test, predictions))
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, predictions))
    print("\nClassification Report:\n", classification_report(y_test, predictions))

    # Predict new email
    new_email = ["Win free money now"]
    new_email_vector = vectorizer.transform(new_email)
    new_prediction = model.predict(new_email_vector)

    print("\nNew Email:", new_email[0])
    print("Prediction:", new_prediction[0])


if __name__ == "__main__":
    main()

## üìä Understanding Accuracy & Evaluation Metrics

- **Accuracy:** Percentage of correct predictions out of total. Higher = better.

- **Confusion Matrix:** Shows how many were correctly/incorrectly classified as Spam or Ham. *Rows = actual label, columns = predicted label.*

- **Classification Report:** Precision, recall, and F1-score per class ‚Äî for a deeper look than accuracy alone.

**Why we use these:** 
- To measure how well the model works, 
- Spot where it goes wrong,
- Decide if it's reliable enough for real use.

---
## üìù Key Takeaways

- **Spam classifier** = supervised **classification**: we have emails with labels (spam/ham), we vectorize text ‚Üí split ‚Üí train ‚Üí predict ‚Üí evaluate.

- **Vectorization** (text ‚Üí numbers) and **train‚Äìtest split** from Core Concepts appear here in a real pipeline.

- We evaluate with **accuracy**, **confusion matrix** (rows = actual, columns = predicted), and **classification report**.
