# Introduction to Classification Models

**What is Classification?**
- Predicting categories (like spam/not spam) instead of numbers
- A fundamental machine learning task

**Real-World Examples:**
- 📧 Email spam detection
- 🏥 Medical diagnosis (healthy/sick)
- 📸 Image recognition (cat/dog)
- 💳 Fraud detection (fraudulent/legitimate)

**Today's Goal:** Learn 5 simple but powerful classification models!

## Preparing Our Data

### Why Split Data?
- � Training set (70-80%): Teach the model
- 🧪 Test set (20-30%): Evaluate real performance
- Prevents "cheating" by memorizing answers

### Data Scaling
- Some models need features on the same scale
- Like comparing apples to apples 🍎🍏

In [1]:
# Set up our practice data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create synthetic dataset (1000 examples, 10 features)
X, y = make_classification(n_samples=1000, n_features=10, 
                           n_classes=2, random_state=42)

# Split data (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features for models that need it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model 1: Logistic Regression

📊 **How it works:** 
- Draws a "best fit" line between categories
- Uses probability (sigmoid function) to classify

👍 **Best for:**
- When the relationship is somewhat linear
- Quick baseline model

👎 **Limitations:**
- Struggles with complex patterns

🔍 **Example Use:** Email spam detection

In [2]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
accuracy = log_reg.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.85


## Model 2: k-Nearest Neighbors (k-NN)

📊 **How it works:** 
- "Birds of a feather flock together" 🐦🐦
- Classifies based on what the k closest points are

👍 **Best for:**
- Simple, intuitive approach
- Works well with clear groupings

👎 **Limitations:**
- Slow with large datasets
- Needs careful k selection

🔍 **Example Use:** Handwriting recognition

In [3]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)  # Try k=5 neighbors
knn.fit(X_train_scaled, y_train)
accuracy = knn.score(X_test_scaled, y_test)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.82


## Model 3: Decision Trees

📊 **How it works:** 
- Series of yes/no questions like a flowchart ❓→✓
- Builds a tree by finding best splits

👍 **Best for:**
- Easy to understand and visualize
- Handles non-linear relationships

👎 **Limitations:**
- Can overfit easily
- Unstable with small data changes

🔍 **Example Use:** Customer segmentation

In [4]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=3)  # Limit tree depth
tree.fit(X_train, y_train)
accuracy = tree.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.86


## Model 4: Random Forests

📊 **How it works:** 
- Many decision trees voting together 🌳🌳🌳→✉
- More accurate than single trees

👍 **Best for:**
- Complex problems
- Avoids overfitting better than single trees

👎 **Limitations:**
- Less interpretable
- Slower than single trees

🔍 **Example Use:** Fraud detection

In [5]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100)  # 100 trees
forest.fit(X_train, y_train)
accuracy = forest.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.89


## Model 5: Naive Bayes

📊 **How it works:** 
- Uses probability (Bayes' Theorem) 📈
- "Naive" because it assumes independence

👍 **Best for:**
- Text classification
- Very fast training

👎 **Limitations:**
- Independence assumption often wrong

🔍 **Example Use:** Sentiment analysis (positive/negative reviews)

In [6]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
accuracy = nb.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.83


## Model Comparison

| Model            | Pros                      | Cons                      | Best For                  |
|------------------|---------------------------|---------------------------|---------------------------|
| Logistic Reg.    | Simple, fast              | Linear only               | Baseline models           |
| k-NN             | Intuitive, no training    | Slow prediction           | Small, clear groupings    |
| Decision Tree    | Easy to interpret         | Overfits easily           | Business rules            |
| Random Forest    | Powerful, robust          | Complex, slower           | Complex problems          |
| Naive Bayes      | Very fast, good for text  | Strong assumptions        | Text classification       |
