# 🧠 Supervised Learning: Classification
Training data: {(x1, y1), (x2, y2), ..., (x", y")}

x² ∈ Rd, y² ∈ {+1,-1}

Algorithm outputs a model f : Rd→{+1,-1}

Loss = 11(f(x²) ≠ y²) 1 n n

f(x) = sign(wx+b)

## 📌 Goal

> Predict if the **number of rooms > 3** from **area** and **price** of a house.

This is a **binary classification** problem where:
- Output (`y`) is either `+1` (Yes) or `-1` (No)

---

## 📊 Training Data

We use pairs of feature vectors and labels:

$$
\left\{ (x^1, y^1), (x^2, y^2), \dots, (x^n, y^n) \right\}
$$

- \( x^i \in \mathbb{R}^d \): each feature vector (e.g., area, price)
- \( y^i \in \{+1, -1\} \): binary label indicating the class

---

## 🧮 Model Function

The model is a **linear classifier**:

$$
f : \mathbb{R}^d \rightarrow \{+1, -1\}
$$

It uses the **sign function** to make predictions:

$$
f(x) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)
$$

Where:
- \( \mathbf{w} \): weight vector  
- \( b \): bias term  
- \( \text{sign}(z) = \begin{cases}
+1 & \text{if } z > 0 \\
-1 & \text{otherwise}
\end{cases} \)

---

## ⚖️ Loss Function (0-1 Loss)

To evaluate model performance:

$$
\text{Loss} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}(f(x^i) \neq y^i)
$$

- It counts how many predictions are incorrect.  
- \( \mathbf{1}(\cdot) \) is the **indicator function**, which returns 1 if the condition is true, otherwise 0.

---

## 📝 Intuition

- We're drawing a **decision boundary (line/hyperplane)** to separate two classes.
- Works well when data is **linearly separable**.
- Outputs **only class labels** (`+1` or `-1`), not probabilities.


## Evaluating Learned Models : Test Data
   - Learning algorithm uses training data $(x^1, y^1), ........, (x^n, y^n)$ to get model f.
   - But evaluating the learned model must not be done on the training data itself.
   - Use test data that is not in the training data for model evaluation.

## Model Selection  : Validation Data
   - Learning algorithms just find the "best" model in the collection of models given by the human.
   - How to find the right collection of models?
   - This is called model selection, and it is done by using another subset of data called **validation data** that is distinct from train and test data.

---

---

# 📘 Machine Learning Foundations – Lecture: Classification

Welcome to another lecture on **Machine Learning Foundations**.

In the last lecture, we introduced the **supervised learning paradigm** and the **regression learning problem**.

---

## 🎯 Today's Topic: Classification

We'll now explore the **classification problem** and look at a few examples.

### 🔍 Example:
Suppose you're given the `area` and `price` of a house, and you want to **predict whether the number of rooms** is:
- `> 3` (more than 3)
- `≤ 3` (3 or fewer)

This is a **classification problem**, because the prediction output is **categorical** (either of two labels).

---

## 📊 Training Data Format

Just like regression, we use:
- Inputs: \( \mathbf{x}_i \in \mathbb{R}^d \)
- Labels: \( y_i \in \{+1, -1\} \)

> ⚠️ Unlike regression (where \( y_i \in \mathbb{R} \)), here the outputs are just two discrete values.

The **model output** is:
\[
f: \mathbb{R}^d \rightarrow \{+1, -1\}
\]

---

## ✅ Evaluating a Classification Model

The goal is:
\[
f(\mathbf{x}_i) = y_i \quad \text{for all } i
\]

In reality, this might not always happen.

### 📉 Loss Function: 0-1 Loss
\[
\text{Loss}(f) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}[f(\mathbf{x}_i) \ne y_i]
\]

Where:
- \( \mathbb{I}[\cdot] \) is an indicator function (1 if true, 0 otherwise)
- The loss counts the **fraction of misclassified points**

---

## 📏 Linear Classifiers

In regression, we had:
\[
f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b
\]

But for classification:
\[
f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)
\]

This is called a **linear separator**. It maps inputs to either \( +1 \) or \( -1 \).

---

## 📌 Simple 2D Example

Let \( d = 2 \). We have 6 data points:

### 📌 Data Points:
| Point | Coordinates   | Label  |
|-------|---------------|--------|
| x₁    | (0, 0)        | +1     |
| x₂    | (1, 0)        | +1     |
| x₃    | (0, 1)        | +1     |
| x₄    | (4, 4)        | -1     |
| x₅    | (3, 4)        | -1     |
| x₆    | (4, 3)        | -1     |

### 🖼️ Visualization (Rough Description):
- Red points: positive class
- Blue points: negative class

All red points are in the bottom-left, and all blue points are in the top-right of the 2D plane.

---

## 🧪 Two Models to Compare

Let's define two classifiers:

### Model f:
\[
f(\mathbf{x}) = \text{sign}(2 - x_1)
\]

### Model g:
\[
g(\mathbf{x}) = \text{sign}(x_1 - 2x_2)
\]

We will compute the loss for each.

---

## 🧮 Evaluate Model f

Apply \( f(\mathbf{x}) = \text{sign}(2 - x_1) \) to each point:

| x     | f(x)         | y    | Correct? |
|-------|--------------|------|----------|
| (0,0) | +1           | +1   | ✅       |
| (1,0) | +1           | +1   | ✅       |
| (0,1) | +1           | +1   | ✅       |
| (4,4) | -1           | -1   | ✅       |
| (3,4) | -1           | -1   | ✅       |
| (4,3) | -1           | -1   | ✅       |

🟢 All predictions correct → **Loss = 0**

---

## 🧮 Evaluate Model g

Apply \( g(\mathbf{x}) = \text{sign}(x_1 - 2x_2) \):

| x     | g(x)         | y    | Correct? |
|-------|--------------|------|----------|
| (0,0) | +1           | +1   | ✅       |
| (1,0) | +1           | +1   | ✅       |
| (0,1) | -1           | +1   | ❌       |
| (4,4) | -1           | -1   | ✅       |
| (3,4) | -1           | -1   | ✅       |
| (4,3) | -1           | -1   | ✅       |

🔴 One mistake → **Loss = 1/6**

---

## 🏁 Conclusion

- **f** has lower loss → learning algorithm will **prefer model f**
- In general, the algorithm learns the best function from all possible classifiers
- For visualization, the **input space is split into regions**:
  - Region where \( f(\mathbf{x}) = +1 \)
  - Region where \( f(\mathbf{x}) = -1 \)

---

✅ **Key Takeaway**:
Classification involves predicting labels from a finite set. The loss function typically measures the number (or fraction) of misclassifications. Linear separators like \( \text{sign}(\mathbf{w}^\top \mathbf{x} + b) \) are a powerful and interpretable model in simple classification problems.


## 📁 Training, Validation, and Test Data

In supervised machine learning, we split our dataset into three parts:

---

### 🔧 1. Training Data

- **Used to train the model**
- The model learns patterns, weights, and relationships from this data
- The training process **minimizes loss** on this data
- **Risk**: If the model only learns from training data, it might memorize it (overfitting)

---

### 🧪 2. Validation Data

- **Used during training** to tune the model
- Helps in choosing the **best model or hyperparameters**
- Not used to update the model directly
- Helps detect **overfitting**: if training accuracy is high but validation accuracy is low, the model might not generalize well

---

### 🎯 3. Test Data

- **Used only after training and model selection are complete**
- It measures the **final performance** of the model on **unseen data**
- Helps us understand how the model will perform in the real world
- We **never train or tune** the model using test data

---

### ✅ Summary Table

| Dataset        | Used For                            | Model Sees It? | Purpose                         |
|----------------|-------------------------------------|----------------|----------------------------------|
| Training       | Learning model parameters (weights) | ✅ Yes         | Learn patterns                  |
| Validation     | Tuning model & hyperparameters      | ✅ Yes         | Check generalization during training |
| Test           | Final evaluation                    | ❌ No          | Check true generalization       |

---

⚠️ **Important**:
- Never use test data to make training decisions.
- Validation data can be used multiple times while tuning.
- Test data should be used **only once** after everything is finalized.

