# Datasets and Learning Paradigms

## What is a Dataset?

A dataset is a collection of examples (data points or samples) containing input-output mappings that the model aims to learn.

![Dataset Overview](images/dataset-overview.png)

**Key Concepts:**

- Each example consists of:
  - **Input** `x`: a feature vector (measurements/attributes)
  - **Output** `Å·`: the desired target (label or value) â€” only available in supervised learning

- Inputs are typically organized into a **Feature Matrix** `X`:
  $$
  X = \begin{bmatrix}
  x_{11} & x_{12} & \dots & x_{1D} \\
  x_{21} & x_{22} & \dots & x_{2D} \\
  \vdots & \vdots & \ddots & \vdots \\
  x_{N1} & x_{N2} & \dots & x_{ND}
  \end{bmatrix}
  \quad \Rightarrow \quad
  X = [x_1, x_2, \dots, x_N]^T \in \mathbb{R}^{N \times D}
  $$

  - `N` = number of samples
  - `D` = number of features (dimensions)

### What is a Feature?

A **feature** is a measurable property or attribute relevant to the problem.

**Classic Example: Iris Dataset** (Classification of 3 iris flower species)

![Iris Features](images/features-iris.png)

- Features: Sepal Length, Sepal Width, Petal Length, Petal Width
- Dataset size: 150 samples â†’ $X \in \mathbb{R}^{150 \times 4}$

## Training and Test Datasets

![Training and Test Split](images/train-test-split.png)

![Data Distribution](images/data-distribution.png)

- We always split data into **at least two distinct sets**:
  - **Training Dataset**: Used to learn the function `f(x)`
  - **Test Dataset**: Used to evaluate performance after training

**Important Rules:**

- Training and test examples **must be different**
- Why? If the model sees test data during training, we **overestimate** its real-world performance (like giving a student the exam answers as homework!)

**Assumptions for Good Generalization:**

- Training and test samples are drawn from the **same underlying data distribution** $P_{data}$
- Samples are **independent and identically distributed (i.i.d.)**

## Supervised Learning

![Supervised Learning](images/supervised-learning.png)

Supervised learning = learning from **labeled** examples (both `x` and correct `y` are provided).

- Covers both **classification** and **regression** problems

In matrix form:
- Feature matrix: $X \in \mathbb{R}^{N \times D}$
- Target matrix: $Y \in \mathbb{R}^{N \times K}$ (K = number of outputs/classes)

**Common Supervised Algorithms:**

- Linear models
- Neural networks
- Support vector machines
- Naive Bayes
- K-nearest neighbors
- Random forests

## Unsupervised Learning

Unsupervised learning = learning from **unlabeled** data (only inputs `x` are available).

> Even without labels, we can discover useful structure in the data.

### Common Tasks

1. **Clustering**  
   Group similar data points together into clusters.

   ![Clustering Example](images/clustering-example.png)

   - We detect "close" points and assign them to the same cluster.
   - No predefined labels â€” clusters represent different types of inputs.

2. **Probability Density Estimation**  
   Estimate the underlying data distribution $P_{data}$.

    ![Density Estimation Example](images/density-estimation.png)  
   - Learn how likely different points are.
   - Useful for anomaly detection, generating new samples, etc.

**Feature Matrix (same as supervised):**  
$X \in \mathbb{R}^{N \times D}$ (no `Y` matrix)

**Applications:**
- Customer segmentation
- Anomaly detection
- Data compression (dimensionality reduction)
- Preprocessing for supervised tasks

## Reinforcement Learning (Brief Overview)

![Reinforcement Learning](images/reinforcement-learning.png)

- No fixed labeled dataset
- Learning by **interaction** with an environment
- Agent takes **actions** â†’ receives **rewards/penalties**
- Goal: Maximize cumulative reward

**Applications:**
- Game playing (AlphaGo)
- Robotics
- Self-driving cars
- Drug discovery

---



# Basic Components of a Machine Learning Model

A machine learning system has a few fundamental building blocks. Understanding these helps demystify how models work.

## Machine Learning Algorithm

### Core Idea

A machine learning **algorithm** is a procedure that finds a function $\hat{f}$ mapping inputs to outputs:

$$\hat{y} = \hat{f}(x)$$

Where:
- $x \in \mathbb{R}^D$ â†’ Input feature vector (D dimensions)
- $\hat{y} \in \mathbb{R}^K$ â†’ Predicted output (K dimensions, e.g., class probabilities or values)

The algorithm does **not** search all possible functions in the universe. Instead, it searches within a restricted set called the **hypothesis space** $\mathcal{H}$.

### Hypothesis Space $\mathcal{H}$

- $\mathcal{H}$ = the set of all functions (models) that the algorithm is capable of selecting.
- Each function $f_i \in \mathcal{H}$ is a **possible solution** (a candidate model).

    ![Hypothesis Space Illustration](images/hypothesis-space.png)

Examples of hypothesis spaces:
- Linear models: $\mathcal{H}$ = all straight lines (or hyperplanes)
- Decision trees: $\mathcal{H}$ = all possible tree structures
- Neural networks: $\mathcal{H}$ = all networks with a given architecture

### Objective (Loss) Function $J$

For every candidate function $f_i$, we evaluate how well it performs on the training data using an **objective function**:

$$J(Y, f_i(X))$$

- Measures the error or "cost" of predictions
- Lower $J$ = better fit to training data

Common examples:
- Mean Squared Error (regression)
- Cross-Entropy Loss (classification)

### Training Process

The algorithm explores the hypothesis space $\mathcal{H}$ to find the function $\hat{f}$ that **minimizes** the objective:

$$\hat{f} = \arg\min_{f \in \mathcal{H}} J(Y, f(X))$$

In practice:
- We iteratively test different functions (or adjust parameters)
- Move toward functions with lower training error
- Stop when we find a function that "explains the training data well"

## Summary of Key Components

| Component              | Symbol/Name             | Role                                                                 |
|------------------------|-------------------------|----------------------------------------------------------------------|
| **Input**              | $x$                     | Features of a data point                                             |
| **Output/Prediction**  | $\hat{y}$               | Model's predicted target                                             |
| **Model/Function**     | $\hat{f}$ or $f \in \mathcal{H}$ | Mapping learned by the algorithm                                     |
| **Hypothesis Space**   | $\mathcal{H}$           | All possible models the algorithm can choose from                    |
| **Objective Function** | $J$                     | Measures how good a candidate model is on training data              |
| **Training Data**      | $(X, Y)$                | Used to evaluate $J$ and guide the search in $\mathcal{H}$            |

These components are present in **every** machine learning algorithm, regardless of whether it's linear regression, a neural network, or clustering.

---



# Objective (Loss) Function

The **objective function** (also called **loss function**, **cost function**, or **criterion**) quantifies how well a candidate model fits the training data.

## Key Ideas

- Training aims to find a function $\hat{f}$ that "fits well" the training data.
- To measure "how good" the predictions are, we define an **objective function** $J$:

$$J(Y, f(X)) : \mathbb{R}^{N \times K} \to \mathbb{R}$$

- By convention, we usually **minimize** $J$ (lower = better).
- If we minimize it, it is often called **loss** or **error**.

### Empirical Risk Minimization

Most ML algorithms compute the objective as an average loss over training samples:

$$J(Y, f(X)) = \frac{1}{N} \sum_{i=1}^{N} L(f(x_i), y_i)$$

- $L(f(x_i), y_i)$ = loss for a single example (also called **individual loss** or **criterion**)
- The overall training goal is **empirical risk minimization** â€” minimize the average loss on training data.

## Common Loss Functions

### For Regression: Mean Squared Error (MSE)



A popular choice for regression:

$$MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2$$

- Measures squared distance between predictions and targets
- Intuitively penalizes larger errors more
- Lower MSE = better fit

### For Classification

#### Accuracy (Hard Metric)

Accuracy:

$$Acc = \frac{N_{correct}}{N_{tot}}$$

- Easy to interpret
- Common evaluation metric (especially on test set)
- But it's a **hard** metric â€” doesn't reflect confidence
- Two models can have same accuracy but very different confidence levels

#### Categorical Cross-Entropy (Soft Metric)


A "softer" and preferred training loss:

$$CCE = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \ln(\hat{p}_{ik})$$

- Uses **one-hot** labels and **softmax probabilities** $\hat{p}$
- Penalizes confident wrong predictions heavily
- Lower cross-entropy = better (0 = perfect, higher = worse)
- Ranges from 0 (perfect) to +âˆž (very bad)

**Key Advantage:**  
Models with same accuracy but higher confidence get lower (better) cross-entropy.

## Summary of Common Objectives

| Task           | Common Objective/Loss                  | Notes                                      |
|----------------|----------------------------------------|--------------------------------------------|
| **Regression** | Mean Squared Error (MSE)               | Sensitive to outliers, easy to optimize    |
| **Classification** | Categorical Cross-Entropy (CCE)    | Standard for multi-class, uses probabilities |
| **Classification (evaluation)** | Accuracy                      | Simple, interpretable, but "hard" metric   |

The choice of objective function is crucial â€” it directly guides what the algorithm learns!

---


# Optimization in Machine Learning

Optimization is the process of finding the best model (function) from the hypothesis space that minimizes the objective function on the training data.


## The Optimization Problem

We want to solve:

$$f^* = \arg\min_{f \in \mathcal{H}} J(Y, f(X))$$

- Search within the hypothesis space $\mathcal{H}$
- Find the function with the **lowest** objective value $J$
- This procedure is performed by an **optimizer**

The loss landscape can be complex (high-dimensional, non-convex), so exact solutions are often impossible â†’ we use iterative algorithms.

## Naive Optimization Example: Curve Fitting

Let's see a simple (naive) example to understand the idea.

### Setup

- Training set: scattered points $(x_i, y_i)$
- Goal: Find $f(x): \mathbb{R} \to \mathbb{R}$ that fits the data well

### Hypothesis Space (Toy Example)

**Naive approach**:   try all the functions of the hypothesis space and take the one that better explains the training data.

We limit ourselves to only 4 candidate functions:

1. $f(x) = x$
2. $f(x) = e^x$
3. $f(x) = \sin(x)$
4. $f(x) = \cos(x)$

### Naive Approach

- Evaluate **Mean Squared Error (MSE)** for each candidate on the training data
- Pick the one with the lowest MSE

Results:
- $f(x) = \sin(x)$ has the lowest training MSE (~0.013) â†’ **Winner!**

### Check Generalization

- Apply the winning function to a **test set** (new points)
- Compute test MSE (~0.010) â†’ very low
- Conclusion: The learned function generalizes well ðŸ˜Š

**Note:** This naive "try all" approach only works for tiny hypothesis spaces. Real ML has billions/trillions of possible functions â†’ we need smarter optimizers.

## Real-World Optimization

In practice, we use **iterative optimization algorithms** like:

- **Gradient Descent** (and variants: SGD, Adam, RMSProp)
- These efficiently navigate the high-dimensional loss landscape
- Start from a random initialization
- Repeatedly update model parameters to reduce $J$

Weâ€™ll cover Gradient Descent in detail next!

---

