# Datasets and Learning Paradigms

## What is a Dataset?

A dataset is a collection of examples (data points or samples) containing input-output mappings that the model aims to learn.

![Dataset Overview](images/dataset-overview.png)

**Key Concepts:**

- Each example consists of:
  - **Input** `x`: a feature vector (measurements/attributes)
  - **Output** `ŷ`: the desired target (label or value) — only available in supervised learning

- Inputs are typically organized into a **Feature Matrix** `X`:
  $$
  X = \begin{bmatrix}
  x_{11} & x_{12} & \dots & x_{1D} \\
  x_{21} & x_{22} & \dots & x_{2D} \\
  \vdots & \vdots & \ddots & \vdots \\
  x_{N1} & x_{N2} & \dots & x_{ND}
  \end{bmatrix}
  \quad \Rightarrow \quad
  X = [x_1, x_2, \dots, x_N]^T \in \mathbb{R}^{N \times D}
  $$

  - `N` = number of samples
  - `D` = number of features (dimensions)

### What is a Feature?

A **feature** is a measurable property or attribute relevant to the problem.

**Classic Example: Iris Dataset** (Classification of 3 iris flower species)

![Iris Features](images/features-iris.png)

- Features: Sepal Length, Sepal Width, Petal Length, Petal Width
- Dataset size: 150 samples → $X \in \mathbb{R}^{150 \times 4}$

## Training and Test Datasets

![Training and Test Split](images/train-test-split.png)

![Data Distribution](images/data-distribution.png)

- We always split data into **at least two distinct sets**:
  - **Training Dataset**: Used to learn the function `f(x)`
  - **Test Dataset**: Used to evaluate performance after training

**Important Rules:**

- Training and test examples **must be different**
- Why? If the model sees test data during training, we **overestimate** its real-world performance (like giving a student the exam answers as homework!)

**Assumptions for Good Generalization:**

- Training and test samples are drawn from the **same underlying data distribution** $P_{data}$
- Samples are **independent and identically distributed (i.i.d.)**

## Supervised Learning

![Supervised Learning](images/supervised-learning.png)

Supervised learning = learning from **labeled** examples (both `x` and correct `y` are provided).

- Covers both **classification** and **regression** problems

In matrix form:
- Feature matrix: $X \in \mathbb{R}^{N \times D}$
- Target matrix: $Y \in \mathbb{R}^{N \times K}$ (K = number of outputs/classes)

**Common Supervised Algorithms:**

- Linear models
- Neural networks
- Support vector machines
- Naive Bayes
- K-nearest neighbors
- Random forests

## Unsupervised Learning

Unsupervised learning = learning from **unlabeled** data (only inputs `x` are available).

> Even without labels, we can discover useful structure in the data.

### Common Tasks

1. **Clustering**  
   Group similar data points together into clusters.

   ![Clustering Example](images/clustering-example.png)

   - We detect "close" points and assign them to the same cluster.
   - No predefined labels — clusters represent different types of inputs.

2. **Probability Density Estimation**  
   Estimate the underlying data distribution $P_{data}$.

    ![Density Estimation Example](images/density-estimation.png)  
   - Learn how likely different points are.
   - Useful for anomaly detection, generating new samples, etc.

**Feature Matrix (same as supervised):**  
$X \in \mathbb{R}^{N \times D}$ (no `Y` matrix)

**Applications:**
- Customer segmentation
- Anomaly detection
- Data compression (dimensionality reduction)
- Preprocessing for supervised tasks

## Reinforcement Learning (Brief Overview)

![Reinforcement Learning](images/reinforcement-learning.png)

- No fixed labeled dataset
- Learning by **interaction** with an environment
- Agent takes **actions** → receives **rewards/penalties**
- Goal: Maximize cumulative reward

**Applications:**
- Game playing (AlphaGo)
- Robotics
- Self-driving cars
- Drug discovery

---

