# Problem formulation

- Data
- Model
- Loss
- Optimization algorithm

## Data

### Inputs

- $m$: number of samples and associated results in the dataset.
- $n$: number of features. A **feature** is an attribute (a property) of the data samples.
- $\pmb{x}^{(i)}$: data **sample** (example), vector of $n$ features.
- $x^{(i)}_j$: value of the $j$th feature for the $i$th data sample.

$$\pmb{x}^{(i)} = \begin{pmatrix}
       \ x^{(i)}_1 \\
       \ x^{(i)}_2 \\
       \ \vdots \\
       \ x^{(i)}_n
     \end{pmatrix} \in \pmb{R}^n$$

### Design matrix

- $\pmb{X}$: matrix of the form (*samples, features*) expected by most ML algorithms and called *design matrix*.
  - First dimension is for the $m$ samples.
  - Second dimension is for the $n$ features of each sample.

$$\pmb{X} = \begin{bmatrix}
       \ x^{(1)T} \\
       \ x^{(2)T} \\
       \ \vdots \\
       \ x^{(m)T} \\
     \end{bmatrix} = 
\begin{bmatrix}
       \ x^{(1)}_1 & x^{(1)}_2 & \cdots & x^{(1)}_n \\
       \ x^{(2)}_1 & x^{(2)}_2 & \cdots & x^{(2)}_n \\
       \ \vdots & \vdots & \ddots & \vdots \\
       \ x^{(m)}_1 & x^{(m)}_2 & \cdots & x^{(m)}_n
     \end{bmatrix} \in \pmb{R}^{m \times n} $$

### Multidimensional data: reshaping

A bitmap image can be represented as a 3D multidimensional array (*height, width, color_channels*).

A video can be represented as a 4D multidimensional array (*frames, height, width, color_channels*).

They have to be **reshaped**, or *flattened* in that case, into a vector before being fed to most ML models.

![Image to vector](images/image2vector.jpeg)

### Multidimensional data: scaling

Individual pixel values for images and videos are typically integers in the [0:255] range.

Scaling them to obtain floats into the $[0,1]$ range is a common practice.

### Targets

- $y^{(i)}$: **target**, or **label** (expected result) for a data sample.
- $\pmb{y}$: vector of labels for all $m$ samples in a dataset.

$$\pmb{y} = \begin{pmatrix}
       \ y^{(1)} \\
       \ y^{(2)} \\
       \ \vdots \\
       \ y^{(m)}
     \end{pmatrix} \in \pmb{R}^m$$

## Model

### Definition

The representation learnt from data during training is called a **model**. It defines the relationship between inputs and outputs, and thus produces results from data. Most (but not all) ML systems are model-based.

[![Extract from the book Hands-on Machine Learning with Scikit-Learn & TensorFlow by A. Géron](images/instance_model_learning.png)](https://github.com/ageron/handson-ml2)

### Hypothesis function

- $\pmb{\theta}$ (sometime noted $\pmb{\omega}$): model's internal parameters vector.
- $h_\theta()$: model's prediction function (*hypothesis function*), using the model parameters $\pmb{\theta}$ to define the relationship between features and labels.
- $y'^{(i)}$ (sometimes noted $\hat{y}^{(i)}$): hypothesis function output (model prediction).

$$y'^{(i)} = h_\theta(\pmb{x}^{(i)})$$ 

### Multiclass classification

* $\pmb{y}^{(i)}$ et $\pmb{y}'^{(i)}$ are vectors with as many elements as the number of predicted classes $K$.
* $\pmb{y}^{(i)}$ (the *ground truth*) is a **binary vector** of $K$ values. $y^{(i)}_k$ is equal to 1 if the $i$th sample's class corresponds to $k$, 0 otherwise.
* $\pmb{y}'^{(i)}$ is a **probability vector** of $K$ values, computed by the model. $y'^{(i)}_k$ represents the probability that the $i$th sample belongs to class $k$.

$$\pmb{y}^{(i)} = \begin{pmatrix}
       \ y^{(i)}_1 \\
       \ y^{(i)}_2 \\
       \ \vdots \\
       \ y^{(i)}_K
     \end{pmatrix} \in \pmb{R}^K
\;\;\;
\pmb{y}'^{(i)} = \begin{pmatrix}
       \ y'^{(i)}_1 \\
       \ y'^{(i)}_2 \\
       \ \vdots \\
       \ y'^{(i)}_K
     \end{pmatrix} \in \pmb{R}^K$$

### Model lifecycle

There are two (repeatable) phases:

- **Training**: using training input samples, the model learns to find a relationship between features and labels.
- **Inference**: the trained model is used to make predictions.

### Hyperparameters

- Many ML models also have user-defined properties called **hyperparameters**.
  - Examples: maximum depth of a decision tree, number oy layers of a neural network...
- Contrary to internal parameters, they are not automatically updated during training.
- The hyperparameters directly affect the model's performance and must be tweaked during the training and tuning steps.

## Loss

### Definition

- $\mathcal{L_{\pmb{X, y}}(\pmb{\theta})}$, sometimes noted $\mathcal{J_{\pmb{X, y}}(\pmb{\theta})}$: **loss** (or **cost**) function that quantifies the difference, often called **error**, between expected results (called *ground truth*) and actual results computed by the model.
- During model training, the input dataset $\pmb{X}$ and the expected results $\pmb{y}$ are treated as constants. The loss depends solely on the model parameters $\pmb{\theta}$. To simplify notations, the loss function will be written $\mathcal{L(\pmb{\theta})}$.
- Different loss functions exist. The choice depends on the learning type. See [Losses](./losses) for details.

## Optimization algorithm

### Definition

- Used during the training phase.
- Objective: find the set of model parameters $\pmb{\theta}^{*}$ that minimizes the loss.
- For each learning type, several algorithms of various complexity exist.

[![Optimization: linear regression](images/LossSideBySide.png)](https://developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach)