All these problems can be expressed as the **minimization** of a loss function $l : \Omega \to \mathbb{R}$ where $\Omega$ is the **parameters space**.

Once we have the expression of the loss function, we can use an optimizer to find the best parameters.

# Regression

For a regression problem, it is already formulated as a minimization problem:

Given a point cloud $(x_k, y_k)$ in $\mathbb{R}^N \times \mathbb{R}^M$, we define a model $m_{\theta} : \mathbb{R}^N \to \mathbb{R}^M$ that depends on the parameters $\theta \in \Omega$ an that is used to fit this point cloud.

An example of **loss function** is the squared l2-norm:
$$
J(\mathbf{\theta}) = \frac{1}{2 K} \sum_{k=1}^K \left(m_{\theta}(x_k) - y_k\right)^2 .
$$

Thus we search for the best parameters $\theta$ that minimize $J$:
$$
\arg\,\min_{\theta \in \Omega} J(\theta) .
$$

# Classification

For a classification problem with $M$ classes, we can define a **score function** $s_{\theta} : \mathbb{R}^N \to \mathbb{R}^M$ that associates, for each input in $\mathbb{R}^N$, a score to each class.

From this score, the predicted class is the one with the highest score, so that the model $m_{\theta} : \mathbb{R}^N \to [1, \dots, M]$ can be expressed as:

$$
m_{\theta}(x) = \arg\!\max_{k \in [1, \dots, M]} s_{\theta}(x)_k
$$

The associated **loss function** should be defined so that its gradient with respect to the parameters is not always zero, thus eliminating an expression based on $m_{\theta}$.

Many common loss functions exist, see:
- http://cs231n.github.io/linear-classify/#loss,
- https://indico.mathrice.fr/event/153/contribution/0/material/slides/0.pdf,
- https://indico.mathrice.fr/event/153/contribution/1/material/1/0.pdf.

The following examples of loss function are expressed for one sample of input/output. We can then take, for example, the mean for all samples to get the complete loss function.

## Support Vector Machine

For a given input $x_i$ with expected output $y_i$, SVM uses the following loss function $L_i$:

$$
L_i = \sum_{j \neq y_i} \max(0, s_{\theta}(x_i)_j - s_{\theta}(x_i)_{y_i} + \Delta)
$$

where $\Delta$ is a positive margin between the score associated to the right class and all other classes.


## SoftMax

For a given input $x_i$ with expected output $y_i$, the SoftMax uses the following loss function $L_i$:

$$
L_i = - \log \left( \frac{ \exp(s_{\theta}(x_i)_{y_i}) }{ \sum_j \exp(s_{\theta}(x_i)_j)} \right)
$$

where the score function is seen as a "normalized probabilities".

## Regularization

To get good optimized model properties, we may want to avoid high values of the parameters (it is out of scope of this course).
For example, with the SVM loss function and given a $\theta$ parameters set that minimizes the loss function, then any $\alpha \theta$ parameters scaled with $\alpha >= 1$ is also a minimizer.

In that case, a regularization term can be added in the loss function that penalizes high values of the parameters, e.g.:

$$
L(\theta) = \frac{1}{I} \sum_i^I L_i + \lambda \frac{1}{K} \sum_k^K \theta_k^2
$$
for $I$ samples and $K$ parameters.