#Training

Whilst training a machine learning model, the most crucial step occurs when we make the choise of which loss function to use. Recall that a loss function, $Λ(w)$, which measures the amount of error our due to our model, $F( Δ, w)$, is producing on some training set $Δ$. Hence the loss function plays a crucial role in the success of our model, moreover the most important step in training also occurs how we choose to minimize this loss funciton, namely how we wish to optimize the performance of our model.

## Loss Functions

* Mean Squared Error: $$
Λ(w) = \frac{1}{N}\sum_{i=1}^{N} (y_i - F(x_i, w))^2
$$
is the most common loss function for prediction of continious values.

## Desnity Modeling and Likelihood of Data

* The "likelihood of the data" refers to the probability of observing the given data given the parameters of the model. In other words, given a model with certain parameters, how likely is it that we would observe the data that we have?

*  In density modeling, the standard loss is often the negative log-likelihood of the data.

* We use the negative log-likelihood because likelihoods can be very small numbers, so taking the logarithm helps with numerical stability, and we take the negative because we want to minimize our loss function, but likelihoods should be maximized.

For example; Let's say we have a model with parameters $\theta$, and a dataset $\mathbf{X} = {x_1, x_2, ..., x_n}$. The log-likelihood of the data is given by:
$$
\log p(\mathbf{X} | \theta) = \sum_{i=1}^{n} \log p(x_i | \theta)
$$
And the loss function, which is the negative log-likelihood, is:
$$
-\log p(\mathbf{X} | \theta) = -\sum_{i=1}^{n} \log p(x_i | \theta)
$$

In training the model, we aim to find the model parameters $\theta$ that minimize this loss. This is equivalent to maximizing the likelihood of the data, hence the term "maximum likelihood estimation".

## Cross Entropy

* For classification problems that are adressed via machine learning algorithms, the usual choice of model is one which outputs a single column vector where the length corresponds to the number of classes and the values to the probability assosicated with the input belonging to that class. Also known as the ***Logit*** vector.

* Suppose $X$ is some input, $Y$ is the class which we wish to predict, we then compute using the model $F$ an estimate of the posterior probabilities; $$\hat{P}(Y = y | X = x) = \frac{exp(F(x,w)_y)}{\sum_z exp(F(x,w)_z)} $$
This is the probability of $Y=y$ conditioned on $X=x$. This is called the ***Softmax***, or also ***softargmax*** of the logits.



* To be more precise, since we wish the model to be trained to the point in which it maximizez the probability of true classes, in other words, it must minimizez the corss entropy function, that is:
$$
Λ_{\text{ce}}(w) = \\ \frac{-1}{N} \sum_{i=1}^{N} log(\hat{P}(Y = y_n | X = x_n)) = \\ \frac{1}{N} \sum_{i=1}^{N} -log(\frac{exp(F(x,w)_y)}{\sum_z exp(F(x,w)_z)}) = \\ \frac{1}{N} \sum_{i=1}^{N} Λ'_{\text{ce}}(F(x_n,w),y_n)
$$


## Contrastive Loss

* Suppose we wish to learn a distance function such that the measure of the distance of some data sample $x_a$ from a certain semantic class to any other data sample $x_b$ from that same class is smaller than the distance it would have to any other data sample say $x_c$ which is not in the same semantic class. This process is called ***metric learning***.  

* To achieve this, most often the form of loss function which is utilized is called a ***constrastive loss*** function.



Here is an example of contrastive loss:

$$
\mathcal{L}(x, x^{+}, x^{-}) = \frac{1}{2N} \sum_{i=1}^{N} [d(f(x_i), f(x_i^{+}))^{2} + \max(0, \text{margin} - d(f(x_i), f(x_i^{-})))^{2}]
$$

Where:
* $x$ is an anchor sample
* $x^{+}$ is a positive sample
* $x^{-}$ is a negative sample
* $f()$ is an embedding function
* $d()$ is a distance function (e.g. euclidean distance)
* $\text{margin}$ is a margin parameter
* $N$ is the number of sample pairs.


The contrastive loss aims to make the distance between an anchor and positive sample small, while making the distance between the anchor and negative sample greater than the margin $\text{margin}$. It helps learn useful embeddings where similar samples are close and dissimilar samples are far apart in the embedding space.

## Engineering the Loss

* Note that it is also possible to add specific terms to the loss function which depend on certain trainable parameters of the model so that certain outcomes are favorable.

* For one example consider the weight decay regulariation technique in which case adds an additional term to the loss function that penalizes large weights. This helps prevent overfitting by encouraging smaller, more diffuse weight values that generalize better.

* The L2 norm regularization has the effect of pulling weights towards the origin. The strength of this pull is controlled by the hyperparameter $\lambda$. With a larger $\lambda$, the regularization effect is stronger.

* So in summary, weight decay regularization helps prevent overfitting and improves generalization performance of neural network models. The penalty on large weights encourages diffuse weights suitable for generalization.



Here is an mathematical formulation of weight decay regularization:

$$
\mathcal{L}(w; x, y) = \ell(f(x;w), y) + \frac{\lambda}{2}\lVert w \rVert_2^2
$$

Where:
$\ell$ is the loss function (e.g. cross-entropy loss)
$f(x;w)$ is the model with weights $w$
$(x, y)$ is the training sample and label
$\lambda$ is the regularization strength hyperparameter
$\lVert w \rVert_2^2$ is the L2 norm squared (sum of squared weights)