![Train the Model banner](./images/5_train_the_model.png)

# 5. Train the Model

## 5.1. Objective Function (Loss/Cost Function)

The objective function, also known as the loss or cost function, measures how well the model's predictions match the true labels in the training data. The goal is to minimize this function during training.

The selection of a loss function is not one-size-fits-all. It requires a deep understanding of the problem, the nature of the data, the distribution of the target variable, and the specific goals of the analysis.

### 5.1.1. Regression Example

In regression tasks, where the goal is to predict a continuous value, the difference between the predicted and actual values is of primary concern. Common loss functions for regression include:

#### Mean Squared Error (MSE)

\begin{equation*}
\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2
\end{equation*}

Suitable for problems where large errors are particularly undesirable since they are squared and thus have a disproportionately large impact. The squaring operation amplifies larger errors.

Examples where MSE is preferred over MAE:
- **Medical diagnosis**: In medical applications, large errors in diagnosis or treatment can have severe consequences. MSE heavily penalizes large errors, making it a more appropriate metric for such critical domains.
- **Financial risk management**: In finance, large errors in risk estimation or portfolio optimization can lead to substantial losses. MSE's emphasis on large errors makes it a better choice for managing financial risks.
- **Structural engineering**: In structural design, large errors in load or stress calculations can lead to catastrophic failures. MSE's sensitivity to large errors is desirable for ensuring safety margins.
- **Image and signal processing**: In applications like image compression or signal denoising, large errors can significantly degrade the output quality. MSE is commonly used as it captures the perceptual impact of large errors better than MAE.

#### Mean Absolute Error (MAE)

\begin{equation*}
\frac{1}{n} \Sigma_{i=1}^n |{y}-\hat{y}|
\end{equation*}

Useful when all errors, regardless of magnitude, are treated uniformly.

Examples where MAE is preferred over MSE:
- **Forecasting sales or revenue**: In business settings, large errors in forecasting sales or revenue may not be substantially worse than smaller errors, as long as the overall trend is captured accurately. MAE treats all errors equally, making it a suitable metric in such cases.
- **Measuring sensor errors**: When dealing with sensor data, large errors or outliers may be caused by temporary malfunctions or noise. MAE is more robust to such outliers and provides a better measure of the typical error.
- **Evaluating navigation systems**: In navigation applications, small and large errors in distance estimation may have similar consequences (e.g., missing a turn). MAE captures the average error without heavily penalizing large deviations.


### 5.1.2. Classification Example

In classification tasks, where the goal is to categorize inputs into classes, the focus is on the discrepancy between the predicted class probabilities and the actual class labels. Common loss functions for classification include:

#### Log Loss (Logistic Loss)

\begin{equation*}
L(y, f(x)) = -[y \,log(f(x)) + (1 - y) \, log(1 - f(x))]
\end{equation*}

Typically used for binary classification problems, where the goal is to predict the probability of an instance belonging to one of two classes. Where:
- y is the true binary label (0 or 1)
- f(x) is the predicted probability of the positive class (between 0 and 1)

Some examples include:
- Email spam detection (spam or not spam)
- Credit risk modeling (default or not default)
- Disease diagnosis (diseased or healthy)
- Fraud detection (fraudulent or legitimate transaction).

#### Hinge Loss

\begin{equation*}
L(y, f(x)) = max(0, 1 - y * f(x))
\end{equation*}

Typically used in maximum-margin classification problems, particularly with Support Vector Machines (SVMs). It is suitable for binary classification tasks where the goal is to maximize the margin between the two classes. Where:
- y is the true label or target value (-1 or 1)
- f(x) is the predicted value or decision function output
 
Some examples include:
- Text classification (e.g., sentiment analysis)
- Image classification (e.g., object detection)
- Bioinformatics (e.g., protein classification)
- Anomaly detection.

#### Cross-Entropy Loss

\begin{equation*}
L = -\sum_{c=1}^My_{o,c}\log(p_{o,c})
\end{equation*}

Generalization of Log Loss for multiclass classification problems, where instances can belong to one of several classes. Where:
- M is the number of classes
- y is a binary indicator (0 or 1) if class label c is the correct classification for observation o
- p is the predicted probability that observation o is of class c

Some examples include:
- Image classification (e.g., classifying images into multiple categories)
- Natural language processing (e.g., text categorization, language modeling)
- Speech recognition
- Recommender systems (e.g., predicting user preferences among multiple items)

-----

## 5.2. Optimization Algorithms

To find the model parameters (weights and biases) that minimize the loss function, optimization algorithms are used. 

### 5.2.1. Gradient Descent

Gradient Descent is a fundamental algorithm in optimization and machine learning for minimizing a function. To understand it better, let's dive into the mathematical formulas that describe its operation. The core idea of Gradient Descent is to iteratively move towards the minimum of a function by updating the parameters in the opposite direction of the gradient of the function at the current point.

Optimization algorithm used to find the minimum of a function by iteratively adjusting its parameters in the direction of the negative gradient of the function. Here's how it works:
1. Start with an initial guess for the parameters of the function you want to minimize.
2. Calculate the gradient (slope) of the function at the current parameter values. The gradient points in the direction of the greatest increase of the function.
3. Update the parameter values by taking a step in the opposite direction of the gradient, scaled by a learning rate. This moves the parameters towards the minimum of the function.
4. Repeat steps 2 and 3 until convergence, which means the minimum of the function is reached or the algorithm can no longer make progress.

The key idea is that by moving the parameters in the direction opposite to the gradient, the function value decreases towards the minimum. The learning rate determines the step size at each iteration - a smaller rate leads to slower but more precise convergence, while a larger rate may diverge.

Gradient descent is widely used in machine learning to train models like neural networks by minimizing a cost/loss function that measures the difference between predicted and actual outputs. The model's weights and biases are the parameters adjusted by gradient descent to minimize this cost function over the training data.

While powerful, gradient descent can get stuck in local minima for non-convex functions and may require techniques like momentum or mini-batch updates for better performance on complex optimization landscapes

#### The Gradient Descent Update Rule

The update rule for the parameters can be expressed as:

\begin{equation*}
\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \nabla_\theta J(\theta)
\end{equation*}

- $\theta$ represents the parameters of the function we are trying to minimize.

- $J(\theta)$ is the cost function, a function of the parameters $\theta$.

- $\nabla_\theta J(\theta)$ denotes the gradient of the cost function with respect to the parameters $\theta$. This gradient points in the direction of the steepest ascent of the cost function.

- $\eta$ is the learning rate, a positive scalar determining the size of the step we take on each iteration. It controls how much we adjust the parameters by in the direction opposite to the gradient.

- $\theta_{\text{new}}$ and $\theta_{\text{old}}$ are the values of the parameters after and before the update, respectively. 

#### The Gradient

The gradient of a function at a point is a vector pointing in the direction of the steepest ascent of the function at that point. For a function $f(x, y)$ with two variables, the gradient is:

\begin{equation*}
\nabla f(x, y) = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)
\end{equation*}

For a multivariate function $f(\mathbf{x})$ where $\mathbf{x} = (x_1, x_2, ..., x_n)$, the gradient is a vector of partial derivatives:

\begin{equation*}
\nabla f(\mathbf{x}) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n} \right)
\end{equation*}

#### Cost Function

A common choice for the cost function in regression problems is the Mean Squared Error (MSE), which for $m$ observations is defined as:

\begin{equation*}
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2
\end{equation*}

- $h_\theta(x^{(i)})$ is the hypothesis function, representing the predicted output for input $x^{(i)}$ with parameters $\theta$.

- $y^{(i)}$ is the actual output for the $i$-th observation.

- The factor of $\frac{1}{2}$ is often included to simplify the derivative of the cost function with respect to the parameters.

#### Learning Rate

The learning rate $\eta$ is crucial for the convergence of Gradient Descent. If it's too large, the algorithm might overshoot the minimum. If it's too small, convergence might be very slow. There's no one-size-fits-all value for $\eta$; it often requires tuning.
These formulas encapsulate the mathematical foundation of the Gradient Descent algorithm, illustrating how it iteratively adjusts parameters to find the minimum of a function by moving in the direction opposite to the gradient.

![image](https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png)

### 5.2.2. Stochastic Gradient Descent (SGD)

SGD is a popular optimization algorithm that updates model parameters using a single learning rate for all parameters. It calculates gradients and updates parameters using one training example or a small batch of examples at a time. This approach introduces some noise during training, which can lead to better generalization performance on novel data, albeit at the cost of slower convergence.

SGD can optionally use momentum to accelerate progress in the relevant direction and dampen oscillations. While slower than some other optimizers, SGD can potentially generalize better by exchanging lower performance on the training set for better performance on unseen data.

### 5.2.3. Adaptive Moment Estimation (Adam Optimizer)

The Adam optimizer is an adaptive learning rate optimization algorithm that adjusts different learning rates for different parameters based on their gradients' first and second moments. It maintains moving averages of the gradients and squared gradients, implicitly incorporating momentum.

Adam updates parameters more frequently using subsets of data, allowing faster convergence than classical gradient descent, especially for large datasets. It is more robust to hyperparameter initialization and can achieve faster convergence than SGD. However, Adam may get stuck in suboptimal minima without careful tuning. 

-----

## 5.3. Overfitting and Underfitting

It's crucial to address the concepts of overfitting and underfitting when training models. Underfitting is caused by high bias (oversimplified model), while overfitting is caused by high variance (overly complex model that captures noise). The goal is to find the right balance between bias and variance by selecting an appropriate model complexity that can capture the true patterns in the data without overfitting.

![Train the Model underfitting and overfitting](./images/5_train_the_model_underfit_overfit.png)

### 5.3.1. Overfitting (high variance and low bias)

Overfitting occurs when the model learns the training data too well, including the noise, and fails to generalize to new unseen data. This leads to poor performance on the test/validation set despite high training accuracy.

Variance refers to the amount that the model's predictions fluctuate when trained on different subsets of the training data. High variance indicates that the model is overly complex and sensitive to noise in the training data.

Possible reasons are:
- The model is too complex for the data (for example a very tall decision tree or a very deep or wide neural network often overfit);
- Too many features but a small number of training examples.

#### Methods to prevent overfitting

- **Regularization techniques**: Forces the learning algorithm to build a less complex model. In practice, that often leads to slightly higher bias but significantly reduces the variance. This problem is known in the literature as the "bias-variance tradeoff" . Types:
    - **L1 (Lasso) Regularization**: Adds the sum of absolute values of weights, driving some weights to zero for sparse models. In practice this works as "feature selection" by deciding which features are essential for prediction and which are not.
    
    - **L2 (Ridge) Regularization**: Adds the sum of squared weights, keeping all weights non-zero but small.
    
    - For **Neural Netowrks**:
    
        - **Dropout**: Randomly drops units from the neural network during training to prevent co-adaptation of features.
        
        - **Batch Normalization**: Nrmalizes layer inputs by subtracting mean and dividing by standard deviation, enabling higher learning rates, reducing internal covariate shift, improving generalization, and faster convergence during training of deep neural networks.


- **Cross-validation**: Splitting the data into training, validation, and test sets. The validation set is used to tune hyperparameters and monitor for overfitting during training.

- **Early Stopping**: Stop training when validation error starts increasing
 
- **Data augmentation**: Increasing the size and diversity of the training data by applying transformations like flipping, rotating, or adding noise. This helps the model generalize better.

- **Reducing model complexity**: Using a simpler model with fewer parameters (linear instead of polynomial regression), a simpler kernel (linear kernel instead of RBF), or techniques like pruning to remove unnecessary connections (e.g. neural network with fewer layers/units).

- **Ensemble methods**: Combining multiple models, such as bagging or boosting, to reduce variance and overfitting.

- **Dimensionality reduction**: Reduce the dimensionality of the data being used, so the model has less "noise" that can be picked up. E.g. instead of using 4-D samples, apply PCA to reduce it to 2-D, and check whether the model generalizes better.

### 5.3.2. Underfitting (high bias and low variance)

Underfitting happens when the model is too simple and cannot capture the underlying patterns in the data, resulting in poor performance on both training and test sets.

Bias refers to the error introduced by overly simplistic assumptions in the learning algorithm. It is the inability of the model to capture the true underlying relationship between the input features and target variable.

Possible reasons are:
- The model is too simple for the data (for example a linear model can often underfit);
- The features you engineered are not informative enough.

#### Methods to prevent underfitting

- **Increasing model complexity**: Using a more complex model with more parameters or layers to capture the underlying patterns in the data.

- **Feature engineering**: Adding more relevant features or transforming existing ones to better represent the data.

- **Removing noise**: Cleaning and preprocessing the data to remove irrelevant or noisy features.

- **Increasing training time**: Training the model for more epochs or iterations to allow it to learn the patterns better.

- **Reducing regularization**: Decreasing the regularization strength if it is causing underfitting by overly constraining the model.