<details>
  <summary>Supervised Learning Steps</summary>
    
1. Data Collection
   * 1.1\. Data Sources
   * 1.2\. Data Collection Considerations
2. Data Exploration and Preparation
   * 2.1\. Data Exploration
   * 2.2\. Data Preparation/Cleaning
3. Split Data into Training and Test Sets
   * 3.1\. Holdout Method
   * 3.2\. Cross Validation
   * 3.3\. Data Leakage
   * 3.4\. Best Practices
4. Choose a Supervised Learning Algorithm
   * 4.1\. Consider algorithm categories
   * 4.2\. Evaluate algorithm characteristics
   * 4.3\. Try multiple algorithms
5. Train the Model
   * 5.1\. Objective Function (Loss/Cost Function)
   * 5.2\. Optimization Algorithms
   * 5.3\. Overfitting and Underfitting
6. Evaluate Model Performance
   * 6.1\. Evaluate Model Performance
   * 6.2\. Performance Metrics for Classification Models
   * 6.3\. Interpreting and Reporting Model Performance
7. Model Tuning and Selection
   * 7.1\. Hyperparameter Tuning
   * 7.2\. Ensemble Methods
</details>

# 5. Train the Model

![image.png](https://pbs.twimg.com/media/D3SwgeEWAAAaSEv.jpg)

## 5.1. Objective Function (Loss/Cost Function)

The objective function, also known as the loss or cost function, measures how well the model's predictions match the true labels in the training data. The goal is to minimize this function during training.

The selection of a loss function is not one-size-fits-all. It requires a deep understanding of the problem, the nature of the data, the distribution of the target variable, and the specific goals of the analysis.

### 5.1.1. Regression Example

In regression tasks, where the goal is to predict a continuous value, the difference between the predicted and actual values is of primary concern. Common loss functions for regression include:

**Mean Squared Error (MSE)**: $\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2$

- Suitable for problems where large errors are particularly undesirable since they are squared and thus have a disproportionately large impact. The squaring operation amplifies larger errors.

- Examples where MSE is preferred over MAE:
    - **Medical diagnosis**: In medical applications, large errors in diagnosis or treatment can have severe consequences. MSE heavily penalizes large errors, making it a more appropriate metric for such critical domains.
    - **Financial risk management**: In finance, large errors in risk estimation or portfolio optimization can lead to substantial losses. MSE's emphasis on large errors makes it a better choice for managing financial risks.
    - **Structural engineering**: In structural design, large errors in load or stress calculations can lead to catastrophic failures. MSE's sensitivity to large errors is desirable for ensuring safety margins.
    - **Image and signal processing**: In applications like image compression or signal denoising, large errors can significantly degrade the output quality. MSE is commonly used as it captures the perceptual impact of large errors better than MAE.

**Mean Absolute Error (MAE)**: $\frac{1}{n} \Sigma_{i=1}^n |{y}-\hat{y}|$

- Useful when all errors, regardless of magnitude, are treated uniformly.

- Examples where MAE is preferred over MSE:
    - **Forecasting sales or revenue**: In business settings, large errors in forecasting sales or revenue may not be substantially worse than smaller errors, as long as the overall trend is captured accurately. MAE treats all errors equally, making it a suitable metric in such cases.
    - **Measuring sensor errors**: When dealing with sensor data, large errors or outliers may be caused by temporary malfunctions or noise. MAE is more robust to such outliers and provides a better measure of the typical error.
    - **Evaluating navigation systems**: In navigation applications, small and large errors in distance estimation may have similar consequences (e.g., missing a turn). MAE captures the average error without heavily penalizing large deviations.


### 5.1.2. Classification Example

In classification tasks, where the goal is to categorize inputs into classes, the focus is on the discrepancy between the predicted class probabilities and the actual class labels. Common loss functions for classification include:

**Log Loss (Logistic Loss)**: $L(y, f(x)) = -[y * log(f(x)) + (1 - y) * log(1 - f(x))]$

- Typically used for binary classification problems, where the goal is to predict the probability of an instance belonging to one of two classes. Where:
    - y is the true binary label (0 or 1)
    - f(x) is the predicted probability of the positive class (between 0 and 1)

- Some examples include: Email spam detection (spam or not spam); Credit risk modeling (default or not default); Disease diagnosis (diseased or healthy); Fraud detection (fraudulent or legitimate transaction).

**Hinge Loss**: $L(y, f(x)) = max(0, 1 - y * f(x))$

- Typically used in maximum-margin classification problems, particularly with Support Vector Machines (SVMs). It is suitable for binary classification tasks where the goal is to maximize the margin between the two classes. Where:
    - y is the true label or target value (-1 or 1)
    - f(x) is the predicted value or decision function output
 
- Some examples include: Text classification (e.g., sentiment analysis); Image classification (e.g., object detection); Bioinformatics (e.g., protein classification); Anomaly detection.

**Cross-Entropy Loss**: $L = -\sum_{c=1}^My_{o,c}\log(p_{o,c})$

- Generalization of Log Loss for multiclass classification problems, where instances can belong to one of several classes. Where:
    - M is the number of classes
    - y is a binary indicator (0 or 1) if class label c is the correct classification for observation o
    - p is the predicted probability that observation o is of class c

- Some examples include: Image classification (e.g., classifying images into multiple categories); Natural language processing (e.g., text categorization, language modeling); Speech recognition; Recommender systems (e.g., predicting user preferences among multiple items).

-----

## 5.2. Optimization Algorithms

To find the model parameters (weights and biases) that minimize the loss function, optimization algorithms are used. Some popular ones:

- **Gradient Descent**: Updates parameters in the direction of the negative gradient of the loss function.
- **Stochastic Gradient Descent (SGD)**: Estimates the gradient from a single example or subset of examples instead of the full dataset, allowing faster iterations.
- **Adam Optimizer**: An extension of SGD that adapts the learning rate for each parameter, providing faster convergence.

-----

## 5.3. Overfitting and Underfitting

It's crucial to address the concepts of overfitting and underfitting when training models. Underfitting is caused by high bias (oversimplified model), while overfitting is caused by high variance (overly complex model that captures noise). The goal is to find the right balance between bias and variance by selecting an appropriate model complexity that can capture the true patterns in the data without overfitting.

### 5.3.1. Overfitting (high variance and low bias)

Overfitting occurs when the model learns the training data too well, including the noise, and fails to generalize to new unseen data. This leads to poor performance on the test/validation set despite high training accuracy.

Variance refers to the amount that the model's predictions fluctuate when trained on different subsets of the training data. High variance indicates that the model is overly complex and sensitive to noise in the training data.

Possible reasons are:
- The model is too complex for the data (for example a very tall decision tree or a very deep or wide neural network often overfit);
- Too many features but a small number of training examples.

<u>**Methods to prevent overfitting**:</u>

- **Regularization techniques**: Forces the learning algorithm to build a less complex model. In practice, that often leads to slightly higher bias but significantly reduces the variance. This problem is known in the literature as the "bias-variance tradeoff" . Types:
    - **L1 (Lasso) Regularization**: Adds the sum of absolute values of weights, driving some weights to zero for sparse models. In practice this works as "feature selection" by deciding which features are essential for prediction and which are not.
    
    - **L2 (Ridge) Regularization**: Adds the sum of squared weights, keeping all weights non-zero but small.
    
    - For **Neural Netowrks**:
    
        - **Dropout**: Randomly drops units from the neural network during training to prevent co-adaptation of features.
        
        - **Batch Normalization**: Nrmalizes layer inputs by subtracting mean and dividing by standard deviation, enabling higher learning rates, reducing internal covariate shift, improving generalization, and faster convergence during training of deep neural networks.


- **Cross-validation**: Splitting the data into training, validation, and test sets. The validation set is used to tune hyperparameters and monitor for overfitting during training.

- **Early Stopping**: Stop training when validation error starts increasing
 
- **Data augmentation**: Increasing the size and diversity of the training data by applying transformations like flipping, rotating, or adding noise. This helps the model generalize better.

- **Reducing model complexity**: Using a simpler model with fewer parameters (linear instead of polynomial regression), a simpler kernel (linear kernel instead of RBF), or techniques like pruning to remove unnecessary connections (e.g. neural network with fewer layers/units).

- **Ensemble methods**: Combining multiple models, such as bagging or boosting, to reduce variance and overfitting.

- **Dimensionality reduction**: Reduce the dimensionality of the data being used, so the model has less "noise" that can be picked up. E.g. instead of using 4-D samples, apply PCA to reduce it to 2-D, and check whether the model generalizes better.

### 5.3.2. Underfitting (high bias and low variance)

Underfitting happens when the model is too simple and cannot capture the underlying patterns in the data, resulting in poor performance on both training and test sets.

Bias refers to the error introduced by overly simplistic assumptions in the learning algorithm. It is the inability of the model to capture the true underlying relationship between the input features and target variable.

Possible reasons are:
- The model is too simple for the data (for example a linear model can often underfit);
- The features you engineered are not informative enough.

<u>**Methods to prevent underfitting**:</u>

- **Increasing model complexity**: Using a more complex model with more parameters or layers to capture the underlying patterns in the data.

- **Feature engineering**: Adding more relevant features or transforming existing ones to better represent the data.

- **Removing noise**: Cleaning and preprocessing the data to remove irrelevant or noisy features.

- **Increasing training time**: Training the model for more epochs or iterations to allow it to learn the patterns better.

- **Reducing regularization**: Decreasing the regularization strength if it is causing underfitting by overly constraining the model.