# Machine Learning Background

  Developed by **Armin Norouzi** and **Farhad Davaripour**

## 1. Machine Learning Background

### 1.1 Machine Learning definition

Machine learning (ML) is the study of computer algorithms that can improve themselves automatically as a result of experience and data consumption. It is considered to be a subcategory of artificial intelligence (AI). Machine learning algorithms construct a model from sample data, referred to as training data, in order to generate predictions or judgments without explicitly programming them to do so. Machine learning algorithms are utilized in a wide variety of applications, including medicine, robotics, email filtering, speech recognition, and computer vision, where developing traditional algorithms to do the required tasks is difficult or impossible. It is important to note that for the foreseeable future, the primary purpose of using AI/ML will be to augment human intelligence than to replace it, which is why it is recommended to use the term augmented intelligence in place of artificial intelligence.

In general, ML can be divided into three main categories: 
1. **Supervised learning:** Supervised learning algorithms build a mathematical model from a set of labelled data (e.g., both the inputs and the known outputs). Supervised learning algorithms could be further categorized into shallow learning algorithms, such as linear regression, logistic regression, etc. and deep learning algorithms, such as the neural network.
2. **Unsupervised learning:** Unsupervised learning algorithms take a set of unlabelled data that contains only inputs and find structure within the data, e.g., by grouping/clustering the data into a certain number of groups/clusters.
3. **Reinforcement Learning:** Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

The scope of this course is limited to (supervised) shallow learning and unsupervised learning; (supervised) deep learning and Reinforcement learning are not discussed in this course.

### 1.2 Machine Learning Workflow
The ML workflow could be categorized into the following steps:

1. **Retrieving data:** This step refers to collecting data from multiple sources. As the ML model performance is greatly impacted by the data quality, it is of vital importance to collect reliable and validated data. Note that data validation involves confirming the data quality/distribution with the Subject Matter Experts (SMEs), which is done by the data readiness team.
2. **Data preparation (preprocessing):** This step involves (i) data cleaning or data transformation, which is removing missing (also called null) and duplicate values; (ii) checking the data type and performing type conversions; (iii) doing Exploratory Data Analysis (EDA) which refers to getting insights from the data distribution using multiple visualization techniques. Note that it is strongly advised to perform preliminary analyses on duplicate or null values before dropping them from the dataset. This step is discussed in great detail in Section 1.4. 
3. **Feature Engineering** Features are characteristics or independent variables used to predict dependent/target variables. Feature engineering comes after data pre-processing (or data preparation) which is the practice of adding new features or adjusting existing features in order to enhance the ML model performance. As a part of feature engineering novel features could be built with the goal of simplifying and speeding up data transformations while improving model accuracy. When working with ML models, feature engineering is essential, as regardless of the data type or the implemented architecture, optimal features will have a direct effect on the model performance.
4. **Model (re)training:** This step involves feature engineering, model selection, model evaluation, and hyperparameter tuning, where:
(i) Model selection is an iterative process of assessing different ML algorithms and finding the one leading to the best model performance.
(ii) Model evaluation includes splitting the processed data into the train and test sets. The model is trained against the training set and evaluated against the test set. The model evaluation is done using different statistical methods and metrics. 
(iii) Hyperparameter tuning is the process of identifying the combination of hyperparameters (characteristic of the ML algorithm) leading to the highest model performance. 
5. **Inference:** This step employs the selected ML model with optimized hyperparameters to make a prediction on the (incoming/live) data the model has never seen before. 
6. **Deployment:** In this step, the ML model becomes available to the end user. This step involves monitoring the ML model as well as the incoming data distribution. The ML model needs to be retrained (step 3) at every fixed time interval (e.g., a quarter) or when there is a data drift (noticeable difference between the incoming data distribution and the distribution of the training dataset) or when there is a significant shift in the model evaluation metrics. 

### 1.3 Data Definitions

#### 1.3.1 Data Types
The following list describes four commonly used data types in ML applications:

1. **Numerical data:** Numerical data is continuous or quantitative data that could be measured, such as age or salary.
2. **Categorical data:** Categorical data is generally of the string type and used to group a dataset into multiple buckets, for instance, based on gender and ethnicity. 
3. **Time Series:** Time series data is the data points recorded in a period of time often in a specific frequency (equally spaced time intervals), for instance, the average temperature in Calgary every 5 minutes. An important characteristic in a time series data is the correlation between the data at each time with its value in previous timestamps. 
4. **Text data:** Text data encompasses words, sentences, and paragraphs.

All the examples in this course are provided using the first two data types (numerical and categorical).

#### 1.3.2 Data Structure
Data could be unstructured (e.g., multimedia) and structured data (e.g., time and date). The former has a varied range of formats and the latter is in particular desired/predefined format. Hence the structured data could be stored in traditional databases (e.g., relational database). However, storing unstructured data is more challenging and often requires more expertise.

### 1.4 Data Preparation

The following steps are dealt with in the data preparation step.

1. **Imputation**

The primary objective of imputation is to deal with missing values caused by a variety of circumstances, including human mistakes, data flow interruptions, privacy problems, and others. The imputation step is crucial as missing values could degrade the performance of ML models. 

2. **Handling Outliers**

Outlier handling, also known as unsupervised anomaly detection, is a statistical technique for identifying and eliminating outliers from a dataset. The effect of handling outliers can be significant, depending on the ML algorithm; for instance, linear regression is highly susceptible to outliers. Amongst the methods to deal with outliers, here is the list of primary techniques:

**Removal:** Samples identified as outliers are detected and then removed from the dataset. The downside of this method is that if there are outliers across numerous features, then a big chunk of data would be lost.

**Replacing values:** Alternatively, outliers could be treated as missing values and replaced using a proper imputation technique.

**Capping:** Setting a max and min threshold for the data range and excluding the data left outside of the range. Note that the threshold should be determined by subject matter experts (SMEs).

**Fourier Transformation:** This method is primarily used in time series problems where the data is provided for a particular period of time, generally in consistent time intervals. Using the Fourier transform, the data could be converted into the frequency domain, and the frequencies with minimal spectral power are excluded from the data. The data will then be converted back into the time domain using an inverse Fourier transform.

3. **Log Transform**

Log Transform is used to handle/compress variables that span several orders of magnitude (e.g., loan or income) or to turn a skewed distribution into a normal or less-skewed distribution.   

$ (x,y) ⇒ (\log(x),\log(y)) $

4. **One-hot encoding**

One-hot encoding is the conversion of categorical data into numeric values in order for ML and statistical algorithms to be applied to the data. It creates new (binary) dummy features which include 0 or 1 numeric value (zero means False and one means True) for every unique category within a single variable.

5. **Scaling**

To train a predictive model, we want data with a well-defined collection of features that can be scaled up or down as necessary.

**Normalization:** Normalization is a technique employed to turn features with numeric data into a similar range. For instance, using min-max normalization, all the values are scaled between 0 and 1.

$ x_n = \frac{x- x_{min}}{x_{max}-x_{min}} $

**Standardization:** Standardization (or z-score normalization) is the process of centring values around 0 while transforming standard deviation $\sigma$ into 1. In other words, all the data points are subtracted from the mean and divided by the standard deviation.

$ x_s = \frac{x- \mu}{\sigma} $

### 1.5 Model training 
The term “model training” refers to minimizing the loss function (e.g., mean squared error) for a given ML algorithm over a set of data using an optimization method (e.g., gradient descent). Considering $f(x) = \mathbf{\theta}^TX$ ,the sole purpose is finding a combination of feature attributes (${\theta}$) which leads to the lowest cost function and, consequently, highest accuracy.

#### 1.5.1 Loss (cost) function
The loss (cost) function $L(Y,f{(x)})$ is a mathematical function used to penalize the deviation between the ML prediction and actual values. The most common loss function is the mean squared error (MSE), where:
$$
L(Y,f{(x)}) = (Y-f{(x)})^2
$$
The MSE is more desired than the absolute squared error as it accounts for negative error between the prediction and actual values and amplifies large errors.

#### 1.5.2 Optimization method
The most commonly used optimization algorithm is gradient descent. Gradient descent uses an iterative approach to find the local minimum by moving in the opposite direction of the (approximate) gradient of the loss function. In other words, the feature attribute ($\theta$) changes depending on the approximate gradient of the cost function in a way that reduces the cost (prediction error). For instance, if there is only one feature (independent variable) in the model, then $f(x) = \theta_0+\theta x$. Using gradient descent, both $\theta_0$ and $\theta$ are updated at each iteration using the equations below until the local min is obtained:
$$
\theta_0 = \theta_0 - \alpha \frac \partial {\partial \theta_0} L(\theta_0,\theta_1)   
$$
$$
\theta_1 = \theta_1 - \alpha \frac \partial {\partial \theta_1} L(\theta_0,\theta_1)
$$

### 1.6 Model assessment 
Upon having sufficient data, the best practice is to split the data randomly into three buckets, including a training set, a cross-validation set, and a test (hold out) set. The training set is used to train the ML model. The validation set is used for model selection as it shows the prediction error. Finally, the test set is used to test the generalization error of the model on the data that the model has never seen. The proportion of data that dumps into each bucket varies based on the problem. However, as a rule of thumb, 60, 20, and 20% of data are allocated to the training set, validation set, and test set, respectively. Note that in a random split of data, special attention should be given to duplicate values as the same example could appear in training, cross-validation and test set, which is considered data leakage and makes the model performance unrealistic.

#### 1.6.1 Tunning model: Bias and Variance trade-off

A key step in assessing model prediction is to check for bias, variance, and model complexity. Note that there is a trade-off between minimizing bias and variance. Hence, the best model performance is obtained by balancing these two metrics, which leads to an optimal model complexity. 

A short description of the bias and variance is provided below.

1. **Bias:** bias corresponds to the discrepancy between model prediction and actual values. High bias means the model could not follow the underlying structure of the data, and hence the prediction is underfitting or oversimplified, leading to a high error on training, validation, and test sets.

2. **Variance:** Variance is the opposite of bias. A model with high variance attempts to follow the more complex underlying structure within the data, and as a result, the prediction error on the training set is minimal. However, this leads to a high generalization error on the test set.

Based on the trade-off between bias and variance, a model could be underfitting or overfitting. The former condition is where the model has high bias and low variance. Underfitting could be due to insufficient training data or using an inappropriate model (e.g., a linear model for nonlinear data). On the contrary, if a model has high variance and low bias, the model prediction is overfitting. These models attempt to follow the noises and outliers in the data. Overfitting could be due to using a complex ML algorithm (e.g., using a high-degree polynomial model to fit linear data) or having a high number of outliers within the data. Note that some ML algorithms are more prone to overfitting or underfitting, which will be discussed in the following lectures. 

Figure 1 shows the interaction between bias, variance, and model complexity and their effect on the prediction error in the training versus test set. Note that the intersection between the two curves is where the bias and variance of the model are in good balance, and the model does not overfit or underfit. On this note, figure 2 shows examples of underfitting, overfitting and optimized model.

<p align="center">
<img  src="https://github.com/Synthetic-Data-for-Responsible-AI/appliedML4dummies/blob/fd-dev/Figures/model_complexity.png?raw=true"  width="500" height="300">
<figcaption align = "center"><b>Fig.1 interaction between bias, variance and model complexity and their effect on the prediction error</p></b></figcaption>

<p align="center">
<img  src="https://github.com/Synthetic-Data-for-Responsible-AI/appliedML4dummies/blob/fd-dev/Figures/overfitting_underfitting_examples.png?raw=true"  width="800" height="300">
<figcaption align = "center"><b>Fig.2 examples of overfitting, underfitting, and optimized models</p></b></figcaption>

## References

[1] Géron, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

[2] https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

[3] https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10#:~:text=Feature%20engineering%20is%20the%20process,design%20and%20train%20better%20features.
