# Supervised Learning: Models and Concepts

Supervised learning is an area of machine learning in which the chosen algorithm attempts to fit a target using given input. A training data set containing labels is provided to the algorithm. Based on a massive set of data, it will learn a rule that it uses to predict labels on new observations. In other words, supervised learning algorithms receive historical data and the task of finding the relationship that has the best predictive power.

There are two varieties of supervised learning algorithms: regression and classification algorithms. Regression-based supervised learning methods attempt to predict outputs based on input variables. Classification-based supervised learning methods identify the category to which a dataset belongs. Classification algorithms are based on probability, that is, the output is the category for which the algorithm finds the highest probability that the data set belongs to it. Regression algorithms, in contrast, estimate the output of problems that have an infinite number of solutions (continuous set of possible outputs).

In the context of finance, supervised learning models represent one of the most widely used classes of machine learning models. Many algorithms that are widely applied in algorithmic trading use supervised learning models, as they can be trained efficiently, are relatively robust to noise in financial data, and have strong links to finance theory.

Regression-based algorithms have been leveraged by academic and industry researchers to develop numerous asset pricing models. Such models are used to predict returns over multiple periods and to identify significant factors that drive returns on assets. There are many other use cases for regression-based supervised learning in portfolio management and derivatives pricing.

Classification-based algorithms, on the other hand, have been pushed into several areas within finance that require predicting a categorical reaction. Among them, we have fraud detection, default prediction, credit score, directional prediction of movements in asset prices and buy/sell recommendations. There are many other use cases for classification-based supervised learning, such as portfolio management and algorithmic trading.

---

##### Topics that will be covered:

- basic concepts about supervised learning models;
- how to implement different supervised learning models in Python;
- how to optimize models and identify their ideal parameters using grid search;
- overfitting versus underfitting and bias versus variance;
- strengths and weaknesses of the different supervised learning models;
- how to use multiple models, deep learning and ANNs for regression and classification;
- how to select a model based on several factors, including performance;
- evaluation metrics for classification and regression models;
- how to perform cross-validation;

---

## Supervised Learning Models: Overview

The problems of predictive classification modeling are different from those of predictive regression modeling in that classification is the task of predicting a discrete class label and regression is the task of predicting a continuous quantity. However, they both share the same concept of using known variables to make predictions, and there are many things that overlap between the models. Therefore, the classification and regression models are presented together.

Some models can be used for both classification and regression with minor modifications. These are the *K-nearest neighbors*, the decision trees, the support vector, the ensemble bagging and boosting methods and the ANNs (including deep neural networks). However, some models, such as linear regression and logistic regression, cannot (at least not easily) be used for both types of problems.

We will analyze the following details:

- model theory;
- implementation in **Scikit-learn** or **Keras**;
- grid search for different models;
- pros and cons of the models.

### Linear Regression (Ordinary Least Squares)

*Linear regression* (ordinary least squares regression - OLS) is perhaps one of the most well-known and understood algorithms in statistics and machine learning. Linear regression is a linear model, that is, a model that assumes a linear relationship between the input variables (*x*) and the single output variable (*y*). Your goal is to train a linear model to predict a new *y*, considering a previously unobserved *x*, with as few errors as possible.

Our model will be a function that predicts *y*, given that *x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>i</sub>*:

*y = β<sub>0</sub> + β<sub>1</sub>x<sub>1</sub> + ... + β<sub>i</sub>x<sub>i</sub>*

where, *β<sub>0</sub>* is called the intercept, and *β<sub>1</sub> ... β<sub>i</sub>* are the regression coefficients.

#### Implementation in Python

```Python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
```

#### Training a model

As we mentioned previously, training a model basically means recovering the model parameters while minimizing the cost (loss) function. The two steps to training a linear regression model are:

*Define a cost function (or loss function)*

- measures the level of inaccuracy present in the model predictions. The *sum of the squares of the residuals (SQR)*, as defined previously, measures the sum of the squares of the difference between the actual and predicted values, and is the cost function for the linear regression.

- equation:
    - $SQR = \sum_{i=1}^{n} (y_i - β_0 - \sum_{j=1}^{n} β_j $x_{ij}$)^2$
    
In this equation, $β_0$ is the interceptor; $β_j$ represents the coefficient; $β_1, ..., β_j$ are the regression coefficients; and *$x_{ij}$* represents the *$i^a$* observation and the *$j^a$* variable.

*Find the parameters that minimize loss*

- for example, making our model as accurate as possible. Graphically, in two dimensions, this results in a line of best fit. In higher dimensions, we would have hyperplanes with more dimensions. mathematically, we observe the difference between the actual data points *(y)* and our model's prediction ($\hat{y}$). Find the root of these differences to avoid negative numbers and penalize large differences, then add them up to get the average. This is a metric of how well our data fits the line.

#### Grid search

The general idea of grid search is to create a grid with all possible combinations of hyperparameters and train the model using each of them. Hyperparameters are the external characteristics of the model, they can be considered model adjustments and are not estimated based on model parameters similar to the data. These hyperparameters are activated during the grid search to obtain better model performance.

Due to its detailed search, grid search will certainly find the ideal parameter within the grid. The disadvantage is that the size of the grid grows exponentially with the addition of more parameters or values considered.

The **GridSearchCV** class in the *model_selection* module of the **sklearn** package makes it easy to systematically evaluate all combinations of hyperparameter values that we would like to test.

The first step is to create a model object. Then, we define a dictionary in which the keywords name the hyperparameters and the values list the settings of the parameter to be tested. For linear regression, the hyperparameter is *fit_intercept*, which is a Boolean variable that determines whether or not to calculate the *intercept* for this model. If set to False, no intercepts will be used in calculations:

```Python
model = LinearRegression()
params_grid = {'fit_intercept': [True, False]}
```

The second step is to instantiate the **GridSearchCV** object and provide the estimator objective and parameter grid, as well as a scoring method and a cross-validation choice, for the initialization method. Cross-variation is a resampling procedure used to evaluate learning models and the score parameter is the model's evaluative metric.

We can adjust GridSearchCV:

```Python
grid = GriSearchCV(estimator = model, param_grid = params_grid, scoring = 'r2', cv = fold)
grid_result = grid.fit(X, y)
```

#### Advantages and disadvantages

In terms of advantages, linear regression is easy to understand and interpret. However, it may not work well when there is a non-linear relationship between predicted and predictor variables. Linear regression has a tendency to *overfit*, and when a large number of features are present, it may not handle irrelevant features well. It also requires that the data follow certain assumptions, such as the absence of multicollinearity. If the hypothesis is false, we will not be able to trust the results obtained.

### Regularized Regression

When a linear regression model contains many independent variables, its coefficients will not be well determined and the model will have a tendency to fit the training data (data used to create the model) extremely well, but to fit the test data (data used to test the quality level of the model) extremely well. This is known as overfitting or high variance.

A popular technique for controlling overfitting is *regularization*, which involves adding a *penalty* term so that the error or loss function discourages the coefficients from reaching large values. In simple terms, regularization is the penalty mechanism that applies shrinkage to model parameters (bringing them close to zero), in order to create a model with greater prediction and interpretation accuracy. Regularized regression has two advantages over linear regression:

*Predictive accuracy*

- the performance of the model that works best with all the test data suggests that the model is trying to generalize from the training data. A model with too many parameters may attempt to fit specific noises to the chirping data. By shrinking or setting some coefficients to zero, we give up the ability to fit complex models (larger biases) in exchange for a more generalizable model (lower variance).

*Interpretation*

- a large number of forecasters can complicate the interpretation or communication of the overall picture of results. It may be preferable to sacrifice some details to limit the model to a smaller subset of parameters with the strongest effects.

Common ways to regularize a linear regression model are these:

*L1 or Lasso regularization*

- *Lasso regularization* performs *L1 regularization* by adding a factor of the sum of the coefficients of absolute values in the cost function (SQR/RSS) to the linear regression, as mentioned previously. The equation for Lasso regularization can be represented as follows:

$Cost Function$ = $SQR + \gamma \times \sum_{j=1}^{p} |\beta_{j}|$

- L1 regularization can lead to zero coefficients (i.e., some features are completely neglected for the output evaluation). The higher the value of $\gamma$, the more features are shrunk to zero. This can eliminate some features entirely and give us a subset of predictors, reducing the complexity of the model. Thus, Lasso regression not only helps in reducing overfitting but can also help in feature selection. Predictors not shrunk to zero mean that they are important, and, in this way, L1 regularization allows feature selection (sparse selection). The regularization parameter ($\gamma$) can be controlled, and a zero value of *lambda* produces the basic linear regression equation.

A lasso regression model can be built using the *Lasso* class from the **sklearn** Python package, as we will show below:

```Python
from sklearn.linear_model import Lasso
model = Lasso()
model.fit(X, y)
```

*L2 or Ridge regularization*

- *Ridge regression* performs *L2 regularization* by adding a factor of the sum of the square of the coefficients in the cost function for the linear regression. The equation for Ridge regularization can be represented like this:

$Cost Function$ = $SQR + \gamma \times \sum_{j=1}^{p} \beta_{j}^{2}$

- Ridge regression places restrictions on coefficients. The penalty term ($\gamma$) regularizes the coefficients so that, if they take on large values, the optimization function is penalized. Thus, Ridge regression shrinks the coefficients and helps reduce model complexity. Shrinking the coefficients leads to lower variance and a lower error value. Therefore, Ridge regression reduces the complexity of the model, but does not reduce the number of variables; it only diminishes its effect. When $\gamma$ is closer to zero, the cost function becomes similar to the linear regression cost function. Therefore, the smaller the restrictions ($\gamma$ $low$) on the features, the more the model will resemble a linear regression model.

A Ridge regression model can be built using the *Ridge* class from the **sklearn** Python package, as shown in the following code:

```Python
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(X, y)
```

*Elastic net*

- *Elastic nets* acrescentam termos de regularização ao modelo, sendo uma combinação das regularizações L1 e L2, como mostra a equação:

$Cost Function$ = $SQR + \gamma \times ((1 - \alpha) / 2 \times \sum_{j=1}^{p} \beta_{j}^{2} + \alpha \times \sum_{j=1}^{p} |\beta_{j}|$

- in addition to establishing and choosing a value of $\gamma$, elastic net also allows us to calibrate the alpha parameter, where $\alpha = 0$ corresponds to ridge and $\alpha = 1$ to lasso. Therefore, we can choose a $\alpha$ value between *0* and *1* to optimize the elastic net. Effectively this will shrink some coefficients and set some to *0* for sparse selection.

An elastic net regression model can be built using the *ElasticNet* class from the **sklearn** Python package, as we will show below:

```Python
from sklearn.linear_model import ElasticNet
model = ElasticNet()
model.fit(X, y)
```

For all regularized regressions, $\gamma$ is the essential parameter to be calibrated during the grid search in Python. In an elastic net, $\alpha$ can be an additional parameter to be calibrated.

### Logistic Regression

*Logistic Regression* is one of the most widely used algorithms for classification. The logistic regression model arises from the desire to model the probabilities of the output classes given a function that is linear in *x*, while ensuring that the output probabilities add up to 1 and remain between 0 and 1, which is what we expect from the probabilities.

If we train a linear regression model with several examples in which *y = 0 or 1*, we may end up predicting some probabilities that are less than zero or greater than one, something that does not make sense. Therefore, we use a logistic regression model (or *logit* model), which is a modification of linear regression that guarantees to present as output a probability between 0 and 1, when applying the $sigmoid^2$ function.

The following equation shows the logistic regression model. Similar to linear regression, input values *(X)* are combined linearly using weights or coefficient values to predict an output value *(y)*. The output of the equation is a probability that is transformed into a binary value (*0* or *1*) to obtain the model's prediction:

$$
y = \frac{exp^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n)}}{1 + exp^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n)}}
$$

$y$ is the predicted output, $\beta_0$ is the bias or intercept term, and $\beta_1$ is the coefficient for the single input value $X$. Each column in the input data has an associated coefficient $\beta$ (a constant real value) that must be learned from the training data.

In logistic regression, the cost function is basically a metric of how often we predict a when the true response is zero, or vice versa. Logistic regression coefficients are trained using techniques such as the maximum likelihood estimator (or MLE) to predict values ​​close to *1* for the default class and close to *0* for the other class.

A logistic regression model can be built using the *LogisticRegression* class from the **sklearn** Python package, as shown below:

```Python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
```

#### Hyperparameters

*Regularization (penalty* in **sklearn**)
- like linear regression, logistic regression can have regularization, which can be *L1*, *L2*, or *elasticnet*. The values in the **sklearn** library are *[l1, l2, elasticnet]*

*Regularization strength (C* in **sklearn**)
- this parameter controls the strength of the regularization. Good penalty parameter values can be *[100, 10, 1.0, 0.1, 0.01]*

#### Advantages and disadvantages

Considering the advantages, the logistic regression model is easy to implement, has good interpretability, and performs very well on linearly separable classes. The output of the model is a probability, which provides more insights and can be used for ranking. The model has a small number of hyperparameters. Although there may be a risk of overfitting, this can be addressed using *L1/L2* regularization, similar to how we address overfitting in linear regression models.

In terms of disadvantages, the model may overfit when fed with a large number of features. Logistic regression can only learn linear functions and is less suited to complex relationships between features and target variables. Furthermore, it may not handle irrelevant features well, especially if they are strongly correlated.

### Support Vector Machine

The goal of the *support vector machine* (SVM) algorithm is to maximize the margin, which is defined as the distance between the separating hyperplane (or decision boundary) and the training samples that are closest to this hyperplane, the so-called support vectors. The margin is calculated as the perpendicular distance from the line to only the closest points. Thus, the SVM calculates a boundary with maximum margin that leads to a homogeneous division of all data points.

In practice, the data is messy and cannot be separated perfectly with a hyperplane. The constraint of maximizing the margin of the line separating the classes must be relaxed. A change is made to allow some training data points to violate the separating line. An additional set of coefficients is introduced, which gives the margin a relaxation in each dimension. A calibration parameter is introduced, simply called *C*, which defines the magnitude of relaxation allowed in all dimensions. The larger the value of *C*, the more violations of the hyperplane are allowed.

In some cases, it is not possible to find a hyperplane or a linear decision boundary, and kernels are used. A kernel is simply a transformation of the input data that allows the SVM algorithm to process the data more easily. Using kernels, the data is projected into a higher dimension to classify it better.

SVM is used for both classification and regression. This is possible by converting the original optimization problem into a dual problem. For regression, the trick is to reverse the objective. Instead of trying to fit the widest possible street between two classes while limiting margin violations, SVM regression tries to fit as many instances as possible on the street while limiting margin violations. The width of the street is controlled by a hyperparameter.

SVM regression and classification models can be built using the Python **sklearn** package, as shown in the following code:

```Python
# Regression
from sklearn.svm import SVR
model = SVR()
model.fit(X, y)
```

```Python
# Classification
from sklearn.svm import SVC
model = SVC()
model.fit(X, y)
```

#### Hyperparameters

The following key parameters are provided in the sklearn implementation of SVM and can be adjusted during grid search:

*Kernels*
- the choice of kernel controls how the input variables will be projected. There are many kernel options, but *linear* and *RFB* are the most common;

*Penalty*
- the penalty parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For larger values of the penalty parameter, the optimization will choose a hyperplane with a smaller margin. Good values can have a logarithmic scale from 10 to 1,000.

#### Advantages and disadvantages

In terms of advantages, the SVM is quite robust against overfitting, especially in a higher-dimensional space. It handles nonlinear relationships very well, with many kernel options to choose from. Furthermore, there is no distributional log for the data.

Considering the disadvantages, SVM can be inefficient to train and requires a lot of memory to run and calibrate. It does not perform well on large data sets. It requires feature scaling of the data. There are also many hyperparameters and their meanings are often not intuitive.

### K-Nearest Neighbors

*K-nearest neighbors* (KNN) is considered a "lazy learner" because there is no learning required by the model. For a new data point, predictions are made by searching the entire training set for the *K* most similar instances (the neighbors) and summarizing the output variable for these *K* instances.

To determine which *K* instances in the training data set are most similar to the new input, a distance measure is used. The most popular is the *Euclidean distance*, which is calculated as the square root of the sum of the squared differences between a point *a* and a point *b* over all input attributes *i* and is represented as:


$$ 
d(a, b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}
$$

A distância euclidiana é uma boa métrica de distância a ser usada se as variáveis de entrada forem semelhantes em tipo.

Another distance metric is the *Manhattan distance*, where the distance between point *a* and point *b* is represented as:

$$
d(a, b) = {\sum_{i=1}^{n} |a_i - b_i|}
$$

This is a good metric to use if the input variables are not similar in type. The steps of KNN can be summarized as follows:

1. Choose the number of *K* and a distance metric;
2. Find the *K-nearest neighbors* of the sample you want to classify;
3. Assign the class label by majority vote.

KNN regression and classification models can be built using the Python **sklearn** package, as demonstrated:

```Python
# Classification
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X, y)
```

```Python
# Regression
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
model.fit(X, y)
```

#### Hyperparameters

The following key parameters are present in the **sklearn** implementation of KNN and can be adjusted during grid search:

*Number of neighbors*
- the most important hyperparameter for KNN is the number of neighbors (*n_neighbors*). Good values ​​are between 1 and 20;

*Distance metric*
- it may also be interesting to test different distance metrics to choose the composition of the neighborhood. Good values are *euclidean* and *manhattan*.

#### Advantages and disadvantages

In terms of advantages, there is no training involved and therefore no learning phase. Since the algorithm does not require training before making predictions, new data can be added seamlessly without impacting the accuracy of the algorithm. It is intuitive and easy to understand. The model handles multi-class classification naturally and can learn complex decision boundaries. KNN is effective if the training data is large. It is also robust to noisy data and there is no need to filter out outliers.

Considering disadvantages, the distance metric to choose is not obvious and in many cases difficult to justify. KNN performs poorly with high-dimensional datasets. It is expensive and slow to predict new instances as the distance to all neighbors has to be recalculated. It is sensitive to noise in the dataset. We need to manually insert missing values and remove outliers. Furthermore, feature scaling (standardization and normalization) is required before applying the KNN algorithm to any dataset; otherwise, it may generate wrong predictions.

### Linear Discriminant Analysis

The goal of the *linear discriminant analysis* (LDA) algorithm is to project data into a space with fewer dimensions so that class separability is maximized and within-class variance is minimized.

During training of the LDA model, the statistical properties (mean and covariance matrix) of each class are computed. They are estimated based on the following assumptions about the data:

- the data has a normal distribution, so that each variable has the shape of a bell curve when plotted;

- each attribute has the same variance, and the values of each variable vary around the mean by approximately the same amount.

To make predictions, LDA estimates the probability that a new set of inputs belongs to each class. The output class is the one with the highest probability.

#### Python implementation and hyperparameters

The LDA classification model can be built using the **sklearn** package, as follows:

```Python
from sklean.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis ()
model.fit(X, y)
```

The key hyperparameter for the LDA model is the number of components for dimensionality reduction, represented by *n_components* in **sklearn**.

#### Advantages and disadvantages

The advantages are that LDA is a relatively simple model, with a quick and easy implementation. The disadvantages are that it requires feature scaling and involves complex matrix operations.

### Classification and Regression Trees

In general terms, the purpose of a tree-building analysis algorithm is to determine a set of logical *if/then* (split) conditions that allow for accurate prediction or classification of cases. Classification and regression trees are attractive models if interpretability is important to us. We can think of this model as decomposing our data and making decisions based on a series of questions asked. This algorithm is the foundation of ensemble methods such as random forest and gradient boosting.

#### Representation

The model can be represented by a *binary tree* or *decision tree*, where each node is an input variable *x* with a bifurcation point and each leaf contains an output variable *y* for each prediction.

#### Learning a CART model

Creating a binary tree is actually a process of splitting the input space. A *greedy approach* called *recursive binary split* is used to split the space. This is a numerical procedure in which all the values ​​are aligned and different split points are tested using a cost (loss) function. The split with the best cost (lowest cost, since we minimize it) is selected. All input variables and all possible split points are evaluated and chosen in a greedy manner (i.e. the best split point will be chosen each time).

For predictive regression modeling problems, the cost function that is minimized to choose the split points is the *sum of squared errors* over all training samples.

${\sum_{i=1}^{n} (y_i - predição_i)^2}$

where $y_i$ is the output for the training sample and prediction is the predicted output. For classification, the *Gini coefficient* is used; it gives an indication of the purity level of the leaf nodes (i.e., whether the training data assigned to each node is well mixed) and is defined as:

$G = {{\sum_{i=1}^{n} P_k^* (1 - p_k)}}$

where *G* is the Gini cost across all classes, and *P_k* is the number of training instances with the class k of interest. A node that has all classes of the same type (perfect class purity) will have *G = 0*, while a node that has a *50 - 50* split of classes for a binary classification problem (worst purity) will have *G = 0.5*.

#### Stopping Criteria

The recursive binary split procedure described above needs to know when to stop splitting as it descends the tree of training data. The most common stopping procedure is to use a minimum count of the number of training instances assigned to each leaf node. If the count is less than some minimum, then the split is not accepted and the node is considered the last one.

#### Pruning the tree

The stopping criterion is important, as it greatly influences the performance of the tree. Pruning can be used after the tree has learned to further improve performance. The complexity of a decision tree is defined as the number of splits in it. Simpler trees are preferred, as they are faster to execute and easier to understand, consume less memory and storage during processing, and are less likely to overfit the data. The quickest and simplest method of pruning is to analyze each leaf node in the tree and evaluate the effect of removing it using a test set. A leaf node is removed only if it results in a drop in the overall cost function on the entire test set. The removal of nodes can be stopped when no further improvements can be made.

#### Implementation

CART regression and classification models can be built using the **sklearn** package, as shown in the following code:

```Python
# Classification
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, y)
```

```Python
# Regression
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X, y)
```

#### Hyperparameters

CART has many hyperparameters. However, the main one is the maximum depth of the tree model, which is the number of components for dimensionality reduction and is represented by *max_depth* in the **sklearn** package. Good values can range from *2 to 30*, depending on the number of features in the data.

#### Advantages and disadvantages

In terms of advantages, CART is easy to interpret and can be adapted to learn complex relationships. It requires little data preparation, and the data usually does not need to be scaled. Feature importance is built in due to the way decision nodes are created. It performs well on large data sets. It works on both regression and classification problems.

In terms of disadvantages, CART has a tendency to overfit unless pruning is used. It can be very robust, meaning that small changes in the training set can lead to quite significant differences in the hypothesis function that is learned. In general, it performs worse than ensemble models.

### Ensemble Models