# Supervised Learning: Models and Concepts

Supervised learning is an area of machine learning in which the chosen algorithm attempts to fit a target using given input. A training data set containing labels is provided to the algorithm. Based on a massive set of data, it will learn a rule that it uses to predict labels on new observations. In other words, supervised learning algorithms receive historical data and the task of finding the relationship that has the best predictive power.

There are two varieties of supervised learning algorithms: regression and classification algorithms. Regression-based supervised learning methods attempt to predict outputs based on input variables. Classification-based supervised learning methods identify the category to which a dataset belongs. Classification algorithms are based on probability, that is, the output is the category for which the algorithm finds the highest probability that the data set belongs to it. Regression algorithms, in contrast, estimate the output of problems that have an infinite number of solutions (continuous set of possible outputs).

In the context of finance, supervised learning models represent one of the most widely used classes of machine learning models. Many algorithms that are widely applied in algorithmic trading use supervised learning models, as they can be trained efficiently, are relatively robust to noise in financial data, and have strong links to finance theory.

Regression-based algorithms have been leveraged by academic and industry researchers to develop numerous asset pricing models. Such models are used to predict returns over multiple periods and to identify significant factors that drive returns on assets. There are many other use cases for regression-based supervised learning in portfolio management and derivatives pricing.

Classification-based algorithms, on the other hand, have been pushed into several areas within finance that require predicting a categorical reaction. Among them, we have fraud detection, default prediction, credit score, directional prediction of movements in asset prices and buy/sell recommendations. There are many other use cases for classification-based supervised learning, such as portfolio management and algorithmic trading.

---

##### Topics that will be covered:

- basic concepts about supervised learning models;
- how to implement different supervised learning models in Python;
- how to optimize models and identify their ideal parameters using grid search;
- overfitting versus underfitting and bias versus variance;
- strengths and weaknesses of the different supervised learning models;
- how to use multiple models, deep learning and ANNs for regression and classification;
- how to select a model based on several factors, including performance;
- evaluation metrics for classification and regression models;
- how to perform cross-validation;

---

## Supervised Learning Models: Overview

The problems of predictive classification modeling are different from those of predictive regression modeling in that classification is the task of predicting a discrete class label and regression is the task of predicting a continuous quantity. However, they both share the same concept of using known variables to make predictions, and there are many things that overlap between the models. Therefore, the classification and regression models are presented together.

Some models can be used for both classification and regression with minor modifications. These are the *K-nearest neighbors*, the decision trees, the support vector, the ensemble bagging and boosting methods and the ANNs (including deep neural networks). However, some models, such as linear regression and logistic regression, cannot (at least not easily) be used for both types of problems.

We will analyze the following details:

- model theory;
- implementation in **Scikit-learn** or **Keras**;
- grid search for different models;
- pros and cons of the models.

### Linear Regression (Ordinary Least Squares)

*Linear regression* (ordinary least squares regression - OLS) is perhaps one of the most well-known and understood algorithms in statistics and machine learning. Linear regression is a linear model, that is, a model that assumes a linear relationship between the input variables (*x*) and the single output variable (*y*). Your goal is to train a linear model to predict a new *y*, considering a previously unobserved *x*, with as few errors as possible.

Our model will be a function that predicts *y*, given that *x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>i</sub>*:

*y = β<sub>0</sub> + β<sub>1</sub>x<sub>1</sub> + ... + β<sub>i</sub>x<sub>i</sub>*

where, *β<sub>0</sub>* is called the intercept, and *β<sub>1</sub> ... β<sub>i</sub>* are the regression coefficients.

#### Implementation in Python

```Python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
```

#### Training a model

As we mentioned previously, training a model basically means recovering the model parameters while minimizing the cost (loss) function. The two steps to training a linear regression model are:

*Define a cost function (or loss function)*

- measures the level of inaccuracy present in the model predictions. The *sum of the squares of the residuals (SQR)*, as defined previously, measures the sum of the squares of the difference between the actual and predicted values, and is the cost function for the linear regression.

- equation:
    - $SQR = \sum_{i=1}^{n} (y_i - β_0 - \sum_{j=1}^{n} β_j $x_{ij}$)^2$
    
In this equation, $β_0$ is the interceptor; $β_j$ represents the coefficient; $β_1, ..., β_j$ are the regression coefficients; and *$x_{ij}$* represents the *$i^a$* observation and the *$j^a$* variable.

*Find the parameters that minimize loss*

- for example, making our model as accurate as possible. Graphically, in two dimensions, this results in a line of best fit. In higher dimensions, we would have hyperplanes with more dimensions. mathematically, we observe the difference between the actual data points *(y)* and our model's prediction ($\hat{y}$). Find the root of these differences to avoid negative numbers and penalize large differences, then add them up to get the average. This is a metric of how well our data fits the line.

#### Grid search

The general idea of grid search is to create a grid with all possible combinations of hyperparameters and train the model using each of them. Hyperparameters are the external characteristics of the model, they can be considered model adjustments and are not estimated based on model parameters similar to the data. These hyperparameters are activated during the grid search to obtain better model performance.

Due to its detailed search, grid search will certainly find the ideal parameter within the grid. The disadvantage is that the size of the grid grows exponentially with the addition of more parameters or values considered.

The **GridSearchCV** class in the *model_selection* module of the **sklearn** package makes it easy to systematically evaluate all combinations of hyperparameter values that we would like to test.

The first step is to create a model object. Then, we define a dictionary in which the keywords name the hyperparameters and the values list the settings of the parameter to be tested. For linear regression, the hyperparameter is *fit_intercept*, which is a Boolean variable that determines whether or not to calculate the *intercept* for this model. If set to False, no intercepts will be used in calculations:

```Python
model = LinearRegression()
params_grid = {'fit_intercept': [True, False]}
```

The second step is to instantiate the **GridSearchCV** object and provide the estimator objective and parameter grid, as well as a scoring method and a cross-validation choice, for the initialization method. Cross-variation is a resampling procedure used to evaluate learning models and the score parameter is the model's evaluative metric.

We can adjust GridSearchCV:

```Python
grid = GriSearchCV(estimator = model, param_grid = params_grid, scoring = 'r2', cv = fold)
grid_result = grid.fit(X, y)
```

#### Advantages and disadvantages

In terms of advantages, linear regression is easy to understand and interpret. However, it may not work well when there is a non-linear relationship between predicted and predictor variables. Linear regression has a tendency to *overfit*, and when a large number of features are present, it may not handle irrelevant features well. It also requires that the data follow certain assumptions, such as the absence of multicollinearity. If the hypothesis is false, we will not be able to trust the results obtained.

### Regularized Regression

When a linear regression model contains many independent variables, its coefficients will not be well determined and the model will have a tendency to fit the training data (data used to create the model) extremely well, but to fit the test data (data used to test the quality level of the model) extremely well. This is known as overfitting or high variance.

A popular technique for controlling overfitting is *regularization*, which involves adding a *penalty* term so that the error or loss function discourages the coefficients from reaching large values. In simple terms, regularization is the penalty mechanism that applies shrinkage to model parameters (bringing them close to zero), in order to create a model with greater prediction and interpretation accuracy. Regularized regression has two advantages over linear regression:

*Predictive accuracy*

- the performance of the model that works best with all the test data suggests that the model is trying to generalize from the training data. A model with too many parameters may attempt to fit specific noises to the chirping data. By shrinking or setting some coefficients to zero, we give up the ability to fit complex models (larger biases) in exchange for a more generalizable model (lower variance).

*Interpretation*

- a large number of forecasters can complicate the interpretation or communication of the overall picture of results. It may be preferable to sacrifice some details to limit the model to a smaller subset of parameters with the strongest effects.

Common ways to regularize a linear regression model are these:

*L1 or Lasso regularization*

- *Lasso regularization* performs *L1 regularization* by adding a factor of the sum of the coefficients of absolute values in the cost function (SQR/RSS) to the linear regression, as mentioned previously. The equation for Lasso regularization can be represented as follows:

$Cost Function$ = $SQR + \gamma \times \sum_{j=1}^{p} |\beta_{j}|$

- L1 regularization can lead to zero coefficients (i.e., some features are completely neglected for the output evaluation). The higher the value of $\gamma$, the more features are shrunk to zero. This can eliminate some features entirely and give us a subset of predictors, reducing the complexity of the model. Thus, Lasso regression not only helps in reducing overfitting but can also help in feature selection. Predictors not shrunk to zero mean that they are important, and, in this way, L1 regularization allows feature selection (sparse selection). The regularization parameter ($\gamma$) can be controlled, and a zero value of *lambda* produces the basic linear regression equation.

A lasso regression model can be built using the *Lasso* class from the **sklearn** Python package, as we will show below:

```Python
from sklearn.linear_model import Lasso
model = Lasso()
model.fit(X, y)
```

*L2 or Ridge regularization*

- *Ridge regression* performs *L2 regularization* by adding a factor of the sum of the square of the coefficients in the cost function for the linear regression. The equation for Ridge regularization can be represented like this:

$Cost Function$ = $SQR + \gamma \times \sum_{j=1}^{p} \beta_{j}^{2}$

- Ridge regression places restrictions on coefficients. The penalty term ($\gamma$) regularizes the coefficients so that, if they take on large values, the optimization function is penalized. Thus, Ridge regression shrinks the coefficients and helps reduce model complexity. Shrinking the coefficients leads to lower variance and a lower error value. Therefore, Ridge regression reduces the complexity of the model, but does not reduce the number of variables; it only diminishes its effect. When $\gamma$ is closer to zero, the cost function becomes similar to the linear regression cost function. Therefore, the smaller the restrictions ($\gamma$ $low$) on the features, the more the model will resemble a linear regression model.

A Ridge regression model can be built using the *Ridge* class from the **sklearn** Python package, as shown in the following code:

```Python
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(X, y)
```

*Elastic net*

- *Elastic nets* acrescentam termos de regularização ao modelo, sendo uma combinação das regularizações L1 e L2, como mostra a equação:

$Cost Function$ = $SQR + \gamma \times ((1 - \alpha) / 2 \times \sum_{j=1}^{p} \beta_{j}^{2} + \alpha \times \sum_{j=1}^{p} |\beta_{j}|$

