# Supervised Learning: Regression

Regression-based machine learning is a predictive form of modeling in which the goal is to model the relationship between a target and predictor variable(s) in order to estimate a continuous set of possible outcomes. It is the most widely used machine learning model in finance.

One of the areas of focus for analysts in financial institutions (and finance in general) is predicting investment opportunities, typically by predicting asset prices and returns. Regression-based machine learning models are inherently well-suited in this context. They help financial and investment managers understand the properties of the predicted variable and its relationship to other variables, and also help them identify significant factors that drive asset returns. This helps investors estimate return profiles, trading costs, technical and financial infrastructure investments required, and consequently the risk profile and profitability of a strategy or portfolio.

With the availability of large amounts of data and processing techniques, regression-based machine learning is not limited to asset price prediction. These models are applied to a wide range of areas within finance, including portfolio management, insurance and instrument pricing, hedging, and risk management.

We will cover three areas of finance with the case studies, namely asset price prediction, instrument pricing, and portfolio management. All the case studies follow the seven-step process development model presented above; these steps include:
1. defining the problem;
2. loading the data;
3. performing exploratory analysis;
4. data preparation;
5. model evaluation;
6. feature engineering;
7. model tuning.

A substantial number of asset modeling and prediction problems in the financial industry involve a time component and the estimation of a continuous output. As such, it is also important to address *time series models*. In its broadest form, time series analysis is about inferring what happened to a series of data points in the past and trying to predict what will happen to them in the future. There has been much comparison and debate in academia and industry regarding the differences between supervised regression and time series models. Most time series models are *parametric* (i.e., a known function is assumed to represent the data), while most supervised regression models are *nonparametric*. Time series models primarily use historical data for the predicted variables for prediction, and supervised learning algorithms use *exogenous variables* as predictor variables. However, supervised regression can incorporate historical data of the predicted variable through a time delay approach, and a time series model (such as ARIMAX) can use exogenous variables for prediction. Thus, time series and supervised regression models are similar in that they can both use exogenous variables as well as historical data of the predicted variable to make predictions. Regarding the final output, both estimate a continuous set of possible outputs of a variable.

Since time series models are more closely aligned with supervised regression than with supervised classification, we will cover the concepts of time series models here, but separately. We will also demonstrate how we can use time series models with financial data to predict future values. Additionally, some deep learning models (such as LSTM) can be used directly for time series forecasting.

In “Case Study 1: Stock Price Prediction,” we will demonstrate one of the most common forecasting problems in finance: predicting stock returns. In addition to accurately predicting future stock prices, the purpose of this case study is to examine a machine learning-based framework for predicting a general asset class in finance. In it, we will explore several machine learning and time series concepts, and also focus on visualization and model tuning.

In “Case Study 2: Derivatives Pricing,” we will dive into derivatives pricing using supervised regression and show how machine learning techniques can be applied to traditional quantitative analysis problems. When compared to traditional derivatives pricing models, machine learning techniques can lead to faster pricing without relying on a lot of useless assumptions. Efficient numerical computation using machine learning could be increasingly useful in areas such as financial risk management, where a trade-off between efficiency and accuracy is often unavoidable.

In "Case Study 3: Investor Risk Tolerance and Robo Advisors", we will illustrate the supervised regression-based framework for estimating investor risk tolerance. In the case study, we will develop a robo advisor dashboard in Python and implement the risk tolerance prediction model in the dashboard. We will demonstrate how such models can lead to the automation of portfolio management processes, including the use of robo advisors for investment management. The purpose is to illustrate how machine learning can effectively be used to overcome the problem of traditional risk tolerance or risk tolerance questionnaires that suffer from several behavioral biases.

In "Case Study 4: Forward Curve Forecasting", we will use a supervised regression-based framework to forecast the terms of the forward curves simultaneously. We will demonstrate how we can produce different terms at the same time to model the yield curve using machine learning models.

---

What we will do next

- applying and comparing different time series and machine learning models;
- interpreting the models and results. Understanding the potential for overfitting and underfitting and the intuition behind linear versus nonlinear models;
- preparing and transforming data to be used in machine learning models;
- selecting and engineering features to improve model performance;
- using data visualization and exploration to understand the outputs;
- tuning algorithms to improve model performance. Understanding, implementing, and tuning time series models, such as ARIMA, for forecasting;
- structuring a problem statement related to portfolio management and behavioral finance in a regression-based machine learning framework;
- understanding how deep learning-based models, such as LSTM, can be used for time series forecasting.

---

## Time Series Models

A *time series* is a sequence of numbers that are ordered by a time index. Let's look at the following aspects of time series models, which we will delve into later in the case studies:

- the components of a time series;
- the autocorrelation and stationarity of time series;
- traditional time series models (such as ARIMA);
- the use of deep learning models for time series modeling;
- the conversion of time series data into a supervised learning framework.

### The Parts of a Time Series

A time series can be divided into the following components:

*Trend Component*
- a trend is a consistent directional movement in a time series. These trends will either be *deterministic* or *stochastic*. The former allows us to provide an underlying rationale for the trend, while the latter is a random feature of a series that we are unlikely to be able to explain. Trends commonly appear in financial series, and many trend models use sophisticated trend-identification algorithms.

*Seasonal Component*
- Many time series contain seasonal variation. This is especially true for time series that represent business sales or weather levels. In quantitative finance, we often see seasonal variation, especially in series related to the holiday season or annual temperature variation (such as natural gas).

We can write the components of a time series $y_t$ as:

$y_t = S_t + T_t + R_t$

where $S_t$ is the seasonal component, $T_t$ is the trend component, and $R_t$ represents the remaining component of the time series that was not captured by the other two components.

##### Implementation

```Python
import statsmodels.api as sm
sm.tsa.seasonal_decompose(y, freq = 52).plot()
```

### Autocorrelation and Stationarity

When we are given one or more time series, it is relatively straightforward to decompose them into trend, seasonality, and residual components. However, there are other aspects that come into play when working with time series data, especially in finance.

#### Autocorrelation

There are several situations in which consecutive elements of a time series exhibit correlation. That is, the behavior of sequential points in the series affects one another in a dependent manner. *Autocorrelation* is the similarity between observations as a function of the interval between them. Such relationships can be modeled using an autoregression model. The term *autoregression* indicates that it is a regression of the variable itself.

In the autoregression model, we predict the variable of interest using a linear combination of past values ​​of the variable.

Thus, an autoregressive model of order *p* can be written as:

$y_t = c + \phi_1 y_t-1 + \phi_2 y_t-2 + ... + \phi_p y_{t-p} + \epsilon$

where $\epsilon_t$ is noise. An autoregressive model is like a multiple regression, but with values ​​with intervals of $y_t$ as predictors. We refer to this as the AR*(p)* model, an autoregressive model of order *p*. Autoregressive models are extremely flexible in handling a wide range of different time series patterns.

#### Stationarity

A time series is considered stationary if its statistical properties do not change over time. Therefore, a time series with a trend or seasonality is not stationary, since they will affect the value of the series at different times. On the other hand, a noise series is stationary, since it does not matter when you observe it, as it will be similar at any time.

The figure below shows some examples of stationary and non-stationary series.

<figure>
    <img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pJUAANRS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/max/1147/1%2Am0E2_nOE1oFhMc1L01oKsg.png" width="600">
    <figcaption>Stationary and Non-stationary graphs</figcaption>
</figure>

In the second graph, we can clearly see that the mean is changing (increasing) over time, resulting in an upward trend. Thus, it is a non-stationary series. For a series to be classified as stationary, it should not exhibit a trend. Moving on to the third graph, we certainly do not see a trend in the series; its variance is a function of time. A stationary series should have constant variance; therefore, this series is also non-stationary. In the last graph, the distribution gets closer as time increases, suggesting that the covariance is a function of time. 

Looking at the first graph, the mean, variance, and covariance are constant over time. This is what a stationary time series looks like. Predicting future values using this graph would be easier. Most statistical models require the series to be stationary in order to make effective and accurate predictions.

The two main reasons behind the non-stationarity of a time series are trend and seasonality. To use time series forecasting models, we typically convert any non-stationary series to stationary, which makes modeling easier since the statistical properties do not change over time.

#### Differentiation

Differentiation is one of the methods used to transform a time series into a stationary series. In this method, we compute the difference of consecutive terms in the series. Differentiation is typically performed to get rid of the varying mean. Mathematically, it can be expressed as:

$y_t = y_t - y_{t-1}$

where $y_t$ is the value at time *t*.

When the differenced series is noise, the original series is referred to as a non-stationary series of degree one.

### Traditional Time Series Models

There are many ways to model a time series for forecasting. Most time series models aim to incorporate the trend, seasonality, and remainder components, while also addressing the autocorrelation and stationarity embedded in the time series. For example, the autoregressive (AR) model discussed earlier addresses autocorrelation in a time series.

One of the most widely used models in time series forecasting is the ARIMA model.

#### ARIMA

If we combine stationarity with autoregression and a moving average model, we will obtain an *ARIMA* model, which is an acronym for AutoRegressive Integrated Moving Average, and has the following components:

*AR(p)*
- represents autoregression, that is, the regression of the time series on itself, as we saw earlier, with an assumption that the current values of the series depend on their previous values with some interval (or several). The maximum interval in the model is referred to as *p*.

*I(d)*
- represents the order of integration. It is simply the number of differences needed to transform the series into stationary.

*MA(q)*
- represents the moving average Without going into details, it models the error of the time series; again, the assumption is that the current error depends on the previous one with some interval, which is referred to as *q*.

The moving average equation is expressed as:

$y_t = c + \epsilon_t + \theta_{1} \epsilon_{t-1} + \theta_{2} \epsilon{t-2}$

where $\epsilon_t$ is noise. We refer to this as an *MA(q)* model of order *q*. Combining all the components, the full ARIMA model can be expressed as:

$y_t = c + \phi_1 y_{t-1} + ... + \phi_p y_{t-p} + \theta_1 \epsilon_{t-1} + ... + \theta_q \epsilon_{t-q} + \epsilon_t$

where $y_t$ is the differenced series (it may have been differenced more than once). The predictors on the right-hand side include both the interval values ​​of $y_t$ and the interval errors. We call this an ARIMA *(p, d, q)* model, where *p* is the order of the autoregressive part, *d* is the degree of the first differencing involved, and *q* is the order of the moving average part. The same stationarity and invertibility conditions used for autoregressive and moving average models also apply to the ARIMA model.

##### Implementation

```Python
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(endog = y_train, order = [1, 0, 0])
```

The ARIMA family of models has several variants, and some of them are as follows:

*ARIMAX*
- ARIMA models with exogenous variables included. We will use this model in case study 1.

*SARIMA*
- the "S" here stands for seasonality, and this model aims to model the seasonality component embedded in the time series, along with other components.

*VARMA*
- is the extension of the model to the multivariate case, where there are many variables to be predicted simultaneously. We do this in case study 4.

### Deep Learning Approach to Time Series Modeling

Traditional time series models, such as ARIMA, are well understood and effective for many problems. However, these traditional methods also suffer from several limitations. Traditional time series models are linear functions, or simple transformations of linear functions, and require manually diagnosed parameters, such as time dependence, and do not perform well with corrupted or missing data.

If we look at the advances in the field of deep learning for time series forecasting, we can see that *recurrent neural networks* (RNN) are gaining more and more attention. These methods can identify structures and patterns such as nonlinearity, can perfectly model problems with multiple input variables, and are relatively robust to missing data. RNN models can retain the state from one iteration to the next while using their own output as input for the next step. These deep learning models can be referred to as time series models as they can make future predictions using past data points, similar to traditional time series models like ARIMA. Therefore, there is a wide range of applications in finance where these deep learning models can be leveraged. Let’s take a look at deep learning models for time series forecasting.

#### RNNs

Recurrent neural networks (RNNs) are called "recurrent" because they perform the same task for each element of a sequence, with the output dependent on previous computations. RNN models have a memory, which captures information about what has already been computed. A recurrent neural network can be thought of as multiple copies of the same network, with each phase passing a message to the next.

#### Long Short-Term Memory (LSTM)

Long short-term memory (LSTM) is a special type of RNN explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is almost the default behavior of an LSTM model. These models are composed of a set of cells with features to memorize the sequence of data. The cells capture and store the data streams. In addition, they interconnect a module from the past with another from the present to transmit information from multiple instances of the past to the present. Due to the use of gates in each cell, the data in each of them can be eliminated, filtered or added to the following cells.

*Gates*, based on the layers of the artificial neural network, allow cells to optionally let data through or be discarded. Each layer outputs numbers between zero and one, representing how much of each data segment should be allowed through each cell. More precisely, an estimate of zero implies “let nothing through.” An estimate of one indicates “let everything through.” Three types of gates are involved in each LSTM, each intended to control the state of each cell:

*Forget Gate*
- outputs a number between zero and one, where “keep this completely” and zero implies “ignore this completely.” This gate conditionally decides whether the past should be forgotten or preserved.

*Input Gate*
- chooses what new data should be stored in the cell.

*Output Gate*
- decides what will be output by each cell. The output value will be based on the state of the cell along with the filtered and newly filtered and newly added data.

Keras wraps the libraries and functions for efficient numerical computation and allows us to define and train LSTM neural network models with just a few lines of code. In the following code, the LSTM module from *keras.layers* is used to implement the LSTM network. The network is trained with the *X_train_LSTM* variable. The network has one hidden layer with fifty LSTM blocks or neurons and one output layer that predicts a single value.

##### Implementation

