# The overfitting problem

![](<src/09_Table_The Overfitting.png>)

## Load the data

In [1]:
import pandas as pd

df = pd.read_excel('data/Microsoft_LinkedIn_Processed.xlsx', parse_dates=['Date'], index_col=0)
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,change_tomorrow,change_tomorrow_direction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-12-08,56.325228,56.582507,55.902560,56.058762,21220800,-1.549143,DOWN
2016-12-09,56.214968,56.959234,56.169027,56.940857,27349400,-0.321692,DOWN
...,...,...,...,...,...,...,...
2023-03-14,256.750000,261.070007,255.860001,260.790009,33620300,-1.751806,DOWN
2023-03-15,259.980011,266.480011,259.209991,265.440002,46028000,-3.895731,DOWN


## Machine Learning Model

### Separate the data

1. Target: which variable do you want to predict?
2. Explanatory: which variables will you use to calculate the prediction?

In [2]:
target = df.change_tomorrow
explanatory = df[['Open','High','Low','Close','Volume']]

### Compute the model

The following Python code will compute the numbers of the mathematical equation that we will use to calculate if the ticker goes UP or DOWN.

In [3]:
from sklearn.tree import DecisionTreeRegressor

model_dt = DecisionTreeRegressor(max_depth=15)
model_dt.fit(explanatory, target)

### Calculate the predictions

In [4]:
y_pred = model_dt.predict(X=explanatory)
y_pred

array([-0.08195771, -0.33014797, -0.08195771, ..., -0.31510252,
       -0.31510252, -3.8957311 ])

In [5]:
df_predictions = df[['change_tomorrow']].copy()
df_predictions['prediction'] = y_pred
df_predictions

Unnamed: 0_level_0,change_tomorrow,prediction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-12-08,-1.549143,-0.081958
2016-12-09,-0.321692,-0.330148
...,...,...
2023-03-14,-1.751806,-0.315103
2023-03-15,-3.895731,-3.895731


### Evaluate the model: compare predictions with the reality

In [6]:
model_dt.score(X=explanatory, y=target)

0.5431115880820776

## Train test split

### Split the dataset

- Imagine we are in 2020, we can only train the data up until 31st December 2020, how good would have the model been going foward?

In [7]:
y = df.change_tomorrow

In [8]:
X = df.drop(columns=['change_tomorrow', 'change_tomorrow_direction'])

In [9]:
n_days = len(df.index)

In [10]:
n_days_split = int(n_days*0.70)

In [11]:
X_train, y_train = X.iloc[:n_days_split], y.iloc[:n_days_split]
X_test, y_test = X.iloc[n_days_split:], y.iloc[n_days_split:]

### Fit the model on train set

In [12]:
model_dt_split = DecisionTreeRegressor(max_depth=15)

In [13]:
model_dt_split.fit(X=X_train, y=y_train)

### Calculate predictions on test set

In [14]:
y_pred = model_dt_split.predict(X=X_test)

In [15]:
df_predictions = y_test.to_frame()
df_predictions['prediction'] = y_pred
df_predictions

Unnamed: 0_level_0,change_tomorrow,prediction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-04-29,0.130867,-3.208841
2021-04-30,0.127049,0.477505
...,...,...
2023-03-14,-1.751806,2.910909
2023-03-15,-3.895731,2.882601


### Evaluate model

#### On test set

In [16]:
from sklearn.metrics import mean_squared_error

y_pred_test = model_dt_split.predict(X=X_test)
mean_squared_error(y_true=y_test, y_pred=y_pred_test)

9.76884461927318

#### On train set

In [17]:
y_pred_train = model_dt_split.predict(X=X_train)
mean_squared_error(y_true=y_train, y_pred=y_pred_train)

0.7335232619210789

## Backtesting

In [18]:
from backtesting import Backtest

### Create library for your strategies

### Import the library

In [19]:
import strategies

### Run the backtest on `test` data

In [23]:
bt = Backtest(X_test, strategies.SimpleRegression,
              cash=10000, commission=.002, exclusive_orders=True)

results = bt.run(model=model_dt_split, limit_buy=1, limit_sell=-5)

df_results_test = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'Out of Sample (Test)'}, axis=1)

### Run the backtest on `train` data

In [27]:
bt = Backtest(X_train, strategies.SimpleRegression,
              cash=10000, commission=.002, exclusive_orders=True)

results = bt.run(model=model_dt_split, limit_buy=1, limit_sell=-5)

df_results_train = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'In Sample (Train)'}, axis=1)

### Compare both backtests

In [28]:
df_results = pd.concat([df_results_test, df_results_train], axis=1)

In [29]:
df_results

Unnamed: 0,Out of Sample (Test),In Sample (Train)
Start,2021-04-29 00:00:00,2016-12-08 00:00:00
End,2023-03-15 00:00:00,2021-04-28 00:00:00
Duration,685 days 00:00:00,1602 days 00:00:00
Exposure Time [%],93.446089,73.798731
Equity Final [$],10346.080656,16728.867219
Equity Peak [$],13301.417623,17110.721542
Return [%],3.460807,67.288672


## The overfitting problem in backtesting

## Which Machine Learning techniques solve the overfitting problem?

- Choose the best hyperparameters for the model
- Evaluate other Machine Learning models

https://algotrading101.com/learn/what-is-overfitting-in-trading/

Ideal world

- Overfitting problem
- Hyperparameter tuning
    - The returns improve
    - Although the model is not better...
- Is there anything else we dould do?
    - Walk Forward Testing with hyperparameters
- The backtest still doesn't improve, what shall we do?
    - Other models: Neural Networks LSTM
    
    
Problems?

- The model does not take into account the "short-term" memory of the data...

In [34]:
df_pred_train = pd.DataFrame({
    'y_train': y_train,
    'y_pred_train': y_pred_train,
})

In [35]:
df_pred_test = pd.DataFrame({
    'y_test': y_test,
    'y_pred_test': y_pred_test,
})

In [42]:
df_pred = pd.concat([df_pred_train, df_pred_test]).melt(ignore_index=False)

In [43]:
df_pred

Unnamed: 0_level_0,variable,value
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-12-08,y_train,-1.549143
2016-12-09,y_train,-0.321692
...,...,...
2023-03-14,y_pred_test,2.910909
2023-03-15,y_pred_test,2.882601


In [44]:
import plotly.express as px

In [45]:
px.line(df_pred, x=df_pred.index, y='value', color='variable')