# The overfitting problem

![](<src/09_Table_The Overfitting.png>)

## Load the data

In [2]:
import pandas as pd

df = pd.read_excel('data/Microsoft_LinkedIn_Processed.xlsx', parse_dates=['Date'], index_col=0)
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,change_tomorrow,change_tomorrow_direction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-12-08,56.325228,56.582507,55.902560,56.058762,21220800,-1.549143,DOWN
2016-12-09,56.214968,56.959234,56.169027,56.940857,27349400,-0.321692,DOWN
...,...,...,...,...,...,...,...
2023-03-14,256.750000,261.070007,255.860001,260.790009,33620300,-1.751806,DOWN
2023-03-15,259.980011,266.480011,259.209991,265.440002,46028000,-3.895731,DOWN


## Machine Learning Model

### Separate the data

1. Target: which variable do you want to predict?
2. Explanatory: which variables will you use to calculate the prediction?

In [3]:
target = df.change_tomorrow_direction
explanatory = df[['Open','High','Low','Close','Volume']]

### Compute the model

The following Python code will compute the numbers of the mathematical equation that we will use to calculate if the ticker goes UP or DOWN.

In [3]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier(max_depth=15)
model_dt.fit(explanatory, target)

### Calculate the predictions

In [4]:
y_pred = model_dt.predict(X=explanatory)
y_pred

array(['UP', 'UP', 'UP', ..., 'UP', 'DOWN', 'UP'], dtype=object)

In [5]:
df_predictions = df[['change_tomorrow_direction']].copy()
df_predictions['prediction'] = y_pred
df_predictions

Unnamed: 0_level_0,change_tomorrow_direction,prediction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-12-08,UP,UP
2016-12-09,UP,UP
...,...,...
2023-03-14,UP,DOWN
2023-03-15,UP,UP


### Evaluate the model: compare predictions with the reality

In [6]:
model_dt.score(X=explanatory, y=target)

0.8305837563451777

## Train test split

### Split the dataset

- Imagine we are in 2020, we can only train the data up until 31st December 2020, how good would have the model been going foward?

In [7]:
y = df.change_tomorrow_direction

In [8]:
X = df.drop(columns=['change_tomorrow', 'change_tomorrow_direction'])

In [18]:
n_days = len(df.index)

In [19]:
n_days_split = int(n_days*0.70)

In [20]:
X_train, y_train = X.iloc[:n_days_split], y.iloc[:n_days_split]
X_test, y_test = X.iloc[n_days_split:], y.iloc[n_days_split:]

### Fit the model on train set

In [21]:
model_dt_split = DecisionTreeClassifier(max_depth=15)

In [22]:
model_dt_split.fit(X=X_train, y=y_train)

### Calculate predictions on test set

In [23]:
y_pred = model_dt_split.predict(X=X_test)

In [24]:
df_predictions = y_test.to_frame()
df_predictions['prediction'] = y_pred
df_predictions

Unnamed: 0_level_0,change_tomorrow_direction,prediction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-04-29,DOWN,UP
2021-04-30,DOWN,UP
...,...,...
2023-03-14,UP,DOWN
2023-03-15,UP,UP


### Evaluate model

#### On test set

In [26]:
model_dt_split.score(X_test, y_test)

0.4820295983086681

#### On train set

In [27]:
model_dt_split.score(X_train, y_train)

0.8386219401631912

## Backtesting

In [28]:
from backtesting import Backtest



### Create library for your strategies

### Import the library

In [29]:
import strategies

### Run the backtest on `test` data

In [35]:
bt = Backtest(X_test, strategies.SimpleClassificationUD,
              cash=10000, commission=.002, exclusive_orders=True)

results = bt.run(model=model_dt_split)

results_test = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'Out of Sample (Test)'}, axis=1)

### Run the backtest on `train` data

In [36]:
bt = Backtest(X_train, strategies.SimpleClassificationUD,
              cash=10000, commission=.002, exclusive_orders=True)

results = bt.run(model=model_dt_split)

results_train = results.to_frame(name='Values').loc[:'Return [%]']\
    .rename({'Values':'In Sample (Train)'}, axis=1)

### Compare both backtests

In [37]:
df_results = pd.concat([results_test, results_train], axis=1)

In [38]:
df_results

Unnamed: 0,Out of Sample (Test),In Sample (Train)
Start,2021-04-29 00:00:00,2016-12-08 00:00:00
End,2023-03-15 00:00:00,2021-04-28 00:00:00
Duration,685 days 00:00:00,1602 days 00:00:00
Exposure Time [%],99.577167,99.818676
Equity Final [$],8058.540832,2558979.805481
Equity Peak [$],14838.778573,2623091.178151
Return [%],-19.414592,25489.798055


## The overfitting problem in backtesting

## Which Machine Learning techniques solve the overfitting problem?

- Choose the best hyperparameters for the model
- Evaluate other Machine Learning models

https://algotrading101.com/learn/what-is-overfitting-in-trading/

Ideal world

- Overfitting problem
- Hyperparameter tuning
    - The returns improve
    - Although the model is not better...
- Is there anything else we dould do?
    - Walk Forward Testing with hyperparameters
- The backtest still doesn't improve, what shall we do?
    - Other models: Neural Networks LSTM
    
    
Problems?

- The model does not take into account the "short-term" memory of the data...