# CHAPTER 5: MACHINE LEARNING MODELS

While **feature and target engineering** make up **80% of the work**, it’s still crucial to properly handle the remaining **20%**. The advantage is that this final step becomes relatively straightforward if you’ve done the previous steps well and understand how your models function. 

At this stage, we already know our **target** and the **features** available to explain it. The task now is to find the **best method to model this relationship**: this method is the model.

*PS: The goal of this chapter is to explain the **strengths and weaknesses** of several models (but you can use others in the same mind), not to evaluate them (that’s for the next chapter).*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
plt.style.use('seaborn')

In [2]:
# Import our dataset containing the features and the signals (already correctly shifted) 
df = pd.read_parquet("DATA/EURUSD_4H_dataset_signal_included.parquet")
df

Unnamed: 0_level_0,open,high,low,close,tick_volume,high_time,low_time,hurst,0_to_20,20_to_40,...,rolling_volatility_yang_zhang,linear_slope_6M,linear_slope_3M,linear_slope_1M,open_close_var,candle_color,next_candle_color,future_market_regime,labeling,dummy
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-11-14 00:00:00,1.24750,1.24789,1.24588,1.24665,14537.0,2014-11-14 00:12:00,2014-11-14 03:09:00,0.606340,12.446352,23.175966,...,,,,,-0.00085,0.0,0.0,,-11.600000,0
2014-11-14 04:00:00,1.24665,1.24669,1.24266,1.24307,17128.0,2014-11-14 04:00:00,2014-11-14 07:46:00,0.710822,12.552301,12.970711,...,,,,,-0.00358,0.0,1.0,,11.283333,1
2014-11-14 08:00:00,1.24306,1.24711,1.24262,1.24623,35033.0,2014-11-14 10:34:00,2014-11-14 08:04:00,0.583402,22.500000,15.000000,...,,,,,0.00317,1.0,0.0,,-3.600000,0
2014-11-14 12:00:00,1.24614,1.24686,1.23982,1.24140,41784.0,2014-11-14 12:02:00,2014-11-14 15:36:00,0.593497,5.000000,7.500000,...,,,,,-0.00474,0.0,1.0,,3.283333,1
2014-11-14 16:00:00,1.24140,1.25435,1.24054,1.25140,74087.0,2014-11-14 19:17:00,2014-11-14 16:05:00,0.682967,25.416667,11.250000,...,,,,,0.01000,1.0,1.0,,56.583333,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-11-12 20:00:00,1.06066,1.06263,1.06062,1.06227,6069.0,2024-11-12 23:05:00,2024-11-12 20:00:00,0.499964,6.276151,33.472803,...,0.002094,0.000022,-0.000122,-0.000109,0.00161,1.0,1.0,0.0,-17.133333,0
2024-11-13 00:00:00,1.06180,1.06288,1.06106,1.06277,4596.0,2024-11-13 03:46:00,2024-11-13 02:15:00,0.469370,16.595745,32.340426,...,0.002097,0.000022,-0.000123,-0.000113,0.00097,1.0,0.0,0.0,-13.133333,0
2024-11-13 04:00:00,1.06277,1.06290,1.06092,1.06127,3868.0,2024-11-13 04:00:00,2024-11-13 07:22:00,0.553327,7.916667,15.000000,...,0.002015,0.000021,-0.000124,-0.000117,-0.00150,0.0,1.0,0.0,-9.133333,0
2024-11-13 08:00:00,1.06127,1.06295,1.05931,1.06266,7587.0,2024-11-13 11:55:00,2024-11-13 10:09:00,0.541830,3.750000,33.333333,...,0.001949,0.000021,-0.000125,-0.000122,0.00139,1.0,1.0,0.0,-5.133333,0


In [3]:
list_X = ['hurst', 'market_regime', 'kama_diff', 'autocorr_20', 'autocorr_50', 'ret_log_10',
       'rolling_volatility_yang_zhang', 'linear_slope_6M', 'linear_slope_3M']
col_y = "dummy"

# Remove the Nan values
df_clean = df[list_X + [col_y]].dropna()


# Split our data into features and target
X_train = df_clean.iloc[0:3_000,:][list_X]
y_train = df_clean.iloc[0:3_000,:][col_y]

X_test = df_clean.iloc[3_000:4_000,:][list_X]
y_test = df_clean.iloc[3_000:4_000,:][col_y]

In [4]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test) # WE DO NOT FIT THE MODEL ON THE TEST DATA ONLY ON THE TRAIN DATA

X_train_sc_df = pd.DataFrame(X_train_sc, columns=X_train.columns)
X_test_sc_df = pd.DataFrame(X_test_sc, columns=X_test.columns)

In [5]:
from sklearn.metrics import confusion_matrix


def model_output_overview(model_class, X_train_sc=X_train_sc, X_test_sc=X_test_sc, y_train=y_train, y_test=y_test):
    model = model_class.fit(X_train_sc, y_train.values)
    y_pred = model.predict(X_test_sc)
    
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    precision_class_0 = 100 * conf_matrix[0][0] / (conf_matrix[0][0] + conf_matrix[1][0])
    precision_class_1 = 100 * conf_matrix[1][1] / (conf_matrix[1][1] + conf_matrix[0][1])

    print(f"Precision Class 0: {precision_class_0:.2f} % \t Precision Class 1: {precision_class_1:.2f} %")

    print(f"NB Prediction Class 0: {(conf_matrix[0][0] + conf_matrix[1][0])} \t NB Prediction Class 1: {(conf_matrix[1][1] + conf_matrix[0][1])}")

<br>

### 5.1. LINEAR MODELS

In supervised machine learning, there are two types of models: **regression and classification**. Here, since our target is a dummy variable (0 or 1), we will use classification models. However, all the explanations and tips provided for each model will also apply to regression.

<br>

##### 5.1.1. Linear/Logistic Regression

**Linear and logistic regression** are linear models designed to capture linear relationships in the data. In trading, purely linear relationships are rare, and these models often struggle to handle the complexity of market dynamics. However, **"not enough" does not mean "useless"**! These models are extremely fast to train, making them valuable tools for quickly assessing the level of linear dependence between features and the target. They can provide valuable insights into your data's structure and serve as a baseline for more complex models.

In [6]:
# LOGISTIC REGRESSION
from sklearn.linear_model import LogisticRegression, LinearRegression
model_output_overview(LogisticRegression())

Precision Class 0: 33.93 % 	 Precision Class 1: 57.57 %
NB Prediction Class 0: 392 	 NB Prediction Class 1: 608


<br>

##### 5.1.2. Linear SVM

Linear SVMs are **robust linear models** that find the **optimal hyperplane** for separating data, performing well even with **noisy data** and **small datasets**. However, their **training time can increase significantly on larger datasets** due to the cost of finding support vectors. When **linear separability** is suspected, they offer **strong performance** but require **standardized data** due to their geometrical basis.

In [7]:
# LINEAR SVC
from sklearn.svm import LinearSVC, LinearSVR
model_output_overview(LinearSVC())

Precision Class 0: 33.93 % 	 Precision Class 1: 57.57 %
NB Prediction Class 0: 392 	 NB Prediction Class 1: 608


<br>

### 5.2. NON-LINEAR MODELS

The second family of models is the **non-linear models**, which capture **non-linear relationships** (quite obvious, I know). These are the **most used models in finance and trading** because **most of the information is non-linear**. In other words, there is a lot of **valuable information** that only these models can detect.

<br>

##### 5.2.1. Non-Linear SVM

Non-linear SVMs share the **same strengths and weaknesses as Linear SVMs**, with an **even greater sensitivity to large dataset issues**. The key difference is that they use a **non-linear kernel** to find the **optimal hyperplane** for separating data.


In [8]:
# NON LINEAR SVC
from sklearn.svm import SVC, SVR
model_output_overview(SVC(C=3))

Precision Class 0: 39.63 % 	 Precision Class 1: 61.89 %
NB Prediction Class 0: 651 	 NB Prediction Class 1: 349


<br>

##### 5.2.2. Random Forest

Random Forests are **ensemble models** that build **multiple decision trees** and aggregate their predictions, making them **robust to overfitting** and effective at handling **non-linear relationships**. They work well with **categorical features**, including **dummy variables**, because they split data based on thresholds rather than relying on linear transformations (**and yes, dummy variables are indeed effective with Random Forests**). Additionally, they are **less sensitive to scaling** and can handle **missing data** to some extent, but they can become **computationally expensive on very large datasets**.

In [9]:
# STANDARD RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
model_output_overview(RandomForestClassifier(n_estimators=1_000, max_depth=100, random_state=56))

Precision Class 0: 40.76 % 	 Precision Class 1: 68.31 %
NB Prediction Class 0: 817 	 NB Prediction Class 1: 183


<br>

##### 5.2.3. Extra Trees

Extra Trees (Extremely Randomized Trees) are **ensemble models** similar to Random Forests but differ by introducing **additional randomness** during tree construction. They split nodes using **random thresholds**, making them **faster to train** and often **less prone to overfitting** in certain cases. Like Random Forests, they handle **non-linear relationships** well and work effectively with **dummy variables** since they use **threshold-based splits**. They are also **robust to scaling and noise** but may require **tuning to balance bias and variance** for optimal performance.

In [10]:
# Extra Tree
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegression
model_class = ExtraTreesClassifier(n_estimators=1_000, max_depth=None, min_samples_split=2, random_state=56)
model_output_overview(model_class)

Precision Class 0: 41.35 % 	 Precision Class 1: 69.16 %
NB Prediction Class 0: 786 	 NB Prediction Class 1: 214


<br>

##### 5.2.4. Bagging

Bagging (Bootstrap Aggregating) combines **predictions from multiple models** trained on **bootstrapped subsets** of the data to **reduce variance** and **improve stability**. It works well with **high-variance models** like decision trees, enhancing **robustness without increasing bias**. However, it can be **computationally expensive** due to the need for multiple model training.

In [11]:
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
model_output_overview(BaggingClassifier(estimator=SVC(C=3),
                        n_estimators=10, random_state=56))

Precision Class 0: 40.00 % 	 Precision Class 1: 62.40 %
NB Prediction Class 0: 625 	 NB Prediction Class 1: 375


<br>

##### 5.2.5. Neural Networks

The **MLPClassifier** is a **neural network model** capable of capturing **complex, non-linear relationships**. It supports **multiple hidden layers** and **backpropagation**, making it **versatile** for various tasks. However, it requires **careful tuning** (e.g., hidden layers, activation functions) and **sufficient data**, as it is prone to **overfitting** and **sensitive to scaling**.

In [16]:
from sklearn.neural_network import MLPClassifier, MLPRegressor
model_class = MLPClassifier(solver='lbfgs',
                    hidden_layer_sizes=(100, 20, 10), random_state=56)
model_output_overview(model_class)

Precision Class 0: 40.13 % 	 Precision Class 1: 64.35 %
NB Prediction Class 0: 770 	 NB Prediction Class 1: 230


<br>

### 5.3. Ensemble Methods

In the previous section, we explored several ensemble methods (Random Forest, Bagging, Extra Trees) that use the same model multiple times. Here, we introduce a **voting method** that combines **different models** (both linear and non-linear) to create a single model that, ideally, delivers the **best performance**.

In [17]:
from sklearn.ensemble import VotingClassifier, VotingRegressor

model_class = VotingClassifier(
    estimators=[('lr', LogisticRegression()),
                ('svc', SVC(C=3)),
                ('rfc', RandomForestClassifier(n_estimators=1_000, max_depth=100, random_state=56)),
               ('ext', ExtraTreesClassifier(n_estimators=1_000, max_depth=None, min_samples_split=2, random_state=56)),
               ('bagsvc',BaggingClassifier(estimator=SVC(C=3),
                        n_estimators=10, random_state=56)),
               ('dnn', MLPClassifier(solver='adam', alpha=3.16e-5,
                    hidden_layer_sizes=(100, 20, 10), random_state=56))],
    voting='hard')

model_output_overview(model_class)

Precision Class 0: 40.17 % 	 Precision Class 1: 63.67 %
NB Prediction Class 0: 722 	 NB Prediction Class 1: 278


In [18]:
# Very quick overview about the profit (only when it is possible which is not always the case)
precision = 0.63
nb_trade = 199
(precision * 0.0048 - 0.0052 * (1-precision)) * nb_trade * 100 # in 8 months

21.889999999999997