<hr>
<br>
<br>
<br>
<h1><center>Predicting Financial Markets with Machine Learning      </center></h1>
<h1><center>-      </center></h1>
<h2><center>Non-Linear Models      </center></h2>
<br>
<br>
<hr>
<br>

<br>
<br>
<h2>Purpose</h2>
<br>
<hr>
A notebook to develop an AI system aiming at trading intraday on cryptocurrencies
<br>
<br>

<br>
<br>
<h2>Imports</h2>
<br>
<hr>
<br>

In [2]:
pip install torch

Collecting torch
  Downloading torch-2.2.2-cp310-none-macosx_10_9_x86_64.whl.metadata (25 kB)
Downloading torch-2.2.2-cp310-none-macosx_10_9_x86_64.whl (150.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.8/150.8 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: torch
Successfully installed torch-2.2.2
Note: you may need to restart the kernel to use updated packages.


In [47]:
# Pandas and Python
import pandas as pd
pd.options.display.float_format = '{:.4f}'.format
import numpy as np
from tqdm import tqdm
import os
import time
from functools import partial
from multiprocessing import Pool

# Graphic Libraries
import plotly.io as pio
pio.templates.default = "simple_white"
pd.options.plotting.backend = "plotly"
import matplotlib as plt
from IPython.display import display, HTML, clear_output


# AI and stats
import statsmodels.api as sm
import xgboost
from xgboost import XGBRegressor
import torch
import sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score




<br>
<br>
<h2>Notebook Parameters</h2>
<br>
<hr>
<br>

In [3]:
# Define data path
data_path = "data/"

# Risk free rate assumption
risk_free_rate = 0.05 # % per year
rfr_hourly = (1 + risk_free_rate)**(1 / (24*365)) - 1

# Suggested training set
start_date_train = "2023-01-24"
last_date_train = "2024-01-24"

# Suggested validation set
start_date_validate = "2024-01-25"
last_date_validate = "2024-07-24"

# Test set (Unavailable)
# start_date_test = "2024-07-25"
# last_date_test = "2025-01-24"

# Maximum number of features to use
max_nb_features = 20

# Set a level of transaction costs
tc = 0.0000


<br>
<br>
<h2>Data Loading</h2>
<br>
<hr>
<br>

In [4]:
# Main data
data = pd.read_csv(
    f"{data_path}data_in_sample.csv",
    index_col=0,
    header=[0,1],
)

# Make sure that the index is in the right format
data.index = pd.to_datetime(data.index)


In [5]:
# Load pre-processed features
features = {}
for dirpath, dirnames, filenames in os.walk(data_path):
   for filename in filenames[-max_nb_features:]:
       
      if "feature" in filename:
          
          print(f"Loading {filename}")

          # Load feature
          feature = pd.read_csv(
              f"{data_path}{filename}",
              index_col=0,
              header=[0],
          )  

          # Make sure that the index is in the right format
          feature.index = pd.to_datetime(feature.index)

          # Store in the feature dict
          features[filename.replace(".csv", "")] = feature
           

Loading feature_133134399568.csv
Loading feature_102170293493.csv
Loading feature_210101168832.csv
Loading feature_229286341668.csv
Loading feature_498002615952.csv
Loading feature_756574266810.csv
Loading feature_903557802651.csv
Loading feature_217884298030.csv
Loading feature_124203207590.csv
Loading feature_780748440480.csv
Loading feature_799063726358.csv
Loading feature_803717807599.csv
Loading feature_374209229689.csv
Loading feature_440084693321.csv
Loading feature_106169570149.csv
Loading feature_844025771320.csv
Loading feature_373101088533.csv
Loading feature_015415068863.csv
Loading feature_278231810051.csv


In [6]:
list(features.keys())

['feature_133134399568',
 'feature_102170293493',
 'feature_210101168832',
 'feature_229286341668',
 'feature_498002615952',
 'feature_756574266810',
 'feature_903557802651',
 'feature_217884298030',
 'feature_124203207590',
 'feature_780748440480',
 'feature_799063726358',
 'feature_803717807599',
 'feature_374209229689',
 'feature_440084693321',
 'feature_106169570149',
 'feature_844025771320',
 'feature_373101088533',
 'feature_015415068863',
 'feature_278231810051']

<br>
<br>
<h2>Analytics</h2>
<br>
<hr>
Basic Portfolio analytics to invest in some predictions of the future instruments returns
<br>
<br>


In [45]:
def expected_returns_to_positions(expected_returns):
    """
    Normalize expected returns to make it an investable portfolio
    
    :param expected_returns: pd.DataFrame containing expectations
                             about future instruments prices variations
    """

    # Positions will be proportional to ranked alpha
    positions = expected_returns.rank(axis=1)

    # Re-scale the leverage
    positions = positions.div(positions.abs().sum(axis=1), axis=0)

    # Make the portfolio dollar neutral
    positions = positions.sub(positions.mean(axis=1), axis=0)
    
    return positions


def get_sharpe(pnl_portfolio, rfr_hourly):
    """
    Compute the sharpe ratio
    
    :param pnl_portfolio: pd.Series of returns of the portfolio considered
    :param rfr_hourly: float, the hourly risk free rate
    """

    # Compute excess returns
    excess_returns = pnl_portfolio - rfr_hourly
    
    # Compute sharpe ratio
    std_dev = excess_returns.std()
    if std_dev == 0:
        sharpe_ratio = np.nan  # Assign NaN instead of inf
    else:
        sharpe_ratio = excess_returns.mean() / std_dev * np.sqrt(252)

    # Output
    return round(sharpe_ratio, 2)


def pnl_analytics(positions, 
                  returns, 
                  rfr_hourly,
                  lag,
                  tc=0):
    """
    Compute the p&l analytics of the strategy
    
    :param positions: pd.DataFrame, some positions that have been reached
    :param returns: pd.DataFrame containing returns of instruments
    :param rfr_hourly: float, the hourly risk free rate
    :param lag: int, the number of hours to reach the positions
    :param tc: float, the transaction costs
    
    """

    # Compute gross p&l
    pnl = positions.shift(1+lag).mul(returns).sum(axis=1)
    
    # Compute transaction costs
    trades = positions.fillna(0).diff()
    costs = trades.abs().sum(axis=1) * tc
    
    # Net p&l: deduce costs from gross p&l
    pnl = pnl.sub(costs, fill_value=0)
    
    # Compute sharpe
    sharpe = get_sharpe(pnl, rfr_hourly)
    
    return {"sharpe": sharpe,
           "pnl": pnl}


def analyze_expected_returns(
    expected_returns,
    returns,
    rfr_hourly,
    title = "a Nice Try",
    lags = [0,1,2,3,6,12],
    tc = 0,
    output_sharpe=False,
    display_results=True,
):
    """
    Provide an economic analysis of some expected_returns
    
    :param expected_returns: pd.DataFrame containing expectations
                             about future instruments prices variations
    """
    
    # Take positions as a function of expected returns
    positions = expected_returns_to_positions(expected_returns)
    
    # Compute p&l and sharpe for different lags
    pnl_lags = {}
    for lag in lags:
        analytics_lag = pnl_analytics(
            positions=positions, 
            returns=returns, 
            rfr_hourly=rfr_hourly,
            lag=lag,
            tc=tc)
        lag_label = f"Lag {lag}, sharpe={analytics_lag['sharpe']}"
        pnl_lags[lag_label] = analytics_lag["pnl"]
        
    # Display returns
    pnl_lags = pd.concat(pnl_lags, axis=1).dropna()
    if display_results:
        fig = (1+pnl_lags).cumprod().plot(
            title=f"Cumulative returns of {title}",
        )
        fig.update_layout(yaxis_type="log")
        fig.show()

    if output_sharpe:
        for lag_label in pnl_lags.columns:
            if "Lag 0" in lag_label:
                return lag_label.split("sharpe=")[-1]
        
    


<br>
<br>
<h2>Features Standard Pre-Processing</h2>
<br>
<hr>

<br>


In [8]:
label = data["return"].loc[start_date_train:last_date_train
    ].shift(-1).stack()

In [9]:
features_normalized = {}

for feature_name in features.keys():

    print(f"Processing {feature_name}")
    
    # Extract the feature
    feature_normalized = features[feature_name]

    # Rank the feature to remove outliers
    feature_normalized = feature_normalized.rank(axis=1, pct=True) - 0.5

    # Stack the feature
    feature_normalized = feature_normalized.stack().sort_index()

    # Store this normalized version
    features_normalized[feature_name] = feature_normalized

# Convert normalized features dict to a single dataframe
features_normalized = pd.concat(features_normalized, axis=1)

# Replace NaNs by average values, as OLS cannot handle NaNs effectively
features_normalized = features_normalized.fillna(0)
                                                        

Processing feature_133134399568
Processing feature_102170293493
Processing feature_210101168832
Processing feature_229286341668
Processing feature_498002615952
Processing feature_756574266810
Processing feature_903557802651
Processing feature_217884298030
Processing feature_124203207590
Processing feature_780748440480
Processing feature_799063726358
Processing feature_803717807599
Processing feature_374209229689
Processing feature_440084693321
Processing feature_106169570149
Processing feature_844025771320
Processing feature_373101088533
Processing feature_015415068863
Processing feature_278231810051


<br>
<br>
<h2>XGBoost: Gradient Boosted Decision Trees</h2>
<br>
<hr>
Gradient Boosted Decision Trees are another way to introduce non-linearity in our model. This non-linearity is present in the link between the label and features, but also among the features themselves. Overfitting is limited thanks to a variety of strategies, resulting in potentially better generalization.
<br>
<br>


<br>
<h4>Defining hyper-parameters</h4>
<br>



In [20]:
# Define hyperparameters
hyperparameters = {
    "learning_rate": 0.001,
    "n_estimators": 500,
    "objective": "reg:squarederror",
    "tree_method": "hist",
    "base_score": 0,
    "max_depth": 7,
    "min_child_weight": 10,
    "subsample": 0.05,
    "colsample_bytree": 0.3,
    "min_split_loss": 0,
    "reg_lambda": 1,
    "reg_alpha": 0,
    "n_jobs": 1,
    "random_state": 0,
}

<br>
<h4>Training the models</h4>
<br>

In [52]:
# Measure time
t1 = time.time()

# Recompute the model every month, skip the first 2 months
rebalancing_dates = pd.date_range(
    start=start_date_train, 
    end=last_date_validate, 
    freq="ME"
)[2:]

def train_predict_period(
    last_date_train_fold,
    returns,
    hyperparameters,
):

    # Define training and validation dates
    
    # Train the model over the last X months
    start_date_train_fold = last_date_train_fold - pd.Timedelta(days = 30 * 12)
    
    # The model cannot be used before the first day following the training
    # (no look-forward bias)
    start_date_validate_fold = last_date_train_fold + pd.Timedelta(days = 1)
    
    # The trained model will be used for 1 month
    last_date_validate_fold = last_date_train_fold + pd.Timedelta(days = 31 * 1)

    # Log informations
    print(f"Train a model from date {start_date_train_fold} to date {last_date_train_fold}")
    print(f"Predict from date {start_date_validate_fold} to date {last_date_validate_fold}")
    print("")


    # Create label
    label_fold = returns.loc[start_date_train_fold:last_date_train_fold
        ].shift(-1).stack()

    # Only keep dates of the train and validation sets for the features
    features_normalized_train_fold = features_normalized.reindex(label_fold.index)
    features_normalized_validate_fold = features_normalized.sort_index().loc[
        start_date_validate_fold:last_date_validate_fold]

    # Split the data along the time axis
    ts_splitter = sklearn.model_selection.TimeSeriesSplit(n_splits=5)
    
    # Create model
    model = XGBRegressor(**hyperparameters)
    
    # Fit model
    model = model.fit(
        y=label_fold,
        X=features_normalized_train_fold,
    )
    
    # Predict on the validation set
    predictions = model.predict(
        features_normalized_validate_fold)
    predictions = pd.Series(
        predictions, 
        index=features_normalized_validate_fold.index
    ).unstack()
    
    # Output results
    return pd.concat(
        {str(last_date_train_fold) : predictions},
        axis=1
    )

# Fix all but one function parameters to iterate on the last one
partial_train_predict_period = partial(
    train_predict_period,
    returns=data["return"],
    hyperparameters=hyperparameters,
)

# Train using a simple loop instead of Pool
predictions = []
for date in rebalancing_dates:
    pred = partial_train_predict_period(date)
    predictions.append(pred)
    
# Training finished, print time used for it
t2 = time.time()
print(f"Total Training time is {t2-t1} seconds")



Train a model from date 2022-04-05 00:00:00 to date 2023-03-31 00:00:00
Predict from date 2023-04-01 00:00:00 to date 2023-05-01 00:00:00

Train a model from date 2022-05-05 00:00:00 to date 2023-04-30 00:00:00
Predict from date 2023-05-01 00:00:00 to date 2023-05-31 00:00:00

Train a model from date 2022-06-05 00:00:00 to date 2023-05-31 00:00:00
Predict from date 2023-06-01 00:00:00 to date 2023-07-01 00:00:00

Train a model from date 2022-07-05 00:00:00 to date 2023-06-30 00:00:00
Predict from date 2023-07-01 00:00:00 to date 2023-07-31 00:00:00

Train a model from date 2022-08-05 00:00:00 to date 2023-07-31 00:00:00
Predict from date 2023-08-01 00:00:00 to date 2023-08-31 00:00:00

Train a model from date 2022-09-05 00:00:00 to date 2023-08-31 00:00:00
Predict from date 2023-09-01 00:00:00 to date 2023-10-01 00:00:00

Train a model from date 2022-10-05 00:00:00 to date 2023-09-30 00:00:00
Predict from date 2023-10-01 00:00:00 to date 2023-10-31 00:00:00

Train a model from date 202

In [53]:


# Analyse our predictions
analyze_expected_returns(
    expected_returns=predictions.loc[start_date_validate:last_date_validate],
    returns=data["return"].loc[start_date_validate:last_date_validate],
    rfr_hourly=rfr_hourly,
    title=f"Gradient Boosted Trees, Walk-Forward Cross-Validation, Validation Set",
    lags=[0,1,2,3,6,12],
    tc=tc, )


    

AttributeError: 'list' object has no attribute 'loc'