<hr>
<br>
<br>
<br>
<h1><center>Predicting Financial Markets with Machine Learning      </center></h1>
<h1><center>-      </center></h1>
<h2><center>A Simple Linear Model      </center></h2>
<br>
<br>
<hr>
<br>

<br>
<br>
<h2>Purpose</h2>
<br>
<hr>
A notebook to develop an AI system aiming at trading intraday on cryptocurrencies
<br>
<br>

<br>
<br>
<h2>Imports</h2>
<br>
<hr>
<br>

In [None]:
# Pandas and Python
import pandas as pd
pd.options.display.float_format = '{:.4f}'.format
import numpy as np
from tqdm import tqdm
import os

# Graphic Libraries
import plotly.io as pio
pio.templates.default = "simple_white"
pd.options.plotting.backend = "plotly"
import matplotlib as plt
from IPython.display import display, HTML, clear_output


# AI and stats
import statsmodels.api as sm
import xgboost
import torch




<br>
<br>
<h2>Notebook Parameters</h2>
<br>
<hr>
<br>

In [None]:
# Define data path
data_path = "/home/tbarrau/notebooks/HEC_Course/data_students/in_sample/"

# Risk free rate assumption
risk_free_rate = 0.05 # % per year
rfr_hourly = (1 + risk_free_rate)**(1 / (24*365)) - 1

# Suggested training set
start_date_train = "2023-01-24"
last_date_train = "2024-01-24"

# Suggested validation set
start_date_validate = "2024-01-25"
last_date_validate = "2024-07-24"

# Test set (Unavailable)
# start_date_test = "2024-07-25"
# last_date_test = "2025-01-24"

# Maximum number of features to use
max_nb_features = 20

# Set a level of transaction costs
tc = 0.0000


<br>
<br>
<h2>Data Loading</h2>
<br>
<hr>
<br>

In [None]:
# Main data
data = pd.read_csv(
    f"{data_path}data_in_sample.csv",
    index_col=0,
    header=[0,1],
)

# Make sure that the index is in the right format
data.index = pd.to_datetime(data.index)

# Visualize data
data


In [None]:
# Check what are the fields available
data.columns.get_level_values(0).drop_duplicates()


In [None]:
# Load pre-processed features
features = {}
for dirpath, dirnames, filenames in os.walk(data_path):
   for filename in filenames[-max_nb_features:]:
       
      if "feature" in filename:
          
          print(f"Loading {filename}")

          # Load feature
          feature = pd.read_csv(
              f"{data_path}{filename}",
              index_col=0,
              header=[0],
          )  

          # Make sure that the index is in the right format
          feature.index = pd.to_datetime(feature.index)

          # Store in the feature dict
          features[filename.replace(".csv", "")] = feature
           

<br>
<br>
<h2>Analytics</h2>
<br>
<hr>
Basic Portfolio analytics to invest in some predictions of the future instruments returns
<br>
<br>


<h4>Naive expected returns definition</h4>
One of the recurring stylized facts about financial markets is that short-term returns tend to revert. Let's leverage on this kowledge to create a first very naive expectation of future returns as the reversion of the past hour.
<br> 

In [None]:
expected_returns = -data['return']

<h4>Analytics definition</h4>
In financial machine learning, the aim of minimizing the prediction loss of a model is directly connected to economic interest: minimizing the loss is equivalent to maximizing a financial gain. Hence, it is quite common to directly assess the goodness of fit in terms of economic indicators, the first of them being the Sharpe Ratio.
<br> 

In [None]:
def expected_returns_to_positions(expected_returns):
    """
    Normalize expected returns to make it an investable portfolio
    
    :param expected_returns: pd.DataFrame containing expectations
                             about future instruments prices variations
    """

    # Positions will be proportional to ranked alpha
    positions = expected_returns.rank(axis=1)

    # Re-scale the leverage
    positions = positions.div(positions.abs().sum(axis=1), axis=0)

    # Make the portfolio dollar neutral
    positions = positions.sub(positions.mean(axis=1), axis=0)
    
    return positions


def get_sharpe(pnl_portfolio, rfr_hourly):
    """
    Compute the sharpe ratio
    
    :param pnl_portfolio: pd.Series of returns of the portfolio considered
    :param rfr_hourly: float, the hourly risk free rate
    """

    # Compute excess returns
    excess_returns = pnl_portfolio - rfr_hourly
    
    # Compute sharpe ratio
    sharpe_ratio = (
        excess_returns.mean() / excess_returns.std() * np.sqrt(24 * 365)
    )
    
    # Output
    return round(sharpe_ratio, 2)


def pnl_analytics(positions, 
                  returns, 
                  rfr_hourly,
                  lag,
                  tc=0):
    """
    Compute the p&l analytics of the strategy
    
    :param positions: pd.DataFrame, some positions that have been reached
    :param returns: pd.DataFrame containing returns of instruments
    :param rfr_hourly: float, the hourly risk free rate
    :param lag: int, the number of hours to reach the positions
    :param tc: float, the transaction costs
    
    """

    # Compute gross p&l
    pnl = positions.shift(1+lag).mul(returns).sum(axis=1)
    
    # Compute transaction costs
    trades = positions.fillna(0).diff()
    costs = trades.abs().sum(axis=1) * tc
    
    # Net p&l: deduce costs from gross p&l
    pnl = pnl.sub(costs, fill_value=0)
    
    # Compute sharpe
    sharpe = get_sharpe(pnl, rfr_hourly)
    
    return {"sharpe": sharpe,
           "pnl": pnl}


def analyze_expected_returns(
    expected_returns,
    returns,
    rfr_hourly,
    title = "a Nice Try",
    lags = [0,1,2,3,6,12],
    tc = 0
):
    """
    Provide an economic analysis of some expected_returns
    
    :param expected_returns: pd.DataFrame containing expectations
                             about future instruments prices variations
    """
    
    # Take positions as a function of expected returns
    positions = expected_returns_to_positions(expected_returns)
    
    # Compute p&l and sharpe for different lags
    pnl_lags = {}
    for lag in lags:
        analytics_lag = pnl_analytics(
            positions=positions, 
            returns=returns, 
            rfr_hourly=rfr_hourly,
            lag=lag,
            tc=tc)
        lag_label = f"Lag {lag}, sharpe={analytics_lag['sharpe']}"
        pnl_lags[lag_label] = analytics_lag["pnl"]
        
    # Display returns
    pnl_lags = pd.concat(pnl_lags, axis=1).dropna()
    fig = (1+pnl_lags).cumprod().plot(
        title=f"Cumulative returns of {title}",
    )
    fig.update_layout(yaxis_type="log")
    fig.show()
    


Testing analytics on our naive expected returns
<br> 

In [None]:
# Analyze it
analyze_expected_returns(
    expected_returns=expected_returns,
    returns=data["return"],
    rfr_hourly=rfr_hourly,
    title = "a simple reversal model",
    lags = [0,1,2,3,6,12],
    tc = tc)

<br>
<br>
<h2>A First Model: OLS Predictions</h2>
<br>
<hr>
A multivariate linear model maybe one of the simplest ways to produce a model, hence that is a good starting point, and a good benchmark.
<br>
<br>


<h4>Label definition</h4>
What do we want to predict?
<br> 

In [None]:
label = data["return"].loc[start_date_train:last_date_train
    ].shift(-1).stack()

<h4>Features Pre-processing</h4>
How to process the features ?
<br> 

In [None]:
features_normalized = {}

for feature_name in features.keys():

    print(f"Processing {feature_name}")
    
    # Extract the feature
    feature_normalized = features[feature_name]

    # Rank the feature to remove outliers
    feature_normalized = feature_normalized.rank(axis=1, pct=True) - 0.5

    # Stack the feature
    feature_normalized = feature_normalized.stack()

    # Store this normalized version
    features_normalized[feature_name] = feature_normalized

# # Convert normalized features dict to a single dataframe
features_normalized = pd.concat(features_normalized, axis=1)

# Replace NaNs by average values, as OLS cannot handle NaNs effectively
features_normalized = features_normalized.fillna(0)

# Reindex like the label for training
features_normalized_train = features_normalized.reindex(label.index)
                                                        

<h4>Model Creation</h4>
<br> 

In [None]:
# Create model
model = sm.OLS(
    endog = label,
    exog = sm.add_constant(features_normalized_train),
)

# Fit the model
model = model.fit()


<h4>Model Predictions</h4>
<br> 
<h5>Train Set</h5>
<br> 

In [None]:
# Make predictions
predictions = model.predict(
    sm.add_constant(features_normalized_train)).unstack()

# Analyse our predictions
analyze_expected_returns(
    expected_returns=predictions,
    returns=data["return"].loc[start_date_train:last_date_train],
    rfr_hourly=rfr_hourly,
    title = "a simple OLS model, Training Set",
    lags = [0,1,2,3,6,12],
    tc = tc)

<br> 
<h5>Validation Set</h5>
<br> 

In [None]:
# Extract features on the validation set
features_normalized_validate = features_normalized.sort_index().loc[
    start_date_validate:last_date_validate]

In [None]:
# Make predictions
predictions = model.predict(
    sm.add_constant(features_normalized_validate)).unstack()

# Analyse our predictions
analyze_expected_returns(
    expected_returns=predictions,
    returns=data["return"].loc[start_date_validate:last_date_validate],
    rfr_hourly=rfr_hourly,
    title = "a simple OLS model, Validation Set",
    lags = [0,1,2,3,6,12],
    tc = tc)