# Long-Short Strategy, Part 5: Generating out-of-sample predictions

In this section, we'll start designing, implementing, and evaluating a trading strategy for US equities driven by daily return forecasts produced by gradient boosting models.

As in the previous examples, we'll lay out a framework and build a specific example that you can adapt to run your own experiments. There are numerous aspects that you can vary, from the asset class and investment universe to more granular aspects like the features, holding period, or trading rules. See, for example, the **Alpha Factor Library** in the [Appendix](../24_alpha_factor_library) for numerous additional features.

We'll keep the trading strategy simple and only use a single ML signal; a real-life application will likely use multiple signals from different sources, such as complementary ML models trained on different datasets or with different lookahead or lookback periods. It would also use sophisticated risk management, from simple stop-loss to value-at-risk analysis.

**Six notebooks** cover our workflow sequence:

1. [preparing_the_model_data](04_preparing_the_model_data.ipyny): we engineer a few simple features from the Quandl Wiki data 
2. [trading_signals_with_lightgbm_and_catboost](05_trading_signals_with_lightgbm_and_catboost.ipynb): we tune hyperparameters for LightGBM and CatBoost to select a model, using 2015/16 as our validation period. 
3. [evaluate_trading_signals](06_evaluate_trading_signals): we compare the cross-validation performance using various metrics to select the best model. 
4. [model_interpretation](07_model_interpretation.ipynb): we take a closer look at the drivers behind the best model's predictions.
5. `making_out_of_sample_predictions` (this noteboook): we predict returns for our out-of-sample period 2019-2023.
6. [backtesting_with_zipline](09_backtesting_with_zipline.ipynb): evaluate the historical performance of a long-short strategy based on our predictive signals using Zipline.

## Imports & Settings

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [10]:
%matplotlib inline

from time import time
import sys, os
from pathlib import Path

import pandas as pd
from scipy.stats import spearmanr

import lightgbm as lgb
from catboost import Pool, CatBoostRegressor

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils import MultipleTimeSeriesCV

sns.set_style('whitegrid')


np.random.seed(42)



DATA_STORE_ITEM = 'engineered_features_trimmed'
YEAR = 52

idx = pd.IndexSlice

In [11]:
scope_params = ['lookahead', 'train_length', 'test_length']
daily_ic_metrics = ['daily_ic_mean', 'daily_ic_mean_n', 'daily_ic_median', 'daily_ic_median_n']
lgb_train_params = ['learning_rate', 'num_leaves', 'feature_fraction', 'min_data_in_leaf']


## Generate Lightgbm predictions

### Model Configuration

In [12]:
base_params = dict(boosting='gbdt',
                   objective='regression',
                   random_state = 42, 
                   verbose=-1)

categoricals = ['sector',]
#categoricals = []#'month','sector','year', 'month', ]

In [13]:
#dos semanas
lookahead = 1
store = Path('data/predictions.h5') #no lo borramos previamente pq ya lo ha hecho el paso 5

### Get Data

In [14]:
data = pd.read_hdf('data/assets.h5',DATA_STORE_ITEM).sort_index()#modificado

In [15]:
labels = sorted(data.filter(like='target').columns)
features = data.columns.difference(labels).tolist()
label = 'target_1w'

In [18]:
# Encuentra las filas con al menos un valor NaN
nan_cols = data.loc[idx[:, '2024':], features + [label]].isna().any(axis=0)

print(nan_cols[nan_cols == True])


target_1w    True
dtype: bool


In [27]:
#completamos con los valores del periodo anterior, para evitar que el último dato apareza nan
data= data.fillna(method='ffill')

In [28]:
#datos desde 2010
data = data.loc[idx[:, '2010':], features + [label]].dropna()

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8591 entries, ('IYR', Timestamp('2010-01-03 00:00:00')) to ('XLY', Timestamp('2024-12-15 00:00:00'))
Data columns (total 55 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   10y_real_interest_rate_diff                   8591 non-null   float64
 1   1y_yield_diff                                 8591 non-null   float64
 2   CMA                                           8591 non-null   float64
 3   HML                                           8591 non-null   float64
 4   M2_money_supply_diff                          8591 non-null   float64
 5   Mkt-RF                                        8591 non-null   float64
 6   RMW                                           8591 non-null   float64
 7   SMB                                           8591 non-null   float64
 8   business_inventory_diff                       8591 non-null  

In [31]:
data.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
XLY,2024-11-17,0.0,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,26.8118,1.2,215000.0,-4000.0,0.0,2024,-0.17,0.16,0.921709
XLY,2024-11-24,0.0,0.08,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,-17.2742,-0.9,215000.0,0.0,0.0,2024,-0.22,-0.05,0.80907
XLY,2024-12-01,-0.145812,-0.12,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,39.6138,-1.73,225000.0,10000.0,0.0,2024,-0.4,-0.18,1.982427
XLY,2024-12-08,0.0,-0.11,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,0.0,-0.74,242000.0,17000.0,0.0,2024,-0.27,0.13,0.308457
XLY,2024-12-15,0.0,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,0.0,0.81,242000.0,0.0,0.0,2024,-0.03,0.24,0.308457


In [32]:
for feature in categoricals:
    data[feature] = pd.factorize(data[feature], sort=True)[0]

In [33]:
lgb_data = lgb.Dataset(data=data[features],
                       label=data[label],
                       categorical_feature=categoricals,
                       free_raw_data=False)

### Generate predictions

In [34]:
#tomamos los IC almacenados
lgb_ic = pd.read_hdf('data/model_tuning.h5', 'lgb/ic')
lgb_daily_ic = pd.read_hdf('data/model_tuning.h5', 'lgb/daily_ic')

In [51]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
IYR,2010-01-03,0.449884,0.04,8.319540,-11.294190,-34.5,6.173885,14.445141,14.058448,2387.0,0.711801,...,2.497,16.1658,2.21,456000.0,-12000.0,-3.333636,2010,3.79,0.02,0.032724
IYR,2010-01-10,0.000000,-0.10,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,-23.2842,-3.55,469000.0,13000.0,0.000000,2010,3.78,-0.01,-0.119889
IYR,2010-01-17,0.000000,-0.04,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,-23.5856,-0.22,507000.0,38000.0,0.000000,2010,3.64,-0.14,-0.795806
IYR,2010-01-24,0.000000,-0.03,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,0.4094,9.40,471000.0,-36000.0,0.000000,2010,3.56,-0.08,-0.160773
IYR,2010-01-31,0.000000,0.00,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,-26.3653,-2.69,496000.0,25000.0,0.000000,2010,3.55,-0.01,0.061537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XLY,2024-11-17,0.000000,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,26.8118,1.20,215000.0,-4000.0,0.000000,2024,-0.17,0.16,0.921709
XLY,2024-11-24,0.000000,0.08,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,-17.2742,-0.90,215000.0,0.0,0.000000,2024,-0.22,-0.05,0.809070
XLY,2024-12-01,-0.145812,-0.12,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,39.6138,-1.73,225000.0,10000.0,0.000000,2024,-0.40,-0.18,1.982427
XLY,2024-12-08,0.000000,-0.11,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,0.0000,-0.74,242000.0,17000.0,0.000000,2024,-0.27,0.13,0.308457


In [38]:
#función para tomar los mejores parametros que saliernon en entrenamiento para un lookahead determinado
def get_lgb_params(data, t=5, best=0):
    param_cols = scope_params[1:] + lgb_train_params + ['boost_rounds']
    df = data[data.lookahead==t].sort_values('ic', ascending=False).iloc[best]
    return df.loc[param_cols]

In [39]:
#para hacer más OOS que el 1 año definido inicialmente
years_OOS=4.9

In [42]:
params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,)
                            

In [43]:
params

train_length        464.00
test_length           1.00
learning_rate         0.01
num_leaves            4.00
feature_fraction      0.95
min_data_in_leaf    500.00
boost_rounds         10.00
Name: 39, dtype: float64

In [97]:
#for par las 10 mejores configuracones de paramentros de las cuales almacenaremos sus predicciones
for position in range(10):
    params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,
                            best=position)

    params = params.to_dict()#parametros a diccionario

    for p in ['min_data_in_leaf', 'num_leaves']:
        params[p] = int(params[p])
    train_length = int(params.pop('train_length')) # Extrae y elimina el parámetro 'train_length' del diccionario de parámetros y lo convierte a un entero
    test_length = int(params.pop('test_length'))
    num_boost_round = int(params.pop('boost_rounds'))
    params.update(base_params)

    print(f'\nPosition: {position:02}')

    # 1-year out-of-sample period
    #vamos a ir haciendo el walk forward con periodos de test de un mes, moveremos el modelo para volver a entrenar y predeciremos el siguiente mes
    n_splits = int(YEAR * years_OOS / test_length)
    cv = MultipleTimeSeriesCV(n_splits=n_splits,
                              test_period_length=test_length,
                              lookahead=lookahead,
                              train_period_length=train_length)

    predictions = []
    start = time()
    for i, (train_idx, test_idx) in enumerate(cv.split(X=data), 1):
        print(i, end=' ', flush=True)
        
        # Crea un conjunto de datos de entrenamiento para LightGBM
        lgb_train = lgb_data.subset(used_indices=train_idx.tolist(),
                                    params=params).construct()
         # Entrena el modelo LightGBM
        model = lgb.train(params=params,
                          train_set=lgb_train,
                          num_boost_round=num_boost_round,
                        )

        test_set = data.iloc[test_idx, :]
        y_test = test_set.loc[:, label].to_frame('y_test')
        # Realiza predicciones en el conjunto de datos de prueba
        y_pred = model.predict(test_set.loc[:, model.feature_name()])
        predictions.append(y_test.assign(prediction=y_pred))

    if position == 0:
        test_predictions = (pd.concat(predictions)
                            .rename(columns={'prediction': position}))
    else:
        test_predictions[position] = pd.concat(predictions).prediction

by_day = test_predictions.groupby(level='date')# Agrupa las predicciones por fecha
for position in range(10):
     # Si es la primera iteración, calcula el coeficiente de correlación de Spearman
    #entre las predicciones y las etiquetas verdaderas y lo almacena en `ic_by_day`
    if position == 0:
        ic_by_day = by_day.apply(lambda x: spearmanr(
            x.y_test, x[position])[0]).to_frame()
    else:
        ic_by_day[position] = by_day.apply(
            lambda x: spearmanr(x.y_test, x[position])[0])
print(ic_by_day.describe())
test_predictions.to_hdf(store, f'lgb/test/{lookahead:02}')


Position: 00
1 2 3 4 5 6 7 8 9 10 11 12 13 

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 
Position: 01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 