# Long-Short Strategy, Part 5: Generating out-of-sample predictions

In this section, we'll start designing, implementing, and evaluating a trading strategy for US equities driven by daily return forecasts produced by gradient boosting models.

As in the previous examples, we'll lay out a framework and build a specific example that you can adapt to run your own experiments. There are numerous aspects that you can vary, from the asset class and investment universe to more granular aspects like the features, holding period, or trading rules. See, for example, the **Alpha Factor Library** in the [Appendix](../24_alpha_factor_library) for numerous additional features.

We'll keep the trading strategy simple and only use a single ML signal; a real-life application will likely use multiple signals from different sources, such as complementary ML models trained on different datasets or with different lookahead or lookback periods. It would also use sophisticated risk management, from simple stop-loss to value-at-risk analysis.

**Six notebooks** cover our workflow sequence:

1. [preparing_the_model_data](04_preparing_the_model_data.ipyny): we engineer a few simple features from the Quandl Wiki data 
2. [trading_signals_with_lightgbm_and_catboost](05_trading_signals_with_lightgbm_and_catboost.ipynb): we tune hyperparameters for LightGBM and CatBoost to select a model, using 2015/16 as our validation period. 
3. [evaluate_trading_signals](06_evaluate_trading_signals): we compare the cross-validation performance using various metrics to select the best model. 
4. [model_interpretation](07_model_interpretation.ipynb): we take a closer look at the drivers behind the best model's predictions.
5. `making_out_of_sample_predictions` (this noteboook): we predict returns for our out-of-sample period 2019-2023.
6. [backtesting_with_zipline](09_backtesting_with_zipline.ipynb): evaluate the historical performance of a long-short strategy based on our predictive signals using Zipline.

## Imports & Settings

In [67]:
import warnings
warnings.filterwarnings('ignore')

In [68]:
%matplotlib inline

from time import time
import sys, os
from pathlib import Path

import pandas as pd
from scipy.stats import spearmanr

import lightgbm as lgb
from catboost import Pool, CatBoostRegressor

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [69]:

np.random.seed(42)

In [70]:
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils import MultipleTimeSeriesCV

In [71]:
sns.set_style('whitegrid')

In [72]:
#YEAR = 252
YEAR = 12
datos_semanales=1
#datos_semanales=1 #si queremos que nuestros datos sean semanales
if datos_semanales==1:
    YEAR=52
idx = pd.IndexSlice

In [73]:
scope_params = ['lookahead', 'train_length', 'test_length']
daily_ic_metrics = ['daily_ic_mean', 'daily_ic_mean_n', 'daily_ic_median', 'daily_ic_median_n']
lgb_train_params = ['learning_rate', 'num_leaves', 'feature_fraction', 'min_data_in_leaf']
rf_train_params = ['bagging_fraction', 'feature_fraction', 'min_data_in_leaf','max_depth']
#catboost_train_params = ['max_depth', 'min_child_samples']

## Generate Lightgbm predictions

### Model Configuration

In [74]:
base_params = dict(boosting='gbdt',
                   objective='regression',
                   random_state = 42, 
                   verbose=-1)

categoricals = ['sector',]
#categoricals = []#'month','sector','year', 'month', ]

In [75]:
#dos semanas
lookahead = 1
store = Path('data/predictions.h5') #no lo borramos previamente pq ya lo ha hecho el paso 5

### Get Data

In [76]:
#data = pd.read_hdf('data.h5', 'model_data').sort_index()
data = pd.read_hdf('data/assets.h5','engineered_features_trimmed').sort_index()#modificado

In [77]:
data.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,return_1w,sector,target_1w,return_2w,return_52w,Mkt-RF,SMB,HML,RMW,CMA,...,natural_gas_diff,business_inventory_diff,corporate_profits_diff,semiconductor_electronics_manufacturing_diff,consumer_price_index_diff,M2_money_supply_diff,10y_real_interest_rate_diff,new_homes_diff,streaming_media_consumption_diff,gold_diff
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
XLY,2024-11-17,-0.325538,XLY,0.921709,1.973735,3.252724,5.041495,1.185825,1.879458,-4.336978,-1.725603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
XLY,2024-11-24,0.921709,XLY,0.80907,0.407887,3.426819,5.041495,1.185825,1.879458,-4.336978,-1.725603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
XLY,2024-12-01,0.80907,XLY,1.982427,1.224426,3.420246,5.041495,1.185825,1.879458,-4.336978,-1.725603,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.145812,0.0,0.0,0.0
XLY,2024-12-08,1.982427,XLY,0.308457,1.962504,3.826297,5.041495,1.185825,1.879458,-4.336978,-1.725603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
XLY,2024-12-15,0.308457,XLY,,1.573796,3.416047,5.041495,1.185825,1.879458,-4.336978,-1.725603,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [78]:
#labels = sorted(data.filter(like='_fwd').columns)
labels = sorted(data.filter(like='target').columns)
features = data.columns.difference(labels).tolist()
#label = f'r{lookahead:02}_fwd'
label = 'target_1w'#modificado

In [79]:
label

'target_1w'

In [80]:
data.info()


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 13750 entries, ('IYR', Timestamp('2001-01-07 00:00:00')) to ('XLY', Timestamp('2024-12-15 00:00:00'))
Data columns (total 55 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   return_1w                                     13178 non-null  float64
 1   sector                                        13750 non-null  object 
 2   target_1w                                     13178 non-null  float64
 3   return_2w                                     13178 non-null  float64
 4   return_52w                                    13178 non-null  float64
 5   Mkt-RF                                        13750 non-null  float64
 6   SMB                                           13750 non-null  float64
 7   HML                                           13750 non-null  float64
 8   RMW                                           13750 non-null

In [81]:
# Encuentra las filas con al menos un valor NaN
nan_cols = data.loc[idx[:, '2024':], features + [label]].isna().any(axis=0)

print(nan_cols[nan_cols == True])


target_1w    True
dtype: bool


In [82]:
#completamos con los valores del periodo anterior, para evitar que el último dato apareza nan
data= data.fillna(method='ffill')

In [83]:
#datos desde 2010
data = data.loc[idx[:, '2010':], features + [label]].dropna()

In [84]:
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8591 entries, ('IYR', Timestamp('2010-01-03 00:00:00')) to ('XLY', Timestamp('2024-12-15 00:00:00'))
Data columns (total 55 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   10y_real_interest_rate_diff                   8591 non-null   float64
 1   1y_yield_diff                                 8591 non-null   float64
 2   CMA                                           8591 non-null   float64
 3   HML                                           8591 non-null   float64
 4   M2_money_supply_diff                          8591 non-null   float64
 5   Mkt-RF                                        8591 non-null   float64
 6   RMW                                           8591 non-null   float64
 7   SMB                                           8591 non-null   float64
 8   business_inventory_diff                       8591 non-null  

In [85]:
#data.loc[idx[:, '2018'],:]

In [86]:
data.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
XLY,2024-11-17,0.0,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,26.8118,1.2,215000.0,-4000.0,0.0,2024,-0.17,0.16,0.921709
XLY,2024-11-24,0.0,0.08,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,-17.2742,-0.9,215000.0,0.0,0.0,2024,-0.22,-0.05,0.80907
XLY,2024-12-01,-0.145812,-0.12,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,39.6138,-1.73,225000.0,10000.0,0.0,2024,-0.4,-0.18,1.982427
XLY,2024-12-08,0.0,-0.11,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,0.0,-0.74,242000.0,17000.0,0.0,2024,-0.27,0.13,0.308457
XLY,2024-12-15,0.0,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,0.0,0.81,242000.0,0.0,0.0,2024,-0.03,0.24,0.308457


In [87]:
data.columns


Index(['10y_real_interest_rate_diff', '1y_yield_diff', 'CMA', 'HML',
       'M2_money_supply_diff', 'Mkt-RF', 'RMW', 'SMB',
       'business_inventory_diff', 'coffee_diff', 'consumer_price_index_diff',
       'copper_diff', 'corn_diff', 'corp_oas', 'corp_oas_diff',
       'corporate_profits_diff', 'cotton_diff', 'empleo_diff',
       'energy_price_diff', 'gold_diff', 'inflacion_diff', 'leading_diff',
       'lumber_diff', 'momentum_2', 'month', 'natural_gas_diff', 'new_homes',
       'new_homes_diff', 'oil', 'oil_diff', 'recession', 'recession_diff',
       'retail_sales_percent', 'retail_sales_percent_diff', 'return_1w',
       'return_1w_t-3', 'return_1w_t-4', 'return_1w_t-5', 'return_1w_t-6',
       'return_2w', 'return_52w', 'sector',
       'semiconductor_electronics_manufacturing_diff', 'sentiment',
       'sentiment_diff', 'streaming_media_consumption_diff',
       'tot_bank_credit_diff', 'vix_diff', 'weekjobclaims',
       'weekjobclaims_diff', 'wheat_diff', 'year', 'yield_curv

In [88]:
for feature in categoricals:
    data[feature] = pd.factorize(data[feature], sort=True)[0]

In [89]:
lgb_data = lgb.Dataset(data=data[features],
                       label=data[label],
                       categorical_feature=categoricals,
                       free_raw_data=False)

### Generate predictions

In [90]:
#tomamos los IC almacenados
lgb_ic = pd.read_hdf('data/model_tuning.h5', 'lgb/ic')
lgb_daily_ic = pd.read_hdf('data/model_tuning.h5', 'lgb/daily_ic')

In [91]:
#función para tomar los mejores parametros que saliernon en entrenamiento para un lookahead determinado
def get_lgb_params(data, t=5, best=0):
    param_cols = scope_params[1:] + lgb_train_params + ['boost_rounds']
    df = data[data.lookahead==t].sort_values('ic', ascending=False).iloc[best]
    return df.loc[param_cols]

In [92]:
#para hacer más OOS que el 1 año definido inicialmente
years_OOS=4.9

In [93]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
IYR,2010-01-03,0.449884,0.04,8.319540,-11.294190,-34.5,6.173885,14.445141,14.058448,2387.0,0.711801,...,2.497,16.1658,2.21,456000.0,-12000.0,-3.333636,2010,3.79,0.02,0.032724
IYR,2010-01-10,0.000000,-0.10,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,-23.2842,-3.55,469000.0,13000.0,0.000000,2010,3.78,-0.01,-0.119889
IYR,2010-01-17,0.000000,-0.04,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,-23.5856,-0.22,507000.0,38000.0,0.000000,2010,3.64,-0.14,-0.795806
IYR,2010-01-24,0.000000,-0.03,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,0.4094,9.40,471000.0,-36000.0,0.000000,2010,3.56,-0.08,-0.160773
IYR,2010-01-31,0.000000,0.00,9.121056,-8.040257,0.0,4.960759,16.718611,12.562260,0.0,0.000000,...,0.000,-26.3653,-2.69,496000.0,25000.0,0.000000,2010,3.55,-0.01,0.061537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XLY,2024-11-17,0.000000,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,26.8118,1.20,215000.0,-4000.0,0.000000,2024,-0.17,0.16,0.921709
XLY,2024-11-24,0.000000,0.08,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,-17.2742,-0.90,215000.0,0.0,0.000000,2024,-0.22,-0.05,0.809070
XLY,2024-12-01,-0.145812,-0.12,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,39.6138,-1.73,225000.0,10000.0,0.000000,2024,-0.40,-0.18,1.982427
XLY,2024-12-08,0.000000,-0.11,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,0.0000,-0.74,242000.0,17000.0,0.000000,2024,-0.27,0.13,0.308457


In [94]:
lgb_daily_ic.sort_values('ic', ascending=False)

Unnamed: 0,lookahead,train_length,test_length,learning_rate,num_leaves,feature_fraction,min_data_in_leaf,boost_rounds,ic
39,1,464,1,0.01,4,0.95,500,10,0.091843
40,1,464,1,0.01,4,0.95,500,25,0.072017
248,1,464,1,0.01,128,0.95,500,25,0.050283
572,1,464,1,0.30,8,0.30,250,10,0.047762
559,1,464,1,0.30,4,0.95,1000,10,0.044768
...,...,...,...,...,...,...,...,...,...
338,1,464,1,0.10,4,0.95,1000,10,-0.011270
172,1,464,1,0.01,32,0.95,1000,75,-0.012066
263,1,464,1,0.01,128,0.95,1000,75,-0.012066
672,1,464,1,0.30,128,0.60,500,350,-0.013267


In [95]:
params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,)
                            

In [96]:
params

train_length        464.00
test_length           1.00
learning_rate         0.01
num_leaves            4.00
feature_fraction      0.95
min_data_in_leaf    500.00
boost_rounds         10.00
Name: 39, dtype: float64

In [112]:
data.columns

Index(['10y_real_interest_rate_diff', '1y_yield_diff', 'CMA', 'HML',
       'M2_money_supply_diff', 'Mkt-RF', 'RMW', 'SMB',
       'business_inventory_diff', 'coffee_diff', 'consumer_price_index_diff',
       'copper_diff', 'corn_diff', 'corp_oas', 'corp_oas_diff',
       'corporate_profits_diff', 'cotton_diff', 'empleo_diff',
       'energy_price_diff', 'gold_diff', 'inflacion_diff', 'leading_diff',
       'lumber_diff', 'momentum_2', 'month', 'natural_gas_diff', 'new_homes',
       'new_homes_diff', 'oil', 'oil_diff', 'recession', 'recession_diff',
       'retail_sales_percent', 'retail_sales_percent_diff', 'return_1w',
       'return_1w_t-3', 'return_1w_t-4', 'return_1w_t-5', 'return_1w_t-6',
       'return_2w', 'return_52w', 'sector',
       'semiconductor_electronics_manufacturing_diff', 'sentiment',
       'sentiment_diff', 'streaming_media_consumption_diff',
       'tot_bank_credit_diff', 'vix_diff', 'weekjobclaims',
       'weekjobclaims_diff', 'wheat_diff', 'year', 'yield_curv

In [97]:
#for par las 10 mejores configuracones de paramentros de las cuales almacenaremos sus predicciones
for position in range(10):
    params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,
                            best=position)

    params = params.to_dict()#parametros a diccionario

    for p in ['min_data_in_leaf', 'num_leaves']:
        params[p] = int(params[p])
    train_length = int(params.pop('train_length')) # Extrae y elimina el parámetro 'train_length' del diccionario de parámetros y lo convierte a un entero
    test_length = int(params.pop('test_length'))
    num_boost_round = int(params.pop('boost_rounds'))
    params.update(base_params)

    print(f'\nPosition: {position:02}')

    # 1-year out-of-sample period
    #vamos a ir haciendo el walk forward con periodos de test de un mes, moveremos el modelo para volver a entrenar y predeciremos el siguiente mes
    n_splits = int(YEAR * years_OOS / test_length)
    cv = MultipleTimeSeriesCV(n_splits=n_splits,
                              test_period_length=test_length,
                              lookahead=lookahead,
                              train_period_length=train_length)

    predictions = []
    start = time()
    for i, (train_idx, test_idx) in enumerate(cv.split(X=data), 1):
        print(i, end=' ', flush=True)
        
        # Crea un conjunto de datos de entrenamiento para LightGBM
        lgb_train = lgb_data.subset(used_indices=train_idx.tolist(),
                                    params=params).construct()
         # Entrena el modelo LightGBM
        model = lgb.train(params=params,
                          train_set=lgb_train,
                          num_boost_round=num_boost_round,
                        )

        test_set = data.iloc[test_idx, :]
        y_test = test_set.loc[:, label].to_frame('y_test')
        # Realiza predicciones en el conjunto de datos de prueba
        y_pred = model.predict(test_set.loc[:, model.feature_name()])
        predictions.append(y_test.assign(prediction=y_pred))

    if position == 0:
        test_predictions = (pd.concat(predictions)
                            .rename(columns={'prediction': position}))
    else:
        test_predictions[position] = pd.concat(predictions).prediction

by_day = test_predictions.groupby(level='date')# Agrupa las predicciones por fecha
for position in range(10):
     # Si es la primera iteración, calcula el coeficiente de correlación de Spearman
    #entre las predicciones y las etiquetas verdaderas y lo almacena en `ic_by_day`
    if position == 0:
        ic_by_day = by_day.apply(lambda x: spearmanr(
            x.y_test, x[position])[0]).to_frame()
    else:
        ic_by_day[position] = by_day.apply(
            lambda x: spearmanr(x.y_test, x[position])[0])
print(ic_by_day.describe())
test_predictions.to_hdf(store, f'lgb/test/{lookahead:02}')


Position: 00
1 2 3 4 5 6 7 8 9 10 11 12 13 

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 
Position: 01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

#for par las 10 mejores configuracones de paramentros de las cuales almacenaremos sus predicciones
for position in range(10):
    params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,
                            best=position)
    print (params)

params

## Verificamos que funciona el cross validation 

In [98]:
train_period_length = 216
test_period_length = 12
#MultipleTimeSeriesCV siempre empieza por el final por eso tomará como periodo de validación/teste desde la ultima fecha que le pasemos hasta 
#los años que definamos por n_splits
n_splits = int(YEAR* years_OOS/test_period_length)
lookahead =1 

cv = MultipleTimeSeriesCV(n_splits=n_splits,
                          test_period_length=test_period_length,
                          lookahead=lookahead,
                          train_period_length=train_period_length)

n_splits

In [99]:
i = 0
for train_idx, test_idx in cv.split(X=data):
    train = data.iloc[train_idx]
    train_dates = train.index.get_level_values('date')
    test = data.iloc[test_idx]
    test_dates = test.index.get_level_values('date')
    df = pd.concat([train.reset_index(), test.reset_index()])
    n = len(df)
    assert n== len(df.drop_duplicates())
    print(train.groupby(level='ticker').size().value_counts().index[0],
          train_dates.min().date(), train_dates.max().date(),
          test.groupby(level='ticker').size().value_counts().index[0],
          test_dates.min().date(), test_dates.max().date())
    i += 1
    if i == 100:
        break

216 2020-08-09 2024-09-22 12 2024-09-29 2024-12-15
216 2020-05-17 2024-06-30 12 2024-07-07 2024-09-22
216 2020-02-23 2024-04-07 12 2024-04-14 2024-06-30
216 2019-12-01 2024-01-14 12 2024-01-21 2024-04-07
216 2019-09-08 2023-10-22 12 2023-10-29 2024-01-14
216 2019-06-16 2023-07-30 12 2023-08-06 2023-10-22
216 2019-03-24 2023-05-07 12 2023-05-14 2023-07-30
216 2018-12-30 2023-02-12 12 2023-02-19 2023-05-07
216 2018-10-07 2022-11-20 12 2022-11-27 2023-02-12
216 2018-07-15 2022-08-28 12 2022-09-04 2022-11-20
216 2018-04-22 2022-06-05 12 2022-06-12 2022-08-28


216 2018-01-28 2022-03-13 12 2022-03-20 2022-06-05
216 2017-11-05 2021-12-19 12 2021-12-26 2022-03-13
216 2017-08-13 2021-09-26 12 2021-10-03 2021-12-19
216 2017-05-21 2021-07-04 12 2021-07-11 2021-09-26
216 2017-02-26 2021-04-11 12 2021-04-18 2021-07-04
216 2016-12-04 2021-01-17 12 2021-01-24 2021-04-11
216 2016-09-11 2020-10-25 12 2020-11-01 2021-01-17
216 2016-06-19 2020-08-02 12 2020-08-09 2020-10-25
216 2016-03-27 2020-05-10 12 2020-05-17 2020-08-02
216 2016-01-03 2020-02-16 12 2020-02-23 2020-05-10


n_splits

In [100]:
#stop

## Generate RF predictions

### Model Configuration

In [101]:
base_params = dict(boosting='rf',
                   objective='regression',
                   random_state = 42, 
                   bagging_freq=1, 
                   verbose=-1)

#categoricals = ['year', 'month', 'sector', 'weekday']
categoricals = ['month','sector']

In [102]:
lookahead 

1

In [103]:

store = Path('data/predictions.h5')

### Get Data

In [104]:
#mismo que en Lightgbm
data.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
XLY,2024-11-17,0.0,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,26.8118,1.2,215000.0,-4000.0,0.0,2024,-0.17,0.16,0.921709
XLY,2024-11-24,0.0,0.08,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,-17.2742,-0.9,215000.0,0.0,0.0,2024,-0.22,-0.05,0.80907
XLY,2024-12-01,-0.145812,-0.12,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,39.6138,-1.73,225000.0,10000.0,0.0,2024,-0.4,-0.18,1.982427
XLY,2024-12-08,0.0,-0.11,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,0.0,-0.74,242000.0,17000.0,0.0,2024,-0.27,0.13,0.308457
XLY,2024-12-15,0.0,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.0,...,0.0,0.0,0.81,242000.0,0.0,0.0,2024,-0.03,0.24,0.308457


In [105]:
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8591 entries, ('IYR', Timestamp('2010-01-03 00:00:00')) to ('XLY', Timestamp('2024-12-15 00:00:00'))
Data columns (total 55 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   10y_real_interest_rate_diff                   8591 non-null   float64
 1   1y_yield_diff                                 8591 non-null   float64
 2   CMA                                           8591 non-null   float64
 3   HML                                           8591 non-null   float64
 4   M2_money_supply_diff                          8591 non-null   float64
 5   Mkt-RF                                        8591 non-null   float64
 6   RMW                                           8591 non-null   float64
 7   SMB                                           8591 non-null   float64
 8   business_inventory_diff                       8591 non-null  

In [106]:
data.loc[idx[:, '2024'],:]

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
IYR,2024-01-07,-0.000028,0.05,8.257349,-5.392769,0.5,4.604515,-11.314591,1.115127,-378.0,-6.431346,...,1.284,-8.3062,0.90,198000.0,0.0,-3.554881,2024,-1.42,0.10,0.191179
IYR,2024-01-14,0.000000,-0.19,9.055477,-6.307256,0.0,4.308705,-9.782543,4.120673,0.0,0.000000,...,0.000,19.6718,-0.65,194000.0,-4000.0,0.000000,2024,-1.49,-0.07,-0.704204
IYR,2024-01-21,0.000000,0.19,9.055477,-6.307256,0.0,4.308705,-9.782543,4.120673,0.0,0.000000,...,0.000,5.8091,0.60,221000.0,27000.0,0.000000,2024,-1.30,0.19,-0.197311
IYR,2024-01-28,0.000000,-0.06,9.055477,-6.307256,0.0,4.308705,-9.782543,4.120673,0.0,0.000000,...,0.000,24.5661,-0.04,225000.0,4000.0,0.000000,2024,-1.29,0.01,-0.274666
IYR,2024-02-04,-0.064048,0.03,9.055477,-6.307256,36.0,4.308705,-9.782543,4.120673,6785.0,4.904203,...,3.249,0.8430,0.59,213000.0,-12000.0,-6.840463,2024,-1.40,-0.11,0.015860
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XLY,2024-11-17,0.000000,0.02,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,26.8118,1.20,215000.0,-4000.0,0.000000,2024,-0.17,0.16,0.921709
XLY,2024-11-24,0.000000,0.08,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,-17.2742,-0.90,215000.0,0.0,0.000000,2024,-0.22,-0.05,0.809070
XLY,2024-12-01,-0.145812,-0.12,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,39.6138,-1.73,225000.0,10000.0,0.000000,2024,-0.40,-0.18,1.982427
XLY,2024-12-08,0.000000,-0.11,-1.725603,1.879458,0.0,5.041495,-4.336978,1.185825,0.0,0.000000,...,0.000,0.0000,-0.74,242000.0,17000.0,0.000000,2024,-0.27,0.13,0.308457


In [107]:
data[feature]

ticker  date      
IYR     2010-01-03     0
        2010-01-10     0
        2010-01-17     0
        2010-01-24     0
        2010-01-31     0
                      ..
XLY     2024-11-17    10
        2024-11-24    10
        2024-12-01    10
        2024-12-08    10
        2024-12-15    10
Name: sector, Length: 8591, dtype: int64

In [108]:
for feature in categoricals:
    data[feature] = pd.factorize(data[feature], sort=True)[0]

In [109]:
lgb_data = lgb.Dataset(data=data[features],
                       label=data[label],
                       categorical_feature=categoricals,
                       free_raw_data=False)

### Generate predictions

In [110]:
#tomamos los IC almacenados
rf_ic = pd.read_hdf('data/model_tuning.h5', 'rf/ic')
rf_daily_ic = pd.read_hdf('data/model_tuning.h5', 'rf/daily_ic')

KeyError: 'No object named rf/ic in the file'

In [41]:
rf_daily_ic

Unnamed: 0,lookahead,train_length,test_length,bagging_fraction,feature_fraction,min_data_in_leaf,max_depth,boost_rounds,ic
0,1,52,1,0.75,0.75,74,-1,50,-0.005022
1,1,52,1,0.75,0.75,74,-1,100,0.004813
2,1,52,1,0.75,0.75,74,10,50,-0.005022
3,1,52,1,0.75,0.75,74,10,100,0.004813
4,1,52,1,0.75,0.75,100,-1,50,-0.023801
...,...,...,...,...,...,...,...,...,...
139,1,216,12,0.95,0.75,200,5,100,-0.007420
140,1,216,12,0.95,0.95,74,-1,50,-0.004407
141,1,216,12,0.95,0.95,74,-1,100,0.006598
142,1,216,12,0.95,0.95,200,10,50,-0.012932


In [42]:
rf_daily_ic['test_length']=1

In [43]:
#función para tomar los mejores parametros que saliernon en entrenamiento para un lookahead determinado
def get_rf_params(data, t=5, best=0):
    param_cols = scope_params[1:] + rf_train_params + ['boost_rounds']
    df = data[data.lookahead==t].sort_values('ic', ascending=False).iloc[best]
    return df.loc[param_cols]

In [44]:
#para hacer más OOS que el 1 año definido inicialmente
#years_OOS=1

In [45]:
#for par las 10 mejores configuracones de paramentros de las cuales almacenaremos sus predicciones
for position in range(10):
    params = get_rf_params(rf_daily_ic,
                            t=lookahead,
                            best=position)

    params = params.to_dict()#parametros a diccionario

    for p in ['min_data_in_leaf','max_depth']:
        params[p] = int(params[p])
    train_length = int(params.pop('train_length')) # Extrae y elimina el parámetro 'train_length' del diccionario de parámetros y lo convierte a un entero
    test_length = int(params.pop('test_length'))
    num_boost_round = int(params.pop('boost_rounds'))
    params.update(base_params)

    print(f'\nPosition: {position:02}')

    # 1-year out-of-sample period
    #vamos a ir haciendo el walk forward con periodos de test de un mes, moveremos el modelo para volver a entrenar y predeciremos el siguiente mes
    n_splits = int(YEAR * years_OOS / test_length)
    cv = MultipleTimeSeriesCV(n_splits=n_splits,
                              test_period_length=test_length,
                              lookahead=lookahead,
                              train_period_length=train_length)

    predictions = []
    start = time()
    for i, (train_idx, test_idx) in enumerate(cv.split(X=data), 1):
        print(i, end=' ', flush=True)
        
        # Crea un conjunto de datos de entrenamiento para LightGBM
        lgb_train = lgb_data.subset(used_indices=train_idx.tolist(),
                                    params=params).construct()
         # Entrena el modelo LightGBM
        model = lgb.train(params=params,
                          train_set=lgb_train,
                          num_boost_round=num_boost_round,
                          verbose_eval=False)

        test_set = data.iloc[test_idx, :]
        y_test = test_set.loc[:, label].to_frame('y_test')
        # Realiza predicciones en el conjunto de datos de prueba
        y_pred = model.predict(test_set.loc[:, model.feature_name()])
        predictions.append(y_test.assign(prediction=y_pred))
        #if position == 0:
        #    break
    #if position == 0:
    #    break
    if position == 0:
        test_predictions = (pd.concat(predictions)
                            .rename(columns={'prediction': position}))
    else:
        test_predictions[position] = pd.concat(predictions).prediction

by_day = test_predictions.groupby(level='date')# Agrupa las predicciones por fecha
for position in range(10):
     # Si es la primera iteración, calcula el coeficiente de correlación de Spearman
    #entre las predicciones y las etiquetas verdaderas y lo almacena en `ic_by_day`
    if position == 0:
        ic_by_day = by_day.apply(lambda x: spearmanr(
            x.y_test, x[position])[0]).to_frame()
    else:
        ic_by_day[position] = by_day.apply(
            lambda x: spearmanr(x.y_test, x[position])[0])
print(ic_by_day.describe())
test_predictions.to_hdf(store, f'rf/test/{lookahead:02}')


Position: 00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
Position: 01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1

In [46]:
 n_splits = int(YEAR * years_OOS / test_length)

In [47]:
test_length

1

In [48]:
ic_by_day

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-10-13,0.369652,,-0.200000,0.182233,0.172727,0.433221,-0.191344,0.433221,0.109091,0.109091
2019-10-20,0.487443,,0.487443,0.481818,-0.081818,0.482917,0.324130,0.482917,0.251144,0.251144
2019-10-27,0.115822,-0.087039,0.506951,0.533031,-0.490909,-0.127273,0.045980,-0.127273,0.321114,0.321114
2019-11-03,,,,0.090909,0.609091,0.490909,-0.009091,0.490909,0.109091,0.109091
2019-11-10,0.836660,0.500000,0.500000,0.045455,0.311940,-0.172727,0.286039,-0.018182,0.434750,0.434750
...,...,...,...,...,...,...,...,...,...,...
2024-09-01,0.280484,,0.064613,-0.036364,-0.082005,-0.200000,-0.433211,-0.200000,-0.092391,-0.092391
2024-09-08,0.349448,0.004709,0.291947,0.349448,0.273975,0.291947,0.577350,0.291947,0.447637,0.447637
2024-09-15,0.731229,-0.692820,0.539360,0.619751,0.231521,0.539360,0.500000,0.539360,0.630960,0.630960
2024-09-22,0.014338,,-0.041872,0.105266,0.050344,0.045558,,0.045558,-0.248195,-0.248195


In [49]:
 f'rf/test/{lookahead:02}'

'rf/test/01'

In [50]:
test_set.loc['XLU', model.feature_name()]

Unnamed: 0_level_0,1y_yield,1y_yield_diff,CMA,HML,Mkt-RF,RMW,SMB,corp_oas,corp_oas_diff,empleo_diff,...,sentiment_diff,us_asset_balance_diff,vix,vix_diff,vixoil,vixoil_diff,weekjobclaims,weekjobclaims_diff,yield_curve,yield_curve_diff
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-10-13,1.67,0.09,0.206604,0.074321,-0.038489,-0.229842,-0.123149,1.22,-0.04,0.0,...,0.0,4355.0,15.58,-1.46,39.65,0.34,215000.0,6000.0,0.08,0.27


In [51]:
num_boost_round

50

In [52]:
params

{'bagging_fraction': 0.75,
 'feature_fraction': 0.75,
 'min_data_in_leaf': 200,
 'max_depth': 10,
 'boosting': 'rf',
 'objective': 'regression',
 'random_state': 42,
 'bagging_freq': 1,
 'verbose': -1}

In [53]:
 model.feature_name()

['1y_yield',
 '1y_yield_diff',
 'CMA',
 'HML',
 'Mkt-RF',
 'RMW',
 'SMB',
 'corp_oas',
 'corp_oas_diff',
 'empleo_diff',
 'eu_hy_oas',
 'eu_hy_oas_diff',
 'hy_oas',
 'hy_oas_diff',
 'inflacion',
 'inflacion_diff',
 'leading',
 'leading_diff',
 'momentum_12',
 'momentum_2',
 'momentum_3',
 'momentum_3_12',
 'momentum_52',
 'momentum_6',
 'month',
 'oil',
 'oil_diff',
 'real_gdp',
 'real_gdp_diff',
 'recession',
 'recession_diff',
 'retail_sales',
 'retail_sales_diff',
 'retail_sales_percent',
 'retail_sales_percent_diff',
 'return_12m',
 'return_1m',
 'return_1m_t-1',
 'return_1m_t-2',
 'return_1m_t-3',
 'return_1m_t-4',
 'return_1m_t-5',
 'return_1m_t-6',
 'return_2m',
 'return_3m',
 'return_52m',
 'return_6m',
 'sector',
 'sentiment',
 'sentiment_diff',
 'us_asset_balance_diff',
 'vix',
 'vix_diff',
 'vixoil',
 'vixoil_diff',
 'weekjobclaims',
 'weekjobclaims_diff',
 'yield_curve',
 'yield_curve_diff']

In [54]:
#for par las 10 mejores configuracones de paramentros de las cuales almacenaremos sus predicciones
for position in range(10):
    params = get_rf_params(rf_daily_ic,
                            t=lookahead,
                            best=position)
    print (params)

train_length        216.00
test_length           1.00
bagging_fraction      0.75
feature_fraction      0.75
min_data_in_leaf    100.00
max_depth             5.00
boost_rounds         50.00
Name: 76, dtype: float64
train_length        216.00
test_length           1.00
bagging_fraction      0.95
feature_fraction      0.75
min_data_in_leaf    100.00
max_depth             5.00
boost_rounds         50.00
Name: 134, dtype: float64
train_length        216.00
test_length           1.00
bagging_fraction      0.75
feature_fraction      0.95
min_data_in_leaf    100.00
max_depth             5.00
boost_rounds         50.00
Name: 124, dtype: float64
train_length        216.00
test_length           1.00
bagging_fraction      0.75
feature_fraction      0.75
min_data_in_leaf    100.00
max_depth            -1.00
boost_rounds         50.00
Name: 110, dtype: float64
train_length        216.00
test_length           1.00
bagging_fraction      0.95
feature_fraction      0.75
min_data_in_leaf     74.00
max_de

In [55]:
params

train_length        216.00
test_length           1.00
bagging_fraction      0.75
feature_fraction      0.75
min_data_in_leaf    200.00
max_depth            10.00
boost_rounds         50.00
Name: 118, dtype: float64

## Verificamos que funciona el cross validation 

In [56]:
train_period_length = 216
test_period_length = 4
#MultipleTimeSeriesCV siempre empieza por el final por eso tomará como periodo de validación/teste desde la ultima fecha que le pasemos hasta 
#los años que definamos por n_splits
n_splits = int(YEAR* years_OOS/test_period_length)
lookahead =2

cv = MultipleTimeSeriesCV(n_splits=n_splits,
                          test_period_length=test_period_length,
                          lookahead=lookahead,
                          train_period_length=train_period_length)

In [57]:
n_splits

65

In [58]:
i = 0
for train_idx, test_idx in cv.split(X=data):
    train = data.iloc[train_idx]
    train_dates = train.index.get_level_values('date')
    test = data.iloc[test_idx]
    test_dates = test.index.get_level_values('date')
    df = train.reset_index().append(test.reset_index())
    n = len(df)
    assert n== len(df.drop_duplicates())
    print(train.groupby(level='ticker').size().value_counts().index[0],
          train_dates.min().date(), train_dates.max().date(),
          test.groupby(level='ticker').size().value_counts().index[0],
          test_dates.min().date(), test_dates.max().date())
    i += 1
    if i == 100:
        break

217 2020-07-05 2024-08-25 4 2024-09-08 2024-09-29
217 2020-06-07 2024-07-28 4 2024-08-11 2024-09-01
217 2020-05-10 2024-06-30 4 2024-07-14 2024-08-04
217 2020-04-12 2024-06-02 4 2024-06-16 2024-07-07
217 2020-03-15 2024-05-05 4 2024-05-19 2024-06-09
217 2020-02-16 2024-04-07 4 2024-04-21 2024-05-12
217 2020-01-19 2024-03-10 4 2024-03-24 2024-04-14
217 2019-12-22 2024-02-11 4 2024-02-25 2024-03-17
217 2019-11-24 2024-01-14 4 2024-01-28 2024-02-18
217 2019-10-27 2023-12-17 4 2023-12-31 2024-01-21
217 2019-09-29 2023-11-19 4 2023-12-03 2023-12-24
217 2019-09-01 2023-10-22 4 2023-11-05 2023-11-26
217 2019-08-04 2023-09-24 4 2023-10-08 2023-10-29
217 2019-07-07 2023-08-27 4 2023-09-10 2023-10-01
217 2019-06-09 2023-07-30 4 2023-08-13 2023-09-03
217 2019-05-12 2023-07-02 4 2023-07-16 2023-08-06
217 2019-04-14 2023-06-04 4 2023-06-18 2023-07-09
217 2019-03-17 2023-05-07 4 2023-05-21 2023-06-11
217 2019-02-17 2023-04-09 4 2023-04-23 2023-05-14
217 2019-01-20 2023-03-12 4 2023-03-26 2023-04-16


In [59]:
n_splits

65

In [60]:
years_OOS

5