# Long-Short Strategy, Part 5: Generating out-of-sample predictions

In this section, we'll start designing, implementing, and evaluating a trading strategy for US equities driven by daily return forecasts produced by gradient boosting models.

As in the previous examples, we'll lay out a framework and build a specific example that you can adapt to run your own experiments. There are numerous aspects that you can vary, from the asset class and investment universe to more granular aspects like the features, holding period, or trading rules. See, for example, the **Alpha Factor Library** in the [Appendix](../24_alpha_factor_library) for numerous additional features.

We'll keep the trading strategy simple and only use a single ML signal; a real-life application will likely use multiple signals from different sources, such as complementary ML models trained on different datasets or with different lookahead or lookback periods. It would also use sophisticated risk management, from simple stop-loss to value-at-risk analysis.

**Six notebooks** cover our workflow sequence:

1. [preparing_the_model_data](04_preparing_the_model_data.ipyny): we engineer a few simple features from the Quandl Wiki data 
2. [trading_signals_with_lightgbm_and_catboost](05_trading_signals_with_lightgbm_and_catboost.ipynb): we tune hyperparameters for LightGBM and CatBoost to select a model, using 2015/16 as our validation period. 
3. [evaluate_trading_signals](06_evaluate_trading_signals): we compare the cross-validation performance using various metrics to select the best model. 
4. [model_interpretation](07_model_interpretation.ipynb): we take a closer look at the drivers behind the best model's predictions.
5. `making_out_of_sample_predictions` (this noteboook): we predict returns for our out-of-sample period 2019-2023.
6. [backtesting_with_zipline](09_backtesting_with_zipline.ipynb): evaluate the historical performance of a long-short strategy based on our predictive signals using Zipline.

## Imports & Settings

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [37]:
%matplotlib inline

from time import time
import sys
import os
from pathlib import Path

import pandas as pd
from scipy.stats import spearmanr

import lightgbm as lgb
import seaborn as sns
import numpy as np
from tqdm import tqdm

sys.path.insert(1, os.path.join(sys.path[0], '..'))
from utils import MultipleTimeSeriesCV

sns.set_style('whitegrid')


np.random.seed(42)



DATA_STORE = Path('data_normalized/assets.h5')
DATA_STORE_ITEM = 'engineered_features_trimmed'
RESULTS_PATH = Path('results_normalized_trimmed/us_stocks')

PREDICTIONS_STORE = Path('data_normalized/predictions_normalized_trimmed.h5')
YEAR = 52

idx = pd.IndexSlice

In [38]:
scope_params = ['lookahead', 'train_length', 'test_length']
daily_ic_metrics = ['daily_ic_mean', 'daily_ic_mean_n', 'daily_ic_median', 'daily_ic_median_n']
lgb_train_params = ['learning_rate', 'num_leaves', 'feature_fraction', 'min_data_in_leaf']


## Generate Lightgbm predictions

### Model Configuration

In [39]:
base_params = dict(boosting='gbdt',
                   objective='regression',
                   random_state = 42, 
                   verbose=-1)

categoricals = ['sector',]
#categoricals = []#'month','sector','year', 'month', ]

In [40]:
#dos semanas
lookahead = 1
 #no lo borramos previamente pq ya lo ha hecho el paso 5

### Get Data

In [41]:
data = pd.read_hdf(DATA_STORE,DATA_STORE_ITEM).sort_index()#modificado

In [42]:
labels = sorted(data.filter(like='target').columns)
features = data.columns.difference(labels).tolist()
label = 'target_1w'

In [43]:
# Encuentra las filas con al menos un valor NaN
nan_cols = data.loc[idx[:, '2024':], features + [label]].isna().any(axis=0)

print(nan_cols[nan_cols == True])


target_1w    True
dtype: bool


In [44]:
#completamos con los valores del periodo anterior, para evitar que el último dato apareza nan
data= data.fillna(method='ffill')

  data= data.fillna(method='ffill')


In [45]:
#datos desde 2010
data = data.loc[idx[:, '2010':], features + [label]].dropna()

In [46]:
data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8613 entries, ('IYR', Timestamp('2010-01-03 00:00:00')) to ('XLY', Timestamp('2024-12-29 00:00:00'))
Data columns (total 55 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   10y_real_interest_rate_diff                   8613 non-null   float64
 1   1y_yield_diff                                 8613 non-null   float64
 2   CMA                                           8613 non-null   float64
 3   HML                                           8613 non-null   float64
 4   M2_money_supply_diff                          8613 non-null   float64
 5   Mkt-RF                                        8613 non-null   float64
 6   RMW                                           8613 non-null   float64
 7   SMB                                           8613 non-null   float64
 8   business_inventory_diff                       8613 non-null  

In [47]:
data.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
XLY,2024-12-01,-0.145812,-0.12,-22.30128,38.24644,0.0,-28.444476,-9.326239,-33.007563,0.0,0.0,...,0.0,40.42,-1.73,225000.0,10000.0,0.0,2024,-0.4,-0.18,1.982427
XLY,2024-12-08,0.0,-0.11,-22.30128,38.24644,0.0,-28.444476,-9.326239,-33.007563,0.0,0.0,...,0.0,-7.2052,-0.74,242000.0,17000.0,0.0,2024,-0.27,0.13,0.47145
XLY,2024-12-15,0.0,0.05,-22.30128,38.24644,0.0,-28.444476,-9.326239,-33.007563,0.0,0.0,...,0.0,30.0419,1.04,220000.0,-22000.0,0.0,2024,0.06,0.33,-1.155353
XLY,2024-12-22,0.0,0.03,-22.30128,38.24644,0.0,-28.444476,-9.326239,-33.007563,0.0,0.0,...,0.0,0.0,4.55,219000.0,-1000.0,0.0,2024,0.18,0.12,0.83212
XLY,2024-12-29,0.0,-0.03,-22.30128,38.24644,0.0,-28.444476,-9.326239,-33.007563,0.0,0.0,...,0.0,0.0,-4.09,219000.0,0.0,0.0,2024,0.23,0.05,0.83212


In [48]:
for feature in categoricals:
    data[feature] = pd.factorize(data[feature], sort=True)[0]

In [49]:
lgb_data = lgb.Dataset(data=data[features],
                       label=data[label],
                       categorical_feature=categoricals,
                       free_raw_data=False)

### Generate predictions

In [50]:
def get_lgb_daily_ic(results_path):
    int_cols = ["lookahead", "train_length", "test_length", "boost_rounds"]

    lgb_ic = []
    with pd.HDFStore(results_path / "tuning_lgb.h5") as store:
        keys = [k[1:] for k in store.keys()]
        for key in keys:
            _, t, train_length, test_length = key.split("/")[:4]
            if key.startswith("daily_ic"):
                df = (
                    store[key]
                    .drop(["boosting", "objective", "verbose"], axis=1)
                    .assign(
                        lookahead=t, train_length=train_length, test_length=test_length
                    )
                )
                lgb_ic.append(df)
        lgb_ic = pd.concat(lgb_ic).reset_index()

    id_vars = ["date"] + scope_params + lgb_train_params
    lgb_ic = pd.melt(
        lgb_ic, id_vars=id_vars, value_name="ic", var_name="boost_rounds"
    ).dropna()
    lgb_ic.loc[:, int_cols] = lgb_ic.loc[:, int_cols].astype(int)

    lgb_daily_ic = (
        lgb_ic.groupby(id_vars[1:] + ["boost_rounds"])
        .ic.mean()
        .to_frame("ic")
        .reset_index()
    )
    return lgb_daily_ic

lgb_daily_ic = get_lgb_daily_ic(RESULTS_PATH)

In [51]:
lgb_daily_ic = lgb_daily_ic.loc[lgb_daily_ic.boost_rounds >= 50]

In [19]:
#tomamos los IC almacenados
lgb_ic = pd.read_hdf('data/model_tuning.h5', 'lgb/ic')
lgb_daily_ic = pd.read_hdf('data/model_tuning.h5', 'lgb/daily_ic')

FileNotFoundError: File data/model_tuning.h5 does not exist

In [52]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,10y_real_interest_rate_diff,1y_yield_diff,CMA,HML,M2_money_supply_diff,Mkt-RF,RMW,SMB,business_inventory_diff,coffee_diff,...,streaming_media_consumption_diff,tot_bank_credit_diff,vix_diff,weekjobclaims,weekjobclaims_diff,wheat_diff,year,yield_curve,yield_curve_diff,target_1w
ticker,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
IYR,2010-01-03,0.449884,0.04,-219.417481,164.443144,-34.5,-110.922122,311.520175,75.099135,2387.0,0.711801,...,2.497,16.1658,2.21,456000.0,-12000.0,-3.333636,2010,3.79,0.02,0.032729
IYR,2010-01-10,0.000000,-0.10,-221.084566,159.558976,0.0,-111.246166,298.548370,77.617125,0.0,0.000000,...,0.000,-23.2842,-3.55,469000.0,13000.0,0.000000,2010,3.78,-0.01,-0.119910
IYR,2010-01-17,0.000000,-0.04,-221.084566,159.558976,0.0,-111.246166,298.548370,77.617125,0.0,0.000000,...,0.000,-23.5856,-0.22,507000.0,38000.0,0.000000,2010,3.64,-0.14,-0.795944
IYR,2010-01-24,0.000000,-0.03,-221.084566,159.558976,0.0,-111.246166,298.548370,77.617125,0.0,0.000000,...,0.000,0.4094,9.40,471000.0,-36000.0,0.000000,2010,3.56,-0.08,-0.160801
IYR,2010-01-31,0.000000,0.00,-221.084566,159.558976,0.0,-111.246166,298.548370,77.617125,0.0,0.000000,...,0.000,-26.3653,-2.69,496000.0,25000.0,0.000000,2010,3.55,-0.01,0.061548
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XLY,2024-12-01,-0.145812,-0.12,-22.301280,38.246440,0.0,-28.444476,-9.326239,-33.007563,0.0,0.000000,...,0.000,40.4200,-1.73,225000.0,10000.0,0.000000,2024,-0.40,-0.18,1.982427
XLY,2024-12-08,0.000000,-0.11,-22.301280,38.246440,0.0,-28.444476,-9.326239,-33.007563,0.0,0.000000,...,0.000,-7.2052,-0.74,242000.0,17000.0,0.000000,2024,-0.27,0.13,0.471450
XLY,2024-12-15,0.000000,0.05,-22.301280,38.246440,0.0,-28.444476,-9.326239,-33.007563,0.0,0.000000,...,0.000,30.0419,1.04,220000.0,-22000.0,0.000000,2024,0.06,0.33,-1.155353
XLY,2024-12-22,0.000000,0.03,-22.301280,38.246440,0.0,-28.444476,-9.326239,-33.007563,0.0,0.000000,...,0.000,0.0000,4.55,219000.0,-1000.0,0.000000,2024,0.18,0.12,0.832120


In [53]:
#función para tomar los mejores parametros que saliernon en entrenamiento para un lookahead determinado
def get_lgb_params(data, t=5, best=0):
    param_cols = scope_params[1:] + lgb_train_params + ['boost_rounds']
    df = data[data.lookahead==t].sort_values('ic', ascending=False).iloc[best]
    return df.loc[param_cols]

In [54]:
#para hacer más OOS que el 1 año definido inicialmente
years_OOS=4.9

In [55]:
params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,)
                            

In [56]:
params

train_length        464.0
test_length           1.0
learning_rate         0.3
num_leaves            4.0
feature_fraction      0.3
min_data_in_leaf    250.0
boost_rounds         75.0
Name: 497, dtype: float64

In [57]:
#for par las 10 mejores configuracones de paramentros de las cuales almacenaremos sus predicciones
for position in range(10):
    params = get_lgb_params(lgb_daily_ic,
                            t=lookahead,
                            best=position)

    params = params.to_dict()#parametros a diccionario

    for p in ['min_data_in_leaf', 'num_leaves']:
        params[p] = int(params[p])
    train_length = int(params.pop('train_length')) # Extrae y elimina el parámetro 'train_length' del diccionario de parámetros y lo convierte a un entero
    test_length = int(params.pop('test_length'))
    num_boost_round = int(params.pop('boost_rounds'))
    params.update(base_params)

    print(f'\nPosition: {position:02}')

    # 1-year out-of-sample period
    #vamos a ir haciendo el walk forward con periodos de test de un mes, moveremos el modelo para volver a entrenar y predeciremos el siguiente mes
    n_splits = int(YEAR * years_OOS / test_length)
    cv = MultipleTimeSeriesCV(n_splits=n_splits,
                              test_period_length=test_length,
                              lookahead=lookahead,
                              train_period_length=train_length)

    predictions = []
    start = time()
    for i, (train_idx, test_idx) in tqdm(enumerate(cv.split(X=data), 1), total=n_splits):
        # print(i, end=' ', flush=True)
        
        # Crea un conjunto de datos de entrenamiento para LightGBM
        lgb_train = lgb_data.subset(used_indices=train_idx.tolist(),
                                    params=params).construct()
         # Entrena el modelo LightGBM
        model = lgb.train(params=params,
                          train_set=lgb_train,
                          num_boost_round=num_boost_round,
                        )

        test_set = data.iloc[test_idx, :]
        y_test = test_set.loc[:, label].to_frame('y_test')
        # Realiza predicciones en el conjunto de datos de prueba
        y_pred = model.predict(test_set.loc[:, model.feature_name()])
        predictions.append(y_test.assign(prediction=y_pred))

    if position == 0:
        test_predictions = (pd.concat(predictions)
                            .rename(columns={'prediction': position}))
    else:
        test_predictions[position] = pd.concat(predictions).prediction

by_day = test_predictions.groupby(level='date')# Agrupa las predicciones por fecha
for position in range(10):
     # Si es la primera iteración, calcula el coeficiente de correlación de Spearman
    #entre las predicciones y las etiquetas verdaderas y lo almacena en `ic_by_day`
    if position == 0:
        ic_by_day = by_day.apply(lambda x: spearmanr(
            x.y_test, x[position])[0]).to_frame()
    else:
        ic_by_day[position] = by_day.apply(
            lambda x: spearmanr(x.y_test, x[position])[0])
print(ic_by_day.describe())
test_predictions.to_hdf(PREDICTIONS_STORE, f'lgb/test/{lookahead:02}')


Position: 00


  0%|          | 0/254 [00:00<?, ?it/s]

100%|██████████| 254/254 [01:12<00:00,  3.53it/s]



Position: 01


100%|██████████| 254/254 [01:00<00:00,  4.21it/s]



Position: 02


100%|██████████| 254/254 [00:51<00:00,  4.96it/s]



Position: 03


100%|██████████| 254/254 [00:50<00:00,  5.05it/s]



Position: 04


100%|██████████| 254/254 [03:35<00:00,  1.18it/s]



Position: 05


100%|██████████| 254/254 [00:45<00:00,  5.57it/s]



Position: 06


100%|██████████| 254/254 [01:04<00:00,  3.96it/s]



Position: 07


100%|██████████| 254/254 [03:14<00:00,  1.30it/s]



Position: 08


100%|██████████| 254/254 [02:04<00:00,  2.05it/s]



Position: 09


100%|██████████| 254/254 [00:37<00:00,  6.70it/s]
  lambda x: spearmanr(x.y_test, x[position])[0])
  lambda x: spearmanr(x.y_test, x[position])[0])
  lambda x: spearmanr(x.y_test, x[position])[0])


                0           1           2           3           4           5  \
count  254.000000  254.000000  254.000000  254.000000  254.000000  253.000000   
mean     0.013926   -0.002100   -0.013742   -0.013742   -0.015744    0.027437   
std      0.368705    0.380981    0.391431    0.391431    0.394315    0.374024   
min     -0.809091   -0.881818   -0.897497   -0.897497   -0.845455   -0.820094   
25%     -0.273153   -0.258398   -0.333139   -0.333139   -0.327273   -0.261158   
50%      0.009091    0.031849   -0.011379   -0.011379    0.009091    0.059774   
75%      0.294824    0.276766    0.280840    0.280840    0.240969    0.284476   
max      0.818182    0.909091    0.883829    0.883829    0.909091    0.884877   

                6           7           8           9  
count  254.000000  253.000000  254.000000  232.000000  
mean     0.011648   -0.013146   -0.016326   -0.003885  
std      0.358314    0.378735    0.364382    0.356787  
min     -0.800000   -0.924832   -0.800000   -0

  test_predictions.to_hdf(PREDICTIONS_STORE, f'lgb/test/{lookahead:02}')
  check_attribute_name(name)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis0] [items->None]

  test_predictions.to_hdf(PREDICTIONS_STORE, f'lgb/test/{lookahead:02}')
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_items] [items->None]

  test_predictions.to_hdf(PREDICTIONS_STORE, f'lgb/test/{lookahead:02}')


In [35]:
test_predictions.to_hdf("data_normalized/predictions_normalized_todo_2.h5", 'lgb/test/01')

  test_predictions.to_hdf("data_normalized/predictions_normalized_todo_2.h5", 'lgb/test/01')
  check_attribute_name(name)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis0] [items->None]

  test_predictions.to_hdf("data_normalized/predictions_normalized_todo_2.h5", 'lgb/test/01')
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_items] [items->None]

  test_predictions.to_hdf("data_normalized/predictions_normalized_todo_2.h5", 'lgb/test/01')
