# Data transformation

#### 0. Imports

In [10]:
# Reload-auto without kernel restart
%load_ext autoreload
%autoreload 2

# Add parent directory to the sys path for module imports
import sys
sys.path.append("..")

# Data preprocessing
import numpy as np
import polars as pl
import pandas as pd
from src.support.data_transformation import TickerExtender, TechnicalIndicators, FileHandler

# System variables
import os

# Manage paths
from pathlib import Path

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Introduction and purpose

This notebook has the purpose of transformating/cleaning the data obtained in the extraction step from `notebooks/1_0_data_extraction.ipynb`.

Transformation goals are:
- Cleaning and formatting of tables extracted, for both OHLCV stocks data and for macroindicators.
- Generation of Technical Indicators from OHLCV data.

The transformations below are simply a demonstrative example of the feature generation process, that append features in 2 ways; 1. Horizontally, the features for every stock symbol made of its technical indicators and macroindicators. 2. Vertically, as the idea is to be able to feed the model as many examples, or rows, as possible.

For a detailed overview of transformations applied, refer to the script in `src/support/data_transformation.ipynb`.

# 2. Data transformation

## 2.1. OHLCV - Adding calculated features and TA-Lib Technical Indicators


For OHLCV data, the transformations performed are:
- Creation of calendar and calendar-derived cycle features of stocks
- Lags of growth rate features, rolling moving averages of original prices
- Rolling min-max to try and extract support/resistance information
- Volatility, min-max ratios, etc

More advanced technical indicators are calculated through class methods based on the TA-Lib library generate features based on patters, momentum, volume and cycles. They can be found under the ``TickerExtender`` class on ``src/support/data_tranformation.py``.

Finally, in the case of Skforecast models, some rolling and autoregressive features are calculated internally, so although they are not managed by custom transformations the algorithm internally generates a table with those features.


The below 2 methods fetch the extracted tickers and transform them in parallel with joblib, to straight away after stack them vertically.

In [16]:
ticker_extender = TickerExtender()

print("Enriching ticker dataframes.")
base_dir = Path(os.getcwd()) # Ensure the right "relative" path is achivied by attaching the absolute path of the notebook
ticker_df_list = ticker_extender.transform_daily_tickers_parallel(base_dir / "../data/extracted/OHLCV")
print("Merging into one collection.")
merged_df_with_tech_ind = ticker_extender.merge_tickers(ticker_df_list)

merged_df_with_tech_ind

Enriching ticker dataframes.


Merging into one collection.


datetime,close,high,low,open,volume,symbol,currency,industry,sector,country,region,year,month,weekday,quarter_n,month_dt,quarter,growth_adj_1d,growth_adj_3d,growth_adj_7d,growth_adj_14d,growth_adj_21d,growth_adj_30d,growth_adj_90d,growth_adj_365d,SMA10,SMA22,SMA66,30d_volatility,high_minus_low_relative,_close_relative_high_low,is_growing_moving_average,adx,adxr,apo,aroon_1,…,cdlhangingman,cdlharami,cdlharamicross,cdlhighwave,cdlhikkake,cdlhikkakemod,cdlhomingpigeon,cdlidentical3crows,cdlinneck,cdlinvertedhammer,cdlkicking,cdlkickingbylength,cdlladderbottom,cdllongleggeddoji,cdllongline,cdlmarubozu,cdlmatchinglow,cdlmathold,cdlmorningdojistar,cdlmorningstar,cdlonneck,cdlpiercing,cdlrickshawman,cdlrisefall3methods,cdlseparatinglines,cdlshootingstar,cdlshortline,cdlspinningtop,cdlstalledpattern,cdlsticksandwich,cdltakuru,cdltasukigap,cdlthrusting,cdltristar,cdlunique3river,cdlupsidegap2crows,cdlxsidegap3methods
datetime[ns],f64,f64,f64,f64,i64,str,str,str,str,str,str,i32,i8,i8,i8,datetime[ns],date,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i8,f64,f64,f64,f64,…,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
2009-08-06 00:00:00,1.150074,1.201962,1.106004,1.17282,241978000,"""AVGO""","""USD""","""semiconductors""","""technology""","""united_states""","""US_AMERICA""",2009,8,4,3,2009-08-01 00:00:00,2009-07-01,,,,,,,,,,,,,0.083436,0.45926,,,,,,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2009-08-07 00:00:00,1.167844,1.1913,1.139412,1.147941,24543000,"""AVGO""","""USD""","""semiconductors""","""technology""","""united_states""","""US_AMERICA""",2009,8,5,3,2009-08-01 00:00:00,2009-07-01,1.015451,,,,,,,,,,,,0.044431,0.547945,,,,,,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2009-08-10 00:00:00,1.135147,1.18206,1.109558,1.18206,24210000,"""AVGO""","""USD""","""semiconductors""","""technology""","""united_states""","""US_AMERICA""",2009,8,1,3,2009-08-01 00:00:00,2009-07-01,0.972002,,,,,,,,,,,,0.06387,0.352941,,,,,,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2009-08-11 00:00:00,1.113823,1.13728,1.10174,1.135858,23054000,"""AVGO""","""USD""","""semiconductors""","""technology""","""united_states""","""US_AMERICA""",2009,8,2,3,2009-08-01 00:00:00,2009-07-01,0.981215,0.96848,,,,,,,,,,,0.031908,0.340001,,,,,,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2009-08-12 00:00:00,1.137279,1.151495,1.113112,1.147941,14513000,"""AVGO""","""USD""","""semiconductors""","""technology""","""united_states""","""US_AMERICA""",2009,8,3,3,2009-08-01 00:00:00,2009-07-01,1.021059,0.973828,,,,,,,,,,,0.03375,0.62963,,,,,,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2025-02-03 00:00:00,57.560001,57.919998,57.02,57.279999,1346900,"""TTE""","""USD""","""oil-gas-integrated""","""energy""","""france""","""EU""",2025,2,1,1,2025-02-01 00:00:00,2025-01-01,0.991901,0.989343,0.976918,0.967558,1.015526,1.0729,0.898979,0.931185,58.425,57.584546,57.59997,21.528084,0.015636,0.600003,1,26.752816,29.95888,1.591756,0.0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2025-02-04 00:00:00,59.189999,59.400002,57.759998,57.759998,1712700,"""TTE""","""USD""","""oil-gas-integrated""","""energy""","""france""","""EU""",2025,2,2,1,2025-02-01 00:00:00,2025-01-01,1.028318,1.007832,1.018235,1.001353,1.041894,1.086055,0.929008,0.953604,58.433,57.797727,57.531438,19.188987,0.027707,0.87195,1,26.281955,29.478139,1.38758,0.0,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2025-02-05 00:00:00,60.32,60.68,60.110001,60.360001,1860700,"""TTE""","""USD""","""oil-gas-integrated""","""energy""","""france""","""EU""",2025,2,3,1,2025-02-01 00:00:00,2025-01-01,1.019091,1.039462,1.036782,1.038031,1.04795,1.095134,0.956651,0.988114,58.654,58.035909,57.490023,18.464838,0.00945,0.36842,1,26.7234,30.073047,1.196382,85.714286,…,0,0,0,-100,0,0,0,0,0,0,0,0,0,100,0,0,0,0,0,0,0,0,100,0,0,0,0,-100,0,0,0,0,0,0,0,0,0
2025-02-06 00:00:00,60.919998,61.330002,60.57,61.290001,1555700,"""TTE""","""USD""","""oil-gas-integrated""","""energy""","""france""","""EU""",2025,2,4,1,2025-02-01 00:00:00,2025-01-01,1.009947,1.058374,1.037289,1.036936,1.056172,1.098053,0.973158,1.005755,58.871,58.283182,57.464564,18.607306,0.012475,0.460523,1,27.508211,30.81291,1.069077,78.571429,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


A variant of the above method called ``transform_daily_tickers_parallel_experiment`` generates a bunch of different features as explained in ``4_model_evaluation.ipynb``.

# 3. Macroindicators

## 3.1 Indices

In [None]:
from functools import partial

indices = ["^GSPC","^DJI","^GDAXI","EPI"]

# Read the first index
prefixed_growth = partial(ticker_extender.compute_daily_index_features,prefix=f"{indices[0]}_")
merged_indices = ticker_extender.read_transform_save(prefixed_growth, str(base_dir / f"../data/extracted/macro/{indices[0]}.parquet"))

# Iterate through the others, joining them horizontally
for index in indices[1:]:
    prefixed_growth = partial(ticker_extender.compute_daily_index_features,prefix=f"{index}_")
    index_df = ticker_extender.read_transform_save(prefixed_growth, str(base_dir / f"../data/extracted/macro/{index}.parquet"))
    merged_indices = merged_indices.join(index_df,how="full",on="datetime")

    # clean duplications of datetime
    merged_indices = merged_indices.with_columns(pl.col("datetime").fill_null(pl.col("datetime_right")) 
                                                    ).sort(by="datetime"
                                                    ).select(pl.exclude("datetime_right")
                                                    ).fill_null(strategy="forward") 


merged_indices

datetime,^GSPC_growth_adj_1d,^GSPC_growth_adj_3d,^GSPC_growth_adj_7d,^GSPC_growth_adj_14d,^GSPC_growth_adj_21d,^GSPC_growth_adj_30d,^GSPC_growth_adj_90d,^GSPC_growth_adj_365d,^GSPC_SMA10,^GSPC_SMA22,^GSPC_SMA66,^GSPC_30d_volatility,^GSPC_high_minus_low_relative,^GSPC__close_relative_high_low,^GSPC_is_growing_moving_average,^DJI_growth_adj_1d,^DJI_growth_adj_3d,^DJI_growth_adj_7d,^DJI_growth_adj_14d,^DJI_growth_adj_21d,^DJI_growth_adj_30d,^DJI_growth_adj_90d,^DJI_growth_adj_365d,^DJI_SMA10,^DJI_SMA22,^DJI_SMA66,^DJI_30d_volatility,^DJI_high_minus_low_relative,^DJI__close_relative_high_low,^DJI_is_growing_moving_average,^GDAXI_growth_adj_1d,^GDAXI_growth_adj_3d,^GDAXI_growth_adj_7d,^GDAXI_growth_adj_14d,^GDAXI_growth_adj_21d,^GDAXI_growth_adj_30d,^GDAXI_growth_adj_90d,^GDAXI_growth_adj_365d,^GDAXI_SMA10,^GDAXI_SMA22,^GDAXI_SMA66,^GDAXI_30d_volatility,^GDAXI_high_minus_low_relative,^GDAXI__close_relative_high_low,^GDAXI_is_growing_moving_average,EPI_growth_adj_1d,EPI_growth_adj_3d,EPI_growth_adj_7d,EPI_growth_adj_14d,EPI_growth_adj_21d,EPI_growth_adj_30d,EPI_growth_adj_90d,EPI_growth_adj_365d,EPI_SMA10,EPI_SMA22,EPI_SMA66,EPI_30d_volatility,EPI_high_minus_low_relative,EPI__close_relative_high_low,EPI_is_growing_moving_average
datetime[ns],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i8,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i8,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i8,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i8
1927-12-30 00:00:00,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1928-01-03 00:00:00,1.005663,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1928-01-04 00:00:00,0.997748,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1928-01-05 00:00:00,0.990406,0.993771,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1928-01-06 00:00:00,1.006268,0.994369,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2025-02-03 00:00:00,0.992391,0.992592,0.997054,0.999651,1.02875,1.014835,1.032102,1.237103,6058.112012,5975.750466,5963.452119,1438.055324,0.016381,0.719347,1,0.997244,0.993478,0.993477,1.021479,1.059217,1.043411,1.054792,1.164392,44529.800781,43520.595526,43553.636837,16027.32887,0.016106,0.758723,1,0.98602,0.990327,1.006863,1.020863,1.064342,1.070093,1.10094,1.267639,21434.050781,20872.414862,20057.102184,9374.187927,0.009804,0.835543,1,0.992204,1.003013,1.006747,0.982739,0.985874,0.959636,0.926614,1.016258,43.42,44.048182,45.864918,14.220103,0.010169,0.795455,0
2025-02-04 00:00:00,1.007225,0.994517,0.995085,0.998122,1.034553,1.026566,1.036809,1.230666,6056.975977,5982.852739,5966.699996,1412.339519,0.008548,0.910868,1,1.003019,0.992735,0.993438,1.012044,1.053406,1.047288,1.051158,1.156704,44582.823828,43612.041903,43586.492424,16002.923704,0.006042,0.845196,1,1.003615,0.989805,1.003505,1.022037,1.060892,1.080358,1.101071,1.262572,21480.420703,20945.124822,20087.012488,8949.114722,0.010222,0.941591,1,1.012018,1.005049,1.024807,0.998632,1.011784,0.967307,0.928626,1.019397,43.414,43.980909,45.813923,13.56106,0.005252,1.0,0
2025-02-05 00:00:00,1.003909,1.003468,1.003671,0.995911,1.037408,1.032875,1.039185,1.222415,6054.486963,5991.622292,5970.163027,1375.429012,0.009206,0.975271,1,1.00712,1.007377,1.003573,1.016227,1.055388,1.058525,1.062516,1.160884,44654.478906,43724.815163,43626.495916,15936.45589,0.011898,0.974531,1,1.003731,0.993276,0.997615,1.015604,1.04915,1.067755,1.108217,1.275616,21513.586719,21007.385742,20118.949751,8809.919869,0.009182,1.0,1,0.998858,1.002981,1.013908,1.004363,1.000686,0.954813,0.923853,1.010724,43.433,43.886818,45.759299,11.921184,0.004115,0.777794,0
2025-02-06 00:00:00,1.003644,1.014847,1.002042,0.994257,1.022464,1.023744,1.046425,1.230792,6050.972949,5998.035911,5974.252419,1397.643203,0.006115,0.987635,1,0.9972,1.007332,0.997003,1.004096,1.035308,1.047166,1.061841,1.165906,44672.734766,43816.4288,43665.982126,15889.730088,0.009416,0.480254,1,1.014662,1.022129,1.008065,1.022926,1.060373,1.076785,1.137354,1.291125,21562.675781,21078.378906,20159.026722,8975.602527,0.010476,0.918938,1,0.992684,1.003467,0.996557,0.987941,0.987491,0.947208,0.919807,1.004255,43.38,43.776818,45.701942,9.785512,0.002303,0.299981,0


: 

## 3.2 GDP

In [6]:
file_handler = FileHandler()
gdppot = pd.read_csv("../data/extracted/macro/GDPPOT.csv",index_col=0)

gdppot['gdppot_us_yoy'] = gdppot["GDPPOT"].pct_change(4)
gdppot['gdppot_us_qoq'] = gdppot["GDPPOT"].pct_change(1)
gdppot.drop(columns="GDPPOT",inplace=True)

gdppot.index = pd.to_datetime(gdppot.index, utc=False, format="%Y-%m-%d")
gdppot

Unnamed: 0_level_0,gdppot_us_yoy,gdppot_us_qoq
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1955-01-01,,
1955-04-01,,0.006516
1955-07-01,,0.006366
1955-10-01,,0.006565
1956-01-01,0.026246,0.006545
...,...,...
2024-01-01,0.020357,0.005102
2024-04-01,0.020474,0.005143
2024-07-01,0.020675,0.005201
2024-10-01,0.020854,0.005247


## 3.3 CPI

In [8]:
cpilfesl = pd.read_csv("../data/extracted/macro/CPILFESL.csv",index_col=0)
cpilfesl.index = pd.to_datetime(cpilfesl.index, utc=False, format="%Y-%m-%d")

# information in current month is actually lagging from previous
cpilfesl.index = cpilfesl.index + pd.DateOffset(months=1)

cpilfesl['cpi_core_yoy_prev_month'] = cpilfesl["CPILFESL"].pct_change(12)
cpilfesl['cpi_core_mom_prev_month'] = cpilfesl["CPILFESL"].pct_change(1)

# cpilfesl.drop(columns="CPILFESL",inplace=True)
# cpilfesl.index = pd.to_datetime(cpilfesl.index, utc=False, format="%Y-%m-%d")
cpilfesl.tail(3)

Unnamed: 0_level_0,CPILFESL,cpi_core_yoy_prev_month,cpi_core_mom_prev_month
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-11-01,321.666,0.033,0.002803
2024-12-01,322.657,0.033002,0.003081
2025-01-01,323.383,0.032483,0.00225


## 3.4 FEDFUNDS

In [8]:
fedfunds = pd.read_csv("../data/extracted/macro/FEDFUNDS.csv",index_col=0)
fedfunds.index = pd.to_datetime(fedfunds.index, utc=False, format="%Y-%m-%d")

# information in current month is actually lagging from previous
fedfunds.index = fedfunds.index + pd.DateOffset(months=1)
fedfunds.columns = ["FEDFUNDS_prev_month"]

fedfunds

Unnamed: 0_level_0,FEDFUNDS_prev_month
DATE,Unnamed: 1_level_1
1955-02-01,1.39
1955-03-01,1.29
1955-04-01,1.35
1955-05-01,1.43
1955-06-01,1.43
...,...
2024-09-01,5.33
2024-10-01,5.13
2024-11-01,4.83
2024-12-01,4.64


## 3.5 Credit Risk - Euro AAA Bonds (Eurozone)

In [9]:
euro_yield_df = pd.read_csv("../data/extracted/macro/eurostat_euro_yield.csv", index_col=0)

# information in current day is actually lagging 2 days
euro_yield_df = ticker_extender.transform_euro_yield_df(euro_yield_df)


euro_yield_df


  eurostat_euro_yield_df = pd.concat([eurostat_euro_yield_df,pd.DataFrame(columns=eurostat_euro_yield_df.columns, index=new_index)],axis=0).shift(2)


Unnamed: 0,eur_yld_Y1_prev_2d,eur_yld_Y10_prev_2d,eur_yld_Y5_prev_2d
2004-09-08 00:00:00+00:00,2.29884,4.20922,3.45722
2004-09-09 00:00:00+00:00,2.32889,4.20963,3.47952
2004-09-10 00:00:00+00:00,2.34667,4.22842,3.50789
2004-09-13 00:00:00+00:00,2.30899,4.16187,3.43063
2004-09-14 00:00:00+00:00,2.27157,4.12098,3.37473
...,...,...,...
2025-01-13 00:00:00+00:00,2.33536,2.61630,2.27237
2025-01-14 00:00:00+00:00,2.36144,2.65624,2.32642
2025-01-15 00:00:00+00:00,2.37913,2.68203,2.35564
2025-01-16 00:00:00+00:00,2.37854,2.71869,2.39536


## 3.6 Credit Risk - DGS (Deposit Guarantee Schemes, US)

In [10]:
dgs = file_handler.read_csv_file("../data/extracted/macro/DGS.csv")
dgs.columns = [col + "_prev_1d" for col in dgs.columns] # lagged 1d
dgs.index = dgs.index + pd.offsets.BusinessDay(1)
dgs

Unnamed: 0_level_0,DGS1_prev_1d,DGS5_prev_1d,DGS10_prev_1d
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1962-01-03 00:00:00+00:00,3.22,3.88,4.06
1962-01-04 00:00:00+00:00,3.24,3.87,4.03
1962-01-05 00:00:00+00:00,3.24,3.86,3.99
1962-01-08 00:00:00+00:00,3.26,3.89,4.02
1962-01-09 00:00:00+00:00,3.31,3.91,4.03
...,...,...,...
2025-01-13 00:00:00+00:00,4.25,4.59,4.77
2025-01-14 00:00:00+00:00,4.24,4.61,4.79
2025-01-15 00:00:00+00:00,4.22,4.59,4.78
2025-01-16 00:00:00+00:00,4.19,4.45,4.66


## 3.7 Volatility - VIX (US)  

In [11]:
vix = file_handler.read_csv_file("../data/extracted/macro/VIX.csv")
vix = vix[["Close"]]
vix.columns = ["VIX_close"]
vix

Unnamed: 0_level_0,VIX_close
Date,Unnamed: 1_level_1
1990-01-02 00:00:00+00:00,17.240000
1990-01-03 00:00:00+00:00,18.190001
1990-01-04 00:00:00+00:00,19.219999
1990-01-05 00:00:00+00:00,20.110001
1990-01-08 00:00:00+00:00,20.260000
...,...
2025-01-14 00:00:00+00:00,18.709999
2025-01-15 00:00:00+00:00,16.120001
2025-01-16 00:00:00+00:00,16.600000
2025-01-17 00:00:00+00:00,15.970000


## 3.8 Commodities - GOLD  

In [12]:
prefixed_growth_gold = partial(ticker_extender.calculate_growth_features,prefix=f"gold_")
gold = ticker_extender.read_transform_save(prefixed_growth_gold,f"../data/extracted/macro/GOLD.csv")
gold

Unnamed: 0_level_0,gold_growth_adj_1d,gold_growth_adj_3d,gold_growth_adj_7d,gold_growth_adj_30d,gold_growth_adj_90d,gold_growth_adj_365d
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-08-30 00:00:00+00:00,,,,,,
2000-08-31 00:00:00+00:00,1.016064,,,,,
2000-09-01 00:00:00+00:00,0.995329,,,,,
2000-09-05 00:00:00+00:00,0.995668,1.006937,,,,
2000-09-06 00:00:00+00:00,0.994199,0.985268,,,,
...,...,...,...,...,...,...
2025-01-10 00:00:00+00:00,1.009203,1.019498,1.024008,1.016361,1.024085,1.339714
2025-01-13 00:00:00+00:00,0.987078,1.003378,1.013304,0.991066,1.022371,1.327260
2025-01-14 00:00:00+00:00,1.001496,0.997653,1.007829,0.979406,1.027437,1.308203
2025-01-15 00:00:00+00:00,1.013072,1.001477,1.018015,1.009302,1.035068,1.338845


## 3.9 Commodities - WTI Crude Oil (Futures and Spot)  

In [13]:
prefixed_growth_crude = partial(ticker_extender.calculate_growth_features,prefix=f"crude_oil_")
crude_oil = ticker_extender.read_transform_save(prefixed_growth_crude,f"../data/extracted/macro/WTI_oil_futures.csv")
crude_oil

Unnamed: 0_level_0,crude_oil_growth_adj_1d,crude_oil_growth_adj_3d,crude_oil_growth_adj_7d,crude_oil_growth_adj_30d,crude_oil_growth_adj_90d,crude_oil_growth_adj_365d
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-08-23 00:00:00+00:00,,,,,,
2000-08-24 00:00:00+00:00,0.986895,,,,,
2000-08-25 00:00:00+00:00,1.013279,,,,,
2000-08-28 00:00:00+00:00,1.025585,1.025585,,,,
2000-08-29 00:00:00+00:00,0.995437,1.034461,,,,
...,...,...,...,...,...,...
2025-01-10 00:00:00+00:00,1.035850,1.031246,1.035289,1.119936,0.992611,1.072860
2025-01-13 00:00:00+00:00,1.029385,1.075014,1.071506,1.149147,1.071361,1.094418
2025-01-14 00:00:00+00:00,0.983253,1.048431,1.043771,1.102575,1.058165,1.066318
2025-01-15 00:00:00+00:00,1.032774,1.045318,1.091653,1.143102,1.055241,1.105525


## 3.10 Commodities - Brent Oil (Futures and Spot)  

In [14]:
prefixed_growth_brent = partial(ticker_extender.calculate_growth_features,prefix=f"brent_oil_")
brent_oil = ticker_extender.read_transform_save(prefixed_growth_brent,f"../data/extracted/macro/Brent_oil_futures.csv")
brent_oil

Unnamed: 0_level_0,brent_oil_growth_adj_1d,brent_oil_growth_adj_3d,brent_oil_growth_adj_7d,brent_oil_growth_adj_30d,brent_oil_growth_adj_90d,brent_oil_growth_adj_365d
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-07-30 00:00:00+00:00,,,,,,
2007-07-31 00:00:00+00:00,1.017296,,,,,
2007-08-01 00:00:00+00:00,0.977936,,,,,
2007-08-02 00:00:00+00:00,1.005441,1.000264,,,,
2007-08-03 00:00:00+00:00,0.986668,0.970149,,,,
...,...,...,...,...,...,...
2025-01-10 00:00:00+00:00,1.036922,1.035172,1.042478,1.105628,0.985543,1.038542
2025-01-13 00:00:00+00:00,1.015672,1.063682,1.061730,1.122178,1.049624,1.046506
2025-01-14 00:00:00+00:00,0.986545,1.039002,1.037248,1.087051,1.043614,1.020820
2025-01-15 00:00:00+00:00,1.026401,1.028460,1.077075,1.117423,1.033123,1.047771


## 3.11 Cryptocurrency - BTC-USD  

Non coincidental days from stocks market to BTC, as stocks are the goal, exclude those days of data from BTC.

In [15]:
prefixed_growth_btc = partial(ticker_extender.calculate_growth_features,prefix=f"BTC_")
BTC_usd = ticker_extender.read_transform_save(prefixed_growth_btc,f"../data/extracted/macro/BTC_usd.csv")
BTC_usd

Unnamed: 0_level_0,BTC_growth_adj_1d,BTC_growth_adj_3d,BTC_growth_adj_7d,BTC_growth_adj_30d,BTC_growth_adj_90d,BTC_growth_adj_365d
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-09-17 00:00:00+00:00,,,,,,
2014-09-18 00:00:00+00:00,0.928074,,,,,
2014-09-19 00:00:00+00:00,0.930157,,,,,
2014-09-20 00:00:00+00:00,1.035735,0.894104,,,,
2014-09-21 00:00:00+00:00,0.975341,0.939640,,,,
...,...,...,...,...,...,...
2025-01-16 00:00:00+00:00,0.992562,1.055444,1.054885,1.004609,1.124636,1.582188
2025-01-17 00:00:00+00:00,1.047166,1.082126,1.105554,1.090469,1.187665,1.718330
2025-01-18 00:00:00+00:00,0.999483,1.038840,1.104654,1.108780,1.152608,1.717416
2025-01-19 00:00:00+00:00,0.968216,1.013360,1.047191,1.062268,1.158614,1.645114


# 4. Merging all together

This merge is just performed to show that the macroindicators would need to be joined horizontally to the features created for every stock, as they would become exogenous features for each of them.

In [16]:
stocks_indices_merged = pd.merge(merged_df_with_tech_ind,merged_indices, on="date", how="left")

In [17]:
stocks_indices_merged.tail()

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,^GDAXI_growth_adj_7d,^GDAXI_growth_adj_30d,^GDAXI_growth_adj_90d,^GDAXI_growth_adj_365d,EPI_growth_adj_1d,EPI_growth_adj_3d,EPI_growth_adj_7d,EPI_growth_adj_30d,EPI_growth_adj_90d,EPI_growth_adj_365d
226949,349.140015,350.910004,344.369995,345.130005,2711200.0,ACN,ireland,EU,information-technology-services,technology,...,0.995878,0.988902,1.052922,1.211201,0.986102,0.960071,0.944154,0.905175,0.901227,1.04918
226950,348.98999,352.519989,345.630005,351.209991,1825400.0,ACN,ireland,EU,information-technology-services,technology,...,0.996596,0.994443,1.061098,1.223251,1.009935,0.971118,0.97004,0.915505,0.893845,1.058069
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,1.012038,1.011241,1.079103,1.252134,1.005948,1.001823,0.975377,0.918265,0.895146,1.045433
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,1.016651,1.016047,1.072733,1.246753,0.998635,1.014556,0.975561,0.9253,0.897203,1.051487
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,1.034064,1.024718,1.088101,1.262653,1.002733,1.007321,1.00319,0.925494,0.896003,1.069948


In [18]:
stocks_to_gdppot = pd.merge(stocks_indices_merged, gdppot, how="left", left_on="quarter", right_index=True)

In [19]:
stocks_to_cpilfesl = pd.merge(stocks_to_gdppot, cpilfesl, how="left", left_on="month_dt", right_index=True)
stocks_to_cpilfesl.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,EPI_growth_adj_3d,EPI_growth_adj_7d,EPI_growth_adj_30d,EPI_growth_adj_90d,EPI_growth_adj_365d,gdppot_us_yoy,gdppot_us_qoq,CPILFESL,cpi_core_yoy_prev_month,cpi_core_mom_prev_month
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,1.001823,0.975377,0.918265,0.895146,1.045433,0.021063,0.005308,323.383,0.032483,0.00225
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,1.014556,0.975561,0.9253,0.897203,1.051487,0.021063,0.005308,323.383,0.032483,0.00225
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,1.007321,1.00319,0.925494,0.896003,1.069948,0.021063,0.005308,323.383,0.032483,0.00225


In [20]:
stocks_to_fedfunds = pd.merge(stocks_to_cpilfesl, fedfunds, how="left", left_on="month_dt", right_index=True)
stocks_to_fedfunds.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,EPI_growth_adj_7d,EPI_growth_adj_30d,EPI_growth_adj_90d,EPI_growth_adj_365d,gdppot_us_yoy,gdppot_us_qoq,CPILFESL,cpi_core_yoy_prev_month,cpi_core_mom_prev_month,FEDFUNDS_prev_month
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,0.975377,0.918265,0.895146,1.045433,0.021063,0.005308,323.383,0.032483,0.00225,4.48
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,0.975561,0.9253,0.897203,1.051487,0.021063,0.005308,323.383,0.032483,0.00225,4.48
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,1.00319,0.925494,0.896003,1.069948,0.021063,0.005308,323.383,0.032483,0.00225,4.48


In [21]:
stocks_to_euro_yld = pd.merge(stocks_to_fedfunds, euro_yield_df, how="left", left_on="date", right_index=True)
stocks_to_euro_yld.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,EPI_growth_adj_365d,gdppot_us_yoy,gdppot_us_qoq,CPILFESL,cpi_core_yoy_prev_month,cpi_core_mom_prev_month,FEDFUNDS_prev_month,eur_yld_Y1_prev_2d,eur_yld_Y10_prev_2d,eur_yld_Y5_prev_2d
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,1.045433,0.021063,0.005308,323.383,0.032483,0.00225,4.48,2.37913,2.68203,2.35564
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,1.051487,0.021063,0.005308,323.383,0.032483,0.00225,4.48,2.37854,2.71869,2.39536
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,1.069948,0.021063,0.005308,323.383,0.032483,0.00225,4.48,2.33828,2.63028,2.30341


In [22]:
stocks_to_dgs = pd.merge(stocks_to_euro_yld, dgs, how="left", left_on="date", right_index=True)
stocks_to_dgs.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,CPILFESL,cpi_core_yoy_prev_month,cpi_core_mom_prev_month,FEDFUNDS_prev_month,eur_yld_Y1_prev_2d,eur_yld_Y10_prev_2d,eur_yld_Y5_prev_2d,DGS1_prev_1d,DGS5_prev_1d,DGS10_prev_1d
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,323.383,0.032483,0.00225,4.48,2.37913,2.68203,2.35564,4.22,4.59,4.78
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,323.383,0.032483,0.00225,4.48,2.37854,2.71869,2.39536,4.19,4.45,4.66
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,323.383,0.032483,0.00225,4.48,2.33828,2.63028,2.30341,4.18,4.39,4.61


In [23]:
stocks_to_vix = pd.merge(stocks_to_dgs, vix, how="left", left_on="date", right_index=True)
stocks_to_vix.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,cpi_core_yoy_prev_month,cpi_core_mom_prev_month,FEDFUNDS_prev_month,eur_yld_Y1_prev_2d,eur_yld_Y10_prev_2d,eur_yld_Y5_prev_2d,DGS1_prev_1d,DGS5_prev_1d,DGS10_prev_1d,VIX_close
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,0.032483,0.00225,4.48,2.37913,2.68203,2.35564,4.22,4.59,4.78,16.120001
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,0.032483,0.00225,4.48,2.37854,2.71869,2.39536,4.19,4.45,4.66,16.6
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,0.032483,0.00225,4.48,2.33828,2.63028,2.30341,4.18,4.39,4.61,15.97


In [24]:
stocks_to_gold = pd.merge(stocks_to_vix, gold, how="left", left_on="date", right_index=True)
stocks_to_gold.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,DGS1_prev_1d,DGS5_prev_1d,DGS10_prev_1d,VIX_close,gold_growth_adj_1d,gold_growth_adj_3d,gold_growth_adj_7d,gold_growth_adj_30d,gold_growth_adj_90d,gold_growth_adj_365d
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,4.22,4.59,4.78,16.120001,1.013072,1.001477,1.018015,1.009302,1.035068,1.338845
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,4.19,4.45,4.66,16.6,,,,,,
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,4.18,4.39,4.61,15.97,,,,,,


In [25]:
stocks_to_crude = pd.merge(stocks_to_gold, crude_oil, how="left", left_on="date", right_index=True)
stocks_to_crude.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,gold_growth_adj_7d,gold_growth_adj_30d,gold_growth_adj_90d,gold_growth_adj_365d,crude_oil_growth_adj_1d,crude_oil_growth_adj_3d,crude_oil_growth_adj_7d,crude_oil_growth_adj_30d,crude_oil_growth_adj_90d,crude_oil_growth_adj_365d
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,1.018015,1.009302,1.035068,1.338845,1.032774,1.045318,1.091653,1.143102,1.055241,1.105525
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,,,,,,,,,,
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,,,,,,,,,,


In [26]:
stocks_to_brent = pd.merge(stocks_to_crude, brent_oil, how="left", left_on="date", right_index=True)
stocks_to_brent.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,crude_oil_growth_adj_7d,crude_oil_growth_adj_30d,crude_oil_growth_adj_90d,crude_oil_growth_adj_365d,brent_oil_growth_adj_1d,brent_oil_growth_adj_3d,brent_oil_growth_adj_7d,brent_oil_growth_adj_30d,brent_oil_growth_adj_90d,brent_oil_growth_adj_365d
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,1.091653,1.143102,1.055241,1.105525,1.026401,1.02846,1.077075,1.117423,1.033123,1.047771
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,,,,,,,,,,
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,,,,,,,,,,


In [27]:
stocks_to_btc = pd.merge(stocks_to_brent, BTC_usd, how="left", left_on="date", right_index=True)
stocks_to_btc.tail(3)

Unnamed: 0,close,high,low,open,volume,symbol,country,region,industry,sector,...,brent_oil_growth_adj_7d,brent_oil_growth_adj_30d,brent_oil_growth_adj_90d,brent_oil_growth_adj_365d,BTC_growth_adj_1d,BTC_growth_adj_3d,BTC_growth_adj_7d,BTC_growth_adj_30d,BTC_growth_adj_90d,BTC_growth_adj_365d
226951,349.730011,355.200012,349.059998,352.350006,2624000.0,ACN,ireland,EU,information-technology-services,technology,...,1.077075,1.117423,1.033123,1.047771,1.04113,1.06367,1.061277,1.018529,1.248903,1.642554
226952,350.559998,353.25,347.0,349.109985,2025800.0,ACN,ireland,EU,information-technology-services,technology,...,,,,,0.992562,1.055444,1.054885,1.004609,1.124636,1.582188
226953,352.589996,357.0,351.910004,354.920013,4061300.0,ACN,ireland,EU,information-technology-services,technology,...,,,,,1.047166,1.082126,1.105554,1.090469,1.187665,1.71833


# 5. Dataset save

In [28]:
stocks_to_btc.to_parquet("../data/transformed/dataset.parquet")

# 6. Conclusion and comments

As said in the introduction, these transformations are an example. Different model configurations require different feature stackings, variations in transformations, etc. Additionally, the algorithms used in `4_model_evaluation.ipynb` perform internal transformations.

Although a myriad of transformations are peformed here, the ideal would be to modify the classes provided by Skforecast, the forecasting library used here, to be able to specify them via a standardised format like a dictionaries or lists, so prototyping is faster cleaner and more directly compatible with different model architectures: Independent multiseries, dependent multivariate, individual/global, etc (see `4_model_evaluation.ipynb` for more details).

The next notebook in the series is `3_data_load.ipynb`.