# Enhancing index performance through multi-factor rebalancing

## Introduction

Traditional stock indexes, like the Dow Jones Industrial Average (DJI), the S&P 500, the FTSE 100 act as benchmarks for the broader market's performance. These indexes are typically constructed and weighted based on a single factor, such as price or market capitalization. While this simplicity has its advantages, it can also limit the potential to capture more nuanced market dynamics.

In recent years, the emergence of factor investing has opened up new possibilities for constructing and managing portfolios. By focusing on multiple factors such as value, size, quality, yield, profitability, and leverage, investors can potentially achieve a better informed exposure to market risks and opportunities.

This article explores an advanced approach to index rebalancing that extends the single-factor methodologies. We will analyse a multi-factor strategy that incorporates various financial metrics, economic and cross-market indicators and dynamically adjusts the index composition based on which factors are predicted to perform best in the following month.

By comparing this multi-factor index to a conventional index, we aim to demonstrate how a more comprehensive approach to weighting and rebalancing can lead to improved returns and better risk management.

## Import required libraries

To start, we install and import the necessary packages. We use the [LSEG Data Libraries](https://developers.lseg.com/en/api-catalog/lseg-data-platform/lseg-data-library-for-python) to retrieve the index constituents, financial metrics to calculate factor-based returns and to get economic indicators and cross-market pricing data. The code is built using Python 3.9. The prerequisite packages are imported as shown below. 

In [1]:
import time
import warnings
import lseg.data as ld
import pandas as pd
import numpy as np
import xgboost as xgb
import plotly.graph_objects as go
import plotly.io as pio

from lseg.data import HeaderType, errors
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder, StandardScaler
from IndexConstutents import IndexConstituents
from datetime import timedelta
from collections import Counter

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = None
pio.renderers.default="notebook_connected"
np.seterr(divide='ignore', invalid='ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
ld.open_session()

<lseg.data.session.Definition object at 0x15f207820 {name='workspace'}>

## Methodology for factor construction

For this analysis, the factor construction methodology is based on the [FTSE Global Factor Index Series Ground Rules](https://www.lseg.com/content/dam/ftse-russell/en_us/documents/ground-rules/ftse-global-factor-index-series-ground-rules.pdf), which provides a comprehensive framework for defining and calculating factors such as value, profitability, size, yield, leverage etc. These rules ensure consistency in how factors are measured, including the treatment of missing data and the handling of outliers, thereby offering a robust foundation for factor-based index rebalancing.

To implement this methodology, the Dow Jones Industrial Average (DJI) was selected due to its stable composition and the extensive historical financial data available for its constituents. This provides a consistent dataset for factor analysis going back to 2000. While the focus here is the DJI, this approach is versatile, and readers are encouraged to experiment with other indexes and factor construction methodologies better suited for their investment requirements.

## Data Ingestion: Retrieving Historical Index Constituents and Financial Data
Below, we define the index (DJI), the request periods and request historical constituents using the custom object **IndexConstituents**. More about the object and **get_historical_constituents** function can be found in [this article](https://developers.lseg.com/en/article-catalog/article/building-historical-index-constituents).

In [3]:
ic = IndexConstituents()
index_ric = '.DJI'
start_date = '2000-01-01'
end_date = '2024-07-31'

index_historical  = ic.get_historical_constituents(index_ric,  start = start_date, end=end_date)
index_historical

Unnamed: 0,Date,RIC
0,2000-01-01,MMM.N
1,2000-01-01,MO.N
2,2000-01-01,AXP.N
3,2000-01-01,T.N^K05
4,2000-01-01,T.N
...,...,...
385,2024-02-26,DOW.N
386,2024-02-26,HON.OQ
387,2024-02-26,AMGN.OQ
388,2024-02-26,CRM.N


After obtaining the historical constituents for the DJI, we group the dataset by date, aggregate the RICs into a list, and resample monthly across the entire period in the dataset.

In [4]:
index_historical.set_index('Date', inplace=True)
index_historical.index = pd.to_datetime(index_historical.index)

index_grouped = index_historical.groupby('Date')['RIC'].agg(list)
ftse_grouped_m = pd.DataFrame(index_grouped.resample('M', convention='start').ffill())
ftse_grouped_m

Unnamed: 0_level_0,RIC
Date,Unnamed: 1_level_1
2000-01-31,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-02-29,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-03-31,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-04-30,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-05-31,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
...,...
2023-10-31,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."
2023-11-30,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."
2023-12-31,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."
2024-01-31,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."


Since the last change to the constituents happened on February 2024, we can extend the RIC list of the last month for the rest of the observation period (July 2024).

In [5]:
last_date = ftse_grouped_m.index[-1]
last_constituents = ftse_grouped_m.iloc[-1]['RIC']

extended_dates = pd.date_range(start='2024-03-31', end='2024-07-31', freq='M')
extended_rows = pd.DataFrame({'RIC': [last_constituents] * len(extended_dates)}, index=extended_dates)

ftse_grouped_m = pd.concat([ftse_grouped_m, extended_rows])
ftse_grouped_m

Unnamed: 0,RIC
2000-01-31,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-02-29,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-03-31,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-04-30,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
2000-05-31,"[MMM.N, MO.N, AXP.N, T.N^K05, T.N, BA.N, CAT.N..."
...,...
2024-03-31,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."
2024-04-30,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."
2024-05-31,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."
2024-06-30,"[MMM.N, AXP.N, BA.N, CAT.N, KO.N, HD.N, IBM.N,..."


Below, we define all the fields required for factor calculation. We have also included comments indicating which factors utilize each field. It should be noted that some fields are used across multiple factor calculations, for example, although "TR.CompanyMarketCap" is mentioned under size it is part of the formulas to be used for calculating Value factor.

In [6]:
fields = [
          "TR.TRBCEconomicSector",

          # Momentum
          "TR.TotalReturn1Mo",
          "TR.TotalReturn52Wk", 

          # Profitability
          "TR.F.AssetTurnover(Period=FY0)",
          "TR.F.AssetTurnover(Period=FY-1)",
          "TR.F.ReturnAvgTotAssetsPct(Period=FY0)",
          "TR.F.TotAssets(Period=FY0)",
          "TR.F.TotAssets(Period=FY-1)",
          "TR.F.WkgCap(Period=FY0)",
          "TR.F.WkgCap(Period=FY-1)",
          "TR.F.TotCurrAssets(Period=FY0)",
          "TR.F.TotCurrAssets(Period=FY-1)",
          "TR.F.InvstTot(Period=FY0)",
          "TR.F.InvstTot(Period=FY-1)",
          "TR.F.CustAdvTot(Period=FY0)",
          "TR.F.CustAdvTot(Period=FY-1)",
          "TR.F.TotLiab(Period=FY0)",
          "TR.F.TotLiab(Period=FY-1)",
          "TR.F.TotCurrLiab(Period=FY0)",
          "TR.F.TotCurrLiab(Period=FY-1)",
          "TR.F.PrefStockLiabPortLT(Period=FY0)",
          "TR.F.PrefStockLiabPortLT(Period=FY-1)",
          "TR.F.CashSTInvst(Period=FY0)",
          "TR.F.CashSTInvst(Period=FY-1)",
          "TR.F.InvstLT(Period=FY0)",
          "TR.F.InvstLT(Period=FY-1)",
          
          # Leverage
          "TR.F.NetCashFlowOp(Period=FY0)",
          "TR.F.NetCashFlowOp(Period=FY-1)",
          "TR.F.DebtTot(Period=FY0)",
          "TR.F.DebtTot(Period=FY-1)",
          "TR.F.DebtLTTot(Period=FY0)",
          "TR.F.DebtLTTot(Period=FY-1)",

          # Size
          "TR.CompanyMarketCap",

          #Value
          "TR.F.CF(Period=FY0)",
          "TR.NetIncome(Period=FY0)",
          "TR.Revenue(Period=FY0)",

          #Yield
          'TR.DividendYield'
          ]

After we have our fields defined, we proceed with requesting the financial data using get_data function from LD libraries. Since we will be sending multiple requests to the API, it is always advisable to wrap the call in try/except statements to catch any occasional communication errors and retry the call. Additionally, we utilize **header_type** parameter of the function and set it to be the NAME (default value being TITLE) of the field since we request same fields with different fiscal periods.

In [7]:
max_steps = 5

dataset = pd.DataFrame()
for row in ftse_grouped_m.itertuples():
    step = 0
    while step < max_steps:
        try:
            df = ld.get_data(row.RIC, fields=fields, parameters={'SDate': f'{row.Index.date()}'}, header_type=HeaderType.NAME)
            df['Date'] = row.Index.date()
            dataset = pd.concat([dataset, df])
            break
    
        except errors.LDError as e: 
            print("LDError message:", e.message)
            step +=1 
            time.sleep(5)
            continue
dataset

Unnamed: 0,Instrument,TR.TRBCECONOMICSECTOR,TR.TOTALRETURN1MO,TR.TOTALRETURN52WK,TR.F.ASSETTURNOVER(PERIOD=FY0),TR.F.ASSETTURNOVER(PERIOD=FY-1),TR.F.RETURNAVGTOTASSETSPCT(PERIOD=FY0),TR.F.TOTASSETS(PERIOD=FY0),TR.F.TOTASSETS(PERIOD=FY-1),TR.F.WKGCAP(PERIOD=FY0),TR.F.WKGCAP(PERIOD=FY-1),TR.F.TOTCURRASSETS(PERIOD=FY0),TR.F.TOTCURRASSETS(PERIOD=FY-1),TR.F.INVSTTOT(PERIOD=FY0),TR.F.INVSTTOT(PERIOD=FY-1),TR.F.CUSTADVTOT(PERIOD=FY0),TR.F.CUSTADVTOT(PERIOD=FY-1),TR.F.TOTLIAB(PERIOD=FY0),TR.F.TOTLIAB(PERIOD=FY-1),TR.F.TOTCURRLIAB(PERIOD=FY0),TR.F.TOTCURRLIAB(PERIOD=FY-1),TR.F.DEBTTOT(PERIOD=FY0),TR.F.DEBTTOT(PERIOD=FY-1),TR.F.DEBTLTTOT(PERIOD=FY0),TR.F.DEBTLTTOT(PERIOD=FY-1),TR.F.PREFSTOCKLIABPORTLT(PERIOD=FY0),TR.F.PREFSTOCKLIABPORTLT(PERIOD=FY-1),TR.F.CASHSTINVST(PERIOD=FY0),TR.F.CASHSTINVST(PERIOD=FY-1),TR.F.INVSTLT(PERIOD=FY0),TR.F.INVSTLT(PERIOD=FY-1),TR.F.NETCASHFLOWOP(PERIOD=FY0),TR.F.NETCASHFLOWOP(PERIOD=FY-1),TR.COMPANYMARKETCAP,TR.F.CF(PERIOD=FY0),TR.NETINCOME(PERIOD=FY0),TR.REVENUE(PERIOD=FY0),TR.DIVIDENDYIELD,Date
0,MMM.N,Consumer Non-Cyclicals,-4.342273,24.657480,1.122892,1.102114,13.176940,1.389600e+10,1.415300e+10,2.247000e+09,1.997000e+09,6.066000e+09,6.219000e+09,4.870000e+08,8.600000e+08,,,7.236000e+09,7.827000e+09,3.819000e+09,4.222000e+09,2.610000e+09,3.106000e+09,1.480000e+09,1.614000e+09,,,3.870000e+08,4.480000e+08,4.870000e+08,6.230000e+08,3.081000e+09,2.417000e+09,3.731793e+10,2.748000e+09,1.763000e+09,1.574800e+10,,2000-01-31
1,MO.N,Consumer Non-Cyclicals,-9.703504,-51.858338,1.018145,0.997920,12.862219,6.138100e+10,5.992000e+10,2.878000e+09,3.851000e+09,2.089500e+10,2.023000e+10,7.527000e+09,6.324000e+09,,,4.607600e+10,4.372300e+10,1.801700e+10,1.637900e+10,1.446800e+10,1.466200e+10,1.222600e+10,1.261500e+10,,,5.100000e+09,4.081000e+09,7.527000e+09,6.324000e+09,1.137500e+10,8.120000e+09,4.939315e+10,9.503000e+09,7.675000e+09,7.859600e+10,,2000-01-31
2,AXP.N,Financials,-0.713182,64.675631,0.154496,0.154955,1.797059,1.485170e+11,1.269330e+11,,,,,4.305200e+10,4.129900e+10,,,1.384220e+11,1.172350e+11,,,3.712200e+10,3.012400e+10,4.685000e+09,5.393000e+09,,,5.052300e+10,4.539100e+10,,,6.443000e+09,4.413000e+09,7.375415e+10,3.899000e+09,2.475000e+09,2.127800e+10,,2000-01-31
3,T.N^K05,Technology,3.940887,-13.080059,0.882308,0.885617,8.678354,5.955000e+10,6.109500e+10,-1.324000e+09,-5.400000e+08,1.411800e+10,1.677700e+10,4.434000e+09,4.173000e+09,,,3.391900e+10,3.741700e+10,1.544200e+10,1.731700e+10,6.727000e+09,1.194200e+10,5.556000e+09,7.857000e+09,,,3.160000e+09,6.250000e+08,4.434000e+09,3.866000e+09,1.662000e+10,7.952000e+09,1.686167e+11,9.864000e+09,3.428000e+09,5.497300e+10,,2000-01-31
4,T.N,Technology,-11.045604,-16.335122,0.626257,0.771957,8.310733,8.321500e+10,7.496600e+10,-7.383000e+09,-5.543000e+09,1.193000e+10,1.269700e+10,,,,,5.648900e+10,5.219200e+10,1.931300e+10,1.824000e+10,2.184900e+10,2.234800e+10,1.847500e+10,1.817000e+10,,,4.950000e+08,5.990000e+08,,,1.667400e+10,1.298100e+10,1.471952e+11,1.512600e+10,8.159000e+09,4.953100e+10,,2000-01-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,DOW.N,Basic Materials,2.676720,4.784246,0.752669,0.920796,1.113266,5.796700e+10,6.060300e+10,7.657000e+09,9.146000e+09,1.761400e+10,2.047700e+10,2.740000e+09,2.793000e+09,,,3.885900e+10,3.935600e+10,9.957000e+09,1.133100e+10,1.508600e+10,1.542200e+10,1.490700e+10,1.469800e+10,,,2.987000e+09,3.886000e+09,2.740000e+09,2.793000e+09,5.164000e+09,7.486000e+09,3.817917e+10,3.271000e+09,5.890000e+08,4.462200e+10,5.140444,2024-07-31
26,AMGN.OQ,Healthcare,6.407425,48.809074,0.347435,0.416879,8.278540,9.715400e+10,6.512100e+10,1.194000e+10,6.499000e+09,3.033200e+10,2.218600e+10,0.000000e+00,1.676000e+09,,,9.092200e+10,6.146000e+10,1.839200e+10,1.568700e+10,6.461300e+10,3.894500e+10,6.317000e+10,3.735400e+10,,,1.094400e+10,9.305000e+09,,,8.471000e+09,9.721000e+09,1.783484e+11,1.078800e+10,6.717000e+09,2.819000e+10,2.707011,2024-07-31
27,CRM.N,Technology,0.820729,17.709931,0.350900,0.323120,4.163647,9.982300e+10,9.884900e+10,2.443000e+09,5.040000e+08,2.907400e+10,2.639500e+10,1.057000e+10,1.016400e+10,,,4.017700e+10,4.049000e+10,2.663100e+10,2.589100e+10,1.040000e+10,1.139200e+10,9.029000e+09,9.953000e+09,,,1.419400e+10,1.250800e+10,4.848000e+09,4.672000e+09,1.023400e+10,7.111000e+09,2.507772e+11,8.095000e+09,4.136000e+09,3.485700e+10,0.618238,2024-07-31
28,HON.OQ,Consumer Non-Cyclicals,-4.116325,8.497392,0.592278,0.559643,9.163166,6.152500e+10,6.227500e+10,4.963000e+09,5.044000e+09,2.350200e+10,2.498200e+10,1.109000e+09,1.428000e+09,,,4.509100e+10,4.495600e+10,1.853900e+10,1.993800e+10,2.044300e+10,1.957000e+10,1.656200e+10,1.512300e+10,,,8.095000e+09,1.011000e+10,9.390000e+08,9.450000e+08,5.340000e+09,5.274000e+09,1.333302e+11,6.848000e+09,5.658000e+09,3.666200e+10,2.109890,2024-07-31


## Building the factors
### Calculating factor constituents
Before calculating the actual factor values, we first calculate their components, for example profitability is comprised of Return on Asset (roa), Change in Asset Turnover (at_delta) and accruals(accruals). More about the factor components and the formulas can be found in [FTSE Global Factor Index Series Ground Rules](https://www.lseg.com/content/dam/ftse-russell/en_us/documents/ground-rules/ftse-global-factor-index-series-ground-rules.pdf).

In [8]:
dataset['at_delta'] = dataset['TR.F.ASSETTURNOVER(PERIOD=FY0)'] - dataset['TR.F.ASSETTURNOVER(PERIOD=FY-1)']
dataset['wk_delta'] = dataset['TR.F.WKGCAP(PERIOD=FY0)'] = dataset['TR.F.WKGCAP(PERIOD=FY-1)']
dataset['nco_delta'] = ((dataset['TR.F.TOTASSETS(PERIOD=FY0)'] - dataset['TR.F.TOTCURRASSETS(PERIOD=FY0)'] - dataset['TR.F.INVSTTOT(PERIOD=FY0)']) - (dataset['TR.F.TOTLIAB(PERIOD=FY0)'] - dataset['TR.F.TOTCURRLIAB(PERIOD=FY0)'] - dataset['TR.F.DEBTLTTOT(PERIOD=FY0)'])) - \
                  ((dataset['TR.F.TOTASSETS(PERIOD=FY-1)'] - dataset['TR.F.TOTCURRASSETS(PERIOD=FY-1)'] - dataset['TR.F.INVSTTOT(PERIOD=FY-1)']) - (dataset['TR.F.TOTLIAB(PERIOD=FY-1)'] - dataset['TR.F.TOTCURRLIAB(PERIOD=FY-1)'] - dataset['TR.F.DEBTLTTOT(PERIOD=FY-1)']))
dataset['fin_delta'] = (dataset['TR.F.INVSTTOT(PERIOD=FY0)'] - dataset['TR.F.DEBTTOT(PERIOD=FY0)']) - (dataset['TR.F.INVSTTOT(PERIOD=FY-1)'] - dataset['TR.F.DEBTTOT(PERIOD=FY-1)'])
dataset['average_ta'] = (dataset['TR.F.TOTASSETS(PERIOD=FY0)'] + dataset['TR.F.TOTASSETS(PERIOD=FY0)'])/2
dataset['accruals'] = -(dataset['wk_delta']+dataset['nco_delta'] + dataset['fin_delta'])/dataset['average_ta']
dataset['leverage'] = dataset['TR.F.NETCASHFLOWOP(PERIOD=FY0)']/dataset['TR.F.DEBTTOT(PERIOD=FY0)']

dataset['size'] = np.log(dataset['TR.COMPANYMARKETCAP'])

dataset['cf_yield'] = dataset['TR.F.CF(PERIOD=FY0)']/dataset['TR.COMPANYMARKETCAP']
dataset['earning_yield'] = dataset['TR.NETINCOME(PERIOD=FY0)']/dataset['TR.COMPANYMARKETCAP']
dataset['sales_to_price'] = dataset['TR.REVENUE(PERIOD=FY0)']/dataset['TR.COMPANYMARKETCAP']

dataset['TR.DIVIDENDYIELD'] = pd.to_numeric(dataset['TR.DIVIDENDYIELD'], errors='coerce')
dataset['yield'] = np.log(dataset['TR.DIVIDENDYIELD'])

dataset.rename(columns={'TR.TOTALRETURN1MO': 'total_return_1m', 
                        'TR.TOTALRETURN52WK': 'momentum', 
                        'TR.F.RETURNAVGTOTASSETSPCT(PERIOD=FY0)': 'roa',
                        'TR.TRBCECONOMICSECTOR': 'sector'}, inplace=True)
dataset

Unnamed: 0,Instrument,sector,total_return_1m,momentum,TR.F.ASSETTURNOVER(PERIOD=FY0),TR.F.ASSETTURNOVER(PERIOD=FY-1),roa,TR.F.TOTASSETS(PERIOD=FY0),TR.F.TOTASSETS(PERIOD=FY-1),TR.F.WKGCAP(PERIOD=FY0),TR.F.WKGCAP(PERIOD=FY-1),TR.F.TOTCURRASSETS(PERIOD=FY0),TR.F.TOTCURRASSETS(PERIOD=FY-1),TR.F.INVSTTOT(PERIOD=FY0),TR.F.INVSTTOT(PERIOD=FY-1),TR.F.CUSTADVTOT(PERIOD=FY0),TR.F.CUSTADVTOT(PERIOD=FY-1),TR.F.TOTLIAB(PERIOD=FY0),TR.F.TOTLIAB(PERIOD=FY-1),TR.F.TOTCURRLIAB(PERIOD=FY0),TR.F.TOTCURRLIAB(PERIOD=FY-1),TR.F.DEBTTOT(PERIOD=FY0),TR.F.DEBTTOT(PERIOD=FY-1),TR.F.DEBTLTTOT(PERIOD=FY0),TR.F.DEBTLTTOT(PERIOD=FY-1),TR.F.PREFSTOCKLIABPORTLT(PERIOD=FY0),TR.F.PREFSTOCKLIABPORTLT(PERIOD=FY-1),TR.F.CASHSTINVST(PERIOD=FY0),TR.F.CASHSTINVST(PERIOD=FY-1),TR.F.INVSTLT(PERIOD=FY0),TR.F.INVSTLT(PERIOD=FY-1),TR.F.NETCASHFLOWOP(PERIOD=FY0),TR.F.NETCASHFLOWOP(PERIOD=FY-1),TR.COMPANYMARKETCAP,TR.F.CF(PERIOD=FY0),TR.NETINCOME(PERIOD=FY0),TR.REVENUE(PERIOD=FY0),TR.DIVIDENDYIELD,Date,at_delta,wk_delta,nco_delta,fin_delta,average_ta,accruals,leverage,size,cf_yield,earning_yield,sales_to_price,yield
0,MMM.N,Consumer Non-Cyclicals,-4.342273,24.657480,1.122892,1.102114,13.176940,1.389600e+10,1.415300e+10,1.997000e+09,1.997000e+09,6.066000e+09,6.219000e+09,4.870000e+08,8.600000e+08,,,7.236000e+09,7.827000e+09,3.819000e+09,4.222000e+09,2.610000e+09,3.106000e+09,1.480000e+09,1.614000e+09,,,3.870000e+08,4.480000e+08,4.870000e+08,6.230000e+08,3.081000e+09,2.417000e+09,3.731793e+10,2.748000e+09,1.763000e+09,1.574800e+10,,2000-01-31,0.020778,1.997000e+09,3.230000e+08,1.230000e+08,1.389600e+10,-0.175806,1.180460,24.342740,0.073638,0.047243,0.421996,
1,MO.N,Consumer Non-Cyclicals,-9.703504,-51.858338,1.018145,0.997920,12.862219,6.138100e+10,5.992000e+10,3.851000e+09,3.851000e+09,2.089500e+10,2.023000e+10,7.527000e+09,6.324000e+09,,,4.607600e+10,4.372300e+10,1.801700e+10,1.637900e+10,1.446800e+10,1.466200e+10,1.222600e+10,1.261500e+10,,,5.100000e+09,4.081000e+09,7.527000e+09,6.324000e+09,1.137500e+10,8.120000e+09,4.939315e+10,9.503000e+09,7.675000e+09,7.859600e+10,,2000-01-31,0.020225,3.851000e+09,-1.511000e+09,1.397000e+09,6.138100e+10,-0.060882,0.786218,24.623078,0.192395,0.155386,1.591233,
2,AXP.N,Financials,-0.713182,64.675631,0.154496,0.154955,1.797059,1.485170e+11,1.269330e+11,,,,,4.305200e+10,4.129900e+10,,,1.384220e+11,1.172350e+11,,,3.712200e+10,3.012400e+10,4.685000e+09,5.393000e+09,,,5.052300e+10,4.539100e+10,,,6.443000e+09,4.413000e+09,7.375415e+10,3.899000e+09,2.475000e+09,2.127800e+10,,2000-01-31,-0.000459,,,-5.245000e+09,1.485170e+11,,0.173563,25.024003,0.052865,0.033557,0.288499,
3,T.N^K05,Technology,3.940887,-13.080059,0.882308,0.885617,8.678354,5.955000e+10,6.109500e+10,-5.400000e+08,-5.400000e+08,1.411800e+10,1.677700e+10,4.434000e+09,4.173000e+09,,,3.391900e+10,3.741700e+10,1.544200e+10,1.731700e+10,6.727000e+09,1.194200e+10,5.556000e+09,7.857000e+09,,,3.160000e+09,6.250000e+08,4.434000e+09,3.866000e+09,1.662000e+10,7.952000e+09,1.686167e+11,9.864000e+09,3.428000e+09,5.497300e+10,,2000-01-31,-0.003309,-5.400000e+08,1.750000e+08,5.476000e+09,5.955000e+10,-0.085827,2.470641,25.850894,0.058500,0.020330,0.326024,
4,T.N,Technology,-11.045604,-16.335122,0.626257,0.771957,8.310733,8.321500e+10,7.496600e+10,-5.543000e+09,-5.543000e+09,1.193000e+10,1.269700e+10,,,,,5.648900e+10,5.219200e+10,1.931300e+10,1.824000e+10,2.184900e+10,2.234800e+10,1.847500e+10,1.817000e+10,,,4.950000e+08,5.990000e+08,,,1.667400e+10,1.298100e+10,1.471952e+11,1.512600e+10,8.159000e+09,4.953100e+10,,2000-01-31,-0.145700,-5.543000e+09,,,8.321500e+10,,0.763147,25.715026,0.102761,0.055430,0.336499,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,DOW.N,Basic Materials,2.676720,4.784246,0.752669,0.920796,1.113266,5.796700e+10,6.060300e+10,9.146000e+09,9.146000e+09,1.761400e+10,2.047700e+10,2.740000e+09,2.793000e+09,,,3.885900e+10,3.935600e+10,9.957000e+09,1.133100e+10,1.508600e+10,1.542200e+10,1.490700e+10,1.469800e+10,,,2.987000e+09,3.886000e+09,2.740000e+09,2.793000e+09,5.164000e+09,7.486000e+09,3.817917e+10,3.271000e+09,5.890000e+08,4.462200e+10,5.140444,2024-07-31,-0.168127,9.146000e+09,-3.880000e+08,2.830000e+08,5.796700e+10,-0.155968,0.342304,24.365556,0.085675,0.015427,1.168753,1.637140
26,AMGN.OQ,Healthcare,6.407425,48.809074,0.347435,0.416879,8.278540,9.715400e+10,6.512100e+10,6.499000e+09,6.499000e+09,3.033200e+10,2.218600e+10,0.000000e+00,1.676000e+09,,,9.092200e+10,6.146000e+10,1.839200e+10,1.568700e+10,6.461300e+10,3.894500e+10,6.317000e+10,3.735400e+10,,,1.094400e+10,9.305000e+09,,,8.471000e+09,9.721000e+09,1.783484e+11,1.078800e+10,6.717000e+09,2.819000e+10,2.707011,2024-07-31,-0.069444,6.499000e+09,2.462200e+10,-2.734400e+10,9.715400e+10,-0.038876,0.131104,25.907005,0.060488,0.037662,0.158061,0.995845
27,CRM.N,Technology,0.820729,17.709931,0.350900,0.323120,4.163647,9.982300e+10,9.884900e+10,5.040000e+08,5.040000e+08,2.907400e+10,2.639500e+10,1.057000e+10,1.016400e+10,,,4.017700e+10,4.049000e+10,2.663100e+10,2.589100e+10,1.040000e+10,1.139200e+10,9.029000e+09,9.953000e+09,,,1.419400e+10,1.250800e+10,4.848000e+09,4.672000e+09,1.023400e+10,7.111000e+09,2.507772e+11,8.095000e+09,4.136000e+09,3.485700e+10,0.618238,2024-07-31,0.027780,5.040000e+08,-1.982000e+09,1.398000e+09,9.982300e+10,0.000801,0.984038,26.247831,0.032280,0.016493,0.138996,-0.480882
28,HON.OQ,Consumer Non-Cyclicals,-4.116325,8.497392,0.592278,0.559643,9.163166,6.152500e+10,6.227500e+10,5.044000e+09,5.044000e+09,2.350200e+10,2.498200e+10,1.109000e+09,1.428000e+09,,,4.509100e+10,4.495600e+10,1.853900e+10,1.993800e+10,2.044300e+10,1.957000e+10,1.656200e+10,1.512300e+10,,,8.095000e+09,1.011000e+10,9.390000e+08,9.450000e+08,5.340000e+09,5.274000e+09,1.333302e+11,6.848000e+09,5.658000e+09,3.666200e+10,2.109890,2024-07-31,0.032634,5.044000e+09,9.540000e+08,-1.192000e+09,6.152500e+10,-0.078115,0.261214,25.616095,0.051361,0.042436,0.274971,0.746636


Next, we select only the factor components and apply data transformations, such as replacing inf with nan and making sure all number fields are numeric. This step is necessary for downstream feature normalization.

In [9]:
factor_components = ['Date', 'Instrument', 'sector', 'total_return_1m', 'momentum', 'roa', 'at_delta', 'accruals', 'leverage', 'size', 'cf_yield', 'earning_yield', 'sales_to_price', 'yield']
df_factor_components = dataset[factor_components]

df_factor_components.replace([np.inf, -np.inf], np.nan, inplace=True)
for col in factor_components[3:]:
    df_factor_components[col] = pd.to_numeric(df_factor_components[col], errors='coerce').astype('float64')
df_factor_components

Unnamed: 0,Date,Instrument,sector,total_return_1m,momentum,roa,at_delta,accruals,leverage,size,cf_yield,earning_yield,sales_to_price,yield
0,2000-01-31,MMM.N,Consumer Non-Cyclicals,-4.342273,24.657480,13.176940,0.020778,-0.175806,1.180460,24.342740,0.073638,0.047243,0.421996,
1,2000-01-31,MO.N,Consumer Non-Cyclicals,-9.703504,-51.858338,12.862219,0.020225,-0.060882,0.786218,24.623078,0.192395,0.155386,1.591233,
2,2000-01-31,AXP.N,Financials,-0.713182,64.675631,1.797059,-0.000459,,0.173563,25.024003,0.052865,0.033557,0.288499,
3,2000-01-31,T.N^K05,Technology,3.940887,-13.080059,8.678354,-0.003309,-0.085827,2.470641,25.850894,0.058500,0.020330,0.326024,
4,2000-01-31,T.N,Technology,-11.045604,-16.335122,8.310733,-0.145700,,0.763147,25.715026,0.102761,0.055430,0.336499,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2024-07-31,DOW.N,Basic Materials,2.676720,4.784246,1.113266,-0.168127,-0.155968,0.342304,24.365556,0.085675,0.015427,1.168753,1.637140
26,2024-07-31,AMGN.OQ,Healthcare,6.407425,48.809074,8.278540,-0.069444,-0.038876,0.131104,25.907005,0.060488,0.037662,0.158061,0.995845
27,2024-07-31,CRM.N,Technology,0.820729,17.709931,4.163647,0.027780,0.000801,0.984038,26.247831,0.032280,0.016493,0.138996,-0.480882
28,2024-07-31,HON.OQ,Consumer Non-Cyclicals,-4.116325,8.497392,9.163166,0.032634,-0.078115,0.261214,25.616095,0.051361,0.042436,0.274971,0.746636


After, we normalize the ratios using StandardScaler (which uses z-score), smooth outliers out by clipping values and re-normalize in the range [-3, 3]. Again, the methodology is coming from [FTSE Global Factor Index Series Ground Rules](https://www.lseg.com/content/dam/ftse-russell/en_us/documents/ground-rules/ftse-global-factor-index-series-ground-rules.pdf). For ease of use, we encapsulated normalization and outlier handling functions into separate functions:

In [10]:
def handle_outliers(series: np.ndarray, scaler:StandardScaler) -> np.ndarray:
    #define the condition
    condition = np.any((series > 3) | (series < -3))

    # Handles outliers by clipping values and re-normalizing until within the condition range
    while condition:
        series = np.clip(series, -3, 3)
        series = scaler.fit_transform(series)
        condition = np.any((series > 3) | (series < -3))
    return series

In [11]:
def normalize_factors(df: pd.DataFrame, factor_columns: list) -> pd.DataFrame:
    scaler = StandardScaler()
    df_norm = pd.DataFrame()

    # Iterate over each unique date to normalize factor columns
    for date in df['Date'].unique():
        date_df = df[df['Date'] == date]

        # Normalize each specified factor column and handle outliers
        for col in factor_columns:
            normalized_series = scaler.fit_transform(np.array(date_df[col]).reshape(-1, 1))
            normalized_series = handle_outliers(normalized_series, scaler)
            date_df[f"{col}_normalized"] = normalized_series
        
        # Combine normalized data back into a single DataFrame
        df_norm = pd.concat([df_norm, date_df])
    
    return df_norm

In [12]:
df_factor_components_norm = normalize_factors(df_factor_components, factor_components[4:])
df_factor_components_norm

Unnamed: 0,Date,Instrument,sector,total_return_1m,momentum,roa,at_delta,accruals,leverage,size,cf_yield,earning_yield,sales_to_price,yield,momentum_normalized,roa_normalized,at_delta_normalized,accruals_normalized,leverage_normalized,size_normalized,cf_yield_normalized,earning_yield_normalized,sales_to_price_normalized,yield_normalized
0,2000-01-31,MMM.N,Consumer Non-Cyclicals,-4.342273,24.657480,13.176940,0.020778,-0.175806,1.180460,24.342740,0.073638,0.047243,0.421996,,0.342508,0.464460,0.208430,-0.754916,0.302763,-0.791365,0.073323,-0.069223,-0.354106,
1,2000-01-31,MO.N,Consumer Non-Cyclicals,-9.703504,-51.858338,12.862219,0.020225,-0.060882,0.786218,24.623078,0.192395,0.155386,1.591233,,-2.624165,0.424160,0.204689,0.132914,-0.122834,-0.509183,2.621721,2.617490,1.926674,
2,2000-01-31,AXP.N,Financials,-0.713182,64.675631,1.797059,-0.000459,,0.173563,25.024003,0.052865,0.033557,0.288499,,1.894093,-0.992735,0.064834,,-0.784215,-0.105620,-0.372435,-0.409220,-0.614512,
3,2000-01-31,T.N^K05,Technology,3.940887,-13.080059,8.678354,-0.003309,-0.085827,2.470641,25.850894,0.058500,0.020330,0.326024,,-1.120652,-0.111585,0.045560,-0.059795,1.695557,0.726710,-0.251520,-0.737840,-0.541314,
4,2000-01-31,T.N,Technology,-11.045604,-16.335122,8.310733,-0.145700,,0.763147,25.715026,0.102761,0.055430,0.336499,,-1.246858,-0.158658,-0.917224,,-0.147740,0.589948,0.698289,0.134177,-0.520881,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2024-07-31,DOW.N,Basic Materials,2.676720,4.784246,1.113266,-0.168127,-0.155968,0.342304,24.365556,0.085675,0.015427,1.168753,1.637140,-0.610420,-0.809415,-2.496570,-0.486696,-0.507190,-1.879853,0.751702,-0.750372,2.636900,1.550514
26,2024-07-31,AMGN.OQ,Healthcare,6.407425,48.809074,8.278540,-0.069444,-0.038876,0.131104,25.907005,0.060488,0.037662,0.158061,0.995845,1.496194,0.079318,-1.005413,0.437197,-0.926344,-0.371732,0.113761,0.028746,-0.844534,0.536771
27,2024-07-31,CRM.N,Technology,0.820729,17.709931,4.163647,0.027780,0.000801,0.984038,26.247831,0.032280,0.016493,0.138996,-0.480882,0.008082,-0.431066,0.463704,0.750269,0.766414,-0.038275,-0.600724,-0.713038,-0.910207,-1.797604
28,2024-07-31,HON.OQ,Consumer Non-Cyclicals,-4.116325,8.497392,9.163166,0.032634,-0.078115,0.261214,25.616095,0.051361,0.042436,0.274971,0.746636,-0.432744,0.189041,0.537057,0.127595,-0.668123,-0.656352,-0.117416,0.196020,-0.441825,0.142827


While momentum, leverage, size and yield have single components, to calculate profitability, quality and value factors we should take into account several of their constituent factor sub-values. Moreover, stocks that are classified as financials, utilise ROA as the sole measure of quality as quality measures such as operating cash flow and accruals cannot meaningfully be calculated or may not be applicable to financial companies. Below, we calculate profitability, quality and value factors as the mean of the factor sub-components.

In [13]:
df_factor_components_norm['profitability_normalized'] = df_factor_components_norm[['roa_normalized', 'at_delta_normalized', 'accruals_normalized']].mean(axis=1, skipna=True)
df_factor_components_norm['quality_normalized'] = np.where(df_factor_components_norm['sector'] == 'Financials', df_factor_components_norm['roa'], 
                                                           df_factor_components_norm[['profitability_normalized', 'leverage_normalized']].mean(axis=1, skipna=True))
df_factor_components_norm['value_normalized'] = df_factor_components_norm[['cf_yield_normalized', 'earning_yield_normalized', 'sales_to_price_normalized']].mean(axis=1, skipna=True)
df_factor_components_norm

Unnamed: 0,Date,Instrument,sector,total_return_1m,momentum,roa,at_delta,accruals,leverage,size,cf_yield,earning_yield,sales_to_price,yield,momentum_normalized,roa_normalized,at_delta_normalized,accruals_normalized,leverage_normalized,size_normalized,cf_yield_normalized,earning_yield_normalized,sales_to_price_normalized,yield_normalized,profitability_normalized,quality_normalized,value_normalized
0,2000-01-31,MMM.N,Consumer Non-Cyclicals,-4.342273,24.657480,13.176940,0.020778,-0.175806,1.180460,24.342740,0.073638,0.047243,0.421996,,0.342508,0.464460,0.208430,-0.754916,0.302763,-0.791365,0.073323,-0.069223,-0.354106,,-0.027342,0.137711,-0.116669
1,2000-01-31,MO.N,Consumer Non-Cyclicals,-9.703504,-51.858338,12.862219,0.020225,-0.060882,0.786218,24.623078,0.192395,0.155386,1.591233,,-2.624165,0.424160,0.204689,0.132914,-0.122834,-0.509183,2.621721,2.617490,1.926674,,0.253921,0.065544,2.388629
2,2000-01-31,AXP.N,Financials,-0.713182,64.675631,1.797059,-0.000459,,0.173563,25.024003,0.052865,0.033557,0.288499,,1.894093,-0.992735,0.064834,,-0.784215,-0.105620,-0.372435,-0.409220,-0.614512,,-0.463951,1.797059,-0.465389
3,2000-01-31,T.N^K05,Technology,3.940887,-13.080059,8.678354,-0.003309,-0.085827,2.470641,25.850894,0.058500,0.020330,0.326024,,-1.120652,-0.111585,0.045560,-0.059795,1.695557,0.726710,-0.251520,-0.737840,-0.541314,,-0.041940,0.826809,-0.510225
4,2000-01-31,T.N,Technology,-11.045604,-16.335122,8.310733,-0.145700,,0.763147,25.715026,0.102761,0.055430,0.336499,,-1.246858,-0.158658,-0.917224,,-0.147740,0.589948,0.698289,0.134177,-0.520881,,-0.537941,-0.342841,0.103862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2024-07-31,DOW.N,Basic Materials,2.676720,4.784246,1.113266,-0.168127,-0.155968,0.342304,24.365556,0.085675,0.015427,1.168753,1.637140,-0.610420,-0.809415,-2.496570,-0.486696,-0.507190,-1.879853,0.751702,-0.750372,2.636900,1.550514,-1.264227,-0.885708,0.879410
26,2024-07-31,AMGN.OQ,Healthcare,6.407425,48.809074,8.278540,-0.069444,-0.038876,0.131104,25.907005,0.060488,0.037662,0.158061,0.995845,1.496194,0.079318,-1.005413,0.437197,-0.926344,-0.371732,0.113761,0.028746,-0.844534,0.536771,-0.162966,-0.544655,-0.234009
27,2024-07-31,CRM.N,Technology,0.820729,17.709931,4.163647,0.027780,0.000801,0.984038,26.247831,0.032280,0.016493,0.138996,-0.480882,0.008082,-0.431066,0.463704,0.750269,0.766414,-0.038275,-0.600724,-0.713038,-0.910207,-1.797604,0.260969,0.513692,-0.741323
28,2024-07-31,HON.OQ,Consumer Non-Cyclicals,-4.116325,8.497392,9.163166,0.032634,-0.078115,0.261214,25.616095,0.051361,0.042436,0.274971,0.746636,-0.432744,0.189041,0.537057,0.127595,-0.668123,-0.656352,-0.117416,0.196020,-0.441825,0.142827,0.284564,-0.191779,-0.121074


Finally, we keep only the final factor columns from our dataset.

In [14]:
factor_cols = ['momentum_normalized','quality_normalized', 'profitability_normalized', 'leverage_normalized', 'size_normalized', 'value_normalized', 'yield_normalized']
df_factors = df_factor_components_norm[['Instrument', 'Date', 'total_return_1m'] + factor_cols]
df_factors

Unnamed: 0,Instrument,Date,total_return_1m,momentum_normalized,quality_normalized,profitability_normalized,leverage_normalized,size_normalized,value_normalized,yield_normalized
0,MMM.N,2000-01-31,-4.342273,0.342508,0.137711,-0.027342,0.302763,-0.791365,-0.116669,
1,MO.N,2000-01-31,-9.703504,-2.624165,0.065544,0.253921,-0.122834,-0.509183,2.388629,
2,AXP.N,2000-01-31,-0.713182,1.894093,1.797059,-0.463951,-0.784215,-0.105620,-0.465389,
3,T.N^K05,2000-01-31,3.940887,-1.120652,0.826809,-0.041940,1.695557,0.726710,-0.510225,
4,T.N,2000-01-31,-11.045604,-1.246858,-0.342841,-0.537941,-0.147740,0.589948,0.103862,
...,...,...,...,...,...,...,...,...,...,...
25,DOW.N,2024-07-31,2.676720,-0.610420,-0.885708,-1.264227,-0.507190,-1.879853,0.879410,1.550514
26,AMGN.OQ,2024-07-31,6.407425,1.496194,-0.544655,-0.162966,-0.926344,-0.371732,-0.234009,0.536771
27,CRM.N,2024-07-31,0.820729,0.008082,0.513692,0.260969,0.766414,-0.038275,-0.741323,-1.797604
28,HON.OQ,2024-07-31,-4.116325,-0.432744,-0.191779,0.284564,-0.668123,-0.656352,-0.121074,0.142827


One last step before we finalise the factor dataset is to handle the missing values. Following the FTSE Ground rules, for all factors with the exception of yield, stocks with missing factor data are allocated a neutral Z-score of 0 after the normalisation procedure. For Yield missing values are assigned a Z-score of -3.

In [15]:
for col in factor_cols:
    if col == 'yield_normalized':
        df_factors[col].fillna(-3, inplace=True)
    else:
        df_factors[col].fillna(0, inplace=True)
df_factors

Unnamed: 0,Instrument,Date,total_return_1m,momentum_normalized,quality_normalized,profitability_normalized,leverage_normalized,size_normalized,value_normalized,yield_normalized
0,MMM.N,2000-01-31,-4.342273,0.342508,0.137711,-0.027342,0.302763,-0.791365,-0.116669,-3.000000
1,MO.N,2000-01-31,-9.703504,-2.624165,0.065544,0.253921,-0.122834,-0.509183,2.388629,-3.000000
2,AXP.N,2000-01-31,-0.713182,1.894093,1.797059,-0.463951,-0.784215,-0.105620,-0.465389,-3.000000
3,T.N^K05,2000-01-31,3.940887,-1.120652,0.826809,-0.041940,1.695557,0.726710,-0.510225,-3.000000
4,T.N,2000-01-31,-11.045604,-1.246858,-0.342841,-0.537941,-0.147740,0.589948,0.103862,-3.000000
...,...,...,...,...,...,...,...,...,...,...
25,DOW.N,2024-07-31,2.676720,-0.610420,-0.885708,-1.264227,-0.507190,-1.879853,0.879410,1.550514
26,AMGN.OQ,2024-07-31,6.407425,1.496194,-0.544655,-0.162966,-0.926344,-0.371732,-0.234009,0.536771
27,CRM.N,2024-07-31,0.820729,0.008082,0.513692,0.260969,0.766414,-0.038275,-0.741323,-1.797604
28,HON.OQ,2024-07-31,-4.116325,-0.432744,-0.191779,0.284564,-0.668123,-0.656352,-0.121074,0.142827


### Factor-Based Weighting

After we acquire the normalized factor values, we proceed by defining a function to calculate weights for each stock across all factors for every month. This function, when systematically applied generates new columns that contain the calculated weights for each stock per month, corresponding to each factor.

In [16]:
def get_factor_based_weights(series: pd.Series) -> pd.Series:
    return np.exp(series) / np.sum(np.exp(series))

In [17]:
df_factors_and_weights = pd.DataFrame()

for date in df_factors['Date'].unique():
    date_df = df_factors[df_factors['Date'] == date]
    for factor in factor_cols:
        factor_based_weights = get_factor_based_weights(date_df[factor])
        date_df[f"weight_{factor.split('_')[0]}_based"] = factor_based_weights
    df_factors_and_weights = pd.concat([df_factors_and_weights, date_df])
    
df_factors_and_weights.set_index('Date', inplace=True)
df_factors_and_weights

Unnamed: 0_level_0,Instrument,total_return_1m,momentum_normalized,quality_normalized,profitability_normalized,leverage_normalized,size_normalized,value_normalized,yield_normalized,weight_momentum_based,weight_quality_based,weight_profitability_based,weight_leverage_based,weight_size_based,weight_value_based,weight_yield_based
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2000-01-31,MMM.N,-4.342273,0.342508,0.137711,-0.027342,0.302763,-0.791365,-0.116669,-3.000000,0.028905,0.019469,0.023163,0.019922,0.009600,0.013439,0.033333
2000-01-31,MO.N,-9.703504,-2.624165,0.065544,0.253921,-0.122834,-0.509183,2.388629,-3.000000,0.001488,0.018114,0.030686,0.013016,0.012730,0.164589,0.033333
2000-01-31,AXP.N,-0.713182,1.894093,1.797059,-0.463951,-0.784215,-0.105620,-0.465389,-3.000000,0.136402,0.102329,0.014968,0.006718,0.019059,0.009482,0.033333
2000-01-31,T.N^K05,3.940887,-1.120652,0.826809,-0.041940,1.695557,0.726710,-0.510225,-3.000000,0.006692,0.038781,0.022827,0.080206,0.043810,0.009067,0.033333
2000-01-31,T.N,-11.045604,-1.246858,-0.342841,-0.537941,-0.147740,0.589948,0.103862,-3.000000,0.005898,0.012041,0.013901,0.012696,0.038210,0.016755,0.033333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-07-31,DOW.N,2.676720,-0.610420,-0.885708,-1.264227,-0.507190,-1.879853,0.879410,1.550514,0.011405,0.005052,0.008038,0.009866,0.002707,0.058473,0.109795
2024-07-31,AMGN.OQ,6.407425,1.496194,-0.544655,-0.162966,-0.926344,-0.371732,-0.234009,0.536771,0.093751,0.007105,0.024178,0.006488,0.012233,0.019205,0.039840
2024-07-31,CRM.N,0.820729,0.008082,0.513692,0.260969,0.766414,-0.038275,-0.741323,-1.797604,0.021169,0.020474,0.036943,0.035258,0.017075,0.011563,0.003859
2024-07-31,HON.OQ,-4.116325,-0.432744,-0.191779,0.284564,-0.668123,-0.656352,-0.121074,0.142827,0.013622,0.010112,0.037825,0.008399,0.009203,0.021501,0.026868


## Building the feature set
### Factor-based features

To create our factor-based feature set, we first compute weighted values for each factor by multiplying with its corresponding weight. Additionally, we calculate the weighted 1-month returns by multiplying the total 1-month return by the same factor weights. Finally, we aggregate these weighted values and returns by month, summing them up to produce a consolidated dataset. This aggregation results in a new DataFrame, where each column represents the summed weighted values and returns for each factor, grouped by date.

In [18]:
factors  = df_factors_and_weights.columns[2:9]
weights = df_factors_and_weights.columns[9:]

for factor, weight in zip(factors, weights):
    df_factors_and_weights[f'{factor}_weighted'] = df_factors_and_weights[factor] * df_factors_and_weights[weight]
    df_factors_and_weights[f'total_return_1m_{factor.split("_")[0]}'] = df_factors_and_weights['total_return_1m'] * df_factors_and_weights[weight]

new_columns_factors = [f'{factor}_weighted' for factor in factors]
new_columns_returns = [f'total_return_1m_{factor.split("_")[0]}' for factor in factors]

feature_df = df_factors_and_weights[new_columns_factors + new_columns_returns].groupby(df_factors_and_weights.index).sum()

feature_df

Unnamed: 0_level_0,momentum_normalized_weighted,quality_normalized_weighted,profitability_normalized_weighted,leverage_normalized_weighted,size_normalized_weighted,value_normalized_weighted,yield_normalized_weighted,total_return_1m_momentum,total_return_1m_quality,total_return_1m_profitability,total_return_1m_leverage,total_return_1m_size,total_return_1m_value,total_return_1m_yield
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000-01-31,0.945405,1.240297,0.819040,1.879267,0.841735,1.687445,-3.000000,-4.493394,3.659980,1.033077,9.302974,-3.740018,-0.811344,-3.951713
2000-02-29,1.497460,1.251082,0.822014,1.916245,0.728465,1.553215,-3.000000,7.450206,-8.212941,-10.100703,-1.277309,-3.226533,-8.806451,-5.929975
2000-03-31,1.482371,1.252198,0.828882,1.914239,0.910537,1.164446,-3.000000,10.255223,7.103758,4.684157,8.616552,11.526238,7.288108,8.125907
2000-04-30,1.561126,1.252198,0.828882,1.914239,0.868652,1.210611,-3.000000,-0.878634,0.038632,1.131502,0.245629,-4.260366,-0.194067,-2.067652
2000-05-31,1.474937,1.252198,0.828882,1.914239,0.865280,1.276771,-3.000000,-0.865824,2.761453,2.831635,0.363008,-0.142014,-4.067952,-0.706639
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-31,1.043473,1.908908,0.277694,1.577512,1.308560,0.635856,-3.000000,3.743506,3.168656,1.727549,2.261306,0.808660,3.554203,2.520541
2024-04-30,1.163670,1.908908,0.277694,1.577512,1.296577,0.604652,0.784668,-4.327978,-2.023432,-4.659949,-3.954715,-4.185906,-4.282108,-2.919605
2024-05-31,0.914941,1.908908,0.277694,1.577512,1.330237,0.589844,0.765439,3.996421,2.509475,2.698176,1.382141,5.223667,3.286129,2.660367
2024-06-30,0.833786,1.908769,0.295397,1.565721,1.339590,0.589984,-3.000000,3.030338,-0.036403,2.875846,2.402459,5.624519,0.746210,1.981235


We now have an initial set of features, including the factor-based total 1-month returns, which will ultimately serve as our target variables. Before proceeding, let's enrich our dataset by introducing additional features.

### Macro economic features

Macro economic indicators, such as GDP growth, unemployment rates, inflation etc, are important indicators that reflect the overall health of the economy. As different factors perform better under varying economic conditions, we are using a set of those indicators as initial features for our model. Below, we define some of those for the US which we believe is a good starting point. The full list of macroeconomic indicators can be found in ECONOMIND app in LSEG Workspace.

In [19]:
macro_rics = [
'USCPI=ECI',
'USUNR=ECI', 
'USGPCS=ECI', 
'USLEAD=ECI',
'USCPF=ECI', 
'USGDPF=ECI',
'USFOUT=ECI',
]

In [20]:
macro_ind_names = ld.get_data(macro_rics, fields=['DSPLY_NAME'])
macro_ind_names

Unnamed: 0,Instrument,DSPLY_NAME
0,USCPI=ECI,US CPI mm
1,USUNR=ECI,US Unemployment
2,USGPCS=ECI,US Consumpn SA
3,USLEAD=ECI,US Lead indic'rs
4,USCPF=ECI,"US Core CPI mm,"
5,USGDPF=ECI,US GDP Final
6,USFOUT=ECI,US Manuf Output


We also request historical data for the fields above covering our observation period. Additionally, we use a function to align dates with our feature dataframe (feature_df) for subsequent merging.

In [21]:
def allign_dates(ind_df: pd.DataFrame, df_to_allign_with:pd.DataFrame)->pd.DataFrame:
    ind_df.columns = ind_df.columns.get_level_values(0)
    ind_df['Dates'] = df_to_allign_with.index
    ind_df.set_index('Dates', inplace=True)
    return ind_df

In [22]:
start_date_macro = pd.to_datetime(start_date) - timedelta(days=30)
macro = ld.get_history(macro_rics, start=start_date_macro, end=end_date).ffill()[1:]
macro = allign_dates(macro, feature_df)
macro

VALUE,USCPI=ECI,USUNR=ECI,USGPCS=ECI,USLEAD=ECI,USCPF=ECI,USGDPF=ECI,USFOUT=ECI
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-31,0.3,4.0,0.0,0.8,0.3,7.3,0.0
2000-02-29,0.4,4.1,1.3,-0.6,0.1,7.3,0.2
2000-03-31,0.6,4.0,0.9,1.0,0.3,5.5,0.7
2000-04-30,-0.1,3.8,-0.1,0.1,0.2,5.5,0.6
2000-05-31,0.2,4.0,0.5,-0.9,0.2,5.5,0.0
...,...,...,...,...,...,...,...
2024-03-31,0.4,3.8,0.7,-0.3,0.4,1.4,0.2
2024-04-30,0.3,3.9,0.2,-0.6,0.3,1.4,-0.5
2024-05-31,0.0,4.0,0.4,-0.5,0.2,1.4,0.8
2024-06-30,-0.1,4.1,0.3,-0.2,0.1,1.4,0.0


### Currency feature

Currency fluctuations impact international trade and the earnings of companies with global exposure, therefore currency is another factor we have considered for the model. In this research we will be looking into the EURUSD pair.

In [23]:
curs = ['EUR=']
cur_prices = ld.get_history(curs, fields='BID', start=start_date, end='2024-08-31', interval='monthly')
cur_prices_change = cur_prices.pct_change().dropna()
cur_prices_change = allign_dates(cur_prices_change, feature_df)
cur_prices_change

EUR=,BID
Dates,Unnamed: 1_level_1
2000-01-31,-0.004849
2000-02-29,-0.008916
2000-03-31,-0.046025
2000-04-30,0.02818
2000-05-31,0.01557
...,...
2024-03-31,-0.01186
2024-04-30,0.016503
2024-05-31,-0.011807
2024-06-30,0.010455


### Market indices features

Another set of factors we believe can impact the factor performance are the market indicators, such as S&P 500(SPX) and the Volatility index(VIX).The SPX performance and VIX together capture market sentiment and volatility. In bull markets with low volatility, momentum and size factors may often outperform. During high volatility or bear markets, value, quality, and yield factors become more attractive as investment mediums.

In [24]:
ind = ['.VIX', '.SPX']
ind_prices = ld.get_history(ind, fields=["TR.PriceClose"], start='2000-01-01', end='2024-08-31', interval='monthly')
ind_prices_change = ind_prices.pct_change().dropna()
ind_prices_change = allign_dates(ind_prices_change, feature_df)
ind_prices_change

Price Close,.VIX,.SPX
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-31,-0.063327,-0.020108
2000-02-29,0.031665,0.09672
2000-03-31,0.086686,-0.030796
2000-04-30,-0.097328,-0.021915
2000-05-31,-0.173784,0.023934
...,...,...
2024-03-31,0.202921,-0.041615
2024-04-30,-0.174441,0.048021
2024-05-31,-0.037152,0.03467
2024-06-30,0.315113,0.011321


### Bond features

Bond yields reflect interest rate expectations and risk sentiment. Rising yields might indicate stronger economic growth expectations, potentially favouring size or momentum factors. On the other hand, falling yields often signal risk aversion or economic slowdown, which might benefit more passive factors such as quality or value. We will be using 3M, 6M and 10Y bonds prices as a proxy for bond market.

In [25]:
bonds = ['US3MT=RR', 'US6MT=RR', 'US10YT=RR']
start_date_bonds = pd.to_datetime(start_date) - timedelta(days=30)
bonds_prices = ld.get_history(bonds, fields=["B_YLD_1"], start=start_date_bonds, end=end_date, interval='monthly')
bonds_prices_change = bonds_prices.pct_change().dropna()
bonds_prices_change = allign_dates(bonds_prices_change, feature_df)

bonds_prices_change

B_YLD_1,US3MT=RR,US6MT=RR,US10YT=RR
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-01-31,0.069632,0.089501,0.035276
2000-02-29,0.016275,0.014569,-0.037977
2000-03-31,0.017794,0.013841,-0.063504
2000-04-30,-0.01049,-0.008191,0.036021
2000-05-31,-0.031802,-0.033207,0.010775
...,...,...,...
2024-03-31,-0.00242,0.00301,-0.013641
2024-04-30,0.006903,0.010692,0.116834
2024-05-31,-0.001112,-0.003526,-0.036721
2024-06-30,-0.005194,-0.010058,-0.037456


### Commodity features

Commodity features, such as futures prices for oil, gas and gold are the final set of features have considered for the model as those prices directly affect certain sectors (e.g., energy, materials) and the broader economy.

In [26]:
comods = ['LCOc1', 'NGc1', 'GCc1']
comods_prices = ld.get_history(comods, fields=["TRDPRC_1"], start='2000-01-01', end='2024-08-31', interval='monthly')
comods_prices_change = comods_prices.pct_change().dropna()
comods_prices_change = allign_dates(comods_prices_change, feature_df)

comods_prices_change

TRDPRC_1,LCOc1,NGc1,GCc1
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000-01-31,0.083205,0.037523,0.03211
2000-02-29,-0.118065,0.063291,-0.050256
2000-03-31,-0.034274,0.068027,-0.013679
2000-04-30,0.179958,0.383758,-0.009489
2000-05-31,0.079264,0.029919,0.046426
...,...,...,...
2024-03-31,0.004229,0.117009,0.027215
2024-04-30,-0.071014,0.318855,0.013595
2024-05-31,0.058434,0.007749,-0.000344
2024-06-30,-0.065741,-0.215302,0.052122


In [27]:
model_df = pd.concat([feature_df, macro, ind_prices_change, cur_prices_change, bonds_prices_change, comods_prices_change], axis = 1)
model_df

Unnamed: 0,momentum_normalized_weighted,quality_normalized_weighted,profitability_normalized_weighted,leverage_normalized_weighted,size_normalized_weighted,value_normalized_weighted,yield_normalized_weighted,total_return_1m_momentum,total_return_1m_quality,total_return_1m_profitability,total_return_1m_leverage,total_return_1m_size,total_return_1m_value,total_return_1m_yield,USCPI=ECI,USUNR=ECI,USGPCS=ECI,USLEAD=ECI,USCPF=ECI,USGDPF=ECI,USFOUT=ECI,.VIX,.SPX,BID,US3MT=RR,US6MT=RR,US10YT=RR,LCOc1,NGc1,GCc1
2000-01-31,0.945405,1.240297,0.819040,1.879267,0.841735,1.687445,-3.000000,-4.493394,3.659980,1.033077,9.302974,-3.740018,-0.811344,-3.951713,0.3,4.0,0.0,0.8,0.3,7.3,0.0,-0.063327,-0.020108,-0.004849,0.069632,0.089501,0.035276,0.083205,0.037523,0.03211
2000-02-29,1.497460,1.251082,0.822014,1.916245,0.728465,1.553215,-3.000000,7.450206,-8.212941,-10.100703,-1.277309,-3.226533,-8.806451,-5.929975,0.4,4.1,1.3,-0.6,0.1,7.3,0.2,0.031665,0.09672,-0.008916,0.016275,0.014569,-0.037977,-0.118065,0.063291,-0.050256
2000-03-31,1.482371,1.252198,0.828882,1.914239,0.910537,1.164446,-3.000000,10.255223,7.103758,4.684157,8.616552,11.526238,7.288108,8.125907,0.6,4.0,0.9,1.0,0.3,5.5,0.7,0.086686,-0.030796,-0.046025,0.017794,0.013841,-0.063504,-0.034274,0.068027,-0.013679
2000-04-30,1.561126,1.252198,0.828882,1.914239,0.868652,1.210611,-3.000000,-0.878634,0.038632,1.131502,0.245629,-4.260366,-0.194067,-2.067652,-0.1,3.8,-0.1,0.1,0.2,5.5,0.6,-0.097328,-0.021915,0.02818,-0.01049,-0.008191,0.036021,0.179958,0.383758,-0.009489
2000-05-31,1.474937,1.252198,0.828882,1.914239,0.865280,1.276771,-3.000000,-0.865824,2.761453,2.831635,0.363008,-0.142014,-4.067952,-0.706639,0.2,4.0,0.5,-0.9,0.2,5.5,0.0,-0.173784,0.023934,0.01557,-0.031802,-0.033207,0.010775,0.079264,0.029919,0.046426
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-31,1.043473,1.908908,0.277694,1.577512,1.308560,0.635856,-3.000000,3.743506,3.168656,1.727549,2.261306,0.808660,3.554203,2.520541,0.4,3.8,0.7,-0.3,0.4,1.4,0.2,0.202921,-0.041615,-0.01186,-0.00242,0.00301,-0.013641,0.004229,0.117009,0.027215
2024-04-30,1.163670,1.908908,0.277694,1.577512,1.296577,0.604652,0.784668,-4.327978,-2.023432,-4.659949,-3.954715,-4.185906,-4.282108,-2.919605,0.3,3.9,0.2,-0.6,0.3,1.4,-0.5,-0.174441,0.048021,0.016503,0.006903,0.010692,0.116834,-0.071014,0.318855,0.013595
2024-05-31,0.914941,1.908908,0.277694,1.577512,1.330237,0.589844,0.765439,3.996421,2.509475,2.698176,1.382141,5.223667,3.286129,2.660367,0.0,4.0,0.4,-0.5,0.2,1.4,0.8,-0.037152,0.03467,-0.011807,-0.001112,-0.003526,-0.036721,0.058434,0.007749,-0.000344
2024-06-30,0.833786,1.908769,0.295397,1.565721,1.339590,0.589984,-3.000000,3.030338,-0.036403,2.875846,2.402459,5.624519,0.746210,1.981235,-0.1,4.1,0.3,-0.2,0.1,1.4,0.0,0.315113,0.011321,0.010455,-0.005194,-0.010058,-0.037456,-0.065741,-0.215302,0.052122


### Creating the target variable

Our target variable, **best_factor_next_month** is created by taking the factor with the highest 1-month total return.

In [28]:
model_df['best_factor_next_month'] = model_df[[
                                            'total_return_1m_momentum',
                                            'total_return_1m_quality', 
                                            'total_return_1m_profitability',
                                            'total_return_1m_leverage',
                                            'total_return_1m_size',
                                            'total_return_1m_value',
                                            'total_return_1m_yield'
                                                   ]].shift(-1).idxmax(axis=1)
for row in model_df['best_factor_next_month'].unique():
    print(row, len(model_df[model_df['best_factor_next_month'] == row]))

total_return_1m_momentum 132
total_return_1m_size 34
total_return_1m_profitability 16
total_return_1m_value 20
total_return_1m_leverage 46
total_return_1m_yield 18
total_return_1m_quality 28
nan 0


Examining the distribution of label classes, we observe a some data imbalance, skewed towards the momentum factor. This imbalance may bias our model towards the momentum class, limiting its ability to generalize across other factors. To address this and to focus on the underrepresented yet critical factors, we have decided to remove the momentum factor from our analysis. This approach is intended to help the model more effectively learn patterns related to these other factors, potentially improving overall performance. While techniques like oversampling, undersampling etc could be used to mitigate class imbalance, in the context of this study, we opted for a simpler approach as our primary goal here is to demonstrate whether factor-based index rebalancing can enhance performance compared to the base index.

In [29]:
model_df['best_factor_next_month'] = model_df[[
                                            'total_return_1m_quality', 
                                            'total_return_1m_profitability',
                                            'total_return_1m_leverage',
                                            'total_return_1m_size',
                                            'total_return_1m_value',
                                            'total_return_1m_yield'
                                                   ]].shift(-1).idxmax(axis=1)

# since we drop the momentum class, we will also drop the momentum feature
model_df = model_df.drop(columns=['momentum_normalized_weighted'])
model_df.dropna(inplace=True, subset=['best_factor_next_month'])

for row in model_df['best_factor_next_month'].unique():
    print(row, len(model_df[model_df['best_factor_next_month'] == row]))

total_return_1m_leverage 75
total_return_1m_size 69
total_return_1m_profitability 37
total_return_1m_yield 35
total_return_1m_value 31
total_return_1m_quality 47


We use LabelEncoder to transform the remaining labels.

In [30]:
label_encoder = LabelEncoder()
model_df['label'] = label_encoder.fit_transform(model_df['best_factor_next_month'])
model_df

Unnamed: 0,quality_normalized_weighted,profitability_normalized_weighted,leverage_normalized_weighted,size_normalized_weighted,value_normalized_weighted,yield_normalized_weighted,total_return_1m_momentum,total_return_1m_quality,total_return_1m_profitability,total_return_1m_leverage,total_return_1m_size,total_return_1m_value,total_return_1m_yield,USCPI=ECI,USUNR=ECI,USGPCS=ECI,USLEAD=ECI,USCPF=ECI,USGDPF=ECI,USFOUT=ECI,.VIX,.SPX,BID,US3MT=RR,US6MT=RR,US10YT=RR,LCOc1,NGc1,GCc1,best_factor_next_month,label
2000-01-31,1.240297,0.819040,1.879267,0.841735,1.687445,-3.000000,-4.493394,3.659980,1.033077,9.302974,-3.740018,-0.811344,-3.951713,0.3,4.0,0.0,0.8,0.3,7.3,0.0,-0.063327,-0.020108,-0.004849,0.069632,0.089501,0.035276,0.083205,0.037523,0.03211,total_return_1m_leverage,0
2000-02-29,1.251082,0.822014,1.916245,0.728465,1.553215,-3.000000,7.450206,-8.212941,-10.100703,-1.277309,-3.226533,-8.806451,-5.929975,0.4,4.1,1.3,-0.6,0.1,7.3,0.2,0.031665,0.09672,-0.008916,0.016275,0.014569,-0.037977,-0.118065,0.063291,-0.050256,total_return_1m_size,3
2000-03-31,1.252198,0.828882,1.914239,0.910537,1.164446,-3.000000,10.255223,7.103758,4.684157,8.616552,11.526238,7.288108,8.125907,0.6,4.0,0.9,1.0,0.3,5.5,0.7,0.086686,-0.030796,-0.046025,0.017794,0.013841,-0.063504,-0.034274,0.068027,-0.013679,total_return_1m_profitability,1
2000-04-30,1.252198,0.828882,1.914239,0.868652,1.210611,-3.000000,-0.878634,0.038632,1.131502,0.245629,-4.260366,-0.194067,-2.067652,-0.1,3.8,-0.1,0.1,0.2,5.5,0.6,-0.097328,-0.021915,0.02818,-0.01049,-0.008191,0.036021,0.179958,0.383758,-0.009489,total_return_1m_profitability,1
2000-05-31,1.252198,0.828882,1.914239,0.865280,1.276771,-3.000000,-0.865824,2.761453,2.831635,0.363008,-0.142014,-4.067952,-0.706639,0.2,4.0,0.5,-0.9,0.2,5.5,0.0,-0.173784,0.023934,0.01557,-0.031802,-0.033207,0.010775,0.079264,0.029919,0.046426,total_return_1m_size,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-29,1.908908,0.277694,1.577512,1.303319,0.660580,0.803768,4.627249,4.848980,2.020470,-0.556119,1.731816,2.207771,-0.001908,0.4,3.9,0.6,0.0,0.4,3.4,1.4,-0.029104,0.031019,-0.000926,0.001491,0.027053,0.072383,0.0464,-0.051948,0.093485,total_return_1m_value,4
2024-03-31,1.908908,0.277694,1.577512,1.308560,0.635856,-3.000000,3.743506,3.168656,1.727549,2.261306,0.808660,3.554203,2.520541,0.4,3.8,0.7,-0.3,0.4,1.4,0.2,0.202921,-0.041615,-0.01186,-0.00242,0.00301,-0.013641,0.004229,0.117009,0.027215,total_return_1m_quality,2
2024-04-30,1.908908,0.277694,1.577512,1.296577,0.604652,0.784668,-4.327978,-2.023432,-4.659949,-3.954715,-4.185906,-4.282108,-2.919605,0.3,3.9,0.2,-0.6,0.3,1.4,-0.5,-0.174441,0.048021,0.016503,0.006903,0.010692,0.116834,-0.071014,0.318855,0.013595,total_return_1m_size,3
2024-05-31,1.908908,0.277694,1.577512,1.330237,0.589844,0.765439,3.996421,2.509475,2.698176,1.382141,5.223667,3.286129,2.660367,0.0,4.0,0.4,-0.5,0.2,1.4,0.8,-0.037152,0.03467,-0.011807,-0.001112,-0.003526,-0.036721,0.058434,0.007749,-0.000344,total_return_1m_size,3


After removing the momentum factor, the distribution of the remaining factors shows a more balanced representation, though some imbalance persists, particularly between leverage and value. To address the slight remaining imbalance and ensure robust performance across all factors, we will employ XGBoost with sample weights, allowing the model to account for the distribution differences in class representation.

To get our initial list of features we should exclude all total return columns which we used for creating the label.

In [31]:
features = [col for col in model_df.columns[:-2] if not col.startswith("total")]
print(len(features))

22


### Recursive Feature Elimination

As demonstrated above, we have a total of 32 features, and given our dataset of 290 observations, it is important to eliminate some features to reduce model complexity and minimize the risk of overfitting. To achieve this, we implement recursive feature elimination, a feature selection strategy using XGBoost. By applying this methodology, we iteratively rank the features based on their importance to the model's predictions and systematically remove the least significant ones. This process continues until the top 12 features remain. We retain 12 features as it allows around 20 observations (train set is around 230 observations) per feature which is a good rule of thumb for ensemble models. This approach allows us to maintain the most critical features that contribute to the model's predictive accuracy, reduce the complexity of the model and minimize the risk of overfitting.

Before implementing feature elimination, we split our dataset into training and test sets. Observations prior to January 2019 are assigned to the training set, while those from January 2019 onward are reserved for the test set. We opt for a temporal train/test split instead of a random split because maintaining the chronological order of observations is essential for testing the investment performance of our rebalancing approach.

In [32]:
scaler = StandardScaler()
split_idx = model_df[model_df.index < '2019-01-01'].shape[0]

y_encoded = (model_df['best_factor_next_month'])
X_train, X_test, y_train, y_test = model_df[:split_idx][features], model_df[split_idx:][features], model_df['label'][:split_idx], model_df['label'][split_idx:]
X_train_norm = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)

As mentioned above, we train the model with sample weights to address label imbalancing using the function below.

In [33]:
def calculate_sample_weights(y_train):
    class_counts = np.bincount(y_train)
    total_samples = len(y_train)
    class_weights = total_samples / (len(np.unique(y_train)) * class_counts)
    sample_weights = np.array([class_weights[class_index] for class_index in y_train])
    return sample_weights

Below, we implement the recursive feature selection as described above and show the 12 most important features along with the importance scores.

In [34]:
model = xgb.XGBClassifier(random_state = 42)
features = X_train_norm.columns.tolist()
n_features_to_select = 12

sample_weights = calculate_sample_weights(y_train)

while len(features) > n_features_to_select:
    model.fit(X_train_norm[features], y_train, sample_weight=sample_weights)
    importances = model.feature_importances_
    least_important = np.argmin(importances)
    features.pop(least_important)

important_features_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances[:-1]
})

important_features_df = important_features_df.sort_values(by=['Importance'], ascending=False).reset_index(drop=True)
important_features_df

Unnamed: 0,Feature,Importance
0,quality_normalized_weighted,0.094042
1,US10YT=RR,0.087496
2,profitability_normalized_weighted,0.083115
3,USGDPF=ECI,0.081911
4,.VIX,0.079165
5,value_normalized_weighted,0.077392
6,USCPF=ECI,0.076581
7,size_normalized_weighted,0.072967
8,.SPX,0.069997
9,USUNR=ECI,0.068678


The most important features include a mix of economic, bond, and market, as well as factor-based indicators which is demonstrating a balanced selection of features which we will use for rolling training window and out of sample testing.

In [35]:
X_train, X_test = model_df[:split_idx][important_features_df['Feature']], model_df[split_idx:][important_features_df['Feature']]
X_train_norm = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
print(X_train_norm.shape)

(228, 12)


## Rolling Training window and performance results

### Out of sample testing

Below, we define a function to train the model which uses XgBoost with softprob objective (allowing to return probabilities along with the predictions), cross entropy loss and fits the model using sample weights. For simplicity, we didn't implement any hyperparameter tuning, which could potentially further enhance the model's predictive power.

In [36]:
def train_model(X_train_norm, y_train):
    sample_weights = calculate_sample_weights(y_train)
    model = xgb.XGBClassifier( 
            objective='multi:softprob', 
            num_class=len(np.unique(y_train)),
            use_label_encoder=False, 
            eval_metric='mlogloss', 
            random_state = 42)
    model.fit(X_train_norm, y_train, sample_weight=sample_weights)
    return model

The rolling training window process is implemented in two steps: first, we train an initial model using the training dataset consisting of observations up to January 2019. Then, we predict the factor for the following month, add this new observation to the training set, retrain the model, and predict the next month. This process is repeated for all test observations after January 2019. In addition to the actual predictions, we store the prediction probabilities, which will be used to build a probabilistic investment strategy.

In [37]:
# initial training
predictions = []
predict_proba = []
model = train_model(X_train_norm, y_train)

# rolling training and out of sample testing
for sample, y in zip(X_test.iterrows(), y_test):
    sample = scaler.transform(pd.DataFrame(sample[1]).T)
    predictions.append(model.predict(sample)[0])
    predict_proba.append(model.predict_proba(sample))
    X_train_norm = np.concatenate((X_train_norm, sample))
    y_train = np.concatenate((y_train, np.array([y])))
    model = train_model(X_train_norm, y_train)

In [38]:
label_mapping = dict(zip(model_df['label'], model_df['best_factor_next_month']))
target_names = [label_mapping[i] for i in sorted(model_df['label'].unique())]

report = classification_report(y_test, predictions, 
                               target_names=target_names
)

print(f"Actual Prediction:\n {dict(sorted(Counter(predictions).items()))} \n")
print(f"Confusion Matrix:\n {confusion_matrix(y_test, predictions)}\n")
print("Classification Report:\n", report)

Actual Prediction:
 {0: 18, 1: 8, 2: 12, 3: 23, 4: 2, 5: 3} 

Confusion Matrix:
 [[ 4  1  0  3  0  1]
 [ 1  0  3  1  0  0]
 [ 5  0  5  3  1  1]
 [ 8  5  2 12  0  1]
 [ 0  0  1  2  0  0]
 [ 0  2  1  2  1  0]]

Classification Report:
                                precision    recall  f1-score   support

     total_return_1m_leverage       0.22      0.44      0.30         9
total_return_1m_profitability       0.00      0.00      0.00         5
      total_return_1m_quality       0.42      0.33      0.37        15
         total_return_1m_size       0.52      0.43      0.47        28
        total_return_1m_value       0.00      0.00      0.00         3
        total_return_1m_yield       0.00      0.00      0.00         6

                     accuracy                           0.32        66
                    macro avg       0.19      0.20      0.19        66
                 weighted avg       0.35      0.32      0.32        66



Although the model's accuracy is 32% for the 6-class classification task, accuracy alone may not fully capture the model's effectiveness in context. Our main goal is to compare index performance with and without prediction-based factor rebalancing. Even if the model doesn't consistently identify the top-performing factor, choosing the next best factor can still enhance performance compared to the base factor (e.g., the DJIA, which is price-weighted). The confusion matrix highlights that the model performs best in identifying the leverage, quality and size factors, with 4 and 5 and 12 correct predictions accordingly, while struggling most with the yield, profitability and value classes, where misclassifications were more frequent.

Given our objective, a more suitable metric to consider might be the Top-N Accuracy, which focuses on how often the model's top predictions include the actual correct factor, rather than just its first choice. These metrics are more aligned with our goal of improving index performance through factor rebalancing, where correctly identifying one of the top factors, even if not the very best one, can be beneficial.

We use **top_n_accuracy** function to calculate the accuracy of correctly predicting one of the top 3 classes.

In [39]:
def top_n_accuracy(y_true, y_pred_proba, n=3):
    y_pred_proba = np.vstack(y_pred_proba)
    top_n_pred = np.argsort(y_pred_proba, axis=1)[:, -n:]
    matches = [y_true[i] in top_n_pred[i] for i in range(len(y_true))]
    
    return np.mean(matches)
top_3_acc = top_n_accuracy(y_test, predict_proba, n=3)
print("top-3 accuracy:", top_3_acc)

top-3 accuracy: 0.6212121212121212


The model achieved a 62% Top-3 accuracy, indicating that its predictions included the correct factor within the top 3 for 62% of the test observations. Nevertheless, the most effective way to assess the model's performance is by applying the rebalancing strategy based on these predictions and comparing the rebalanced index's performance with the actual index performance.

## Index rebalancing performance comparison

We will introduce and compare two rebalancing strategies against the main DJI index:

* Rebalancing based on the Top Prediction: In this strategy, the index will be rebalanced based on the top predicted factor. For example, if the model predicts size as the best factor, the index will be rebalanced using returns derived from size-based weights.
* Weighted rebalancing based on prediction probabilities: In this approach, the index will be rebalanced by calculating a weighted sum of returns, where each factor's return is weighted by its predicted probability.

### Adding predictions

To start with, we create a new dataframe representing the return columns and the best factor names for the test period. Then we add the predictions and shift both the predictions and actual values by 1 period which aligns the predictions with the returns for the month. This is to ensure we are using the prediction or best factor from the previous month to calculate the return for the current month.

In [40]:
returns_df = model_df[target_names + ['best_factor_next_month']][split_idx:]

returns_df['preds'] = label_encoder.inverse_transform(predictions)
returns_df['best_shifted'] = returns_df['best_factor_next_month'].shift(1)
returns_df['pred_shifted'] = returns_df['preds'].shift(1)
returns_df = returns_df.dropna()
returns_df

Unnamed: 0,total_return_1m_leverage,total_return_1m_profitability,total_return_1m_quality,total_return_1m_size,total_return_1m_value,total_return_1m_yield,best_factor_next_month,preds,best_shifted,pred_shifted
2019-02-28,9.635442,5.342961,6.747772,7.331603,4.547667,5.586687,total_return_1m_size,total_return_1m_size,total_return_1m_leverage,total_return_1m_size
2019-03-31,-0.409980,1.234758,1.140586,3.608492,-0.883971,1.120275,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,total_return_1m_size
2019-04-30,1.346865,1.662254,4.866231,5.012717,0.851827,1.069905,total_return_1m_quality,total_return_1m_leverage,total_return_1m_size,total_return_1m_leverage
2019-05-31,-8.063898,-6.772353,-4.436866,-6.042288,-8.810223,-7.937026,total_return_1m_profitability,total_return_1m_size,total_return_1m_quality,total_return_1m_leverage
2019-06-30,6.045734,7.008643,5.765615,6.469608,5.968074,5.680147,total_return_1m_size,total_return_1m_size,total_return_1m_profitability,total_return_1m_size
...,...,...,...,...,...,...,...,...,...,...
2024-02-29,-0.556119,2.020470,4.848980,1.731816,2.207771,-0.001908,total_return_1m_value,total_return_1m_size,total_return_1m_quality,total_return_1m_quality
2024-03-31,2.261306,1.727549,3.168656,0.808660,3.554203,2.520541,total_return_1m_quality,total_return_1m_leverage,total_return_1m_value,total_return_1m_size
2024-04-30,-3.954715,-4.659949,-2.023432,-4.185906,-4.282108,-2.919605,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,total_return_1m_leverage
2024-05-31,1.382141,2.698176,2.509475,5.223667,3.286129,2.660367,total_return_1m_size,total_return_1m_size,total_return_1m_size,total_return_1m_quality


### Calculating prediction based returns

Next, we calculate 3 types of returns:
* best_return - the return if we always predicted the correct class
* pred_return -  the return based on predicted factor
* pred_proba_return - weighted return based on prediction probabilities for each factor.

We use **add_names_to_probas** function to add the target names to the prediction probabilities and **calculate_weighted_return** for pred_proba_return calculaton.

In [41]:
def add_names_to_probas(target_names: list, predict_proba:list)->list:
    proba_list = []
    for probas in predict_proba:
        probas_named = {}
        for name, value in zip(target_names, probas[0]):
            probas_named[name] = value
        proba_list.append(probas_named)
    return proba_list

In [42]:
def calculate_weighted_return(row:pd.DataFrame, probas:dict)->float:
    return (probas['total_return_1m_leverage'] * row.total_return_1m_leverage +
            probas['total_return_1m_profitability'] * row.total_return_1m_profitability +
            probas['total_return_1m_quality'] * row.total_return_1m_quality +
            probas['total_return_1m_size'] * row.total_return_1m_size +
            probas['total_return_1m_value'] * row.total_return_1m_value +
            probas['total_return_1m_yield'] * row.total_return_1m_yield)

In [43]:
returns_df['best_return'] =  [ret[ret['best_shifted']] for _, ret in returns_df.iterrows()]
returns_df['pred_return'] =  [ret[ret['pred_shifted']] for _, ret in returns_df.iterrows()]

returns_df['pred_proba_return'] = [
    calculate_weighted_return(row, probas) for row, probas in zip(returns_df.itertuples(), add_names_to_probas(target_names, predict_proba))
]
returns_df

Unnamed: 0,total_return_1m_leverage,total_return_1m_profitability,total_return_1m_quality,total_return_1m_size,total_return_1m_value,total_return_1m_yield,best_factor_next_month,preds,best_shifted,pred_shifted,best_return,pred_return,pred_proba_return
2019-02-28,9.635442,5.342961,6.747772,7.331603,4.547667,5.586687,total_return_1m_size,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,9.635442,7.331603,6.526403
2019-03-31,-0.409980,1.234758,1.140586,3.608492,-0.883971,1.120275,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,total_return_1m_size,3.608492,3.608492,2.355524
2019-04-30,1.346865,1.662254,4.866231,5.012717,0.851827,1.069905,total_return_1m_quality,total_return_1m_leverage,total_return_1m_size,total_return_1m_leverage,5.012717,1.346865,2.169469
2019-05-31,-8.063898,-6.772353,-4.436866,-6.042288,-8.810223,-7.937026,total_return_1m_profitability,total_return_1m_size,total_return_1m_quality,total_return_1m_leverage,-4.436866,-8.063898,-7.795316
2019-06-30,6.045734,7.008643,5.765615,6.469608,5.968074,5.680147,total_return_1m_size,total_return_1m_size,total_return_1m_profitability,total_return_1m_size,7.008643,6.469608,6.318585
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-29,-0.556119,2.020470,4.848980,1.731816,2.207771,-0.001908,total_return_1m_value,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,4.848980,4.848980,2.810316
2024-03-31,2.261306,1.727549,3.168656,0.808660,3.554203,2.520541,total_return_1m_quality,total_return_1m_leverage,total_return_1m_value,total_return_1m_size,3.554203,0.808660,1.412187
2024-04-30,-3.954715,-4.659949,-2.023432,-4.185906,-4.282108,-2.919605,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,total_return_1m_leverage,-2.023432,-3.954715,-4.003671
2024-05-31,1.382141,2.698176,2.509475,5.223667,3.286129,2.660367,total_return_1m_size,total_return_1m_size,total_return_1m_size,total_return_1m_quality,5.223667,2.509475,2.821662


### Calculating base index returns

Below, we calculate returns for the base index and add to our returns dataframe.

In [44]:
dji = ld.get_history('.DJI', "TR.PriceClose", start='2019-01-31', end='2024-07-01', interval='monthly')
dji_change = dji.pct_change().dropna()

dji_change = allign_dates(dji_change, returns_df)
returns_df['dji'] = dji_change['Price Close']*100
returns_df

Unnamed: 0,total_return_1m_leverage,total_return_1m_profitability,total_return_1m_quality,total_return_1m_size,total_return_1m_value,total_return_1m_yield,best_factor_next_month,preds,best_shifted,pred_shifted,best_return,pred_return,pred_proba_return,dji
2019-02-28,9.635442,5.342961,6.747772,7.331603,4.547667,5.586687,total_return_1m_size,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,9.635442,7.331603,6.526403,3.66534
2019-03-31,-0.409980,1.234758,1.140586,3.608492,-0.883971,1.120275,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,total_return_1m_size,3.608492,3.608492,2.355524,0.048926
2019-04-30,1.346865,1.662254,4.866231,5.012717,0.851827,1.069905,total_return_1m_quality,total_return_1m_leverage,total_return_1m_size,total_return_1m_leverage,5.012717,1.346865,2.169469,2.561774
2019-05-31,-8.063898,-6.772353,-4.436866,-6.042288,-8.810223,-7.937026,total_return_1m_profitability,total_return_1m_size,total_return_1m_quality,total_return_1m_leverage,-4.436866,-8.063898,-7.795316,-6.685522
2019-06-30,6.045734,7.008643,5.765615,6.469608,5.968074,5.680147,total_return_1m_size,total_return_1m_size,total_return_1m_profitability,total_return_1m_size,7.008643,6.469608,6.318585,7.192931
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-29,-0.556119,2.020470,4.848980,1.731816,2.207771,-0.001908,total_return_1m_value,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,4.848980,4.848980,2.810316,2.21778
2024-03-31,2.261306,1.727549,3.168656,0.808660,3.554203,2.520541,total_return_1m_quality,total_return_1m_leverage,total_return_1m_value,total_return_1m_size,3.554203,0.808660,1.412187,2.079652
2024-04-30,-3.954715,-4.659949,-2.023432,-4.185906,-4.282108,-2.919605,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,total_return_1m_leverage,-2.023432,-3.954715,-4.003671,-5.002738
2024-05-31,1.382141,2.698176,2.509475,5.223667,3.286129,2.660367,total_return_1m_size,total_return_1m_size,total_return_1m_size,total_return_1m_quality,5.223667,2.509475,2.821662,2.301692


Additionally, we also calculate the cumulative returns.

In [45]:
cols_for_cumsum = target_names + ['best_return','pred_return','pred_proba_return', 'dji']

for col in cols_for_cumsum:
    returns_df[f'{col}_cumsum'] = returns_df[col].cumsum()
returns_df

Unnamed: 0,total_return_1m_leverage,total_return_1m_profitability,total_return_1m_quality,total_return_1m_size,total_return_1m_value,total_return_1m_yield,best_factor_next_month,preds,best_shifted,pred_shifted,best_return,pred_return,pred_proba_return,dji,total_return_1m_leverage_cumsum,total_return_1m_profitability_cumsum,total_return_1m_quality_cumsum,total_return_1m_size_cumsum,total_return_1m_value_cumsum,total_return_1m_yield_cumsum,best_return_cumsum,pred_return_cumsum,pred_proba_return_cumsum,dji_cumsum
2019-02-28,9.635442,5.342961,6.747772,7.331603,4.547667,5.586687,total_return_1m_size,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,9.635442,7.331603,6.526403,3.66534,9.635442,5.342961,6.747772,7.331603,4.547667,5.586687,9.635442,7.331603,6.526403,3.66534
2019-03-31,-0.409980,1.234758,1.140586,3.608492,-0.883971,1.120275,total_return_1m_size,total_return_1m_leverage,total_return_1m_size,total_return_1m_size,3.608492,3.608492,2.355524,0.048926,9.225462,6.577719,7.888358,10.940096,3.663696,6.706961,13.243934,10.940096,8.881927,3.714266
2019-04-30,1.346865,1.662254,4.866231,5.012717,0.851827,1.069905,total_return_1m_quality,total_return_1m_leverage,total_return_1m_size,total_return_1m_leverage,5.012717,1.346865,2.169469,2.561774,10.572326,8.239973,12.754590,15.952812,4.515523,7.776867,18.256651,12.286960,11.051396,6.27604
2019-05-31,-8.063898,-6.772353,-4.436866,-6.042288,-8.810223,-7.937026,total_return_1m_profitability,total_return_1m_size,total_return_1m_quality,total_return_1m_leverage,-4.436866,-8.063898,-7.795316,-6.685522,2.508428,1.467620,8.317723,9.910524,-4.294700,-0.160159,13.819785,4.223062,3.256080,-0.409482
2019-06-30,6.045734,7.008643,5.765615,6.469608,5.968074,5.680147,total_return_1m_size,total_return_1m_size,total_return_1m_profitability,total_return_1m_size,7.008643,6.469608,6.318585,7.192931,8.554163,8.476263,14.083338,16.380132,1.673374,5.519988,20.828428,10.692670,9.574665,6.783449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-29,-0.556119,2.020470,4.848980,1.731816,2.207771,-0.001908,total_return_1m_value,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,4.848980,4.848980,2.810316,2.21778,34.490193,61.737512,69.327050,105.487750,3.778261,23.347552,213.979165,98.944342,72.939245,52.747886
2024-03-31,2.261306,1.727549,3.168656,0.808660,3.554203,2.520541,total_return_1m_quality,total_return_1m_leverage,total_return_1m_value,total_return_1m_size,3.554203,0.808660,1.412187,2.079652,36.751499,63.465062,72.495707,106.296410,7.332465,25.868093,217.533368,99.753002,74.351432,54.827537
2024-04-30,-3.954715,-4.659949,-2.023432,-4.185906,-4.282108,-2.919605,total_return_1m_size,total_return_1m_quality,total_return_1m_quality,total_return_1m_leverage,-2.023432,-3.954715,-4.003671,-5.002738,32.796785,58.805113,70.472275,102.110504,3.050356,22.948488,215.509936,95.798287,70.347761,49.8248
2024-05-31,1.382141,2.698176,2.509475,5.223667,3.286129,2.660367,total_return_1m_size,total_return_1m_size,total_return_1m_size,total_return_1m_quality,5.223667,2.509475,2.821662,2.301692,34.178926,61.503289,72.981749,107.334171,6.336485,25.608856,220.733604,98.307762,73.169422,52.126492


Finally, we calculate the sharp ratio for the strategies and present the performance comparison in a separate dataframe.

In [46]:
def calculate_sharpe_ratio(returns:pd.Series, risk_free_rate: float =0) -> float:
    excess_returns = returns - risk_free_rate
    mean_excess_return = excess_returns.mean() * 100
    std_excess_return = excess_returns.std() * 100
    sharpe_ratio = mean_excess_return / std_excess_return
    annualized_sharpe_ratio = sharpe_ratio * np.sqrt(252)
    return annualized_sharpe_ratio

In [47]:
sharp_ratio = {}
cum_return = {}
for col in cols_for_cumsum:
    cum_return[col] = returns_df[f"{col}_cumsum"][-1]
    sharp_ratio[col] = calculate_sharpe_ratio(returns_df[col])
    
pd.DataFrame({'cum_return': cum_return, 'sharp_ratio': sharp_ratio}).sort_values(by=['cum_return'], ascending=False)

Unnamed: 0,cum_return,sharp_ratio
best_return,226.358123,9.815381
total_return_1m_size,112.958691,5.195335
pred_return,103.932282,4.354861
pred_proba_return,76.96806,3.532225
total_return_1m_quality,72.945346,2.997232
total_return_1m_profitability,64.379135,3.11441
dji,53.244567,2.550035
total_return_1m_leverage,36.581385,1.679572
total_return_1m_yield,27.590091,1.213864
total_return_1m_value,7.082695,0.30885


The "best_return" represents the theoretical maximum potential with a cumulative return of 226.36% and a Sharpe ratio of 9.82. Model-driven strategies, particularly "pred_return" and "pred_proba_return," outperformed the traditional benchmark. The "pred_return" achieved a cumulative return of 103.93% and a Sharpe ratio of 4.35, while "pred_proba_return" followed closely with a 76.96% return and a Sharpe ratio of 3.53. Both surpassed the base index, which had a cumulative return of 53.24% and a Sharpe ratio of 2.55. Factor-only strategies showed mixed results. The "total_return_1m_size" factor led with a 112.96% return and a Sharpe ratio of 5.20, while other factors, like "total_return_1m_value," significantly lagged, with just a 7.08% return and a Sharpe ratio of 0.31.

We also plot the cumulative returns for model-driven strategies and the base index.

In [48]:
fig = go.Figure()

for series_name in ['dji_cumsum', 'pred_return_cumsum', 'pred_proba_return_cumsum']:
        fig.add_trace(go.Scatter(
                x=returns_df.index, 
                y=returns_df[series_name],
                mode='lines', name=series_name))

fig.show()

According to the plot, both model-driven strategies consistently outperformed the baseline throughout the observation period, with the prediction-based strategy outperforming the prediction-probability-based strategy demonstrating the added value of predictive modelling.

## Conclusion

Our article demonstrated the process of retrieving historical index constituents, calculating associated financial ratios, and constructing factors to explore an advanced approach to index rebalancing that extends beyond traditional single-factor methodologies. The primary goal was to determine whether a multi-factor strategy, with dynamic adjustments based on predictions of the best-performing factors, could enhance the returns of a base index.

The results suggest that a multi-factor, model-driven approach to index rebalancing can outperform traditional single-factor indexes. By incorporating various financial and economic metrics and dynamically adjusting the index composition based on predicted factor performance, the model-driven strategies consistently achieved higher returns and better risk-adjusted outcomes. This showcases the potential of advanced rebalancing techniques to improve investment results.

However, it's important to note that this research is exploratory in nature. Further analysis is needed to ensure the robustness of the findings, including additional tests and validations. Hyperparameter tuning and sensitivity analysis could further refine the model's predictions and potentially enhance performance. Therefore, while the results are promising, they should be considered as a foundation for more comprehensive studies and practical applications.