- By: Harkishan Singh Baniya
- Email: harkishansinghbaniya@gmail.com
- Reference : 1) Advances in Financial Machine Learning by Dr Marcos Lopez De Prado
            2) https://mlfinlab.readthedocs.io/en/latest/labeling/tb_meta_labeling.html

This notebook is a part of article series **Alternative Bars on Alpaca**. In last two parts of the article, we have learnt about *Alternatives Bars* i.e. `tick bar`, `volume bar` and `dollar bar` and developed a trading strategy with 'volume bars' using Alpaca Trade API. <br>

In this notebook, we will try to enhance the trading strategy by trying to reduce the amount of false-positive signals produced the strategy. This will be done using a technique called Meta-labelling *(AFML page-50 3.6)* by Dr Macros Lopez de Prado. In brief, meta-labelling is done by looking at the historical returns of a strategy or a model and label only the profitable trades (returns above a minimum threshold) as 1 and the rest 0. Then a model an ML model can train on the binary labels to decide whether to take a trade position or to avoid it.<br>

The analysis will be performed on historical volume bars of SPY ETF trades data from *Jan 1st 2018* to *Dec 31st 2019* and will be using a dynamic sampling frequency/ thresholds as mentioned during the strategy development (refer article [part-ii](https://alpaca.markets/learn/alternative-bars-02/) ).


For generating the meta-labels, I will be using the [mlfinlab](https://mlfinlab.readthedocs.io/en/latest/index.html) Python package developed by [Hudson&Thames.org](https://hudsonthames.org/) and [pyfolio](https://www.quantopian.com/docs/user-guide/tools/pyfolio) by [Quantopian Inc.](https://www.quantopian.com/) for getting the performance metrics. User can easily install the packages using `pip install` or by running the below cell. Also, it uses [talib](https://mrjbq7.github.io/ta-lib/doc_index.html) technical analysis package to generate the Bollinger Bands. If it’s not already installed it can be installed by `pip install talib`.

In [None]:
!pip install mlfinlab pyfolio 

In [1]:
#Imports
import warnings
warnings.filterwarnings('ignore')

import talib as ta
import numpy as np
import pandas as pd 
import pyfolio as pf
from tqdm import tqdm
import mlfinlab as ml

from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

In [2]:
#defining the strategy parameter
#lookback period for the Bollinger Bands
lookback_period = 15
#TP and SL
tpsl = [10,5]
#bars file
file = '../sample_datasets/analysis/SPY_VBars.csv'
#reading the bars
bars = pd.read_csv(file, index_col=[0], parse_dates=True)
#creating the Bollinger Bands 
bars['UB'], _, bars['LB'] = ta.BBANDS(bars.close, timeperiod=lookback_period, nbdevup=2, nbdevdn=2, matype=0)
bars = bars.dropna()
bars = bars.tz_localize('UTC').tz_convert('US/Eastern')

In [3]:
bars.head()

Unnamed: 0_level_0,open,high,low,close,vwap,cum_volume,cum_ticks,cum_dollar_value,cum_buy_ticks,cum_sell_ticks,cum_buy_volume,cum_sell_volume,cum_buy_dollar_value,cum_sell_dollar_value,UB,LB
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-01-02 10:28:28.226337024-05:00,268.3,268.38,268.21,268.37,268.272298,877541.0,3979.0,235419900.0,527.0,503.0,128839.0,97547.0,34565270.0,26168580.0,268.448649,267.372684
2018-01-02 10:34:48.258940928-05:00,268.36,268.45,268.32,268.37,268.371774,877577.0,4116.0,235516900.0,472.0,461.0,134167.0,104867.0,36007250.0,28142870.0,268.526686,267.382647
2018-01-02 10:38:25.482391040-05:00,268.37,268.45,267.96,268.02,268.28452,877702.0,3439.0,235473900.0,454.0,484.0,111267.0,110587.0,29851480.0,29669070.0,268.525055,267.437612
2018-01-02 10:44:35.941316096-05:00,268.02,268.3,268.0135,268.23,268.177539,877886.0,4057.0,235429300.0,542.0,516.0,132313.0,104728.0,35482000.0,28085250.0,268.500557,267.57011
2018-01-02 10:51:50.436428032-05:00,268.23,268.3,268.2,268.25,268.243706,877793.0,3551.0,235462400.0,380.0,360.0,93431.0,118681.0,25062160.0,31835510.0,268.519926,267.616074


In [4]:
def get_sides(df):
    """
    A function to get the trade sides either long
    or short from up or down cross of the price
    from the Bollinger Bands according to the 
    strategy.
    """
    #up-cross
    c1U = df.close.shift(1) < df.UB.shift(1)  
    c2U = df.close > df.UB
    #down-cross
    c1D = df.close.shift(1) > df.LB.shift(1) 
    c2D = df.close < df.LB
    #signals
    sides = pd.Series(np.nan, index = df.index)
    #LONG
    sides.loc[(c1U) & (c2U)] = int(1)
    #SHORT
    sides.loc[(c1D) & (c2D)] = int(-1)
    return sides.dropna()

def get_hourly_volatility(close, lookback=10):
    """
    Get the hourly volatility of a price series with
    a given decay span.
    """
    timedelta = pd.Timedelta('1 hours')
    df0 = close.index.searchsorted(close.index - timedelta)
    df0 = df0[df0 > 0]
    df0 = (pd.Series(close.index[df0 - 1], index=close.index[close.shape[0] - df0.shape[0]:]))
    df0 = close.loc[df0.index] / close.loc[df0.array].array - 1  # daily returns
    df0 = df0.ewm(span=lookback).std()
    return df0

def get_vertical_barrier(close, sides):
    """
    This function outputs the timestamps where the
    position is closed due to a counter position that
    had to be taken due to side flip while holding a
    opposite position than the current one. 
    
    This timestamp will be considered as the verticle
    barrier or the point where we close the position 
    when neither the TP nor the SL hit has occured.
    """
    #get the positions where side flips
    t1 = pd.Series(pd.NaT, index=close.index)
    prev_side = sides[0]
    last_update = close.index[0]
    for i in range(1, len(sides)):
        if (sides[i] + prev_side) == 0:
            #switch position i.e. close the current position and take a counter position
            t1[last_update:sides.index[i]] = sides.index[i]
            last_update = sides.index[i]
        prev_side = sides[i]
    t1 = t1.fillna(close.index[-1])
    return t1

def get_returns(bars, tpsl):
    """
    A function to get the strategy returns from the 
    entry sides, volatility and the exit conditions 
    according to the strategy, the get_events function
    from mlfinlab get the returns by applying these 
    parameters.
    
    :param bars :(pd.DataFrame) bars dataframe.
    :param tpsl :(list) TP and SL for the strategy.
                    
    :return : (pd.DataFrame) a dataframe of the sides
             generated by the strategy and the returns
             for those.
    """
    #signals i.e. LONG/SHORT (1/-1)
    sides = get_sides(bars)
    #hourly volatility
    vol = get_hourly_volatility(bars.close)
    #vertical barrier 
    t1 = get_vertical_barrier(bars.close, sides)
    #get the 3B events 
    triple_barrier_events = ml.labeling.get_events(close=bars['close'],
                                                   t_events=sides.index,
                                                   pt_sl=tpsl,
                                                   target=vol,
                                                   min_ret=0.0,
                                                   num_threads=4,
                                                   vertical_barrier_times=t1,
                                                   side_prediction=sides)
    labels = ml.labeling.get_bins(triple_barrier_events, bars['close'])
    return labels[['ret', 'side']]

In [5]:
#get the returns and sides for the bar sets
ordinary_returns = get_returns(bars, tpsl)

2020-11-16 15:42:37.411472 100.0% apply_pt_sl_on_t1 done after 0.16 minutes. Remaining 0.0 minutes.


In [6]:
#creating a copy of the bar set
X = bars.copy()
#adding the sides to the dataframe 
X['side'] = ordinary_returns['side']
#adding log returns, volatility, momentum and RSI as features 
#more relevent features can be added here to improve the ML model
X['returns'] = np.log(bars['close']).diff()
X['volatility'] = bars['close'].rolling('H').std()
X['momentum_5'] = bars['close'].pct_change(5)
X['rsi_5'] = ta.RSI(bars['close'], 5)

#formatting the returns and the dataframe to remove Nan values
X['strat_returns'] = ordinary_returns.ret
X = X.dropna()
strat_ret = X['strat_returns']
X = X.drop(['strat_returns', 'open', 'high', 'low', 'close', 'cum_ticks','cum_dollar_value'], 1)
#converting the returns to binary labels
y = np.sign(strat_ret)
y[y <= 0] = 0

In [7]:
X.head()

Unnamed: 0_level_0,vwap,cum_volume,cum_buy_ticks,cum_sell_ticks,cum_buy_volume,cum_sell_volume,cum_buy_dollar_value,cum_sell_dollar_value,UB,LB,side,returns,volatility,momentum_5,rsi_5
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2018-01-02 15:23:02.304551936-05:00,268.394006,878054.0,451.0,446.0,140277.0,119020.0,37649920.0,31944000.0,268.39473,268.213937,1.0,3.7e-05,0.043906,0.000298,77.701663
2018-01-02 15:43:33.539259904-05:00,268.49884,877547.0,351.0,352.0,91038.0,80807.0,24443930.0,21696280.0,268.50527,268.20273,1.0,0.000186,0.078601,0.000447,69.628955
2018-01-02 15:52:45.711525120-05:00,268.675839,880038.0,433.0,415.0,107477.0,143749.0,28876570.0,38620050.0,268.679802,268.172865,1.0,0.000372,0.1338,0.001342,84.656739
2018-01-03 12:41:31.885147904-05:00,270.084926,961161.0,442.0,428.0,118977.0,141481.0,32135080.0,38213300.0,270.123896,269.740104,1.0,0.000426,0.101719,0.001093,75.002486
2018-01-03 15:40:10.644578048-05:00,270.499555,961156.0,354.0,342.0,132018.0,107326.0,35711920.0,29031430.0,270.522831,269.954369,1.0,0.000299,0.111557,0.00111,84.864758


In [8]:
#meta model
def train_model(X, y, split_date):
    """
    A funtion to train a Random Forest as meta model 
    on the given features(X) and labels(y) and return 
    the prediction for out-of-sample (OOS) validation.
    """
    X_train, y_train, X_val, y_val = X[:split_date], y[:split_date], X[split_date:], y[split_date:]
    #defining a random forest model
    model = RandomForestClassifier(n_estimators=800, max_depth=7, criterion='entropy', random_state=1, n_jobs=-1)
    #fitting the model
    model.fit(X_train, y_train)
    # OOS prediction
    y_pred = model.predict(X_val)
    #displaying models preformance metrics out-of-sample (OOS)
    print(f'(OOS) Accuracy : {accuracy_score(y_val, y_pred)}')
    print(f'(OOS) Precision : {precision_score(y_val, y_pred)}')
    print(f'(OOS) Recall : {recall_score(y_val, y_pred)}')
    print(f'(OOS) Confusion Matrix : \n {confusion_matrix(y_val, y_pred)}')
    return y_pred

In [9]:
test_from_date = '2019-10-01'
meta_signals = train_model(X, y, test_from_date)
#get the normal return for the test period
normal_rets = strat_ret[test_from_date:]
#get the returns with the signals from meta-model for the test period
rets_with_meta_model = strat_ret[test_from_date:] * meta_signals

(OOS) Accuracy : 0.5172413793103449
(OOS) Precision : 0.4431818181818182
(OOS) Recall : 0.8478260869565217
(OOS) Confusion Matrix : 
 [[21 49]
 [ 7 39]]


### Plain Strategy Performance

In [10]:
pf.show_perf_stats(normal_rets)

Start date,2019-10-01,2019-10-01
End date,2019-12-30,2019-12-30
Total months,5,5
Unnamed: 0_level_3,Backtest,Unnamed: 2_level_3
Annual return,11.1%,
Cumulative returns,5.0%,
Annual volatility,9.3%,
Sharpe ratio,1.18,
Calmar ratio,1.80,
Stability,0.38,
Max drawdown,-6.2%,
Omega ratio,1.23,
Sortino ratio,2.07,
Skew,0.77,


### Strategy with Meta-model Performance

In [11]:
pf.show_perf_stats(rets_with_meta_model)

Start date,2019-10-01,2019-10-01
End date,2019-12-30,2019-12-30
Total months,5,5
Unnamed: 0_level_3,Backtest,Unnamed: 2_level_3
Annual return,27.4%,
Cumulative returns,11.8%,
Annual volatility,8.0%,
Sharpe ratio,3.08,
Calmar ratio,6.74,
Stability,0.69,
Max drawdown,-4.1%,
Omega ratio,1.90,
Sortino ratio,6.78,
Skew,1.27,


## Conclusion

We can see that the above meta-model didn't perform much well in terms of accuracy and precision and there are a lot of reasons for that like tunning, feature selection etc. These topics don't fall under the scope of this article series but will be discussed later. The goal was to introduce the concept of meta-modelling and labelling to avoid some false positives as a filtering method. <br>
As for the performance, we can see that the meta-model *outperforms* the plain strategy significantly with all the performance statistics like cumulative returns, Sharpe ratio, max drawdown, annual volatility etc.  The meta-model helped to reduce the max drawdown and increase the overall Sharpe ratio which was expected as the main motive was to filter out as many false positives as possible. Risk-averse investors can trade some of the recall from the model to increase the precision by keeping a threshold on the predicted probability (e.g. at 60%) from the meta-model. This way the investor can reduce their max drawdown and volatility further but will sacrifice some return in the process. <br>

Improving the robustness of the model and the testing process can involve the following steps, but not limited to them only. 
- Use more relevant features for training.
- Do features selection and engineering. 
- Tune the hyper-parameter of the model with cross-validation.
- For the testing, I would recommend using an online learning setup for the model training and testing with moving window to keep the model relevant with the new information and not predicting long into the future when the model tends to decay in performance.