---
# Application of CUSUM filter and Meta-labeling for a Trend Following Strategy
---

This notebook looks looks to evaluate the effectiveness for meta-labeling when applied to a simple trend following strategy. The moving average cross strategy will determine the side {-1: Sell, 1: Buy} while the meta-labeling technique (using a trainded random forest) will be used to determine the size of the bet {0: Don't trade, 1: Trade}.  The daily standard deviation will be used to derive the labels in combination with the CUSUM filter.

In [1]:
import mlfinlab as ml

import numpy as np
import pandas as pd 
import pyfolio as pf 

import timeit

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, classification_report, confusion_matrix, accuracy_score
from sklearn.utils import resample
from sklearn.utils import shuffle

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'mlfinlab'

---
## Import Data (Dollar Bars)

In [2]:
# Read in Dollar Bars
data = pd.read_csv('./data/financial_data/dollar_bars.csv')
data.index = pd.to_datetime(data['date_time'])
data = data.drop('date_time', axis=1)
# Slice the data starting with:
data = data['2011-09-01':]

NameError: name 'pd' is not defined

---
# Fit the Primary Model (Trend Following Strategy)

Simple moving average strategy using.

In [4]:
# Create rolling windows
fast_window = 20
slow_window = 50

# Compute moving average lines
data['fast_mavg'] = data['close'].rolling(window=fast_window, min_periods=fast_window, center=False).mean()
data['slow_mavg'] = data['close'].rolling(window=slow_window, min_periods=slow_window, center=False).mean()
data.head()

# Compute sides
data['side'] = np.nan

# Signals
long_signals = data['fast_mavg'] >= data['slow_mavg']
short_signals = data['slow_mavg'] <= data['fast_mavg']

# Adding sigals to df
data.loc[long_signals, 'side'] = 1
data.loc[short_signals, 'side'] = -1

# Remove look ahead bias (shifting the signal by 1 (lagging))
data['side'] = data['side'].shift(1)

NameError: name 'data' is not defined

In [5]:
# Make a copy of the raw data
raw_data = data.copy()

# Clean NaN values
data.dropna(axis=0, how='any', inplace=True)

# Count the values on each side {-1, 1}
data['side'].value_counts()

NameError: name 'data' is not defined

---
# Filter Events (CUSUM Filter)
Used to predict what will happen when a CUSUM event is triggered using daily volatility as the trigger.

In [6]:
# Compute daily volatility
daily_vol = ml.util.get_daily_vol(close=data['close'], lookback=50)

# Apply symetric CUSUM filter (return timestamps)
cusum_events = ml.filters.cusum_filter(data['close'], threshold=daily_vol['2011-09-01'].mean()*0.5)

# Cumpute vertical barriers
vertical_barriers = ml.labeling.add_vertical_barrier(t_events=cusum_events, close=data['close'], num_days=1)

NameError: name 'ml' is not defined

In [7]:
# Profit-take and stop-loss
pt_sl = [1, 2]
min_retracement = 0.005

triple_barrier_events = ml.labeling.get_events(close=data['close'],
                                               t_events=cusum_events,
                                               pt_sl=pt_sl,
                                               target=daily_vol,
                                               min_ret=min_retracement,
                                               num_threads=3,
                                               vertical_barrier_times=vertical_barriers,
                                               side_prediction=data['side'])

NameError: name 'ml' is not defined

In [8]:
# Create an object storing the labels resulting from the triple barrier events
labels = ml.labeling.get_bins(triple_barrier_events, data['close'])
# Count labels on each side
labels.side.value_counts()

NameError: name 'ml' is not defined

---
## Results: Primary Model
Here we analyze the predictive power of the primary model (trend following strategy) with the triple barrier events applied to measure precision before the secondary model is applied to determine size (whether to take the bet or not).

In [9]:

primary_forecast = pd.DataFrame(labels['bin'])
primary_forecast['pred'] = 1
primary_forecast.columns = ['actual', 'pred']

# Performance Metrics
actual = primary_forecast['actual']
pred = primary_forecast['pred']
print(classification_report(y_true=actual, y_pred=pred))

print("Confusion Matrix")
print(confusion_matrix(actual, pred))

print('')
print("Accuracy")
print(accuracy_score(actual, pred))

NameError: name 'pd' is not defined

---
## Notes: Primary Model
- Imablance in the number of 'trades' and 'no trades'
- There are many false positives (> than the number of true positives)
- Matrix: [[TN, FP], [FN, TP]]

---
# Apply Meta-labeling (Meta Model)
Here we train a random forest the size of the bet {0, 1} since the previous (primary) model is used to determine the side of the bet {-1, 0, 1}.

### Features used:
- Volatility
- Serial Correlation
- Returns at different lags from the Serial Correlation
- Sides from primary model (Moving average strategy)

In [10]:
raw_data.head()

NameError: name 'raw_data' is not defined

In [11]:
# Log Returns
raw_data['log_ret'] = np.log(raw_data['close']).diff()

# Momentum
raw_data['mom1'] = raw_data['close'].pct_change(periods=1)
raw_data['mom2'] = raw_data['close'].pct_change(periods=2)
raw_data['mom3'] = raw_data['close'].pct_change(periods=3)
raw_data['mom4'] = raw_data['close'].pct_change(periods=4)
raw_data['mom5'] = raw_data['close'].pct_change(periods=5)

# Volatility
raw_data['volatility_64'] = raw_data['log_ret'].rolling(window=64, min_periods=64, center=False).std()
raw_data['volatility_32'] = raw_data['log_ret'].rolling(window=32, min_periods=32, center=False).std()
raw_data['volatility_16'] = raw_data['log_ret'].rolling(window=16, min_periods=16, center=False).std()

# Serial Correlation (Takes about 4 minutes)
window_autocorr = 50

raw_data['autocorr_1'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=1), raw=False)
raw_data['autocorr_2'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=2), raw=False)
raw_data['autocorr_3'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=3), raw=False)
raw_data['autocorr_4'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=4), raw=False)
raw_data['autocorr_5'] = raw_data['log_ret'].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=5), raw=False)

# Get the various log -t returns
raw_data['log_t1'] = raw_data['log_ret'].shift(1)
raw_data['log_t2'] = raw_data['log_ret'].shift(2)
raw_data['log_t3'] = raw_data['log_ret'].shift(3)
raw_data['log_t4'] = raw_data['log_ret'].shift(4)
raw_data['log_t5'] = raw_data['log_ret'].shift(5)

NameError: name 'np' is not defined

In [None]:
# Recalculate the sides (with features added)
raw_data['side'] = np.nan

# Drop unwanted columns
X.drop(['open', 'high', 'low', 'close', 'cum_vol', 'cum_dollar', 'cum_ticks','fast_mavg', 'slow_mavg',], axis=1, inplace=True)

y = labels['bin']