# Label financial data

It is not enough to simply compare the target rate of return with a fixed size of the threshold when creating labels for the model from financial data. In the real trading environment, the model should be made considering three barriers. We call them Triple-Barrier which is defined by profit-taking, stop-loss limit, and expiration limit.

In [1]:
import mlfinlab as ml

import numpy as np
import pandas as pd



In [2]:
dollar_bars = pd.read_csv('sample_dollar_bars.csv', nrows=40000)
dollar_bars

Unnamed: 0_level_0,open,high,low,close,cum_vol,cum_dollar,cum_ticks
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-07-31 23:31:58.810,1306.00,1308.75,1301.75,1305.75,53658,70035704.75,14115
2011-08-01 02:55:17.443,1305.75,1309.50,1304.00,1306.50,53552,70006277.00,15422
2011-08-01 07:25:56.319,1306.75,1309.75,1304.75,1305.00,53543,70000901.00,14727
2011-08-01 08:33:10.903,1305.00,1305.00,1299.00,1300.00,53830,70094217.75,14987
2011-08-01 10:51:41.842,1300.00,1307.75,1299.00,1307.75,53734,70033006.25,14499
...,...,...,...,...,...,...,...
2012-07-30 12:30:28.642,1379.25,1380.00,1377.50,1377.75,50843,70116589.50,17923
2012-07-30 13:29:21.258,1377.75,1380.00,1377.00,1379.25,50782,70014483.25,14040
2012-07-30 13:35:05.407,1379.25,1383.25,1379.00,1382.50,50675,70001889.25,12017
2012-07-30 13:43:43.711,1382.50,1383.25,1380.00,1381.00,50667,70002243.75,13904


## Sample the events

Sample the events through the CUSUM filter. Threshold uses the average of estimated volatility in the market.

In [4]:
vol = ml.util.get_daily_vol(close=data['close'], lookback=50)
cusum_events = ml.filters.cusum_filter(data['close'], threshold=vol.mean())
cusum_events

DatetimeIndex(['2011-08-01 13:46:23.650000', '2011-08-01 14:03:22.782000',
               '2011-08-01 15:38:23.090000', '2011-08-01 19:25:42.891000',
               '2011-08-02 12:27:07.195000', '2011-08-02 16:48:53.474000',
               '2011-08-02 19:42:30.586000', '2011-08-03 14:23:36.205000',
               '2011-08-03 15:13:42.802000', '2011-08-03 17:43:21.280000',
               ...
               '2012-07-23 07:04:48.948000', '2012-07-23 13:12:31.459000',
               '2012-07-23 19:24:55.447000', '2012-07-24 14:48:20.985000',
               '2012-07-24 16:28:02.366000', '2012-07-25 11:39:30.331000',
               '2012-07-26 10:24:12.595000', '2012-07-26 12:41:28.312000',
               '2012-07-27 12:30:12.984000', '2012-07-27 16:33:45.042000'],
              dtype='datetime64[ns]', length=667, freq=None)

## Compute vertical barrier

Find the timestamp of one day after the event.

In [5]:
# Compute vertical barrier
vertical_barriers = ml.labeling.add_vertical_barrier(cusum_events,
                                                     data['close'],
                                                     num_days=1)
vertical_barriers

2011-08-01 13:46:23.650   2011-08-02 13:50:40.053
2011-08-01 14:03:22.782   2011-08-02 14:04:29.869
2011-08-01 15:38:23.090   2011-08-02 15:49:00.114
2011-08-01 19:25:42.891   2011-08-02 19:26:07.927
2011-08-02 12:27:07.195   2011-08-03 13:07:43.154
                                    ...          
2012-07-25 11:39:30.331   2012-07-26 11:53:52.356
2012-07-26 10:24:12.595   2012-07-27 11:37:04.490
2012-07-26 12:41:28.312   2012-07-27 12:47:21.434
2012-07-27 12:30:12.984   2012-07-30 06:13:28.136
2012-07-27 16:33:45.042   2012-07-30 06:13:28.136
Name: date_time, Length: 667, dtype: datetime64[ns]

## Find the time of the first touch

You can set the width of the horizontal barriers using the target and pt_sl parameters. If you use meta-labeling, you can pass the output of the primary model which indicates the side of the bet (long/short) to side_prediction.

In [6]:
triple_barrier_events = ml.labeling.get_events(close=data['close'],
                                               t_events=cusum_events,
                                               pt_sl=[1, 1],
                                               target=vol,
                                               min_ret=0.01,
                                               num_threads=1,
                                               vertical_barrier_times=vertical_barriers,
                                               side_prediction=None)

The column _t1_ is the first timestamp to touch the barrier.

In [7]:
triple_barrier_events

Unnamed: 0,t1,trgt,pt,sl
2011-08-04 01:57:00.466,2011-08-04 10:27:24.326000128,0.011841,1,1
2011-08-04 09:53:01.844,2011-08-04 13:50:40.606000128,0.011918,1,1
2011-08-04 19:30:23.101,2011-08-04 20:01:48.966000128,0.010392,1,1
2011-08-04 19:59:41.879,2011-08-05 12:30:19.803000064,0.013613,1,1
2011-08-05 12:30:19.803,2011-08-05 13:51:55.448999936,0.013697,1,1
...,...,...,...,...
2012-06-10 22:00:00.149,2012-06-11 14:00:08.657999872,0.012076,1,1
2012-06-11 13:41:07.758,2012-06-11 19:10:45.499000064,0.010801,1,1
2012-06-29 12:37:47.020,2012-07-02 02:39:19.100999936,0.010435,1,1
2012-06-29 19:59:53.768,2012-07-02 02:39:19.100999936,0.012873,1,1


## Label the observations

The column _bin_ indicates which barrier was touched, and the return at the time is displayed in the column _ret_.

In [8]:
labels = ml.labeling.get_bins(triple_barrier_events, data['close'])
labels

Unnamed: 0,ret,trgt,bin
2011-08-04 01:57:00.466,-0.012091,0.011841,-1
2011-08-04 09:53:01.844,-0.009812,0.011918,0
2011-08-04 19:30:23.101,-0.006425,0.010392,0
2011-08-04 19:59:41.879,0.016298,0.013613,1
2011-08-05 12:30:19.803,-0.015024,0.013697,-1
...,...,...,...
2012-06-10 22:00:00.149,-0.013089,0.012076,-1
2012-06-11 13:41:07.758,-0.012651,0.010801,-1
2012-06-29 12:37:47.020,0.005572,0.010435,0
2012-06-29 19:59:53.768,-0.001844,0.012873,0


## Drop under-populated labels

In some ML classifiers, performance can be improved by eliminating really rare cases. Because it can make the model focus on dealing with common cases. For example, you can drop labels if they account for less than 5% of the amount.

In [9]:
clean_labels = ml.labeling.drop_labels(labels, min_pct=0.05)
clean_labels

Unnamed: 0,ret,trgt,bin
2011-08-04 01:57:00.466,-0.012091,0.011841,-1
2011-08-04 09:53:01.844,-0.009812,0.011918,0
2011-08-04 19:30:23.101,-0.006425,0.010392,0
2011-08-04 19:59:41.879,0.016298,0.013613,1
2011-08-05 12:30:19.803,-0.015024,0.013697,-1
...,...,...,...
2012-06-10 22:00:00.149,-0.013089,0.012076,-1
2012-06-11 13:41:07.758,-0.012651,0.010801,-1
2012-06-29 12:37:47.020,0.005572,0.010435,0
2012-06-29 19:59:53.768,-0.001844,0.012873,0


In this case, all labels account for more than 5%, so there is no need to remove them.

In [10]:
clean_labels['bin'].value_counts(normalize=True)

 1    0.400593
 0    0.326409
-1    0.272997
Name: bin, dtype: float64