1. Given a time series of E-mini S&P 500 futures, compute labels on one-
minute time bars using the fixed-horizon method, where τ is set at two
standard deviations of one-minute returns.

    1. Compute the overall distribution of the labels.
    2. Compute the distribution of labels across all days, for each hour of the 
    trading session.
    3. How different are the distributions in (b) relative to the distribution in (a)? Why ?

In [4]:
import sqlite3
import pandas as pd

sp_db_path='../db/sp.db'

con=sqlite3.connect(sp_db_path)

query='''SELECT * FROM sp WHERE "index" BETWEEN '2010-01-01' AND '2020-12-31' '''

df=pd.read_sql(query, con)
print(df)
con.close()

                              index   Close  Volume
0               2010-01-03 17:00:00  1113.2       1
1               2010-01-03 17:00:00  1113.2       1
2               2010-01-03 17:00:00  1113.2       1
3               2010-01-03 17:00:00  1113.2       1
4               2010-01-03 17:00:00  1113.2       1
...                             ...     ...     ...
2950280  2019-12-31 15:00:02.086000  3232.0       0
2950281  2019-12-31 15:00:09.277000  3231.5       0
2950282  2019-12-31 15:00:14.203000  3230.5       0
2950283  2019-12-31 15:01:09.088000  3233.0       0
2950284  2019-12-31 15:03:41.206000  3232.0       0

[2950285 rows x 3 columns]


In [5]:
df['index']=pd.to_datetime(df['index'], format='mixed')
df['minute']=df['index'].dt.floor('min') # minute

min_df=df.groupby('minute').agg({
    'Close':'last',
    'Volume':'sum'
})

In [6]:
def get_fixed_horizon_labels(close:pd.Series,horizon:int, tau:float)->pd.Series:
    shifted_close=close.shift(-horizon)
    horizon_returns=shifted_close/close - 1
    labels=horizon_returns.apply(lambda x: (1 if x>tau else -1) if abs(x)>tau else 0)
    return labels


min_horizon_labels=get_fixed_horizon_labels(min_df['Close'], 20, 2*min_df['Close'].pct_change().dropna().std())

In [7]:
min_label= min_horizon_labels.value_counts(normalize=True)
print(min_label)

Close
 0    0.510622
 1    0.253676
-1    0.235702
Name: proportion, dtype: float64


In [8]:
hour_label=min_horizon_labels.resample('h').first().value_counts(normalize=True)
print(hour_label)

Close
 0.0    0.474990
 1.0    0.280764
-1.0    0.244246
Name: proportion, dtype: float64


hour 단위로 뽑은것이 좀 더 signal이 많은 편이다.

시장 개장/폐장, 경제지표 발표, 옵션 만기, FOMC 등의 주요 이벤트들은 대부분 정시에 맞춰서 나온다. 따라서
1에 의해서 이벤트 직후 데이터를 잡을 확률이 높아 시그널이 좀 더 많다.

2. Repeat Exercise 1, where this time you label standardized returns 
(instead of raw returns), where the standardization is based on mean and variance
estimates from a lookback of one hour. Do you reach a different conclusion?

이전과는 반대의 결과가 나온다 (min_df)가 signal이 더 많음, normalization을 통해서 수익률의 패턴(seasonality)를 제거한 결과가 나타난것으로 보인다.

In [9]:
def get_fixed_normalized_horizon_labels(close:pd.Series, horizon:int, tau:int=2):
    returns=(close.shift(-horizon)/close-1).dropna()

    hour_idx=returns.index.floor('h')
    hour_mean_returns=returns.resample('h').mean().shift(1)
    hour_std_returns=returns.resample('h').std().shift(1)

    hour_mean_aligned=hour_idx.map(hour_mean_returns)
    hour_std_aligned=hour_idx.map(hour_std_returns)

    # normalize returns
    returns_normalized = (returns - hour_mean_aligned) / hour_std_aligned

    labels=returns_normalized.apply(lambda x: (1 if x>tau else -1) if abs(x)>tau else 0)

    return labels

normalized_min_horizon_labels=get_fixed_normalized_horizon_labels(min_df['Close'], 20, 2)

normalized_min_label=normalized_min_horizon_labels.value_counts(normalize=True)
normalized_hour_label=normalized_min_horizon_labels.resample('h').first().value_counts(normalize=True)
print(normalized_min_label)
print(normalized_hour_label)


 0    0.685731
-1    0.159442
 1    0.154826
Name: proportion, dtype: float64
 0.0    0.813218
-1.0    0.094507
 1.0    0.092275
Name: proportion, dtype: float64


3. Repeat Exercise 1, where this time you apply the triple-barrier method on
volume bars. The maximum holding period is the average number of bars per
day, and the horizontal barriers are set at two standard deviations of bar
returns. How do results compare to the solutions from Exercises 1 and 2?

거의 2개의 결과가 비슷한것으로 보아서 seasonality의 해결뿐만 아니라 실제로 labeling 중간에 발생하는 신호(중간 이벤트)를 무시하지 않도록 해주기 때문이다. 또한 0보다 1, -1 label이 많아서 보다 시장의 기회를 잘 포착하는 것을 확인할 수 있다.


In [15]:
from typing import Tuple
from joblib import Parallel, delayed
from tqdm import tqdm
import numpy as np
from tqdm_joblib import tqdm_joblib
import statsmodels.api as sm
import multiprocessing

def split_index(index: pd.Index, chunk_size: int = 1000):
    for i in range(0, len(index), chunk_size):
        yield index[i:i + chunk_size]

min_df_daily_bar=min_df.groupby(min_df.index.date).count()
average_daily_bars=int(round(min_df_daily_bar.mean().iloc[0]))

# set vertical barrier
t1=min_df.index.searchsorted(min_df.index+pd.Timedelta(minutes=average_daily_bars))
t1=t1[t1<len(min_df)]
t1=pd.Series(data=min_df.index[t1], index=min_df.index[:len(t1)])
print(t1)


def get_triple_barrier_label(close: pd.Series, t1: pd.Series, barrier_width: Tuple[float, float], molecule: pd.Index = None):
    if molecule is not None:
        close = close[molecule].copy()
        t1 = t1[molecule].copy()

    upper_barrier_width, lower_barrier_width = barrier_width

    # datetime64[ns]로 미리 선언 (NaT로 채워짐)
    ret = pd.DataFrame(index=t1.index)
    ret['t1'] = pd.to_datetime(t1)
    ret['sl'] = pd.NaT
    ret['pt'] = pd.NaT

    for h_s, h_e in t1.fillna(close.index[-1]).items():
        path_price = close.loc[h_s:h_e]
        if len(path_price) == 0:
            continue

        path_return = (path_price / path_price.iloc[0] - 1).dropna()

        # 하단/상단 배리어 최초 터치 시점 (없으면 NaT 유지)
        sl_idx = path_return.index[path_return <= -lower_barrier_width]
        pt_idx = path_return.index[path_return >=  upper_barrier_width]
        ret.loc[h_s, 'sl'] = sl_idx.min() if len(sl_idx) else pd.NaT
        ret.loc[h_s, 'pt'] = pt_idx.min() if len(pt_idx) else pd.NaT

    # 세 컬럼 모두 datetime64[ns] → 안전하게 min(axis=1)
    first_touch = ret[['t1', 'sl', 'pt']].min(axis=1)

    # 레이블 부여
    ret['label'] = np.select(
        [first_touch.eq(ret['t1']), first_touch.eq(ret['sl']), first_touch.eq(ret['pt'])],
        [0, -1, 1],
        default=0
    )
    ret['t1'] = first_touch
    return ret


def get_triple_barrier_label_molecule(args):
    close, t1, (sl, pt), molecule=args
    return get_triple_barrier_label(close, t1, (sl, pt), molecule)

n_jobs = multiprocessing.cpu_count() - 1
molecules = list(split_index(t1.index, chunk_size=1000))
tau=2*min_df['Close'].pct_change().dropna().std()
with tqdm_joblib(tqdm(total=len(molecules))) as progress_bar:
    results = Parallel(n_jobs=n_jobs)(
        delayed(get_triple_barrier_label_molecule)((min_df['Close'].loc[t1.index], t1, (tau, tau), mol))
        for mol in molecules
    )

triple_barrier_label=pd.concat(results).sort_index()


minute
2010-01-03 17:00:00   2010-01-03 21:39:00
2010-01-03 17:01:00   2010-01-03 21:39:00
2010-01-03 17:02:00   2010-01-03 21:39:00
2010-01-03 17:03:00   2010-01-03 22:04:00
2010-01-03 17:04:00   2010-01-03 22:04:00
                              ...        
2019-12-30 15:03:00   2019-12-31 02:21:00
2019-12-30 15:07:00   2019-12-31 02:21:00
2019-12-31 02:21:00   2019-12-31 08:30:00
2019-12-31 08:30:00   2019-12-31 14:48:00
2019-12-31 08:54:00   2019-12-31 14:48:00
Name: minute, Length: 856970, dtype: datetime64[ns]



[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

  0%|          | 0/857 [07:11<?, ?it/s]
  0%|          | 0/857 [01:02<?, ?it/s]


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A



In [16]:
min_triple_barrier_labels=triple_barrier_label['label'].value_counts(normalize=True)
hour_triple_barrier_labels=triple_barrier_label.resample('h').first()['label'].value_counts(normalize=True)

print(min_triple_barrier_labels)
print(hour_triple_barrier_labels)

label
 1    0.478149
-1    0.458651
 0    0.063201
Name: proportion, dtype: float64
label
 1.0    0.445743
-1.0    0.418892
 0.0    0.135365
Name: proportion, dtype: float64


4. Repeat Exercise 1, where this time you apply the trend-scanning method,
with look-forward periods of up to one day. How do results compare to the
solutions from Exercises 1, 2, and 3?

t-value의 부호를 기준으로 label을 부여하므로 0이라는 label이 존재하지 않는 다는 것이 가장 큰 차이이고 3과 비슷하게 seasonality에 덜 민감합니다. (고정된 시간 대신, 추세를 직접 측정하기 때문에)

In [29]:
from joblib import Parallel, delayed
from tqdm import tqdm
import numpy as np
from tqdm_joblib import tqdm_joblib
import statsmodels.api as sm
from concurrent.futures import ThreadPoolExecutor

import numpy as np

def get_t_val_linear(close:pd.Series):
    x=np.ones((len(close), 2))
    x[:, 1]=np.arange(len(close))
    ols=sm.OLS(close, x).fit()
    return ols.tvalues[1]

def _tval_for_span(cur_index, cur_end_iloc, close):
    span_close_end = close.index[cur_end_iloc]
    span_close = close.loc[cur_index:span_close_end]
    tval = get_t_val_linear(span_close.values)
    return span_close_end, tval

def get_bins_from_trend(module:pd.Index, close:pd.Series, max_look_forward:pd.Timedelta, n_threads:int=8):
    '''
    linear trend로 부터 t-value의 sign을 구한다. 
    '''
    ret=pd.DataFrame(index=module, columns=['t1', 't_val', 'bin'])
    for cur_index in module:
        t_vals=pd.Series()
        start_iloc=close.index.get_loc(cur_index)
        end_iloc=np.searchsorted(close.index, cur_index+max_look_forward, side='right')
        end_range = range(start_iloc + 5, min(len(close.index), end_iloc))
        if start_iloc+5>=min(len(close.index), end_iloc):
            continue

        # 스레드로 내부 t-value 계산
        with ThreadPoolExecutor(max_workers=n_threads) as ex:
            results = list(ex.map(lambda i: _tval_for_span(cur_index, i, close), end_range))

        if not results:
            continue

        idx, vals = zip(*results)
        t_vals = pd.Series(vals, index=pd.Index(idx, name='span_end'), dtype='float64')
        t_vals = pd.to_numeric(t_vals, errors='coerce')
        finite = t_vals[np.isfinite(t_vals.to_numpy())]
        if finite.empty:
            continue
        max_abs_t_val_idx = finite.abs().idxmax()
        ret.loc[cur_index, ['t1', 't_val', 'bin']]=finite.index[-1], finite[max_abs_t_val_idx], np.sign(finite[max_abs_t_val_idx])
    ret['t1']=pd.to_datetime(ret['t1'])
    ret['bin']=pd.to_numeric(ret['bin'], downcast='signed')
    return ret.dropna(subset=['bin'])

def get_bins_from_trend_molecule(args):
    molecule, close, max_look_forward = args
    return get_bins_from_trend(molecule, close, max_look_forward)

import multiprocessing
    
n_jobs = multiprocessing.cpu_count() - 1
max_look_forward = pd.Timedelta(days=1)

molecules = list(split_index(min_df.index[:10000], chunk_size=1000))
with tqdm_joblib(tqdm(total=len(molecules))) as progress_bar:
    results = Parallel(n_jobs=n_jobs)(
        delayed(get_bins_from_trend_molecule)((mol, min_df['Close'], max_look_forward))
        for mol in molecules
    )

# 결과 합치기
trend_label_result = pd.concat(results).sort_index()



[A



  0%|          | 0/857 [02:53<?, ?it/s]




[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



100%|██████████| 10/10 [06:09<00:00, 36.95s/it]


In [30]:
print(trend_label_result['bin'].value_counts(normalize=True))
print(trend_label_result.resample('h').first()['bin'].value_counts(normalize=True))

bin
-1.0    0.517217
 1.0    0.482783
Name: proportion, dtype: float64
bin
-1.0    0.516224
 1.0    0.483776
Name: proportion, dtype: float64


5. Using the labels generated in Exercise 3 (triple-barrier method):
    1. a Fit a random forest classifier on those labels. Use as features estimates of
    mean return, volatility, skewness, kurtosis, and various differences in moving averages.
    2. Backtest those predictions using as a trading rule the same rule used to
    generate the labels.
    3. Apply meta-labeling on the backtest results.
    4. Refit the random forest on meta-labels, adding as a feature the label predicted in (a).
    5. Size (a) bets according to predictions in (d), and recompute the backtest.