Fork from https://www.kaggle.com/alexfir/recreating-target with min_periods=3750 in rolling, when the previous 3750 rows there is a gap in the data βM(t) = 0 

In this notebook we try to recreate Target calculation described in [Tutorial to the G-Research Crypto Competition](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition/notebook#Building-your-prediction-model).


In [None]:
import os
import numpy as np
import pandas as pd

directory = '../input/g-research-crypto-forecasting'
file_path = os.path.join(directory, 'train.csv')
dtypes = {
    'timestamp': np.int64,
    'Asset_ID': np.int8,
    'Count': np.int32,
    'Open': np.float64,
    'High': np.float64,
    'Low': np.float64,
    'Close': np.float64,
    'Volume': np.float64,
    'VWAP': np.float64,
    'Target': np.float64,
}
data = pd.read_csv(file_path, dtype=dtypes)
data['Time'] = pd.to_datetime(data['timestamp'], unit='s')
data.drop('timestamp', axis=1, inplace=True)

file_path = os.path.join(directory, 'asset_details.csv')
details = pd.read_csv(file_path)

Then calculate return as per formula below. 

$$R^a(t) = log (P^a(t+16)\ /\ P^a(t+1))$$

This is done for each asset separately. We do not know which price should be used. There are five different prices: Open, High, Low, Close, and VWAP. Probably there is a mix like Open price for time + 1 min, and Close for time + 16 min. We use **Close** price in calculation below.

In [None]:
price_column = 'Close'
ids = list(details.Asset_ID)
chunks = []
for id in ids:    
    asset = data[data.Asset_ID == id].copy()
    asset.sort_values(by='Time', inplace=True)
    asset.set_index(keys='Time', inplace=True)
    asset['p1'] = asset[price_column].shift(freq='-1T')
    asset['p16'] = asset[price_column].shift(freq='-16T')
    asset['r'] = np.log(asset.p16/asset.p1)
    asset.drop(['p1', 'p16'], axis=1, inplace=True)
    asset.reset_index(inplace=True)
    chunks.append(asset)

data = pd.concat(chunks)
data.sort_values(by='Time', inplace=True)

Next, assign weight for each row. And calculate M(t). Note that M(t) is the same for all assets and depend only on time.

$$M(t) = \frac{\sum_a w^a R^a(t)}{\sum_a w^a}$$

We do not know if ${\sum_a w^a}$ should be calculated for all assets or only for assets having data at time t.


In [None]:
data['w'] = data['Asset_ID'].map(details.set_index(keys='Asset_ID')['Weight'])
weight_sum = details.Weight.sum()

data['weighted_asset_r'] = data.w * data.r
time_group = data.groupby('Time')

m = time_group['weighted_asset_r'].sum() / time_group['w'].sum()
#m = time_group['weighted_asset_r'].sum() / weight_sum

data.set_index(keys=['Time'], inplace=True)
data['m'] = m
data.reset_index(inplace=True)

After that, Beta is calculated. Bracket $\langle .\rangle$ represent the rolling average over time (3750 minute windows).

$$\beta^a = \frac{\langle M \cdot R^a \rangle}{\langle M^2 \rangle}$$

In [None]:
data['m2'] = data.m ** 2
data['mr'] = data.r * data.m

chunks = []
for id in ids:
    # type: pd.DataFrame
    asset = data[data.Asset_ID == id].copy()
    asset.sort_values(by='Time', inplace=True)
    asset.set_index(keys='Time', inplace=True)
    #asset['mr_rolling'] = asset['mr'].rolling(window='3750T', min_periods=1).mean()
    asset['mr_rolling'] = asset['mr'].rolling(window='3750T', min_periods=3750).mean()
    #asset['m2_rolling'] = asset['m2'].rolling(window='3750T', min_periods=1).mean()
    asset['m2_rolling'] = asset['m2'].rolling(window='3750T', min_periods=3750).mean()
    asset.reset_index(inplace=True)
    chunks.append(asset)
    debug = 1

data = pd.concat(chunks)
data.sort_values(by='Time', inplace=True)
data['beta'] = data['mr_rolling'] / data['m2_rolling']

And finallly Target is calculated.
$$\text{Target}^a(t) = R^a(t) - \beta^a M(t)$$

In [None]:
data['Target_recreated'] = data['r'] - data['beta'] * data['m']

Now we compare given and recreated Target.

In [None]:
data['Target_diff'] = np.abs(data['Target'] - data['Target_recreated'])

print(f'Average absolute error {data.Target_diff.mean()}')
print(f'Max absolute error {data.Target_diff.max()}')

In [None]:
data['Target'].agg(['min', 'max'])

In [None]:
(data.Target_recreated < -0.509351).sum()

In [None]:
(data.Target_recreated > 0.96417).sum()

How Many Row with N=0

In [None]:
data['betaNA']=data['beta'].isna().astype('int')
data.groupby('Asset_ID').betaNA.mean()