# Supporting Notebook 2: (Predictable and unpredicable prices)

## Modeling: What to Predict
It is not straightforward to choose how to model stock behavior.

Do we predict prices at 1 minute ahead, 10 minutes, or daily.

Or should we predict all?

What kind of price do we want to predict within an interval:

- EndPrice
- Mean(EndPrice)
- WeightedMean(EndPrice)
- MedianPrice
- or the complete distribution of prices
- if the complete distribution is to be predicted, do we predict the mean, variance and the skew. Do we discretize the distribution or we treat the available prices as samples from the distribution that we predict

Another decision is whether to predict:
- Percent of Change of Price
- The sign of the price movement
- The log of the actual price

From a practical perspective, even in the presence of perfect predictions of prices in the future, one may not be able to make any money. This depends on
- fees (the fee to enter and exit a position) should be lower than the expected change in the price
- on seeing a quoted price how much above (when buying) and how far below (when selling) will one execute
- volumes in the market. If one trades with very large volumes one will disrupt the supply and demand mechanism, therefore one will not execute at the desired price.
- how the broker executes trades matters

# Insights: what is predictable

In the current notebook we try to get an understanding what is easier and what is difficult to predict.
For example, we find out that when we organize the data into intervals of say 10, 15, or 30 minutes,
the mean price (`Mean(EndPrice)`) is much easier to predict than the end price `EndPrice`.

We also find that when normalizing for `PctChange:X` of a feature `X`, we should use features that are normalized in the same way. One way to normalize the features is to divide them by `X[t - 1]`. 

One way to avoid this difficulty would be the following:
- generate different linear combination of prices (averages are also linear combinations, and so are absolute returns)
- compute the logs of all types of prices (or averages) of prices
- we may also attempt to predict linear combination of prices, by taking logs
- notice that is essentially a non-linear model, because we apply logs on sums of raw input features
- when using logs care must be taken to avoid numbers close to zero as well as negative numbers.
- however, log returns are approximately equal to percent change, so logs can be avoided

We explore different ways to normalize the prices. One way would be to choose a more stable price as an "anchor", and replace all prices as linear functions such as:
```
AnchorPrice = mean(Price[t - k, t - 1])
NormPrice:X[i] = (Price:X[i] - AnchorPrice)/AnchorPrice, for i = t, t - 1, ...
```


Regarding the prediction of an `EndPrice` we hypothesis that one would do better by predicting:
- predict two prices, one 10 minutes ahead, one 5 minutes ahead
- find the differences between the predictions


In the future we will also explore the possiblity to predict entire distribution of prices, instead of just a single end-price, or averaged prices.


In [4]:
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.figsize'] = (5, 3) # use bigger graphs

As usual we first load the data we prepared in notebook 2

In [6]:
input_file = '/data/cooked_v3.pkl'
df = pd.read_pickle(input_file)
df['CalcDateTime'] = df.index

Next we prepare a dataset consting of a single stock and compute derivative features
(percent change) of a number of price features.

In [289]:
price_features = ['MaxPrice', 'MinPrice', 'LastEndPrice', 'FirstStartPrice', 'MeanEndPrice', 'MedianEndPrice', 
                  'MeanStartEndPrice', 'MeanMaxMinPrice', 'MeanAvg4Price',
                  'Direction1', 'Direction2', 'StdEndPrice', 'VolumeWeightedEndPrice']

log_ret_features = ['MaxPrice', 'MinPrice', 'LastEndPrice', 'FirstStartPrice', 'MeanEndPrice', 'MedianEndPrice', 
                  'MeanStartEndPrice', 'MeanMaxMinPrice', 'MeanAvg4Price']

indicator_features = [
    'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8',
    'G1', 'G2', 'G3', 'G4'
]

def pct_change_of(feature):
        return 'PctChange:' + feature
    
def log_return_f(feature):
    return 'LogReturn:' + feature

def adj_log_return_f(feature):
    return 'AdjLogReturn:' + feature

def rev_pct_change_of_at_t(feature, t):
        return 'RevPctChange[t - {}]:{}'.format(str(t), feature)
    
def shifted(feature):
    return feature + '[t - 1]'

def closer_to(pnt, a, b):
    return (np.absolute(pnt - a) - np.absolute(pnt - b))/pnt

def closer_to_with_normalization(pnt, a, b, norm):
    return (np.absolute(pnt - a) - np.absolute(pnt - b))/norm

def closer_to_or(pnt1, a, pnt2, b):
    return 2.0*(np.absolute(pnt1 - a) - np.absolute(pnt2 - b))/(pnt1 + pnt2)

def rev_pct_change(a, t):
    one_step_in_past = a.shift(1)
    t_steps_in_past = a.shift(1 + t)
    return (one_step_in_past - t_steps_in_past)/one_step_in_past
    
def log_return(a):
    return np.log(a) - np.log(a.shift(1))

def adj_log_return(a, norm):
    return np.log(a) - np.log(norm)

def norm_feature(feature_family, t, norm_feature):
    return "{}@WithNorm({})[t - {}]".format(feature_family, norm_feature, t)    
    
def weighted_mean(prices, volumes, interval):
    prices_times_volumes = prices * volumes
    num = prices_times_volumes.resample(interval).sum()
    denom = volumes.resample(interval).sum()
    return num/denom
    
# use this to experiment with custom weights    
def custom_linear_comb(prices):
    prices = prices.values[:,]
    prices = prices[~np.isnan(prices)]
    if prices.shape[0] == 0:
        return np.nan
    weights = np.zeros_like(prices)
    if prices.shape[0] < 3:
        return prices[-1]
    weights[-1] = 1.0
    weights[-2] = 0.5
    return np.sum(np.multiply(prices, weights))/np.sum(weights)    

def prepare_single_stock(mnemonic, interval):
    # TODO: add traded volume to averaging of prices, also traded volume weighted differently (e.g. exponential weighting)
    # TODO: one can weight max/min/start/end price differently depending on the move
    # TODO: add exponential weighting of the prices within a window (also there are different ways to center)
    single_stock = df[df.Mnemonic == mnemonic].copy()
    single_stock['StartEndPrice'] = 0.5*(single_stock['StartPrice'] + single_stock['EndPrice'])
    single_stock['MaxMinPrice'] = 0.5*(single_stock['MaxPrice'] + single_stock['MinPrice'])
    single_stock['Avg4Price'] = 0.25*(single_stock['MaxPrice'] + single_stock['MinPrice'] + 
                                      single_stock['StartPrice'] + single_stock['EndPrice'])
    
    # TODO: add smoothed traded volume
    single_stock['PctChange'] = single_stock['EndPrice'].pct_change()
    single_stock['SmoothedTradedVolume'] = single_stock['TradedVolume'].ewm(com=2.5).mean()
    
    single_stock['Direction'] = \
        2.0*(single_stock['EndPrice'] - single_stock['StartPrice'])/ \
        (single_stock['EndPrice'] + single_stock['StartPrice'])
        
    single_stock['F1'] = - closer_to(single_stock['EndPrice'], single_stock['MaxPrice'], single_stock['MinPrice'])
        

    resampled = pd.DataFrame({
        'MaxPrice': single_stock['MaxPrice'].resample(interval).max(),
        'MinPrice': single_stock['MinPrice'].resample(interval).min(),
        'MeanStartEndPrice': single_stock['StartEndPrice'].resample(interval).mean(),  
        'MeanMaxMinPrice': single_stock['MaxMinPrice'].resample(interval).mean(), 
        'MeanAvg4Price': single_stock['Avg4Price'].resample(interval).mean(),  
        'CustomLinearComb': single_stock['Avg4Price'].resample(interval).apply(custom_linear_comb),         
        'FirstStartPrice': single_stock['StartPrice'].resample(interval).first(),        
        'LastEndPrice': single_stock['EndPrice'].resample(interval).last(), 
        'MeanEndPrice': single_stock['EndPrice'].resample(interval).mean(), 
        'MedianEndPrice': single_stock['EndPrice'].resample(interval).median(),
        'VolumeWeightedEndPrice': weighted_mean(single_stock['EndPrice'], single_stock['SmoothedTradedVolume'], interval),
        'VolumeWeightedPctChange': weighted_mean(single_stock['PctChange'], single_stock['SmoothedTradedVolume'], interval),
        
        'StdEndPrice': single_stock['EndPrice'].resample(interval).std(),
        'HasTrade': single_stock['HasTrade'].resample(interval).max(),
        'G1': single_stock['Direction'].resample(interval).mean(),
        'G2': np.sign(single_stock['Direction']).resample(interval).mean(),
        'G3': single_stock['F1'].resample(interval).mean(),
        'G4': np.sign(single_stock['F1']).resample(interval).mean()        
    })
    resampled['AdjustedPctChange'] = (resampled['LastEndPrice'] - resampled['MeanEndPrice'])/resampled['MeanEndPrice']
    resampled['AdjustedPctChange[t - 1]'] = resampled['AdjustedPctChange'].shift(1)
    
    resampled['MeanPctChangeV2'] = (resampled['MeanEndPrice'] - resampled['LastEndPrice'].shift(1))/resampled['LastEndPrice'].shift(1)
    
    anchor = resampled['MeanEndPrice'].shift(1)
    resampled[adj_log_return_f('LastEndPrice')] = adj_log_return(resampled['LastEndPrice'], anchor)
    resampled[shifted(adj_log_return_f('LastEndPrice'))] = \
        adj_log_return(resampled['LastEndPrice'].shift(1), anchor)
    
    resampled['Direction1'] = \
        2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
        (resampled['LastEndPrice'] + resampled['FirstStartPrice'])
    
    resampled['Direction2'] = \
        2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
        resampled['MeanEndPrice'].shift(1)
        
    resampled[shifted('Direction1')] = resampled['Direction1'].shift(1)
    resampled[shifted('Direction2')] = resampled['Direction2'].shift(1)
    
    for f in ['MinPrice', 'MaxPrice', 'LastEndPrice', 'FirstStartPrice']:
        resampled[shifted(f)] = resampled[f].shift(1)
        
    resampled['F1'] = - closer_to(resampled['LastEndPrice'], resampled['MaxPrice'], resampled['MinPrice'])
    resampled['F2'] = - closer_to(resampled['MaxPrice'], resampled['LastEndPrice'], resampled['FirstStartPrice'])
    resampled['F3'] = closer_to(resampled['MinPrice'], resampled['LastEndPrice'], resampled['FirstStartPrice'])
    resampled['F4'] = - closer_to(resampled['LastEndPrice'], resampled['MaxPrice'], resampled[shifted('MaxPrice')])
    resampled['F5'] = closer_to(resampled['LastEndPrice'], resampled['MinPrice'], resampled[shifted('MinPrice')])
    
    resampled['F6'] = - closer_to_or(resampled['LastEndPrice'], resampled['MaxPrice'],
                                   resampled[shifted('LastEndPrice')], resampled[shifted('MaxPrice')])

    resampled['F7'] = closer_to_or(resampled['LastEndPrice'], resampled['MinPrice'],
                                   resampled[shifted('LastEndPrice')], resampled[shifted('MinPrice')])
    
    
    resampled['F8'] = np.where(resampled['Direction2'] >= 0, resampled['F4'] , resampled['F5'])
    
    
    for t in range(1, 5):    
        # note: normalization is fixed
        resampled[norm_feature('H1', t, 'MeanEndPrice')] = - closer_to_with_normalization(
                                             resampled['LastEndPrice'].shift(t), 
                                             resampled['MaxPrice'].shift(t), 
                                             resampled['MinPrice'].shift(t),
                                             resampled['MeanEndPrice'].shift(1))    
        
    
    for f in indicator_features:
        resampled[shifted(f)] = resampled[f].shift(1)
        
    for f in price_features:
        pct_change_f = pct_change_of(f)
        resampled[pct_change_f] = resampled[f].pct_change()
        resampled[shifted(pct_change_f)] = resampled[pct_change_f].shift(1) 
    
    for f in price_features:
        for t in range(1, 5):
            rev_pct_change_f_t = rev_pct_change_of_at_t(f, t)
            resampled[rev_pct_change_f_t] = rev_pct_change(resampled[f], t)    
        

    for f in log_ret_features:
        log_ret_f = log_return_f(f)
        resampled[log_ret_f] = log_return(resampled[f])
        resampled[shifted(log_ret_f)] = resampled[log_ret_f].shift(1) 
        
    resampled = resampled[resampled['HasTrade'] == 1.0]
    
    return resampled

def correlation_with_feature(single_stock, corr_feature):
    pct_change_of_f = corr_feature
    sh_dir1 = shifted('Direction1')
    sh_dir2 = shifted('Direction2')    
    d = {
        pct_change_of_f: single_stock[pct_change_of_f],
        sh_dir1: single_stock[sh_dir1],
        sh_dir2: single_stock[sh_dir2]        
    }
    
    for f in indicator_features:
        d[shifted(f)] = single_stock[shifted(f)]
        
    d['AdjustedPctChange[t - 1]'] = single_stock['AdjustedPctChange[t - 1]']    
    d[adj_log_return_f('LastEndPrice')] = single_stock[adj_log_return_f('LastEndPrice')]
    d[shifted(adj_log_return_f('LastEndPrice'))] = single_stock[shifted(adj_log_return_f('LastEndPrice'))]
    
    d['MeanPctChangeV2'] = single_stock['MeanPctChangeV2']
    
    for f in price_features:
        pct_change_shifted = shifted(pct_change_of(f))
        d[pct_change_shifted] = single_stock[pct_change_shifted]
        
    for t in range(1, 5):    
        # note: normalization is fixed
        f = norm_feature('H1', t, 'MeanEndPrice')
        d[f] = single_stock[f]
            
    for f in log_ret_features:
        log_ret_f = log_return_f(f)
        d[log_ret_f] = single_stock[log_ret_f]
            
    for f in price_features:
        for t in range(1, 5):
            rev_pct_change_f_t = rev_pct_change_of_at_t(f, t)
            d[rev_pct_change_f_t] = single_stock[rev_pct_change_f_t]
            
    d[shifted('VolumeWeightedPctChange')] = single_stock[('VolumeWeightedPctChange')].shift(1)
    corr = pd.DataFrame(d).corr()
    row_id = np.argwhere(corr.index.values == pct_change_of_f)[0][0]
    return corr.iloc[[row_id]].drop(columns=[pct_change_of_f])

def corr_pct_change(single_stock, feature):
    pct_change_of_f = pct_change_of(feature)
    return correlation_with_feature(single_stock, pct_change_of_f)

def corr_log_return(single_stock, feature):
    return correlation_with_feature(single_stock, log_return_f(feature))

def find_most_correlated_features(single_stock, feature):
    corrs = correlation_with_feature(single_stock, feature).T
    corrs['AbsCorr'] = np.absolute(corrs[feature])
    sorted_corrs = corrs.sort_values('AbsCorr', ascending=False)
    selected_names = filter(lambda n: n.find('[t - 1') >= 0 or n.find('[t - 2') >= 0, list(sorted_corrs.index.values))
    sorted_corrs = sorted_corrs.T[selected_names].T
    return sorted_corrs

We choose a single stock, for example 'SIE' (Siemens)

In [290]:
single_stock = prepare_single_stock('SIE', '30Min')

We look at the corrlations of all the other price features from previous time periods `(t - 1)`.
We could also say that we investigate if a single feature from the previous time period `(t - 1)`
would be predictive of the change in the next time period `(t)`

We first study the `LastEndPrice` or the last end price in the interval `t` (here interval is 30 minutes).
As it can be seen there are no strong correlations.

In [291]:
corr_pct_change(single_stock, 'LastEndPrice')

Unnamed: 0,AdjLogReturn:LastEndPrice,AdjLogReturn:LastEndPrice[t - 1],AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:LastEndPrice,0.875334,-0.044889,-0.04486,-0.011198,-0.0297,-0.027385,-0.011297,-0.011028,0.02114,-0.027374,...,-0.038945,-0.021345,-0.021353,-0.021357,-0.021333,-0.015984,-0.011112,-0.021343,-0.02265,-0.014676


Next we look for correlations with the mean prices: 
- MeanAvg4Price: we averaged all 4 prices available within a minute and then averaged within an interval like 10Min
- MeanMaxMinPrice: took the average of Min and Max prices and then averaged within an interval
- MeanStartEndPrice: took the average of Start and End prices and then averaged within an interval

# Conclusion:

- The more averaging was done, the easier to predict
- EndPrice is very difficult to predict, but the mean price is not 
- The median of the end price is harder to predict than the mean, but a lot easier than the end price



In [292]:
corr_pct_change(single_stock, 'MeanAvg4Price')

Unnamed: 0,AdjLogReturn:LastEndPrice,AdjLogReturn:LastEndPrice[t - 1],AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanAvg4Price,0.886189,0.649081,0.649087,0.499324,0.546771,0.592774,0.500389,0.497669,0.091224,0.101009,...,0.071413,0.069717,0.07359,0.069911,0.069521,0.068348,0.091924,-0.029574,0.088645,0.276965


In [232]:
find_most_correlated_features(single_stock, pct_change_of('MeanAvg4Price'))

Unnamed: 0,PctChange:MeanAvg4Price,AbsCorr
AdjustedPctChange[t - 1],0.649087,0.649087
F1[t - 1],0.592774,0.592774
H1@WithNorm(MeanEndPrice)[t - 1],0.592662,0.592662
RevPctChange[t - 1]:LastEndPrice,0.54692,0.54692
Direction2[t - 1],0.546771,0.546771
F2[t - 1],0.500389,0.500389
Direction1[t - 1],0.499324,0.499324
F3[t - 1],0.497669,0.497669
PctChange:LastEndPrice[t - 1],0.432989,0.432989
G1[t - 1],0.427626,0.427626


In [233]:
corr_pct_change(single_stock, 'MeanMaxMinPrice')

Unnamed: 0,AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanMaxMinPrice,0.648601,0.498502,0.546014,0.592666,0.499563,0.496845,0.0917,0.101526,0.357278,0.331763,...,0.071013,0.069293,0.073177,0.069474,0.069111,0.067948,0.091528,-0.029617,0.088248,0.276102


In [234]:
corr_pct_change(single_stock, 'MeanStartEndPrice')

Unnamed: 0,AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanStartEndPrice,0.649527,0.500111,0.54749,0.592839,0.501179,0.498458,0.090742,0.100484,0.357715,0.331181,...,0.071808,0.070136,0.073999,0.070345,0.069926,0.068744,0.092315,-0.029529,0.089035,0.277798


In [293]:
corr_pct_change(single_stock, 'MeanEndPrice')

Unnamed: 0,AdjLogReturn:LastEndPrice,AdjLogReturn:LastEndPrice[t - 1],AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanEndPrice,0.893023,0.633271,0.633278,0.484106,0.528506,0.57733,0.485131,0.482517,0.091301,0.099043,...,0.066974,0.065182,0.068844,0.065384,0.06498,0.063751,0.087711,-0.029297,0.083367,0.264644


In [236]:
corr_pct_change(single_stock, 'MedianEndPrice')

Unnamed: 0,AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MedianEndPrice,0.619535,0.433678,0.498296,0.53076,0.434843,0.431914,0.094451,0.102715,0.336348,0.30959,...,0.061692,0.054922,0.058357,0.055128,0.054715,0.045778,0.082627,-0.02849,0.073163,0.24804


In [237]:
corr_pct_change(single_stock, 'VolumeWeightedEndPrice')

Unnamed: 0,AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:VolumeWeightedEndPrice,0.57062,0.405897,0.466458,0.515441,0.407015,0.404191,0.094179,0.092344,0.320096,0.307567,...,0.04479,0.042424,0.045507,0.042602,0.042246,0.042875,0.057806,-0.021437,0.052565,0.21799


In [294]:
correlation_with_feature(single_stock, 'MeanPctChangeV2')

Unnamed: 0,AdjLogReturn:LastEndPrice,AdjLogReturn:LastEndPrice[t - 1],AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
MeanPctChangeV2,0.766463,-0.05235,-0.052341,-0.024095,-0.038681,-0.036089,-0.024238,-0.02394,0.024079,-0.023621,...,-0.041307,-0.024616,-0.025007,-0.024613,-0.024618,-0.021375,-0.011481,-0.023722,-0.026919,-0.022443


In [296]:
find_most_correlated_features(single_stock, 'MeanPctChangeV2')

Unnamed: 0,MeanPctChangeV2,AbsCorr
G4[t - 1],-0.060184,0.060184
G2[t - 1],-0.056568,0.056568
G3[t - 1],-0.052721,0.052721
AdjLogReturn:LastEndPrice[t - 1],-0.05235,0.05235
AdjustedPctChange[t - 1],-0.052341,0.052341
G1[t - 1],-0.044197,0.044197
RevPctChange[t - 2]:StdEndPrice,-0.040138,0.040138
Direction2[t - 1],-0.038681,0.038681
PctChange:StdEndPrice[t - 1],0.038139,0.038139
RevPctChange[t - 1]:LastEndPrice,-0.036484,0.036484


Let us settle for the moment on the 'MeanEndPrice', and compute the correlations over multiple intervals

In [244]:
# some of the better features
x_features = [
    'AdjustedPctChange[t - 1]',
    'F1[t - 1]',
    'H1@WithNorm(MeanEndPrice)[t - 1]',
    'RevPctChange[t - 1]:LastEndPrice',
    'F2[t - 1]',
    'Direction2[t - 1]',
    'F3[t - 1]',
    'G1[t - 1]',
    'G3[t - 1]',
    'PctChange:LastEndPrice[t - 1]',
    'RevPctChange[t - 2]:LastEndPrice',
    'F6[t - 1]',
    'G2[t - 1]',
    'F7[t - 1]',
    'G4[t - 1]',
    'RevPctChange[t - 1]:VolumeWeightedEndPrice',
    'RevPctChange[t - 1]:MinPrice',
    'RevPctChange[t - 1]:MeanEndPrice',
    'RevPctChange[t - 1]:MeanMaxMinPrice',
    'RevPctChange[t - 1]:MeanAvg4Price',
    'RevPctChange[t - 1]:MeanStartEndPrice',
    'RevPctChange[t - 1]:MaxPrice']

In [245]:
intervals = ['1Min', '2Min', '3Min', '4Min', '5Min', '6Min', '7Min', '8Min', '9Min', '10Min', 
             '15Min', '30Min', '60Min', '120Min', '180Min', '240Min',
             '1D', '2D', '1W', '2W']
def corr_over_intervals(stock):
    results = []
    for interval in intervals:
        single_stock = prepare_single_stock(stock, interval)
        cr = corr_pct_change(single_stock, 'MeanEndPrice')
        cr.index = [cr.index[0] + "@" + interval]
        results.append(cr)
    return pd.concat(results)

In [246]:
corr_over_intervals('BMW')[x_features]

Unnamed: 0,AdjustedPctChange[t - 1],F1[t - 1],H1@WithNorm(MeanEndPrice)[t - 1],RevPctChange[t - 1]:LastEndPrice,F2[t - 1],Direction2[t - 1],F3[t - 1],G1[t - 1],G3[t - 1],PctChange:LastEndPrice[t - 1],...,G2[t - 1],F7[t - 1],G4[t - 1],RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 1]:MinPrice,RevPctChange[t - 1]:MeanEndPrice,RevPctChange[t - 1]:MeanMaxMinPrice,RevPctChange[t - 1]:MeanAvg4Price,RevPctChange[t - 1]:MeanStartEndPrice,RevPctChange[t - 1]:MaxPrice
PctChange:MeanEndPrice@1Min,,-0.039041,-0.039041,-0.058135,-0.043878,-0.044041,-0.043996,-0.043937,-0.039041,-0.048899,...,-0.025848,-0.026503,-0.023483,-0.058135,-0.040189,-0.058135,-0.042481,-0.042339,-0.040873,-0.036193
PctChange:MeanEndPrice@2Min,0.34111,0.289113,0.289109,0.223326,0.243899,0.268656,0.24362,0.219414,0.188602,0.164457,...,0.165963,0.207309,0.136051,0.113732,0.061592,0.105204,0.052072,0.049232,0.045965,0.075007
PctChange:MeanEndPrice@3Min,0.555537,0.500738,0.500195,0.369516,0.446452,0.379535,0.447419,0.332804,0.284271,0.363713,...,0.242359,0.273922,0.200823,0.195872,0.132835,0.154092,0.100225,0.097386,0.094141,0.12699
PctChange:MeanEndPrice@4Min,0.569054,0.528768,0.528799,0.432938,0.466612,0.436476,0.466332,0.379996,0.325611,0.380441,...,0.286384,0.30989,0.233416,0.217401,0.164844,0.1828,0.132186,0.129911,0.127312,0.143626
PctChange:MeanEndPrice@5Min,0.564955,0.532253,0.532383,0.429962,0.456543,0.440691,0.455466,0.386485,0.34575,0.361401,...,0.299391,0.326282,0.249827,0.208762,0.148942,0.196382,0.151863,0.151399,0.150698,0.159227
PctChange:MeanEndPrice@6Min,0.589749,0.555853,0.556019,0.467293,0.47739,0.47175,0.476017,0.420884,0.368116,0.383651,...,0.31355,0.34054,0.261761,0.22759,0.165068,0.205875,0.165517,0.164416,0.163131,0.182958
PctChange:MeanEndPrice@7Min,0.626719,0.593687,0.593914,0.511498,0.525365,0.516885,0.524448,0.430444,0.381381,0.421565,...,0.322283,0.372832,0.269345,0.252963,0.194866,0.214009,0.17413,0.173577,0.172869,0.195457
PctChange:MeanEndPrice@8Min,0.62917,0.592355,0.592385,0.516915,0.515672,0.521957,0.514255,0.46822,0.408759,0.416825,...,0.334731,0.342079,0.273265,0.252307,0.212106,0.212127,0.178905,0.177894,0.176768,0.181348
PctChange:MeanEndPrice@9Min,0.620197,0.582707,0.582823,0.496655,0.501964,0.500054,0.500331,0.435434,0.374597,0.40285,...,0.331195,0.335596,0.270271,0.237795,0.202288,0.206838,0.180865,0.180089,0.179218,0.184182
PctChange:MeanEndPrice@10Min,0.613953,0.573533,0.57356,0.484112,0.481306,0.487638,0.479811,0.429325,0.379193,0.391853,...,0.332825,0.325004,0.26594,0.222683,0.196991,0.200285,0.176907,0.17641,0.175839,0.169255


In [248]:
corr_over_intervals('SIE')[x_features]

Unnamed: 0,AdjustedPctChange[t - 1],F1[t - 1],H1@WithNorm(MeanEndPrice)[t - 1],RevPctChange[t - 1]:LastEndPrice,F2[t - 1],Direction2[t - 1],F3[t - 1],G1[t - 1],G3[t - 1],PctChange:LastEndPrice[t - 1],...,G2[t - 1],F7[t - 1],G4[t - 1],RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 1]:MinPrice,RevPctChange[t - 1]:MeanEndPrice,RevPctChange[t - 1]:MeanMaxMinPrice,RevPctChange[t - 1]:MeanAvg4Price,RevPctChange[t - 1]:MeanStartEndPrice,RevPctChange[t - 1]:MaxPrice
PctChange:MeanEndPrice@1Min,,-0.092423,-0.092423,-0.087856,-0.084036,-0.084351,-0.08401,-0.08402,-0.092423,-0.079444,...,-0.080969,-0.052195,-0.091907,-0.087856,-0.051012,-0.087856,-0.053637,-0.06072,-0.06634,-0.04566
PctChange:MeanEndPrice@2Min,0.298605,0.227853,0.227863,0.191619,0.200317,0.23249,0.200306,0.140958,0.108513,0.142136,...,0.106158,0.171958,0.065779,0.091896,0.058992,0.0814,0.045812,0.042929,0.039638,0.062606
PctChange:MeanEndPrice@3Min,0.570456,0.491523,0.491597,0.340438,0.470085,0.353936,0.47214,0.23272,0.188142,0.388743,...,0.187207,0.238747,0.13536,0.179187,0.120468,0.142871,0.098049,0.096265,0.094003,0.119281
PctChange:MeanEndPrice@4Min,0.594308,0.534169,0.534168,0.410578,0.498187,0.423191,0.49909,0.287541,0.231346,0.404286,...,0.232097,0.26862,0.168284,0.215712,0.159722,0.171856,0.133369,0.131304,0.128825,0.147028
PctChange:MeanEndPrice@5Min,0.576031,0.533269,0.533188,0.428651,0.494623,0.441942,0.494576,0.316516,0.26334,0.39809,...,0.262333,0.289055,0.196944,0.217533,0.157345,0.183931,0.145491,0.144391,0.142989,0.143154
PctChange:MeanEndPrice@6Min,0.598146,0.556198,0.556089,0.462269,0.503052,0.471614,0.50235,0.340911,0.285162,0.413045,...,0.280586,0.297252,0.216987,0.229235,0.182452,0.196093,0.163087,0.162078,0.160827,0.158972
PctChange:MeanEndPrice@7Min,0.578646,0.537384,0.537404,0.473308,0.481754,0.498598,0.48095,0.319848,0.26505,0.385788,...,0.276517,0.305627,0.198839,0.236555,0.201566,0.202792,0.174863,0.174314,0.173585,0.173407
PctChange:MeanEndPrice@8Min,0.619482,0.580045,0.57991,0.506192,0.5129,0.510449,0.511388,0.381618,0.317593,0.429069,...,0.312338,0.307584,0.237347,0.257562,0.216119,0.220359,0.194195,0.193187,0.192005,0.182721
PctChange:MeanEndPrice@9Min,0.622126,0.580829,0.580725,0.497267,0.526866,0.503129,0.525878,0.357355,0.298075,0.436417,...,0.305613,0.300351,0.237133,0.254477,0.221069,0.215525,0.193067,0.192377,0.191554,0.179974
PctChange:MeanEndPrice@10Min,0.615045,0.564001,0.563964,0.497268,0.489266,0.501323,0.487324,0.374626,0.317268,0.410471,...,0.308915,0.299049,0.239798,0.241582,0.212065,0.213481,0.192998,0.192655,0.192203,0.175417


In [15]:
# corr_over_intervals('SAP')

The reasons for the strong correlations is due to averaging.
We make an experiment to check the correlation with `MaxPrice` which uses only `MaxPrice`
and `Direction` which uses only `EndPrice` and `StartPrice`.



In [16]:
single_stock = prepare_single_stock('BMW', '1D')
pd.DataFrame({
    'PctChange:MaxPrice': single_stock['PctChange:MaxPrice'],
    'Direction[t + 1]': single_stock['Direction'].shift(-1),   
    'Direction[t - 0]': single_stock['Direction'],        
    'Direction[t - 1]': single_stock['Direction[t - 1]'],
    'Direction[t - 2]': single_stock['Direction[t - 1]'].shift(1),
    'Direction[t - 3]': single_stock['Direction[t - 1]'].shift(2),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(3),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(4)    
}).corr()[['PctChange:MaxPrice']]

Unnamed: 0,PctChange:MaxPrice
Direction[t + 1],-0.012476
Direction[t - 0],0.462436
Direction[t - 1],0.375247
Direction[t - 2],0.065301
Direction[t - 3],-0.007288
Direction[t - 4],-0.042227
PctChange:MaxPrice,1.0


In [17]:
single_stock = prepare_single_stock('BMW', '1D')
pd.DataFrame({
    'PctChange:LastEndPrice': single_stock['PctChange:LastEndPrice'],
    'Direction[t + 1]': single_stock['Direction'].shift(-1),   
    'Direction[t - 0]': single_stock['Direction'],        
    'Direction[t - 1]': single_stock['Direction[t - 1]'],
    'Direction[t - 2]': single_stock['Direction[t - 1]'].shift(1),
    'Direction[t - 3]': single_stock['Direction[t - 1]'].shift(2),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(3),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(4)    
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
Direction[t + 1],0.027184
Direction[t - 0],0.732582
Direction[t - 1],0.040058
Direction[t - 2],0.119437
Direction[t - 3],-0.032281
Direction[t - 4],-0.037653
PctChange:LastEndPrice,1.0


The reason that there is no correlation between period `(t)` and `(t - 2)` is related to the normalization we do in percent change. We can verify this independently next.

In [18]:
single_stock = prepare_single_stock('BMW', '1D')
last_end_price = single_stock['PctChange:LastEndPrice']
pd.DataFrame({
    'PctChange:LastEndPrice': last_end_price,       
    'PctChange:LastEndPrice[t - 1]': last_end_price.shift(1),
    'PctChange:LastEndPrice[t - 2]': last_end_price.shift(2),
    'PctChange:LastEndPrice[t - 3]': last_end_price.shift(3),
    'PctChange:LastEndPrice[t - 4]': last_end_price.shift(4)    
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
PctChange:LastEndPrice[t - 1],0.133811
PctChange:LastEndPrice[t - 2],0.067861
PctChange:LastEndPrice[t - 3],-0.039186
PctChange:LastEndPrice[t - 4],-0.015114


Once we start normalizing the features in the same way, we find that past periods also exibit
correlations to the present

In [19]:
single_stock = prepare_single_stock('BMW', '1D')
last_end_price = single_stock['PctChange:LastEndPrice']
pd.DataFrame({
    'PctChange:LastEndPrice': last_end_price,       
    rev_pct_change_of_at_t('LastEndPrice', 1): single_stock[rev_pct_change_of_at_t('LastEndPrice', 1)],
    rev_pct_change_of_at_t('LastEndPrice', 2): single_stock[rev_pct_change_of_at_t('LastEndPrice', 2)],
    rev_pct_change_of_at_t('LastEndPrice', 3): single_stock[rev_pct_change_of_at_t('LastEndPrice', 3)]
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
RevPctChange[t - 1]:LastEndPrice,0.124178
RevPctChange[t - 2]:LastEndPrice,0.144278
RevPctChange[t - 3]:LastEndPrice,0.123414


In [20]:
single_stock = prepare_single_stock('BMW', '1D')
mean_end_price = single_stock['PctChange:MeanEndPrice']
pd.DataFrame({
    'PctChange:MeanEndPrice': mean_end_price,       
    rev_pct_change_of_at_t('MeanEndPrice', 1): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 1)],
    rev_pct_change_of_at_t('MeanEndPrice', 2): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 2)],
    rev_pct_change_of_at_t('MeanEndPrice', 3): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 3)]
}).corr()[['PctChange:MeanEndPrice']]

Unnamed: 0,PctChange:MeanEndPrice
PctChange:MeanEndPrice,1.0
RevPctChange[t - 1]:MeanEndPrice,0.181585
RevPctChange[t - 2]:MeanEndPrice,0.144456
RevPctChange[t - 3]:MeanEndPrice,0.175877


If we experiment with the same feature, but normalized differently we:
- are not going to see correlations beyond `(t - 1)` when different normalization is used
- we may see correlations with the past when the same normalization is used

In [21]:
single_stock = prepare_single_stock('BMW', '1D')
mean_end_price = single_stock['PctChange:MeanEndPrice']
pd.DataFrame({
    'PctChange:MeanEndPrice': mean_end_price,       
    norm_feature('H1', 1, 'MeanEndPrice'): single_stock[norm_feature('H1', 1, 'MeanEndPrice')],
    norm_feature('H1', 2, 'MeanEndPrice'): single_stock[norm_feature('H1', 2, 'MeanEndPrice')],
    norm_feature('H1', 3, 'MeanEndPrice'): single_stock[norm_feature('H1', 3, 'MeanEndPrice')],
    norm_feature('H1', 4, 'MeanEndPrice'): single_stock[norm_feature('H1', 4, 'MeanEndPrice')],
    'F1[t - 1]': single_stock['F1[t - 1]'],
    'F1[t - 2]': single_stock['F1[t - 1]'].shift(1),
    'F1[t - 3]': single_stock['F1[t - 1]'].shift(2),
    'F1[t - 4]': single_stock['F1[t - 1]'].shift(3)  
}).corr()[['PctChange:MeanEndPrice']]

Unnamed: 0,PctChange:MeanEndPrice
F1[t - 1],0.379122
F1[t - 2],0.065195
F1[t - 3],0.001757
F1[t - 4],0.022977
H1@WithNorm(MeanEndPrice)[t - 1],0.378796
H1@WithNorm(MeanEndPrice)[t - 2],0.051776
H1@WithNorm(MeanEndPrice)[t - 3],-0.013097
H1@WithNorm(MeanEndPrice)[t - 4],0.277385
PctChange:MeanEndPrice,1.0


## Log of return
In their [demo notebook](https://github.com/googledatalab/notebooks/blob/master/samples/TensorFlow/Machine%20Learning%20with%20Financial%20Data.ipynb) Google Cloud computes a feature that they call log return. This feature is:

```
log(Price:X[t]/Price:X[t - 1])
```
for some version of price X.

We explore this feature below. We find that when `Mean` of price is used, this feature correlates well
with various predictors, but when the `Last` of price is used, this feature does not correlate very well

In [253]:
single_stock = prepare_single_stock('SIE', '30Min')
corr_log_return(single_stock, 'MeanEndPrice')

Unnamed: 0,AdjLogReturn:LastEndPrice,AdjLogReturn:LastEndPrice[t - 1],AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
LogReturn:MeanEndPrice,0.89294,0.633228,0.633234,0.48406,0.528475,0.577222,0.485085,0.482471,0.090951,0.099265,...,0.066983,0.065278,0.06894,0.065479,0.065075,0.063838,0.087899,-0.029496,0.083471,0.44998


In [254]:
single_stock = prepare_single_stock('SIE', '30Min')
corr_log_return(single_stock, 'LastEndPrice')

Unnamed: 0,AdjLogReturn:LastEndPrice,AdjLogReturn:LastEndPrice[t - 1],AdjustedPctChange[t - 1],Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],...,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
LogReturn:LastEndPrice,0.87536,-0.04484,-0.044811,-0.011142,-0.029627,-0.027358,-0.01124,-0.010971,0.020748,-0.027203,...,-0.038851,-0.021236,-0.021244,-0.021247,-0.021224,-0.015875,-0.010987,-0.021412,-0.022538,-0.0137


When we examine correlations of LogReturn we find that beyond `(t - 1)` there are no correlations.
Google cloud also reports the same. As we explained, the reason is due to different normalizaiton
in different periods.

In [24]:
single_stock = prepare_single_stock('SIE', '30Min')
mean_end_price = single_stock['LogReturn:MeanEndPrice']
pd.DataFrame({
    'LogReturn:MeanEndPrice': mean_end_price,     
    'LogReturn:MeanEndPrice[t - 1]': single_stock['LogReturn:MeanEndPrice[t - 1]'],
    'LogReturn:MeanEndPrice[t - 2]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(1),
    'LogReturn:MeanEndPrice[t - 3]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(2),
    'LogReturn:MeanEndPrice[t - 4]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(3)  
}).corr()[['LogReturn:MeanEndPrice']]

Unnamed: 0,LogReturn:MeanEndPrice
LogReturn:MeanEndPrice,1.0
LogReturn:MeanEndPrice[t - 1],0.238046
LogReturn:MeanEndPrice[t - 2],-0.008179
LogReturn:MeanEndPrice[t - 3],-0.035098
LogReturn:MeanEndPrice[t - 4],-0.021521


## On the relationship between percent change and log return

Some people prefer to model percent change, while others prefer to model log return.
We find that both a suspiciously correlated. It turns that there is an approximate
mathematical equality that relates them

```
log(a/b) = log[ (b + (a - b))/b ] = log(1 + pct_change) = pct_change, 
where pct_change = (a - b)/b

log(1 + x) = x is an approximate equality when x is close to zero
```

# The variance of predicted features

Below we study the variance of a number of functions:

```
func = (price - anchor)/anchor
```
for different choices of price and anchor.

```
end to end: (end_price[t] - end_price[t - 1])/end_price[t - 1]
end to mean: (end_price[t] - mean_price[t - 1])/mean_price[t - 1]
mean to end: (mean_price[t] - end_price[t - 1])/end_price[t - 1]
mean to mean: (mean_price[t] - mean_price[t - 1])/mean_price[t - 1]
mean to prev mean: (mean_price[t] - mean_price[t - 2])/mean_price[t - 2]
```

The idea is that we use the following method to estimate mean price in minutes 15 to 20.
- estimate the mean price in minutes 10 to 15, call it `M[1]`
- estimate the mean price in minutes 10 to 20, call it `M[2]`
- compute `M[2] - M[1]`, the second minus the first

## Some observations

- As the anchor goes further back in time from the predicted price, the variance increases


In [309]:
intervals = ['1Min', '2Min', '3Min', '4Min', '5Min', '6Min', '7Min', '8Min', '9Min', '10Min', 
             '15Min', '30Min', '60Min', '120Min', '180Min', '240Min',
             '1D', '2D', '1W', '2W', '1M', '2M', '3M', '6M']
rows = []
for i in intervals:
    single_stock = prepare_single_stock('SIE', i)
    e = single_stock['LastEndPrice']
    n = single_stock['LastEndPrice'].shift(1)
    end_to_end = (e - n)/n
    
    m = single_stock['MeanAvg4Price']
    end_to_mean = (e - m.shift(1))/m.shift(1)  
    mean_to_mean = (m - m.shift(1))/m.shift(1)
    mean_to_prev_mean = (m - m.shift(2))/m.shift(2) 
    mean_to_end = (m - e)/e
    s1, s2 = end_to_mean.std(), indirect2.std()
    indirect_std = np.sqrt(s1*s1 + s2*s2)
    rows.append((i, end_to_end.std(), indirect_std, end_to_mean.std(), mean_to_mean.std(), mean_to_prev_mean.std(), mean_to_end.std()))


In [310]:
pd.DataFrame(rows, columns = [
    "interval", "end_to_end", "indirect", "end to mean", "mean to mean", "mean to prev mean", "mean to end"
])

Unnamed: 0,interval,end_to_end,indirect,end to mean,mean to mean,mean to prev mean,mean to end
0,1Min,0.000632,0.033256,0.000652,0.000551,0.000809,0.000219
1,2Min,0.000837,0.033261,0.000884,0.000747,0.001104,0.000338
2,3Min,0.001023,0.033268,0.001112,0.000859,0.001306,0.0005
3,4Min,0.001156,0.033273,0.001253,0.00098,0.001491,0.000527
4,5Min,0.001305,0.033279,0.001396,0.0011,0.001666,0.000575
5,6Min,0.001404,0.033284,0.001506,0.001207,0.001824,0.000612
6,7Min,0.001519,0.03329,0.001631,0.001312,0.001974,0.000664
7,8Min,0.001604,0.033294,0.001725,0.001387,0.002099,0.000686
8,9Min,0.0017,0.0333,0.001823,0.00146,0.002216,0.000734
9,10Min,0.001789,0.033305,0.001918,0.001569,0.002359,0.000747


## Predicting 60 min ahead using 10, 15, 20 and 30 minutes windows

In [367]:
def resample_single_stock(single_stock, interval):
    return pd.DataFrame({
        'MaxPrice': single_stock['MaxPrice'].resample(interval).max(),
        'MinPrice': single_stock['MinPrice'].resample(interval).min(),
        'LastEndPrice': single_stock['EndPrice'].resample(interval).last(),
        'FirstStartPrice': single_stock['StartPrice'].resample(interval).first(),         
        'MeanEndPrice': single_stock['EndPrice'].resample(interval).mean(),        
        'HasTrade': single_stock['HasTrade'].resample(interval).max(),
    })

def prepare_single_stock_multi_intervals(mnemonic, predicted_price, main_interval, intervals):
    single_stock = df[df.Mnemonic == mnemonic].copy()
        
    main = resample_single_stock(single_stock, main_interval)
    # we use the same anchor
    anchor = main['MeanEndPrice']
    future_mean_price = main[predicted_price].shift(-1)
    main['AdjustedPctChange[t + 1]'] = (future_mean_price - anchor)/anchor
    
    all_intervals = [main_interval] + intervals
    
    for interval in all_intervals:
        sub = resample_single_stock(single_stock, interval)
        resampled = sub.resample(main_interval).last() 

        main['Direction@' + interval] = \
            2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
            anchor

        main['H1@' + interval] = - closer_to_with_normalization(
                                                 resampled['LastEndPrice'], 
                                                 resampled['MaxPrice'], 
                                                 resampled['MinPrice'],
                                                 anchor)    
        
        main['EndToMean@' + interval] = (resampled['LastEndPrice'] - resampled['MeanEndPrice'])/anchor
        
    main = main[main['HasTrade'] == 1.0]
    main = main.drop(columns = [
        'MaxPrice',
        'MinPrice',
        'LastEndPrice',
        'FirstStartPrice',         
        'MeanEndPrice',     
        'HasTrade'       
    ])
    return main

In [368]:
main_interval = '60Min'
intervals = ['2Min', '5Min', '10Min', '15Min', '20Min', '30Min']

single_stock = prepare_single_stock_multi_intervals('SIE', 'MeanEndPrice', main_interval, intervals)

k = 'AdjustedPctChange[t + 1]'
single_stock.corr()[[k]].sort_values(k, ascending=False)

Unnamed: 0,AdjustedPctChange[t + 1]
AdjustedPctChange[t + 1],1.0
EndToMean@60Min,0.694064
Direction@30Min,0.635698
H1@60Min,0.61109
H1@30Min,0.586646
EndToMean@30Min,0.577795
Direction@20Min,0.54622
Direction@60Min,0.512021
Direction@15Min,0.488824
EndToMean@20Min,0.479197


In [371]:
main_interval = '60Min'
intervals = ['2Min', '5Min', '10Min', '15Min', '20Min', '30Min']

single_stock = prepare_single_stock_multi_intervals('SIE', 'LastEndPrice', main_interval, intervals)

k = 'AdjustedPctChange[t + 1]'
single_stock.corr()[[k]].sort_values(k, ascending=False)

Unnamed: 0,AdjustedPctChange[t + 1]
AdjustedPctChange[t + 1],1.0
EndToMean@60Min,0.506684
Direction@30Min,0.471918
H1@60Min,0.449346
H1@30Min,0.431266
EndToMean@30Min,0.42911
Direction@20Min,0.409199
Direction@60Min,0.384595
Direction@15Min,0.358005
EndToMean@20Min,0.35178
