# Supporting Notebook 2: (Predictable and unpredicable prices)

## Modeling: What to Predict
It is not straightforward to choose how to model stock behavior.

Do we predict prices at 1 minute ahead, 10 minutes, or daily.

Or should we predict all?

What kind of price do we want to predict within an interval:

- EndPrice
- Mean(EndPrice)
- WeightedMean(EndPrice)
- MedianPrice
- or the complete distribution of prices
- if the complete distribution is to be predicted, do we predict the mean, variance and the skew. Do we discretize the distribution or we treat the available prices as samples from the distribution that we predict

Another decision is whether to predict:
- Percent of Change of Price
- The sign of the price movement
- The log of the actual price

From a practical perspective, even in the presence of perfect predictions of prices in the future, one may not be able to make any money. This depends on
- fees (the fee to enter and exit a position) should be lower than the expected change in the price
- on seeing a quoted price how much above (when buying) and how far below (when selling) will one execute
- volumes in the market. If one trades with very large volumes one will disrupt the supply and demand mechanism, therefore one will not execute at the desired price.
- how the broker executes trades matters

# Insights: what is predictable

In the current notebook we try to get an understanding what is easier and what is difficult to predict.
For example, we find out that when we organize the data into intervals of say 10, 15, or 30 minutes,
the mean price (`Mean(EndPrice)`) is much easier to predict than the end price `EndPrice`.

We also find that when normalizing for `PctChange:X` of a feature `X`, we should use features that are normalized in the same way. One way to normalize the features is to divide them by `X[t - 1]`. 

One way to avoid this difficulty would be the following:
- generate different linear combination of prices (averages are also linear combinations, and so are absolute returns)
- compute the logs of all types of prices (or averages) of prices
- we may also attempt to predict linear combination of prices, by taking logs
- notice that is essentially a non-linear model, because we apply logs on sums of raw input features
- when using logs care must be taken to avoid numbers close to zero as well as negative numbers.
- however, log returns are approximately equal to percent change, so logs can be avoided

We explore different ways to normalize the prices. One way would be to choose a more stable price as an "anchor", and replace all prices as linear functions such as:
```
AnchorPrice = mean(Price[t - k, t - 1])
NormPrice:X[i] = (Price:X[i] - AnchorPrice)/AnchorPrice, for i = t, t - 1, ...
```


Regarding the prediction of an `EndPrice` we hypothesis that one would do better by predicting:
- predict two prices, one 10 minutes ahead, one 5 minutes ahead
- find the differences between the predictions


In the future we will also explore the possiblity to predict entire distribution of prices, instead of just a single end-price, or averaged prices.


In [1]:
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.figsize'] = (5, 3) # use bigger graphs

As usual we first load the data we prepared in notebook 2

In [2]:
input_file = '/data/processed/cooked_v3.pkl'
df = pd.read_pickle(input_file)
df['CalcDateTime'] = df.index

Next we prepare a dataset consting of a single stock and compute derivative features
(percent change) of a number of price features.

In [3]:
price_features = ['MaxPrice', 'MinPrice', 'LastEndPrice', 'FirstStartPrice', 'MeanEndPrice', 'MedianEndPrice', 
                  'MeanStartEndPrice', 'MeanMaxMinPrice', 'MeanAvg4Price',
                  'Direction1', 'Direction2', 'StdEndPrice', 'VolumeWeightedEndPrice']

log_ret_features = ['MaxPrice', 'MinPrice', 'LastEndPrice', 'FirstStartPrice', 'MeanEndPrice', 'MedianEndPrice', 
                  'MeanStartEndPrice', 'MeanMaxMinPrice', 'MeanAvg4Price']

indicator_features = [
    'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8',
    'G1', 'G2', 'G3', 'G4'
]

def pct_change_of(feature):
        return 'PctChange:' + feature
    
def log_return_f(feature):
    return 'LogReturn:' + feature

def adj_log_return_f(feature):
    return 'AdjLogReturn:' + feature

def rev_pct_change_of_at_t(feature, t):
        return 'RevPctChange[t - {}]:{}'.format(str(t), feature)
    
def shifted(feature):
    return feature + '[t - 1]'

def closer_to(pnt, a, b):
    return (np.absolute(pnt - a) - np.absolute(pnt - b))/pnt

def closer_to_with_normalization(pnt, a, b, norm):
    return (np.absolute(pnt - a) - np.absolute(pnt - b))/norm

def closer_to_or(pnt1, a, pnt2, b):
    return 2.0*(np.absolute(pnt1 - a) - np.absolute(pnt2 - b))/(pnt1 + pnt2)

def rev_pct_change(a, t):
    one_step_in_past = a.shift(1)
    t_steps_in_past = a.shift(1 + t)
    return (one_step_in_past - t_steps_in_past)/one_step_in_past
    
def log_return(a):
    return np.log(a) - np.log(a.shift(1))

def adj_log_return(a, norm):
    return np.log(a) - np.log(norm)

def norm_feature(feature_family, t, norm_feature):
    return "{}@WithNorm({})[t - {}]".format(feature_family, norm_feature, t)    
    
def weighted_mean(prices, volumes, interval):
    prices_times_volumes = prices * volumes
    num = prices_times_volumes.resample(interval).sum()
    denom = volumes.resample(interval).sum()
    return num/denom
    
# use this to experiment with custom weights    
def custom_linear_comb(prices):
    prices = prices.values[:,]
    prices = prices[~np.isnan(prices)]
    if prices.shape[0] == 0:
        return np.nan
    weights = np.zeros_like(prices)
    if prices.shape[0] < 3:
        return prices[-1]
    weights[-1] = 1.0
    weights[-2] = 0.5
    return np.sum(np.multiply(prices, weights))/np.sum(weights)    

def prepare_single_stock(mnemonic, interval):
    # TODO: add traded volume to averaging of prices, also traded volume weighted differently (e.g. exponential weighting)
    # TODO: one can weight max/min/start/end price differently depending on the move
    # TODO: add exponential weighting of the prices within a window (also there are different ways to center)
    single_stock = df[df.Mnemonic == mnemonic].copy()
    single_stock['StartEndPrice'] = 0.5*(single_stock['StartPrice'] + single_stock['EndPrice'])
    single_stock['MaxMinPrice'] = 0.5*(single_stock['MaxPrice'] + single_stock['MinPrice'])
    single_stock['Avg4Price'] = 0.25*(single_stock['MaxPrice'] + single_stock['MinPrice'] + 
                                      single_stock['StartPrice'] + single_stock['EndPrice'])
    
    # TODO: add smoothed traded volume
    single_stock['PctChange'] = single_stock['EndPrice'].pct_change()
    single_stock['SmoothedTradedVolume'] = single_stock['TradedVolume'].ewm(com=2.5).mean()
    
    single_stock['Direction'] = \
        2.0*(single_stock['EndPrice'] - single_stock['StartPrice'])/ \
        (single_stock['EndPrice'] + single_stock['StartPrice'])
        
    single_stock['F1'] = - closer_to(single_stock['EndPrice'], single_stock['MaxPrice'], single_stock['MinPrice'])
        

    resampled = pd.DataFrame({
        'MaxPrice': single_stock['MaxPrice'].resample(interval).max(),
        'MinPrice': single_stock['MinPrice'].resample(interval).min(),
        'MeanStartEndPrice': single_stock['StartEndPrice'].resample(interval).mean(),  
        'MeanMaxMinPrice': single_stock['MaxMinPrice'].resample(interval).mean(), 
        'MeanAvg4Price': single_stock['Avg4Price'].resample(interval).mean(),  
        'CustomLinearComb': single_stock['Avg4Price'].resample(interval).apply(custom_linear_comb),         
        'FirstStartPrice': single_stock['StartPrice'].resample(interval).first(),        
        'LastEndPrice': single_stock['EndPrice'].resample(interval).last(), 
        'MeanEndPrice': single_stock['EndPrice'].resample(interval).mean(), 
        'MedianEndPrice': single_stock['EndPrice'].resample(interval).median(),
        'VolumeWeightedEndPrice': weighted_mean(single_stock['EndPrice'], single_stock['SmoothedTradedVolume'], interval),
        'VolumeWeightedPctChange': weighted_mean(single_stock['PctChange'], single_stock['SmoothedTradedVolume'], interval),
        
        'StdEndPrice': single_stock['EndPrice'].resample(interval).std(),
        'HasTrade': single_stock['HasTrade'].resample(interval).max(),
        'G1': single_stock['Direction'].resample(interval).mean(),
        'G2': np.sign(single_stock['Direction']).resample(interval).mean(),
        'G3': single_stock['F1'].resample(interval).mean(),
        'G4': np.sign(single_stock['F1']).resample(interval).mean()
    })
    resampled['AdjustedPctChange'] = (resampled['LastEndPrice'] - resampled['MeanEndPrice'])/resampled['MeanEndPrice']
    resampled['AdjustedPctChange[t - 1]'] = resampled['AdjustedPctChange'].shift(1)
    
    resampled['MeanPctChangeV2'] = (resampled['MeanEndPrice'] - resampled['LastEndPrice'].shift(1))/resampled['LastEndPrice'].shift(1)
    
    anchor = resampled['MeanEndPrice'].shift(1)
    resampled[adj_log_return_f('LastEndPrice')] = adj_log_return(resampled['LastEndPrice'], anchor)
    resampled[shifted(adj_log_return_f('LastEndPrice'))] = \
        adj_log_return(resampled['LastEndPrice'].shift(1), anchor)
    
    resampled['Direction1'] = \
        2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
        (resampled['LastEndPrice'] + resampled['FirstStartPrice'])
    
    resampled['Direction2'] = \
        2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
        resampled['MeanEndPrice'].shift(1)
        
    resampled[shifted('Direction1')] = resampled['Direction1'].shift(1)
    resampled[shifted('Direction2')] = resampled['Direction2'].shift(1)
    
    for f in ['MinPrice', 'MaxPrice', 'LastEndPrice', 'FirstStartPrice']:
        resampled[shifted(f)] = resampled[f].shift(1)
        
    resampled['F1'] = - closer_to(resampled['LastEndPrice'], resampled['MaxPrice'], resampled['MinPrice'])
    resampled['F2'] = - closer_to(resampled['MaxPrice'], resampled['LastEndPrice'], resampled['FirstStartPrice'])
    resampled['F3'] = closer_to(resampled['MinPrice'], resampled['LastEndPrice'], resampled['FirstStartPrice'])
    resampled['F4'] = - closer_to(resampled['LastEndPrice'], resampled['MaxPrice'], resampled[shifted('MaxPrice')])
    resampled['F5'] = closer_to(resampled['LastEndPrice'], resampled['MinPrice'], resampled[shifted('MinPrice')])
    
    resampled['F6'] = - closer_to_or(resampled['LastEndPrice'], resampled['MaxPrice'],
                                   resampled[shifted('LastEndPrice')], resampled[shifted('MaxPrice')])

    resampled['F7'] = closer_to_or(resampled['LastEndPrice'], resampled['MinPrice'],
                                   resampled[shifted('LastEndPrice')], resampled[shifted('MinPrice')])
    
    
    resampled['F8'] = np.where(resampled['Direction2'] >= 0, resampled['F4'] , resampled['F5'])
    
    
    for t in range(1, 5):    
        # note: normalization is fixed
        resampled[norm_feature('H1', t, 'MeanEndPrice')] = - closer_to_with_normalization(
                                             resampled['LastEndPrice'].shift(t), 
                                             resampled['MaxPrice'].shift(t), 
                                             resampled['MinPrice'].shift(t),
                                             resampled['MeanEndPrice'].shift(1))    
        
    
    for f in indicator_features:
        resampled[shifted(f)] = resampled[f].shift(1)
        
    for f in price_features:
        pct_change_f = pct_change_of(f)
        resampled[pct_change_f] = resampled[f].pct_change()
        resampled[shifted(pct_change_f)] = resampled[pct_change_f].shift(1) 
    
    for f in price_features:
        for t in range(1, 5):
            rev_pct_change_f_t = rev_pct_change_of_at_t(f, t)
            resampled[rev_pct_change_f_t] = rev_pct_change(resampled[f], t)    
        

    for f in log_ret_features:
        log_ret_f = log_return_f(f)
        resampled[log_ret_f] = log_return(resampled[f])
        resampled[shifted(log_ret_f)] = resampled[log_ret_f].shift(1) 
        
    resampled = resampled[resampled['HasTrade'] == 1.0]
    
    return resampled

def correlation_with_feature(single_stock, corr_feature):
    pct_change_of_f = corr_feature
    sh_dir1 = shifted('Direction1')
    sh_dir2 = shifted('Direction2')    
    d = {
        pct_change_of_f: single_stock[pct_change_of_f],
        sh_dir1: single_stock[sh_dir1],
        sh_dir2: single_stock[sh_dir2]        
    }
    
    for f in indicator_features:
        d[shifted(f)] = single_stock[shifted(f)]
        
    d['AdjustedPctChange[t - 1]'] = single_stock['AdjustedPctChange[t - 1]']    
    d[adj_log_return_f('LastEndPrice')] = single_stock[adj_log_return_f('LastEndPrice')]
    d[shifted(adj_log_return_f('LastEndPrice'))] = single_stock[shifted(adj_log_return_f('LastEndPrice'))]
    
    d['MeanPctChangeV2'] = single_stock['MeanPctChangeV2']
    
    for f in price_features:
        pct_change_shifted = shifted(pct_change_of(f))
        d[pct_change_shifted] = single_stock[pct_change_shifted]
        
    for t in range(1, 5):    
        # note: normalization is fixed
        f = norm_feature('H1', t, 'MeanEndPrice')
        d[f] = single_stock[f]
            
    for f in log_ret_features:
        log_ret_f = log_return_f(f)
        d[log_ret_f] = single_stock[log_ret_f]
            
    for f in price_features:
        for t in range(1, 5):
            rev_pct_change_f_t = rev_pct_change_of_at_t(f, t)
            d[rev_pct_change_f_t] = single_stock[rev_pct_change_f_t]
            
    d[shifted('VolumeWeightedPctChange')] = single_stock[('VolumeWeightedPctChange')].shift(1)
    corr = pd.DataFrame(d).corr()
    row_id = np.argwhere(corr.index.values == pct_change_of_f)[0][0]
    return corr.iloc[[row_id]].drop(columns=[pct_change_of_f])

def corr_pct_change(single_stock, feature):
    pct_change_of_f = pct_change_of(feature)
    return correlation_with_feature(single_stock, pct_change_of_f)

def corr_log_return(single_stock, feature):
    return correlation_with_feature(single_stock, log_return_f(feature))

def find_most_correlated_features(single_stock, feature):
    corrs = correlation_with_feature(single_stock, feature).T
    corrs['AbsCorr'] = np.absolute(corrs[feature])
    sorted_corrs = corrs.sort_values('AbsCorr', ascending=False)
    selected_names = filter(lambda n: n.find('[t - 1') >= 0 or n.find('[t - 2') >= 0, list(sorted_corrs.index.values))
    sorted_corrs = sorted_corrs.T[selected_names].T
    return sorted_corrs

We choose a single stock, for example 'SIE' (Siemens)

In [4]:
single_stock = prepare_single_stock('SIE', '30Min')

We look at the corrlations of all the other price features from previous time periods `(t - 1)`.
We could also say that we investigate if a single feature from the previous time period `(t - 1)`
would be predictive of the change in the next time period `(t)`

We first study the `LastEndPrice` or the last end price in the interval `t` (here interval is 30 minutes).
As it can be seen there are no strong correlations.

In [5]:
corr_pct_change(single_stock, 'LastEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:LastEndPrice,-0.014452,-0.031263,-0.028687,-0.014556,-0.014273,0.019518,-0.018305,-0.020372,-0.030215,-0.013344,...,-0.025498,0.004233,-0.036114,0.031601,-0.015092,0.011779,-0.00563,-0.032617,-0.018977,-0.004466


Next we look for correlations with the mean prices: 
- MeanAvg4Price: we averaged all 4 prices available within a minute and then averaged within an interval like 10Min
- MeanMaxMinPrice: took the average of Min and Max prices and then averaged within an interval
- MeanStartEndPrice: took the average of Start and End prices and then averaged within an interval

# Conclusion:

- The more averaging was done, the easier to predict
- EndPrice is very difficult to predict, but the mean price is not 
- The median of the end price is harder to predict than the mean, but a lot easier than the end price



In [6]:
corr_pct_change(single_stock, 'MeanAvg4Price')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanAvg4Price,0.499097,0.545182,0.594027,0.500112,0.497522,0.100365,0.098164,0.3572,0.337546,0.396958,...,0.007962,7.5e-05,-0.036812,0.007804,-0.017486,0.273278,0.173408,0.121242,0.090679,0.279725


In [7]:
find_most_correlated_features(single_stock, pct_change_of('MeanAvg4Price'))

Unnamed: 0,PctChange:MeanAvg4Price,AbsCorr
AdjustedPctChange[t - 1],0.648615,0.648615
AdjLogReturn:LastEndPrice[t - 1],0.648611,0.648611
F1[t - 1],0.594027,0.594027
H1@WithNorm(MeanEndPrice)[t - 1],0.593925,0.593925
RevPctChange[t - 1]:LastEndPrice,0.545623,0.545623
Direction2[t - 1],0.545182,0.545182
F2[t - 1],0.500112,0.500112
Direction1[t - 1],0.499097,0.499097
F3[t - 1],0.497522,0.497522
G1[t - 1],0.426857,0.426857


In [8]:
corr_pct_change(single_stock, 'MeanMaxMinPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanMaxMinPrice,0.498263,0.544435,0.59389,0.499274,0.496686,0.100764,0.098704,0.356966,0.337818,0.397218,...,0.007842,0.000264,-0.036838,0.007853,-0.01757,0.272533,0.172921,0.120676,0.090292,0.278972


In [9]:
corr_pct_change(single_stock, 'MeanStartEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanStartEndPrice,0.499896,0.545891,0.59412,0.500914,0.498324,0.099959,0.097617,0.35741,0.337251,0.396669,...,0.008082,-0.000114,-0.036783,0.007755,-0.017401,0.274004,0.173883,0.121801,0.091061,0.280448


In [10]:
corr_pct_change(single_stock, 'MeanEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MeanEndPrice,0.483451,0.526626,0.578251,0.484426,0.481941,0.100102,0.096647,0.346744,0.328076,0.386489,...,0.00797,0.001267,-0.03703,0.009331,-0.017346,0.261489,0.165408,0.114751,0.08554,0.267774


In [11]:
corr_pct_change(single_stock, 'MedianEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:MedianEndPrice,0.43313,0.495696,0.531789,0.434237,0.431455,0.103161,0.099751,0.333383,0.314047,0.375251,...,0.003766,-0.001568,-0.036419,0.017188,-0.017327,0.231907,0.146785,0.102032,0.075265,0.25105


In [12]:
corr_pct_change(single_stock, 'VolumeWeightedEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
PctChange:VolumeWeightedEndPrice,0.40708,0.465904,0.517969,0.408149,0.405454,0.102325,0.092331,0.321097,0.313636,0.355343,...,0.007447,0.005583,-0.036474,0.014437,-0.010864,0.201354,0.123107,0.076505,0.054892,0.223231


In [13]:
correlation_with_feature(single_stock, 'MeanPctChangeV2')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
MeanPctChangeV2,-0.031991,-0.045596,-0.04144,-0.032153,-0.031811,0.023188,-0.014245,-0.022885,-0.022381,-0.019361,...,-0.009426,0.016292,-0.040286,0.028166,-0.019766,-0.009845,-0.026365,-0.041061,-0.028729,-0.024427


In [14]:
find_most_correlated_features(single_stock, 'MeanPctChangeV2')

Unnamed: 0,MeanPctChangeV2,AbsCorr
G4[t - 1],-0.06323,0.06323
G2[t - 1],-0.062256,0.062256
G3[t - 1],-0.059414,0.059414
AdjLogReturn:LastEndPrice[t - 1],-0.056929,0.056929
AdjustedPctChange[t - 1],-0.056923,0.056923
G1[t - 1],-0.053496,0.053496
Direction2[t - 1],-0.045596,0.045596
RevPctChange[t - 1]:LastEndPrice,-0.043364,0.043364
F1[t - 1],-0.04144,0.04144
H1@WithNorm(MeanEndPrice)[t - 1],-0.041382,0.041382


Let us settle for the moment on the 'MeanEndPrice', and compute the correlations over multiple intervals

In [15]:
# some of the better features
x_features = [
    'AdjustedPctChange[t - 1]',
    'F1[t - 1]',
    'H1@WithNorm(MeanEndPrice)[t - 1]',
    'RevPctChange[t - 1]:LastEndPrice',
    'F2[t - 1]',
    'Direction2[t - 1]',
    'F3[t - 1]',
    'G1[t - 1]',
    'G3[t - 1]',
    'PctChange:LastEndPrice[t - 1]',
    'RevPctChange[t - 2]:LastEndPrice',
    'F6[t - 1]',
    'G2[t - 1]',
    'F7[t - 1]',
    'G4[t - 1]',
    'RevPctChange[t - 1]:VolumeWeightedEndPrice',
    'RevPctChange[t - 1]:MinPrice',
    'RevPctChange[t - 1]:MeanEndPrice',
    'RevPctChange[t - 1]:MeanMaxMinPrice',
    'RevPctChange[t - 1]:MeanAvg4Price',
    'RevPctChange[t - 1]:MeanStartEndPrice',
    'RevPctChange[t - 1]:MaxPrice']

In [16]:
intervals = ['1Min', '2Min', '3Min', '4Min', '5Min', '6Min', '7Min', '8Min', '9Min', '10Min', 
             '15Min', '30Min', '60Min', '120Min', '180Min', '240Min',
             '1D', '2D', '1W', '2W']
def corr_over_intervals(stock):
    results = []
    for interval in intervals:
        single_stock = prepare_single_stock(stock, interval)
        cr = corr_pct_change(single_stock, 'MeanEndPrice')
        cr.index = [cr.index[0] + "@" + interval]
        results.append(cr)
    return pd.concat(results)

In [17]:
corr_over_intervals('BMW')[x_features]

Unnamed: 0,AdjustedPctChange[t - 1],F1[t - 1],H1@WithNorm(MeanEndPrice)[t - 1],RevPctChange[t - 1]:LastEndPrice,F2[t - 1],Direction2[t - 1],F3[t - 1],G1[t - 1],G3[t - 1],PctChange:LastEndPrice[t - 1],...,G2[t - 1],F7[t - 1],G4[t - 1],RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 1]:MinPrice,RevPctChange[t - 1]:MeanEndPrice,RevPctChange[t - 1]:MeanMaxMinPrice,RevPctChange[t - 1]:MeanAvg4Price,RevPctChange[t - 1]:MeanStartEndPrice,RevPctChange[t - 1]:MaxPrice
PctChange:MeanEndPrice@1Min,,-0.040016,-0.040016,-0.057669,-0.04406,-0.044253,-0.04418,-0.044115,-0.040016,-0.0422,...,-0.025962,-0.027606,-0.024194,-0.057669,-0.038706,-0.057669,-0.041413,-0.041558,-0.040399,-0.035641
PctChange:MeanEndPrice@2Min,0.342471,0.289739,0.289734,0.22545,0.244871,0.268131,0.244607,0.220126,0.188707,0.136605,...,0.167881,0.204659,0.136711,0.116428,0.065166,0.107801,0.054954,0.051866,0.048316,0.076487
PctChange:MeanEndPrice@3Min,0.551271,0.495723,0.495199,0.368611,0.441762,0.378296,0.44269,0.332429,0.284448,0.28267,...,0.243943,0.273013,0.202095,0.196265,0.134186,0.156402,0.102394,0.09978,0.096735,0.129668
PctChange:MeanEndPrice@4Min,0.566814,0.525967,0.525993,0.430283,0.462232,0.433443,0.461978,0.378099,0.324907,0.290082,...,0.286096,0.308451,0.233498,0.216985,0.164556,0.183966,0.133441,0.131541,0.129296,0.145654
PctChange:MeanEndPrice@5Min,0.565416,0.531774,0.531898,0.430781,0.455919,0.440793,0.454894,0.387331,0.34472,0.273967,...,0.300524,0.325587,0.250678,0.20939,0.150874,0.196675,0.152193,0.151793,0.151141,0.161078
PctChange:MeanEndPrice@6Min,0.590399,0.555696,0.555851,0.467702,0.477667,0.472083,0.476358,0.421592,0.366387,0.289028,...,0.313906,0.340947,0.26296,0.22711,0.166189,0.205975,0.165595,0.164571,0.163352,0.183919
PctChange:MeanEndPrice@7Min,0.625876,0.591994,0.592208,0.510106,0.523547,0.5152,0.522686,0.431648,0.38051,0.317882,...,0.323291,0.374064,0.270319,0.250972,0.193256,0.213155,0.173343,0.172789,0.172079,0.19612
PctChange:MeanEndPrice@8Min,0.630653,0.594318,0.59435,0.515764,0.515287,0.520891,0.513944,0.467379,0.406207,0.307911,...,0.334354,0.344175,0.273827,0.249791,0.210597,0.2109,0.177758,0.176773,0.175667,0.181611
PctChange:MeanEndPrice@9Min,0.618434,0.579562,0.579671,0.49609,0.501663,0.499425,0.50013,0.435211,0.370102,0.301639,...,0.329688,0.335445,0.268693,0.237636,0.2036,0.207381,0.181254,0.180439,0.179524,0.184723
PctChange:MeanEndPrice@10Min,0.614705,0.57466,0.574693,0.483943,0.481906,0.48757,0.480498,0.4287,0.3752,0.286715,...,0.332562,0.330276,0.265342,0.219296,0.192938,0.198068,0.174794,0.174223,0.173575,0.169566


In [18]:
corr_over_intervals('SIE')[x_features]

Unnamed: 0,AdjustedPctChange[t - 1],F1[t - 1],H1@WithNorm(MeanEndPrice)[t - 1],RevPctChange[t - 1]:LastEndPrice,F2[t - 1],Direction2[t - 1],F3[t - 1],G1[t - 1],G3[t - 1],PctChange:LastEndPrice[t - 1],...,G2[t - 1],F7[t - 1],G4[t - 1],RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 1]:MinPrice,RevPctChange[t - 1]:MeanEndPrice,RevPctChange[t - 1]:MeanMaxMinPrice,RevPctChange[t - 1]:MeanAvg4Price,RevPctChange[t - 1]:MeanStartEndPrice,RevPctChange[t - 1]:MaxPrice
PctChange:MeanEndPrice@1Min,,-0.090224,-0.090224,-0.085911,-0.082159,-0.082471,-0.082134,-0.082144,-0.090224,-0.063377,...,-0.078285,-0.051484,-0.088564,-0.085911,-0.049432,-0.085911,-0.052119,-0.05914,-0.064718,-0.04442
PctChange:MeanEndPrice@2Min,0.301682,0.230909,0.230919,0.193584,0.20288,0.233511,0.202869,0.144197,0.111798,0.112306,...,0.109725,0.173531,0.069823,0.09348,0.059237,0.083069,0.046689,0.043832,0.040555,0.06355
PctChange:MeanEndPrice@3Min,0.567136,0.489287,0.489356,0.342411,0.465839,0.355991,0.467812,0.239553,0.194866,0.280627,...,0.193716,0.243119,0.142012,0.179563,0.119608,0.144509,0.098368,0.096647,0.094445,0.121027
PctChange:MeanEndPrice@4Min,0.590153,0.530668,0.530669,0.409752,0.492846,0.421914,0.493727,0.292073,0.236732,0.281313,...,0.237012,0.272101,0.174166,0.214838,0.157742,0.172412,0.133051,0.13111,0.128759,0.148077
PctChange:MeanEndPrice@5Min,0.574304,0.531847,0.531773,0.430875,0.493123,0.443913,0.493072,0.322261,0.268725,0.271183,...,0.267913,0.292105,0.202738,0.219473,0.158246,0.18651,0.147531,0.146439,0.145044,0.147374
PctChange:MeanEndPrice@6Min,0.598595,0.557032,0.556929,0.463866,0.502388,0.47294,0.50171,0.347485,0.291706,0.277257,...,0.28708,0.300238,0.223415,0.230445,0.182982,0.198046,0.164598,0.163655,0.162469,0.162552
PctChange:MeanEndPrice@7Min,0.580801,0.540116,0.540132,0.475107,0.483449,0.499448,0.482672,0.327988,0.27241,0.272062,...,0.28417,0.308659,0.206439,0.237626,0.201766,0.204463,0.175912,0.175371,0.174649,0.176317
PctChange:MeanEndPrice@8Min,0.620121,0.581521,0.581395,0.508523,0.514442,0.512527,0.512973,0.388415,0.323499,0.280601,...,0.318103,0.313624,0.242819,0.257305,0.215078,0.220813,0.194119,0.193123,0.191955,0.186707
PctChange:MeanEndPrice@9Min,0.623222,0.58249,0.582389,0.498934,0.526482,0.504428,0.525521,0.36446,0.304915,0.290548,...,0.312963,0.304046,0.244559,0.25454,0.220449,0.216911,0.193835,0.193237,0.192508,0.18234
PctChange:MeanEndPrice@10Min,0.615522,0.566007,0.565975,0.499654,0.492738,0.504239,0.490857,0.381413,0.322332,0.264453,...,0.315689,0.305738,0.244638,0.241467,0.210238,0.213795,0.192759,0.192372,0.191879,0.177824


In [19]:
# corr_over_intervals('SAP')

The reasons for the strong correlations is due to averaging.
We make an experiment to check the correlation with `MaxPrice` which uses only `MaxPrice`
and `Direction` which uses only `EndPrice` and `StartPrice`.



In [20]:
single_stock = prepare_single_stock('BMW', '1D')
pd.DataFrame({
    'PctChange:MaxPrice': single_stock['PctChange:MaxPrice'],
    'Direction2[t + 1]': single_stock['Direction2'].shift(-1),   
    'Direction2[t - 0]': single_stock['Direction2'],        
    'Direction2[t - 1]': single_stock['Direction2[t - 1]'],
    'Direction2[t - 2]': single_stock['Direction2[t - 1]'].shift(1),
    'Direction2[t - 3]': single_stock['Direction2[t - 1]'].shift(2),
    'Direction2[t - 4]': single_stock['Direction2[t - 1]'].shift(3),
    'Direction2[t - 4]': single_stock['Direction2[t - 1]'].shift(4)    
}).corr()[['PctChange:MaxPrice']]

Unnamed: 0,PctChange:MaxPrice
PctChange:MaxPrice,1.0
Direction2[t + 1],-0.040965
Direction2[t - 0],0.457563
Direction2[t - 1],0.522053
Direction2[t - 2],0.103093
Direction2[t - 3],0.060559
Direction2[t - 4],-0.088948


In [21]:
single_stock = prepare_single_stock('BMW', '1D')
pd.DataFrame({
    'PctChange:LastEndPrice': single_stock['PctChange:LastEndPrice'],
    'Direction2[t + 1]': single_stock['Direction2'].shift(-1),   
    'Direction2[t - 0]': single_stock['Direction2'],        
    'Direction2[t - 1]': single_stock['Direction2[t - 1]'],
    'Direction2[t - 2]': single_stock['Direction2[t - 1]'].shift(1),
    'Direction2[t - 3]': single_stock['Direction2[t - 1]'].shift(2),
    'Direction2[t - 4]': single_stock['Direction2[t - 1]'].shift(3),
    'Direction2[t - 4]': single_stock['Direction2[t - 1]'].shift(4)    
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
Direction2[t + 1],0.037161
Direction2[t - 0],0.764989
Direction2[t - 1],0.165309
Direction2[t - 2],0.171774
Direction2[t - 3],0.01543
Direction2[t - 4],-0.084145


The reason that there is no correlation between period `(t)` and `(t - 2)` is related to the normalization we do in percent change. We can verify this independently next.

In [22]:
single_stock = prepare_single_stock('BMW', '1D')
last_end_price = single_stock['PctChange:LastEndPrice']
pd.DataFrame({
    'PctChange:LastEndPrice': last_end_price,       
    'PctChange:LastEndPrice[t - 1]': last_end_price.shift(1),
    'PctChange:LastEndPrice[t - 2]': last_end_price.shift(2),
    'PctChange:LastEndPrice[t - 3]': last_end_price.shift(3),
    'PctChange:LastEndPrice[t - 4]': last_end_price.shift(4)    
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
PctChange:LastEndPrice[t - 1],0.069811
PctChange:LastEndPrice[t - 2],0.061999
PctChange:LastEndPrice[t - 3],0.03779
PctChange:LastEndPrice[t - 4],-0.093435


Once we start normalizing the features in the same way, we find that past periods also exibit
correlations to the present

In [23]:
single_stock = prepare_single_stock('BMW', '1D')
last_end_price = single_stock['PctChange:LastEndPrice']
pd.DataFrame({
    'PctChange:LastEndPrice': last_end_price,       
    rev_pct_change_of_at_t('LastEndPrice', 1): single_stock[rev_pct_change_of_at_t('LastEndPrice', 1)],
    rev_pct_change_of_at_t('LastEndPrice', 2): single_stock[rev_pct_change_of_at_t('LastEndPrice', 2)],
    rev_pct_change_of_at_t('LastEndPrice', 3): single_stock[rev_pct_change_of_at_t('LastEndPrice', 3)]
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
RevPctChange[t - 1]:LastEndPrice,0.082298
RevPctChange[t - 2]:LastEndPrice,0.138807
RevPctChange[t - 3]:LastEndPrice,0.116341


In [24]:
single_stock = prepare_single_stock('BMW', '1D')
mean_end_price = single_stock['PctChange:MeanEndPrice']
pd.DataFrame({
    'PctChange:MeanEndPrice': mean_end_price,       
    rev_pct_change_of_at_t('MeanEndPrice', 1): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 1)],
    rev_pct_change_of_at_t('MeanEndPrice', 2): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 2)],
    rev_pct_change_of_at_t('MeanEndPrice', 3): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 3)]
}).corr()[['PctChange:MeanEndPrice']]

Unnamed: 0,PctChange:MeanEndPrice
PctChange:MeanEndPrice,1.0
RevPctChange[t - 1]:MeanEndPrice,0.074884
RevPctChange[t - 2]:MeanEndPrice,0.098395
RevPctChange[t - 3]:MeanEndPrice,0.170181


If we experiment with the same feature, but normalized differently we:
- are not going to see correlations beyond `(t - 1)` when different normalization is used
- we may see correlations with the past when the same normalization is used

In [25]:
single_stock = prepare_single_stock('BMW', '1D')
mean_end_price = single_stock['PctChange:MeanEndPrice']
pd.DataFrame({
    'PctChange:MeanEndPrice': mean_end_price,       
    norm_feature('H1', 1, 'MeanEndPrice'): single_stock[norm_feature('H1', 1, 'MeanEndPrice')],
    norm_feature('H1', 2, 'MeanEndPrice'): single_stock[norm_feature('H1', 2, 'MeanEndPrice')],
    norm_feature('H1', 3, 'MeanEndPrice'): single_stock[norm_feature('H1', 3, 'MeanEndPrice')],
    norm_feature('H1', 4, 'MeanEndPrice'): single_stock[norm_feature('H1', 4, 'MeanEndPrice')],
    'F1[t - 1]': single_stock['F1[t - 1]'],
    'F1[t - 2]': single_stock['F1[t - 1]'].shift(1),
    'F1[t - 3]': single_stock['F1[t - 1]'].shift(2),
    'F1[t - 4]': single_stock['F1[t - 1]'].shift(3)  
}).corr()[['PctChange:MeanEndPrice']]

Unnamed: 0,PctChange:MeanEndPrice
PctChange:MeanEndPrice,1.0
H1@WithNorm(MeanEndPrice)[t - 1],0.396102
H1@WithNorm(MeanEndPrice)[t - 2],0.024512
H1@WithNorm(MeanEndPrice)[t - 3],-0.083978
H1@WithNorm(MeanEndPrice)[t - 4],0.258168
F1[t - 1],0.396335
F1[t - 2],0.041163
F1[t - 3],-0.014884
F1[t - 4],0.059665


## Log of return
In their [demo notebook](https://github.com/googledatalab/notebooks/blob/master/samples/TensorFlow/Machine%20Learning%20with%20Financial%20Data.ipynb) Google Cloud computes a feature that they call log return. This feature is:

```
log(Price:X[t]/Price:X[t - 1])
```
for some version of price X.

We explore this feature below. We find that when `Mean` of price is used, this feature correlates well
with various predictors, but when the `Last` of price is used, this feature does not correlate very well

In [26]:
single_stock = prepare_single_stock('SIE', '30Min')
corr_log_return(single_stock, 'MeanEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
LogReturn:MeanEndPrice,0.483406,0.526596,0.578146,0.484382,0.481897,0.099758,0.096859,0.346751,0.32796,0.386403,...,0.007901,0.001222,-0.03723,0.009178,-0.017552,0.261486,0.165445,0.114805,0.085632,0.447735


In [27]:
single_stock = prepare_single_stock('SIE', '30Min')
corr_log_return(single_stock, 'LastEndPrice')

Unnamed: 0,Direction1[t - 1],Direction2[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],F8[t - 1],...,RevPctChange[t - 4]:Direction2,RevPctChange[t - 1]:StdEndPrice,RevPctChange[t - 2]:StdEndPrice,RevPctChange[t - 3]:StdEndPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 1]:VolumeWeightedEndPrice,RevPctChange[t - 2]:VolumeWeightedEndPrice,RevPctChange[t - 3]:VolumeWeightedEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice,VolumeWeightedPctChange[t - 1]
LogReturn:LastEndPrice,-0.014406,-0.0312,-0.028672,-0.01451,-0.014227,0.019149,-0.018149,-0.020512,-0.030164,-0.013356,...,-0.025579,0.004343,-0.036127,0.031563,-0.015164,0.01188,-0.005545,-0.032484,-0.018885,-0.009704


When we examine correlations of LogReturn we find that beyond `(t - 1)` there are no correlations.
Google cloud also reports the same. As we explained, the reason is due to different normalizaiton
in different periods.

In [28]:
single_stock = prepare_single_stock('SIE', '30Min')
mean_end_price = single_stock['LogReturn:MeanEndPrice']
pd.DataFrame({
    'LogReturn:MeanEndPrice': mean_end_price,     
    'LogReturn:MeanEndPrice[t - 1]': single_stock['LogReturn:MeanEndPrice[t - 1]'],
    'LogReturn:MeanEndPrice[t - 2]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(1),
    'LogReturn:MeanEndPrice[t - 3]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(2),
    'LogReturn:MeanEndPrice[t - 4]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(3)  
}).corr()[['LogReturn:MeanEndPrice']]

Unnamed: 0,LogReturn:MeanEndPrice
LogReturn:MeanEndPrice,1.0
LogReturn:MeanEndPrice[t - 1],0.235328
LogReturn:MeanEndPrice[t - 2],-0.006197
LogReturn:MeanEndPrice[t - 3],-0.03237
LogReturn:MeanEndPrice[t - 4],-0.021452


## On the relationship between percent change and log return

Some people prefer to model percent change, while others prefer to model log return.
We find that both a suspiciously correlated. It turns that there is an approximate
mathematical equality that relates them

```
log(a/b) = log[ (b + (a - b))/b ] = log(1 + pct_change) = pct_change, 
where pct_change = (a - b)/b

log(1 + x) = x is an approximate equality when x is close to zero
```

# The variance of predicted features

Below we study the variance of a number of functions:

```
func = (price - anchor)/anchor
```
for different choices of price and anchor.

```
end to end: (end_price[t] - end_price[t - 1])/end_price[t - 1]
end to mean: (end_price[t] - mean_price[t - 1])/mean_price[t - 1]
mean to end: (mean_price[t] - end_price[t - 1])/end_price[t - 1]
mean to mean: (mean_price[t] - mean_price[t - 1])/mean_price[t - 1]
mean to prev mean: (mean_price[t] - mean_price[t - 2])/mean_price[t - 2]
```

The idea is that we use the following method to estimate mean price in minutes 15 to 20.
- estimate the mean price in minutes 10 to 15, call it `M[1]`
- estimate the mean price in minutes 10 to 20, call it `M[2]`
- compute `M[2] - M[1]`, the second minus the first

## Some observations

- As the anchor goes further back in time from the predicted price, the variance increases


In [29]:
intervals = ['1Min', '2Min', '3Min', '4Min', '5Min', '6Min', '7Min', '8Min', '9Min', '10Min', 
             '15Min', '30Min', '60Min', '120Min', '180Min', '240Min',
             '1D', '2D', '1W', '2W', '1M', '2M', '3M', '6M']
rows = []
for i in intervals:
    single_stock = prepare_single_stock('SIE', i)
    e = single_stock['LastEndPrice']
    n = single_stock['LastEndPrice'].shift(1)
    end_to_end = (e - n)/n
    
    m = single_stock['MeanAvg4Price']
    end_to_mean = (e - m.shift(1))/m.shift(1)  
    mean_to_mean = (m - m.shift(1))/m.shift(1)
    mean_to_prev_mean = (m - m.shift(2))/m.shift(2) 
    mean_to_end = (m - e)/e
    rows.append((i, end_to_end.std(), end_to_mean.std(), mean_to_mean.std(), mean_to_prev_mean.std(), mean_to_end.std()))


In [30]:
pd.DataFrame(rows, columns = [
    "interval", "end_to_end", "end to mean", "mean to mean", "mean to prev mean", "mean to end"
])

Unnamed: 0,interval,end_to_end,end to mean,mean to mean,mean to prev mean,mean to end
0,1Min,0.000632,0.000653,0.000553,0.000813,0.000217
1,2Min,0.000842,0.000889,0.000754,0.001113,0.000335
2,3Min,0.001027,0.001115,0.000872,0.00132,0.000492
3,4Min,0.001163,0.001258,0.000994,0.001508,0.000522
4,5Min,0.001312,0.001403,0.001117,0.001685,0.00057
5,6Min,0.001412,0.001515,0.001225,0.001845,0.000608
6,7Min,0.001523,0.001634,0.001328,0.001993,0.000658
7,8Min,0.001619,0.001738,0.001406,0.002123,0.000683
8,9Min,0.001708,0.001831,0.001481,0.002241,0.000727
9,10Min,0.001803,0.001931,0.00159,0.002389,0.000743


## Predicting 60 min ahead using 10, 15, 20 and 30 minutes windows

In [31]:
def resample_single_stock(single_stock, interval):
    return pd.DataFrame({
        'MaxPrice': single_stock['MaxPrice'].resample(interval).max(),
        'MinPrice': single_stock['MinPrice'].resample(interval).min(),
        'LastEndPrice': single_stock['EndPrice'].resample(interval).last(),
        'FirstStartPrice': single_stock['StartPrice'].resample(interval).first(),         
        'MeanEndPrice': single_stock['EndPrice'].resample(interval).mean(),        
        'HasTrade': single_stock['HasTrade'].resample(interval).max(),
    })

def prepare_single_stock_multi_intervals(mnemonic, predicted_price, main_interval, intervals):
    single_stock = df[df.Mnemonic == mnemonic].copy()
        
    main = resample_single_stock(single_stock, main_interval)
    # we use the same anchor
    anchor = main['MeanEndPrice']
    future_mean_price = main[predicted_price].shift(-1)
    main['AdjustedPctChange[t + 1]'] = (future_mean_price - anchor)/anchor
    
    all_intervals = [main_interval] + intervals
    
    for interval in all_intervals:
        sub = resample_single_stock(single_stock, interval)
        resampled = sub.resample(main_interval).last() 

        main['Direction@' + interval] = \
            2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
            anchor

        main['H1@' + interval] = - closer_to_with_normalization(
                                                 resampled['LastEndPrice'], 
                                                 resampled['MaxPrice'], 
                                                 resampled['MinPrice'],
                                                 anchor)    
        
        main['EndToMean@' + interval] = (resampled['LastEndPrice'] - resampled['MeanEndPrice'])/anchor
        
    main = main[main['HasTrade'] == 1.0]
    main = main.drop(columns = [
        'MaxPrice',
        'MinPrice',
        'LastEndPrice',
        'FirstStartPrice',         
        'MeanEndPrice',     
        'HasTrade'       
    ])
    return main

In [32]:
main_interval = '60Min'
intervals = ['2Min', '5Min', '10Min', '15Min', '20Min', '30Min']

single_stock = prepare_single_stock_multi_intervals('SIE', 'MeanEndPrice', main_interval, intervals)

k = 'AdjustedPctChange[t + 1]'
single_stock.corr()[[k]].sort_values(k, ascending=False)

Unnamed: 0,AdjustedPctChange[t + 1]
AdjustedPctChange[t + 1],1.0
EndToMean@60Min,0.69429
Direction@30Min,0.633703
H1@60Min,0.61418
H1@30Min,0.588048
EndToMean@30Min,0.582287
Direction@20Min,0.551924
Direction@60Min,0.519582
Direction@15Min,0.493588
EndToMean@20Min,0.48665


In [33]:
main_interval = '60Min'
intervals = ['2Min', '5Min', '10Min', '15Min', '20Min', '30Min']

single_stock = prepare_single_stock_multi_intervals('SIE', 'LastEndPrice', main_interval, intervals)

k = 'AdjustedPctChange[t + 1]'
single_stock.corr()[[k]].sort_values(k, ascending=False)

Unnamed: 0,AdjustedPctChange[t + 1]
AdjustedPctChange[t + 1],1.0
EndToMean@60Min,0.504806
Direction@30Min,0.468883
H1@60Min,0.449401
EndToMean@30Min,0.430578
H1@30Min,0.429825
Direction@20Min,0.41092
Direction@60Min,0.388155
Direction@15Min,0.35981
EndToMean@20Min,0.355766


In [34]:
!echo "Last run on `date`"

Last run on Wed Jan 22 18:30:31 UTC 2020
