It is not straightforward to choose how to model stock behavior?
Do we predict prices at 1 minute ahead, 10 minutes, or daily.

Or should we predict all?

What kind of price do we want to predict within an interval:

- EndPrice
- Mean(EndPrice)
- WeightedMean(EndPrice)
- MedianPrice
- or the complete distribution of prices
- if the complete distribution is to be predicted, do we predict the mean, variance and the skew. Do we discretize the distribution or we treat the available prices as samples from the distribution that we predict

Another decision is whether to predict:
- Percent of Change of Price
- The sign of the price movement
- The log of the actual price

From a practical perspective, even in the presence of perfect predictions of prices in the future, one may not be able to make any money. This depends on
- fees (the fee to enter and exit a position) should be lower than the expected change in the price
- on seeing a quoted price how much above (when buying) and how far below (when selling) will one execute
- volumes in the market. If one trades with very large volumes one will disrupt the supply and demand mechanism, therefore one will not execute at the desired price.
- how the broker executes trades matters

In the current notebook we try to get an understanding what is easier and what is difficult to predict.
For example, we find out that when we resample the data into intervals of say 10, 15, or 30 minutes,
the mean price (`Mean(EndPrice)`) is much easier to predict than the end price `EndPrice`.

We also find that when normalizing for `PctChange:X` of a feature `X`, we should use features that are normalized in the same say, that is they are divided by `X[t - 1]`.

One way to avoid this difficulty would be the following:
- generate different linear combination of prices (averages are also linear combinations, and so are absolute returns)
- compute the logs of all types of prices (or averages) of prices
- we may also attempt to predict linear combination of prices, by taking logs
- notice that is essentially a non-linear model, because we apply logs on sums of raw input features
- when using logs care must be taken to avoid numbers close to zero as well as negative numbers

We also explore the possiblity to predict entire distribution of prices in the future, instead of just a single end-price, or averaged prices


In [1]:
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.figsize'] = (5, 3) # use bigger graphs

As usual we first load the data we prepared in notebook 2

In [2]:
input_file = '/data/cooked_v3.pkl'
df = pd.read_pickle(input_file)
df['CalcDateTime'] = df.index

Next we prepare a dataset consting of a single stock and compute derivative features
(percent change) of a number of price features.

In [3]:
price_features = ['MaxPrice', 'MinPrice', 'LastEndPrice', 'FirstStartPrice', 'MeanEndPrice', 'MedianEndPrice', 
                  'MeanStartEndPrice', 'MeanMaxMinPrice', 'MeanAvg4Price',
                  'Direction', 'StdEndPrice', 'VolumeWeightedEndPrice']

log_ret_features = ['MaxPrice', 'MinPrice', 'LastEndPrice', 'FirstStartPrice', 'MeanEndPrice', 'MedianEndPrice', 
                  'MeanStartEndPrice', 'MeanMaxMinPrice', 'MeanAvg4Price']

indicator_features = [
    'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 
    'G1', 'G2', 'G3', 'G4'
]

def pct_change_of(feature):
        return 'PctChange:' + feature
    
def log_return_f(feature):
    return 'LogReturn:' + feature

def rev_pct_change_of_at_t(feature, t):
        return 'RevPctChange[t - {}]:{}'.format(str(t), feature)
    
def shifted(feature):
    return feature + '[t - 1]'

def closer_to(pnt, a, b):
    return (np.absolute(pnt - a) - np.absolute(pnt - b))/pnt

def closer_to_with_normalization(pnt, a, b, norm):
    return (np.absolute(pnt - a) - np.absolute(pnt - b))/norm

def closer_to_or(pnt1, a, pnt2, b):
    return 2.0*(np.absolute(pnt1 - a) - np.absolute(pnt2 - b))/(pnt1 + pnt2)

def rev_pct_change(a, t):
    one_step_in_past = a.shift(1)
    t_steps_in_past = a.shift(1 + t)
    return (one_step_in_past - t_steps_in_past)/one_step_in_past
    
def log_return(a):
    return np.log(a) - np.log(a.shift(1))

def norm_feature(feature_family, t, norm_feature):
    return "{}@WithNorm({})[t - {}]".format(feature_family, norm_feature, t)    
    
def weighted_mean(prices, volumes, interval):
    prices_times_volumes = prices * volumes
    num = prices_times_volumes.resample(interval).sum()
    denom = volumes.resample(interval).sum()
    return num/denom
    
def prepare_single_stock(mnemonic, interval):
    # TODO: add traded volume to averaging of prices, also traded volume weighted differently (e.g. exponential weighting)
    # TODO: one can weight max/min/start/end price differently depending on the move
    # TODO: add exponential weighting of the prices within a window (also there are different ways to center)
    single_stock = df[df.Mnemonic == mnemonic].copy()
    single_stock['StartEndPrice'] = 0.5*(single_stock['StartPrice'] + single_stock['EndPrice'])
    single_stock['MaxMinPrice'] = 0.5*(single_stock['MaxPrice'] + single_stock['MinPrice'])
    single_stock['Avg4Price'] = 0.25*(single_stock['MaxPrice'] + single_stock['MinPrice'] + 
                                      single_stock['StartPrice'] + single_stock['EndPrice'])
    
    
    
    single_stock['Direction'] = \
        2.0*(single_stock['EndPrice'] - single_stock['StartPrice'])/ \
        (single_stock['EndPrice'] + single_stock['StartPrice'])
        
    single_stock['F1'] = - closer_to(single_stock['EndPrice'], single_stock['MaxPrice'], single_stock['MinPrice'])
        
        
    resampled = pd.DataFrame({
        'MaxPrice': single_stock['MaxPrice'].resample(interval).max(),
        'MinPrice': single_stock['MinPrice'].resample(interval).min(),
        'MeanStartEndPrice': single_stock['StartEndPrice'].resample(interval).mean(),  
        'MeanMaxMinPrice': single_stock['MaxMinPrice'].resample(interval).mean(), 
        'MeanAvg4Price': single_stock['Avg4Price'].resample(interval).mean(),          
        'FirstStartPrice': single_stock['StartPrice'].resample(interval).first(),        
        'LastEndPrice': single_stock['EndPrice'].resample(interval).last(), 
        'MeanEndPrice': single_stock['EndPrice'].resample(interval).mean(), 
        'MedianEndPrice': single_stock['EndPrice'].resample(interval).median(),
        'VolumeWeightedEndPrice': weighted_mean(single_stock['EndPrice'], single_stock['TradedVolume'], interval),
        'StdEndPrice': single_stock['EndPrice'].resample(interval).std(),
        'HasTrade': single_stock['HasTrade'].resample(interval).max(),
        'G1': single_stock['Direction'].resample(interval).mean(),
        'G2': np.sign(single_stock['Direction']).resample(interval).mean(),
        'G3': single_stock['F1'].resample(interval).mean(),
        'G4': np.sign(single_stock['F1']).resample(interval).mean()        
    })
    resampled['Direction'] = \
        2.0*(resampled['LastEndPrice'] - resampled['FirstStartPrice'])/ \
        (resampled['LastEndPrice'] + resampled['FirstStartPrice'])
    
    resampled[shifted('Direction')] = resampled['Direction'].shift(1)
    
    for f in ['MinPrice', 'MaxPrice', 'LastEndPrice', 'FirstStartPrice']:
        resampled[shifted(f)] = resampled[f].shift(1)
        
    resampled['F1'] = - closer_to(resampled['LastEndPrice'], resampled['MaxPrice'], resampled['MinPrice'])
    resampled['F2'] = - closer_to(resampled['MaxPrice'], resampled['LastEndPrice'], resampled['FirstStartPrice'])
    resampled['F3'] = closer_to(resampled['MinPrice'], resampled['LastEndPrice'], resampled['FirstStartPrice'])
    resampled['F4'] = - closer_to(resampled['LastEndPrice'], resampled['MaxPrice'], resampled[shifted('MaxPrice')])
    resampled['F5'] = closer_to(resampled['LastEndPrice'], resampled['MinPrice'], resampled[shifted('MinPrice')])
    
    resampled['F6'] = - closer_to_or(resampled['LastEndPrice'], resampled['MaxPrice'],
                                   resampled[shifted('LastEndPrice')], resampled[shifted('MaxPrice')])

    resampled['F7'] = closer_to_or(resampled['LastEndPrice'], resampled['MinPrice'],
                                   resampled[shifted('LastEndPrice')], resampled[shifted('MinPrice')])
    
    
    for t in range(1, 5):    
        # note: normalization is fixed
        resampled[norm_feature('H1', t, 'MeanEndPrice')] = - closer_to_with_normalization(
                                             resampled['LastEndPrice'].shift(t), 
                                             resampled['MaxPrice'].shift(t), 
                                             resampled['MinPrice'].shift(t),
                                             resampled['MeanEndPrice'].shift(1))    
        
    
    for f in indicator_features:
        resampled[shifted(f)] = resampled[f].shift(1)
        
    for f in price_features:
        pct_change_f = pct_change_of(f)
        resampled[pct_change_f] = resampled[f].pct_change()
        resampled[shifted(pct_change_f)] = resampled[pct_change_f].shift(1) 
    
    for f in price_features:
        for t in range(1, 5):
            rev_pct_change_f_t = rev_pct_change_of_at_t(f, t)
            resampled[rev_pct_change_f_t] = rev_pct_change(resampled[f], t)    
        

    for f in log_ret_features:
        log_ret_f = log_return_f(f)
        resampled[log_ret_f] = log_return(resampled[f])
        resampled[shifted(log_ret_f)] = resampled[log_ret_f].shift(1) 
        
    resampled = resampled[resampled['HasTrade'] == 1.0]
    
    return resampled

def correlation_with_feature(single_stock, corr_feature):
    pct_change_of_f = corr_feature #TODO: pct_change_of(feature)
    sh_dir = shifted('Direction')
    d = {
        pct_change_of_f: single_stock[pct_change_of_f],
        sh_dir: single_stock[sh_dir]
    }
    
    for f in indicator_features:
        d[shifted(f)] = single_stock[shifted(f)]
        
    for f in price_features:
        pct_change_shifted = shifted(pct_change_of(f))
        d[pct_change_shifted] = single_stock[pct_change_shifted]
        
    for t in range(1, 5):    
        # note: normalization is fixed
        f = norm_feature('H1', t, 'MeanEndPrice')
        d[f] = single_stock[f]
            
    for f in log_ret_features:
        log_ret_f = log_return_f(f)
        d[log_ret_f] = single_stock[log_ret_f]
            
    for f in price_features:
        for t in range(1, 5):
            rev_pct_change_f_t = rev_pct_change_of_at_t(f, t)
            d[rev_pct_change_f_t] = single_stock[rev_pct_change_f_t]
            
            
    corr = pd.DataFrame(d).corr()
    row_id = np.argwhere(corr.index.values == pct_change_of_f)[0][0]
    return corr.iloc[[row_id]].drop(columns=[pct_change_of_f])

def corr_pct_change(single_stock, feature):
    pct_change_of_f = pct_change_of(feature)
    return correlation_with_feature(single_stock, pct_change_of_f)

def corr_log_return(single_stock, feature):
    return correlation_with_feature(single_stock, log_return_f(feature))

We choose a single stock, for example 'SIE' (Siemens)

In [4]:
single_stock = prepare_single_stock('SIE', '30Min')

We look at the corrlations of all the other price features from previous time periods `(t - 1).
We could also say that we investigate if a single feature from the previous time period `(t - 1)`
would be predictive of the change in the next time period `(t)`

We first study the `LastEndPrice` or the last end price in the interval `t`.
As it can be seen there are no strong correlations.

In [5]:
corr_pct_change(single_stock, 'LastEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:LastEndPrice,-0.011198,-0.027385,-0.011297,-0.011028,0.02114,-0.027374,-0.018079,-0.034673,-0.028488,-0.026029,...,-0.03019,-0.038945,-0.021345,-0.021353,-0.021357,-0.021333,-0.015984,-0.011112,-0.021343,-0.026869


Next we look for correlations with the mean prices: 
- MeanAvg4Price: we averaged all 4 prices available within a minute and then averaged within an interval like 10Min
- MeanMaxMinPrice: took the average of Min and Max prices and then averaged within an interval
- MeanStartEndPrice: took the average of Start and End prices and then averaged within an interval

# Conclusion:

- The more averaging was done, the easier to predict
- EndPrice is very difficult to predict, but the mean price is not 
- The median of the end price is harder to predict than the mean, but a lot easier than the end price



In [6]:
corr_pct_change(single_stock, 'MeanAvg4Price')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:MeanAvg4Price,0.499324,0.592774,0.500389,0.497669,0.091224,0.101009,0.357508,0.331484,0.427626,0.338498,...,0.249119,0.071413,0.069717,0.07359,0.069911,0.069521,0.068348,0.091924,-0.029574,0.088075


In [7]:
corr_pct_change(single_stock, 'MeanMaxMinPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:MeanMaxMinPrice,0.498502,0.592666,0.499563,0.496845,0.0917,0.101526,0.357278,0.331763,0.427373,0.338062,...,0.248583,0.071013,0.069293,0.073177,0.069474,0.069111,0.067948,0.091528,-0.029617,0.087665


In [8]:
corr_pct_change(single_stock, 'MeanStartEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:MeanStartEndPrice,0.500111,0.592839,0.501179,0.498458,0.090742,0.100484,0.357715,0.331181,0.427849,0.33891,...,0.24964,0.071808,0.070136,0.073999,0.070345,0.069926,0.068744,0.092315,-0.029529,0.08848


In [9]:
corr_pct_change(single_stock, 'MeanEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:MeanEndPrice,0.484106,0.57733,0.485131,0.482517,0.091301,0.099043,0.347314,0.322265,0.405209,0.320019,...,0.240225,0.066974,0.065182,0.068844,0.065384,0.06498,0.063751,0.087711,-0.029297,0.082722


In [10]:
corr_pct_change(single_stock, 'MedianEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:MedianEndPrice,0.433678,0.53076,0.434843,0.431914,0.094451,0.102715,0.336348,0.30959,0.381251,0.299344,...,0.227286,0.061692,0.054922,0.058357,0.055128,0.054715,0.045778,0.082627,-0.02849,0.073112


In [11]:
corr_pct_change(single_stock, 'VolumeWeightedEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:VolumeWeightedEndPrice,0.404,0.517642,0.405091,0.40228,0.097143,0.096174,0.31892,0.304942,0.357195,0.293577,...,0.19244,0.04144,0.040911,0.043978,0.041081,0.04074,0.041603,0.055856,-0.022639,0.047989


Let us settle for the moment on the 'MeanEndPrice', and compute the correlations over multiple intervals

In [12]:
intervals = ['1Min', '2Min', '3Min', '4Min', '5Min', '6Min', '7Min', '8Min', '9Min', '10Min', 
             '15Min', '30Min', '60Min', '120Min', '180Min', '240Min',
             '1D', '2D', '1W', '2W']
def corr_over_intervals(stock):
    results = []
    for interval in intervals:
        single_stock = prepare_single_stock(stock, interval)
        cr = corr_pct_change(single_stock, 'MeanEndPrice')
        cr.index = [cr.index[0] + "@" + interval]
        results.append(cr)
    return pd.concat(results)

In [13]:
corr_over_intervals('BMW')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
PctChange:MeanEndPrice@1Min,-0.043937,-0.039041,-0.043878,-0.043996,0.019617,-0.030026,-0.032589,-0.026503,-0.043937,-0.025848,...,-0.036003,-0.01885,-0.023348,-0.036003,-0.023252,-0.023293,-0.036003,-0.026166,,-0.032512
PctChange:MeanEndPrice@2Min,0.243801,0.289113,0.243899,0.24362,0.100845,0.056373,0.192183,0.207309,0.219414,0.165963,...,0.112833,0.034513,0.018251,0.046166,0.019459,0.017004,0.046166,0.02496,-0.003425,0.053319
PctChange:MeanEndPrice@3Min,0.446939,0.500738,0.446452,0.447419,0.129923,0.098694,0.259828,0.273922,0.332804,0.242359,...,0.15961,0.040017,0.02296,0.049421,0.02395,0.021948,0.036612,0.041267,-0.004517,0.067492
PctChange:MeanEndPrice@4Min,0.46652,0.528768,0.466612,0.466332,0.125949,0.123349,0.315666,0.30989,0.379996,0.286384,...,0.191336,0.055733,0.041659,0.064565,0.042958,0.040337,0.057334,0.06277,-0.00834,0.085934
PctChange:MeanEndPrice@5Min,0.456127,0.532253,0.456543,0.455466,0.120404,0.134864,0.303333,0.326282,0.386485,0.299391,...,0.20397,0.063215,0.054092,0.073949,0.054546,0.053621,0.067687,0.063778,-0.002089,0.084708
PctChange:MeanEndPrice@6Min,0.476878,0.555853,0.47739,0.476017,0.134123,0.134708,0.32047,0.34054,0.420884,0.31355,...,0.221797,0.085873,0.064623,0.081728,0.065151,0.06408,0.073026,0.072951,4e-06,0.093992
PctChange:MeanEndPrice@7Min,0.525034,0.593687,0.525365,0.524448,0.152601,0.154428,0.351836,0.372832,0.430444,0.322283,...,0.25463,0.091615,0.070236,0.08703,0.070592,0.069866,0.076179,0.086931,-0.001277,0.112648
PctChange:MeanEndPrice@8Min,0.515185,0.592355,0.515672,0.514255,0.148737,0.131932,0.370597,0.342079,0.46822,0.334731,...,0.259177,0.093301,0.075384,0.090757,0.07574,0.075018,0.08019,0.092538,-0.001212,0.111636
PctChange:MeanEndPrice@9Min,0.501365,0.582707,0.501964,0.500331,0.148811,0.111296,0.352506,0.335596,0.435434,0.331195,...,0.24795,0.090526,0.07223,0.084838,0.072534,0.071917,0.073479,0.087941,-0.001548,0.102093
PctChange:MeanEndPrice@10Min,0.480809,0.573533,0.481306,0.479811,0.136594,0.108212,0.349747,0.325004,0.429325,0.332825,...,0.235687,0.075253,0.067106,0.078336,0.067409,0.066796,0.071347,0.079706,-0.011731,0.092845


In [14]:
# corr_over_intervals('SIE')

In [15]:
# corr_over_intervals('SAP')

The reasons for the strong correlations is due to averaging.
We make an experiment to check the correlation with `MaxPrice` which uses only `MaxPrice`
and `Direction` which uses only `EndPrice` and `StartPrice`.



In [16]:
single_stock = prepare_single_stock('BMW', '1D')
pd.DataFrame({
    'PctChange:MaxPrice': single_stock['PctChange:MaxPrice'],
    'Direction[t + 1]': single_stock['Direction'].shift(-1),   
    'Direction[t - 0]': single_stock['Direction'],        
    'Direction[t - 1]': single_stock['Direction[t - 1]'],
    'Direction[t - 2]': single_stock['Direction[t - 1]'].shift(1),
    'Direction[t - 3]': single_stock['Direction[t - 1]'].shift(2),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(3),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(4)    
}).corr()[['PctChange:MaxPrice']]

Unnamed: 0,PctChange:MaxPrice
Direction[t + 1],-0.012476
Direction[t - 0],0.462436
Direction[t - 1],0.375247
Direction[t - 2],0.065301
Direction[t - 3],-0.007288
Direction[t - 4],-0.042227
PctChange:MaxPrice,1.0


In [17]:
single_stock = prepare_single_stock('BMW', '1D')
pd.DataFrame({
    'PctChange:LastEndPrice': single_stock['PctChange:LastEndPrice'],
    'Direction[t + 1]': single_stock['Direction'].shift(-1),   
    'Direction[t - 0]': single_stock['Direction'],        
    'Direction[t - 1]': single_stock['Direction[t - 1]'],
    'Direction[t - 2]': single_stock['Direction[t - 1]'].shift(1),
    'Direction[t - 3]': single_stock['Direction[t - 1]'].shift(2),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(3),
    'Direction[t - 4]': single_stock['Direction[t - 1]'].shift(4)    
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
Direction[t + 1],0.027184
Direction[t - 0],0.732582
Direction[t - 1],0.040058
Direction[t - 2],0.119437
Direction[t - 3],-0.032281
Direction[t - 4],-0.037653
PctChange:LastEndPrice,1.0


The reason that there is no correlation between period `(t)` and `(t - 2)` is related to the normalization we do in percent change. We can verify this independently next.

In [18]:
single_stock = prepare_single_stock('BMW', '1D')
last_end_price = single_stock['PctChange:LastEndPrice']
pd.DataFrame({
    'PctChange:LastEndPrice': last_end_price,       
    'PctChange:LastEndPrice[t - 1]': last_end_price.shift(1),
    'PctChange:LastEndPrice[t - 2]': last_end_price.shift(2),
    'PctChange:LastEndPrice[t - 3]': last_end_price.shift(3),
    'PctChange:LastEndPrice[t - 4]': last_end_price.shift(4)    
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
PctChange:LastEndPrice[t - 1],0.133811
PctChange:LastEndPrice[t - 2],0.067861
PctChange:LastEndPrice[t - 3],-0.039186
PctChange:LastEndPrice[t - 4],-0.015114


Once we start normalizing the features in the same way, we find that past periods also exibit
correlations to the present

In [19]:
single_stock = prepare_single_stock('BMW', '1D')
last_end_price = single_stock['PctChange:LastEndPrice']
pd.DataFrame({
    'PctChange:LastEndPrice': last_end_price,       
    rev_pct_change_of_at_t('LastEndPrice', 1): single_stock[rev_pct_change_of_at_t('LastEndPrice', 1)],
    rev_pct_change_of_at_t('LastEndPrice', 2): single_stock[rev_pct_change_of_at_t('LastEndPrice', 2)],
    rev_pct_change_of_at_t('LastEndPrice', 3): single_stock[rev_pct_change_of_at_t('LastEndPrice', 3)]
}).corr()[['PctChange:LastEndPrice']]

Unnamed: 0,PctChange:LastEndPrice
PctChange:LastEndPrice,1.0
RevPctChange[t - 1]:LastEndPrice,0.124178
RevPctChange[t - 2]:LastEndPrice,0.144278
RevPctChange[t - 3]:LastEndPrice,0.123414


In [20]:
single_stock = prepare_single_stock('BMW', '1D')
mean_end_price = single_stock['PctChange:MeanEndPrice']
pd.DataFrame({
    'PctChange:MeanEndPrice': mean_end_price,       
    rev_pct_change_of_at_t('MeanEndPrice', 1): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 1)],
    rev_pct_change_of_at_t('MeanEndPrice', 2): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 2)],
    rev_pct_change_of_at_t('MeanEndPrice', 3): single_stock[rev_pct_change_of_at_t('MeanEndPrice', 3)]
}).corr()[['PctChange:MeanEndPrice']]

Unnamed: 0,PctChange:MeanEndPrice
PctChange:MeanEndPrice,1.0
RevPctChange[t - 1]:MeanEndPrice,0.181585
RevPctChange[t - 2]:MeanEndPrice,0.144456
RevPctChange[t - 3]:MeanEndPrice,0.175877


If we experiment with the same feature, but normalized differently we:
- are not going to see correlations beyond `(t - 1)` when different normalization is used
- we may see correlations with the past when the same normalization is used

In [21]:
single_stock = prepare_single_stock('BMW', '1D')
mean_end_price = single_stock['PctChange:MeanEndPrice']
pd.DataFrame({
    'PctChange:MeanEndPrice': mean_end_price,       
    norm_feature('H1', 1, 'MeanEndPrice'): single_stock[norm_feature('H1', 1, 'MeanEndPrice')],
    norm_feature('H1', 2, 'MeanEndPrice'): single_stock[norm_feature('H1', 2, 'MeanEndPrice')],
    norm_feature('H1', 3, 'MeanEndPrice'): single_stock[norm_feature('H1', 3, 'MeanEndPrice')],
    norm_feature('H1', 4, 'MeanEndPrice'): single_stock[norm_feature('H1', 4, 'MeanEndPrice')],
    'F1[t - 1]': single_stock['F1[t - 1]'],
    'F1[t - 2]': single_stock['F1[t - 1]'].shift(1),
    'F1[t - 3]': single_stock['F1[t - 1]'].shift(2),
    'F1[t - 4]': single_stock['F1[t - 1]'].shift(3)  
}).corr()[['PctChange:MeanEndPrice']]

Unnamed: 0,PctChange:MeanEndPrice
F1[t - 1],0.379122
F1[t - 2],0.065195
F1[t - 3],0.001757
F1[t - 4],0.022977
H1@WithNorm(MeanEndPrice)[t - 1],0.378796
H1@WithNorm(MeanEndPrice)[t - 2],0.051776
H1@WithNorm(MeanEndPrice)[t - 3],-0.013097
H1@WithNorm(MeanEndPrice)[t - 4],0.277385
PctChange:MeanEndPrice,1.0


## Log of return
In their [demo notebook](https://github.com/googledatalab/notebooks/blob/master/samples/TensorFlow/Machine%20Learning%20with%20Financial%20Data.ipynb) Google Cloud computes a feature that they call log return. This feature is:

```
log(Price:X[t]/Price:X[t - 1])
```
for some version of price X.

We explore this feature below. We find that when `Mean` of price is used, this feature correlates well
with various predictors, but when the `Last` of price is used, this feature does not correlate very well

In [22]:
single_stock = prepare_single_stock('SIE', '30Min')
corr_log_return(single_stock, 'MeanEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
LogReturn:MeanEndPrice,0.48406,0.577222,0.485085,0.482471,0.090951,0.099265,0.347317,0.322155,0.40516,0.319939,...,0.240314,0.066983,0.065278,0.06894,0.065479,0.065075,0.063838,0.087899,-0.029496,0.082828


In [23]:
single_stock = prepare_single_stock('SIE', '30Min')
corr_log_return(single_stock, 'LastEndPrice')

Unnamed: 0,Direction[t - 1],F1[t - 1],F2[t - 1],F3[t - 1],F4[t - 1],F5[t - 1],F6[t - 1],F7[t - 1],G1[t - 1],G2[t - 1],...,RevPctChange[t - 4]:LastEndPrice,RevPctChange[t - 4]:MaxPrice,RevPctChange[t - 4]:MeanAvg4Price,RevPctChange[t - 4]:MeanEndPrice,RevPctChange[t - 4]:MeanMaxMinPrice,RevPctChange[t - 4]:MeanStartEndPrice,RevPctChange[t - 4]:MedianEndPrice,RevPctChange[t - 4]:MinPrice,RevPctChange[t - 4]:StdEndPrice,RevPctChange[t - 4]:VolumeWeightedEndPrice
LogReturn:LastEndPrice,-0.011142,-0.027358,-0.01124,-0.010971,0.020748,-0.027203,-0.01822,-0.03461,-0.028491,-0.026029,...,-0.030074,-0.038851,-0.021236,-0.021244,-0.021247,-0.021224,-0.015875,-0.010987,-0.021412,-0.02676


When we examine auto correlations of LogReturn we find that beyond `(t - 1)` there are no correlations.
Google cloud also reports the same. As we explained, the reason is due to different normalizaiton
in different periods.

In [24]:
single_stock = prepare_single_stock('SIE', '30Min')
mean_end_price = single_stock['LogReturn:MeanEndPrice']
pd.DataFrame({
    'LogReturn:MeanEndPrice': mean_end_price,     
    'LogReturn:MeanEndPrice[t - 1]': single_stock['LogReturn:MeanEndPrice[t - 1]'],
    'LogReturn:MeanEndPrice[t - 2]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(1),
    'LogReturn:MeanEndPrice[t - 3]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(2),
    'LogReturn:MeanEndPrice[t - 4]': single_stock['LogReturn:MeanEndPrice[t - 1]'].shift(3)  
}).corr()[['LogReturn:MeanEndPrice']]

Unnamed: 0,LogReturn:MeanEndPrice
LogReturn:MeanEndPrice,1.0
LogReturn:MeanEndPrice[t - 1],0.238046
LogReturn:MeanEndPrice[t - 2],-0.008179
LogReturn:MeanEndPrice[t - 3],-0.035098
LogReturn:MeanEndPrice[t - 4],-0.021521
