# Jane street time series exploration and statistics

This notebook does various explorations to give some insights about some feature engineering techniques regarding time series that might be possible :  
- Features visualisation against time on a subsample of 9 days (with and without some exponential average smoothing)
- MACD calculation and visualisation example on 1 feature (see this for example about MACD : https://towardsdatascience.com/implementing-macd-in-python-cc9b2280126a)   
- Stationarity test of features
- Spearman correlation of features
- Pearson autocorrelation, to have an idea of the step between values of the time series
- Visualisation of fractionaly differenciated features with mlfinlab package (implementation of the FFD method : see https://www.kaggle.com/c/jane-street-market-prediction/discussion/198994 for more details)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

#import janestreet
#env = janestreet.make_env() # initialize the environment

#!pip install datatable # Internet is not activated in this competition
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl
import datatable as dt

DF_FILE = '../input/jane-street-first-time-series-dataframe-save/dataframe.pickle'
LOAD_DF = True

REMOVE_OUTLIERS = False

import gc
import pickle

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

INPUT_DIR = '/kaggle/input/jane-street-market-prediction/'

pd.set_option('display.max_rows', 2000)

In [None]:
from statsmodels.tsa.stattools import adfuller

# Load data

In [None]:
%%time
if (LOAD_DF != True):
    # Thanks to his notebook for this fast loading : https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance
    train_data_datatable = dt.fread('../input/jane-street-market-prediction/train.csv')
    df = train_data_datatable.to_pandas()

    # Thanks to this notebook to gain memory usage : https://www.kaggle.com/jorijnsmit/one-liner-to-halve-your-memory-usage
    float64_cols = df.select_dtypes(include='float64').columns
    mapper = {col_name: np.float32 for col_name in float64_cols}
    df = df.astype(mapper)
    
    del train_data_datatable
    
    df['resp_positive'] = ((df['resp'])>0)*1  # Target to predict
    
    # Temporal features
    FEATURES_FOR_MACD = ['feature_'+str(i) for i in range(1,130)]
    
    for feature in FEATURES_FOR_MACD:
        df.loc[:, feature + '_macd'] = df[feature].ewm(span=12, adjust=False).mean().astype('float32') # Short term exponential moving average\
        - df[feature].ewm(span=26, adjust=False).mean().astype('float32') # Short term exponential moving average

        df.loc[:, feature + '_macd_minus_signal'] = df[feature + '_macd'] - df[feature + '_macd'].ewm(span=9, adjust=False).mean().astype('float32')
        gc.collect()
else:
    with open(DF_FILE, 'rb') as f:
        df = pickle.load(f)   

# Remove most of the data : keep 9 days

In [None]:
df['date'].max()

In [None]:
df.drop(index=df[df['date'] <= 490].index, inplace=True)

In [None]:
df.shape

# Split train test

No split train test in this notebook

In [None]:
df_train = df

In [None]:
#df.shape

In [None]:
#train_size = int(df.shape[0] * 0.90)

In [None]:
#train_size

In [None]:
#df.iloc[0:train_size-1, :]

In [None]:
#df_train = df.iloc[0:train_size-1, :].copy(deep=True)
#y_train = df.iloc[0:train_size-1]['resp_positive'].copy(deep=True)

#df_test = df.iloc[train_size:df.shape[0]-1].copy(deep=True) 
#df_test_origin = df_test.copy(deep=True)
#y_test = df.iloc[train_size:df.shape[0]-1]['resp_positive'].copy(deep=True)

In [None]:
#del df
#gc.collect()

# Data clean

In [None]:
cols_with_missing_train = [col for col in df_train.columns if df_train[col].isnull().any()]

In [None]:
REMOVE_OUTLIERS = True

In [None]:
if (REMOVE_OUTLIERS == True):
    for col in cols_with_missing_train:
        #df_train[col].fillna(-999, inplace=True) 
        pass # No fill NA at -999 : that would impair feature visualization
    
    df_train.dropna(axis=0, inplace=True)

In [None]:
#if (REMOVE_OUTLIERS == True):
#    for col in cols_with_missing_train:
#        df_test[col].fillna(-999, inplace=True) 

In [None]:
gc.collect()

In [None]:
df_train.shape

# Feature definition

In [None]:
FEATURES_LIST = ['feature_'+str(i) for i in range(130)] + ['feature_'+str(i)+'_macd' for i in range(1, 130)] +  ['weight']
FEATURES_LIST_RESP = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']
FEATURES_LIST_ORIGIN = ['feature_'+str(i) for i in range(130)] +  ['weight']


# Features smooth plot

In [None]:
# Thanks to https://www.kaggle.com/mlconsult/feature-visualization.  I added some smoothing.
df_train[FEATURES_LIST_RESP + FEATURES_LIST_ORIGIN].ewm(span = 1000).mean().plot(kind='line', subplots=True, grid=True, title="Visualize ", sharex=True, sharey=False, legend=True,figsize=(15,200));

In [None]:
# Thanks to https://www.kaggle.com/mlconsult/feature-visualization
df_train[FEATURES_LIST_RESP + FEATURES_LIST_ORIGIN].plot(kind='line', subplots=True, grid=True, title="Visualize ", sharex=True, sharey=False, legend=True,figsize=(15,200));

## MACD Example with feature 1 : check when MACD goes above signal

In [None]:
df_train.shape

In [None]:
df_train_subsample = df_train.iloc[0:20]

exp1 = df_train_subsample['feature_1'].ewm(span=12, adjust=False).mean() # Short term exponential moving average
exp2 = df_train_subsample['feature_1'].ewm(span=26, adjust=False).mean() # Long term exponential moving average

macd = exp1 - exp2

exp3 = macd.ewm(span=9, adjust=False).mean() # Signal line
resp = df_train_subsample['resp'] * df_train_subsample['feature_1'].mean() / np.abs(df_train_subsample['resp'].mean()) # Resp augmented to fit mean of feature 1 for better visualisation



plt.figure(figsize=(20,10))

plt.plot(df_train_subsample.ts_id, df_train_subsample.feature_1, label='Feature 1')
plt.plot(df_train_subsample.ts_id, macd, label='Feature 1 MACD', color='orange')
plt.plot(df_train_subsample.ts_id, exp3, label='Signal Line', color='Magenta')
#plt.plot(df_train_subsample.ts_id, resp, label='Resp', color='red')
plt.legend(loc='upper left')
plt.show()

# Stationarity test

In [None]:
def Augmented_Dickey_Fuller_Test_func(series , column_name):
    print (f'Results of Dickey-Fuller Test for column: {column_name}')
    dftest = adfuller(series, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','No Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
    if dftest[1] <= 0.05:
        print("Conclusion:====>")
        print("Reject the null hypothesis")
        print("Data is stationary")
        
        return(True)
    
    else:
        print("Conclusion:====>")
        print("Fail to reject the null hypothesis")
        print("Data is non-stationary")
        
        return(False)

In [None]:
[Augmented_Dickey_Fuller_Test_func(df_train[feat], feat) for feat in FEATURES_LIST_RESP]

In [None]:
feats_stationary_bool = [Augmented_Dickey_Fuller_Test_func(df_train[feat], feat) for feat in FEATURES_LIST_ORIGIN]

In [None]:
[i for i, x in enumerate(feats_stationary_bool) if (x == False)]

### Result: all features are stationary (empty list of True values)

# Feature correlations

In [None]:
fig, axs = plt.subplots(figsize=(64, 64))
plt.title('Corrélation des features (coefficient de spearman)')
sns.heatmap(df_train[FEATURES_LIST + FEATURES_LIST_RESP].corr(method='spearman'), cmap=sns.diverging_palette(2, 255, n=20), square=True, center=0)
plt.show()

# Feature autocorrelations

In [None]:
step = 500

In [None]:
y_autocorr = [df_train['resp'].autocorr(lag=x) for x in range(step)]

In [None]:
plt.scatter(range(1,step), y_autocorr[1:])

In [None]:
np.argmax(y_autocorr[1:])

# Compute step indice that has maximal pearson auto correlation

In [None]:
y_autocorr = [[df_train[feat].autocorr(lag=x) for x in range(step)] for feat in FEATURES_LIST_ORIGIN]

In [None]:
argmax_feat_corrs = [np.argmax(autocorr_feat[1:]) for autocorr_feat in y_autocorr]

In [None]:
argmax_feat_corrs

### We see a vast majority of max auto correlation indices of 0 which means step 1. Which means features have probably already been pre-engineered so that they have a step of 1.

## Plot of correlation coefficients accross steps :

In [None]:
feats_todisplay = range(130)

n_step = 0
for n_feat in feats_todisplay:
    #print(f'Feature {n_feat}:')
    
    plt.plot(range(len(y_autocorr[n_feat][1:100])), y_autocorr[n_feat][1:100])
    plt.title(f'Feature {n_step} autocorrelation value accross steps, starting with 1')
    plt.show()
    
    n_step += 1
    
    '''
    for n_step in range(len(y_autocorr[n_feat][1:])):
        corr_value_for_step = y_autocorr[n_feat][1:][n_step]

        print(f'Correlation coefficient for step {n_step+1} : {corr_value_for_step}')
        
        
        if (n_step > 10):
            break
    '''

# Compute step indice among resp features that has maximal pearson auto correlation

In [None]:
y_autocorr = [[df_train[feat].autocorr(lag=x) for x in range(step)] for feat in FEATURES_LIST_RESP]

In [None]:
argmax_feat_resp_corrs = [np.argmax(autocorr_feat[1:]) for autocorr_feat in y_autocorr]

In [None]:
argmax_feat_resp_corrs

## Plot of resp correlation coefficients accross steps :

In [None]:
feats_todisplay = range(5)

n_step = 0
for n_feat in feats_todisplay:
    #print(f'Feature {n_feat}:')
    
    plt.plot(range(len(y_autocorr[n_feat][1:500])), y_autocorr[n_feat][1:500])
    plt.title(f'Feature resp {n_step} autocorrelation value accross steps, starting with 1')
    plt.show()
    
    n_step += 1

# FFD method test (Fractional differenciating)

In [None]:
!pip uninstall typing -y
!pip install mlfinlab

In [None]:
import mlfinlab as mlfin

In [None]:
help(mlfin.fracdiff.FractionalDifferentiation)

In [None]:
frac = mlfin.fracdiff.FractionalDifferentiation()

In [None]:
DIFF_AMT = 0.1

for feat in FEATURES_LIST_ORIGIN:
    df_train[feat+'_frac'] = frac.frac_diff_ffd(df_train[[feat]], 0.1)

In [None]:
FEATURES_LIST_WITH_FRAC = []
for i in range(130):
    FEATURES_LIST_WITH_FRAC.append('feature_'+str(i))
    FEATURES_LIST_WITH_FRAC.append('feature_'+str(i)+'_frac')

In [None]:
# Thanks to https://www.kaggle.com/mlconsult/feature-visualization.  I added some smoothing.
df_train[FEATURES_LIST_WITH_FRAC].ewm(span = 1000).mean().plot(kind='line', subplots=True, grid=True, title="Visualize features with FFD smoothed", sharex=True, sharey=False, legend=True,figsize=(15,400));

In [None]:
feat1_diff = frac.frac_diff_ffd(df_train[['feature_1']], 0.1)

In [None]:
df_train['feature_1'].ewm(span = 1000).mean().plot(kind='line', title='Feature 1 smoothed', legend=True, figsize=(15, 5));

In [None]:
feat1_diff.ewm(span = 1000).mean().plot(kind='line', title='Feature 1 fractionally differenciated smoothed', legend=True, figsize=(15, 5));

# Conclusion
Since data seems stationary, it probably means that trends have already been extracted,   
And since step between features is 1, time series are probably consistent with time stamp indexes.   
Which confirms some refined feature engineering has already been done.   

MACD and FFD techniques may not be of any use in this competition since they may have already been included in the feature engineering already done.  
However it could still be worth trying to include those features in a model to see what happens : MACD minus signal seems to give some information about local trends in the visualisation example of feature 1.  