---
# Creating Financial Data Structures using Tick Data
---


This notebook walks through the creation of financial data structures using the mlfinlab package. The idea is to demonstrate the methods for creating various bar structures and to compare the statistical properties of the structures.

In [2]:
import mlfinlab as ml

import numpy as np
import pandas as pd 
import scipy as st 

import seaborn as sns 
from stats.models.graphics.tsaplots import plot_acf

import matplotlib as plt 

%matplotlib inline

ModuleNotFoundError: No module named 'mlfinlab'

---
## Read and Format Data

For the purpose of these files we will format the tick data as in columns: date_time, price, and volume.  The financial data structures can be created using these three columns.

In [3]:
raw_data = pd.read_csv('./data/financial_data/ES_Trades/ES_Trades.csv')
data.head()

NameError: name 'pd' is not defined

In [4]:
# Format the Data
date_time = data['Date'] + ' ' + data['Time']
new_data = pd.concat([date_time, data['Price'], data['Volume']], axis=1)
new_data.columns = ['date', 'price', 'volume']

print(new_data.head())
print('\n')
print('Rows:', new_data.shape[0])

NameError: name 'data' is not defined

---
## Save the File to CSV

In [5]:
new_data.to_csv('./data/raw_tick_data.csv', index=False)

NameError: name 'new_data' is not defined

---
## Create Financial Data Structures

Here we are going to create the various standard bar structures and save the data to various csv files to save room on memory.

In [7]:
print('Creating Dollar Bars')
dollar_bars = ml.data_structures.get_dollar_bars('data/raw_tick_data.csv', threshold=50000000, batch_size=750000, verbose=True, to_csv=True, output_path='./data/financial_data/dollar_bars.csv')
print('Dollar Bars Created')

print('Creating Volume Bars')
volume_bars = ml.data_structures.get_volume_bars('data/raw_tick_data.csv', threshold=25000, batch_size=750000, verbose=True, to_csv=True, output_path='./data/financial_data/volume_bars.csv')
print('Volume Bars Created')

print('Creating Tick Bars')
tick_bars = ml.data_structures.get_tick_bars('data/raw_tick_data.csv', threshold=5500, batch_size=750000, verbose=True, to_csv=True, output_path='./data/financial_data/tick_bars.csv')
print('Tick Bars Created')


Creating Dollar Bars


NameError: name 'ml' is not defined

---
## Test Sampling

In [8]:
# Confirm dollar bar creation (each bar is roughly equal to the set threshold)
dollar_bars['value'] = dollar_bars['close'] * dollar_bars['volume']
dollar_bars.head()

NameError: name 'dollar_bars' is not defined

In [9]:
# Confirm the sampling for the volume bars (compare to threshold)
volume_bars.head()

NameError: name 'volume_bars' is not defined

---
# Compare statistical properties

Here we will be comparing the statistical properties of the various bar structures to understand pros and cons of using various data structures for time series analysis and machine learning applications.

In [10]:
time_bars = pd.read_csv('.data/financial_data/time_bars.csv', index_col=0, parse_dates=True)
volume_bars = pd.read_csv('.data/financial_data/volume_bars.csv', index_col=0, parse_dates=True)
dollar_bars = pd.read_csv('.data/financial_data/dollar_bars.csv', index_col=0, parse_dates=True)
tick_bars = pd.read_csv('.data/financial_data/tick_bars.csv', index_col=0, parse_dates=True)

NameError: name 'pd' is not defined

---
## Measure Stability of Bar Structures

In [13]:
# Downsample to weekly time-frame
time_count = time_bars['price'].resample('W', label='right').count()
volume_count = volume_bars['close'].resample('W', label='right').count()
dollar_count = dollar_bars['close'].resample('W', label='right').count()
tick_count = tick_bars['close'].resample('W', label='right').count()

compare_df = pd.concat([time_count, volume_count, dollar_count, tick_count], axis=1)
compare_df.columns = ['time', 'volume', 'dollar', 'tick']

NameError: name 'time_bars' is not defined

In [14]:
# Plot Bars over time
compare_df.loc[:, ['time', 'volume', 'dollar', 'tick']].plot(kind='bar', figsize=[25, 5], color=('darkred', 'darkblue', 'green', 'darkcyan'))
plt.title('Number of bars over time', loc='center', fontsize=25, fontweight='bold', fontname='Times New Roman')
plt.show()

NameError: name 'compare_df' is not defined

---
## Calculate the Log Returns (Partial return to normality)

In [16]:
time_rets = np.log(time_bars['price']).diff().dropna()
volume_rets = np.log(volume_bars['close']).diff().dropna()
dollar_rets = np.log(dollar_bars['close']).diff().dropna()
tick_rets = np.log(tick_bars['close']).diff().dropna()




NameError: name 'np' is not defined

In [21]:
plot_acf(time_rets, lags=10, zero=False)
plt.title('Time Bar Autocorrelation')
plt.show()

NameError: name 'plot_acf' is not defined

In [None]:
plot_acf(volume_rets, lags=10, zero=False)
plt.title('Volume Bar Autocorrelation')
plt.show()

In [None]:
plot_acf(dollar_rets, lags=10, zero=False)
plt.title('Dollar Bar Autocorrelation')
plt.show()

In [None]:
plot_acf(tick_rets, lags=10, zero=False)
plt.title('Tick Bar Autocorrelation')
plt.show()

In [18]:
print('Test Statitics:')
print('Time:', '\t', int(st.jarque_bera(time_rets)[0]))
print('Volume:', '\t', int(st.jarque_bera(volume_rets)[0]))
print('Dollar:', '\t', int(st.jarque_bera(dollar_rets)[0]))
print('Tick:', '\t', int(st.jarque_bera(tick_rets)[0]))

Test Statitics:


NameError: name 'st' is not defined

---
## Standardize the Returns

In [20]:
# Standardize (bring to same scale)
time_standard = ((time_rets - time_rets.mean()) / time_rets.std())
volume_standard = ((volume_rets - volume_rets.mean()) / volume_rets.std())
dollar_standard = ((dolllar_rets - dollar_rets.mean()) / dollar_rets.std())
tick_standard = ((tick_rets - tick_rets.mean()) / tick_rets.std())

# Plot Distributions
plt.figure(figsize=(16, 12))
sns.kdeplot(time_standard, label='Time Bars', color='darkblue')
sns.kdeplot(volume_standard, label='Volume Bars', color='darkred')
sns.kdeplot(dollar_standard, label='Dollar Bars', color='darkgreen')
sns.kdeplot(tick_standard, label='Tick Bars', color='cyan')
sns.kdeplot(np.random.normal(size=1000000), label='Normal' ,color='black', linestyle='--')

plt.xticks(range(-5, 6))
plt.legend(loc=8, ncol=5)
plt.title('Exhibit 1 - Partial recovery of Normality through a price sampling process \nsubordinated to a volume, tick, dollar clock',
          loc='center', fontsize=20, fontweight="bold", fontname="Times New Roman")
plt.xlim(-5, 5)
plt.show()

NameError: name 'time_rets' is not defined

---
## Conclusion

Standard bar structures: Dollar, Volume, and Tick can be created using raw OHLCV data with better statistical properties than traditional Time bars.  The work can further be compared against Imbalance bars referenced by Marcos Lopez de Prado in Advances in Financial Machine Learning.