# ITCH trade sign classification (explanation)

#### Juan Camilo Henao Londono - 27.12.2019
#### AG Guhr - Universität Duisburg-Essen

The results of the trade sign classification from the [paper](https://link.springer.com/content/pdf/10.1140/epjb/e2016-60818-y.pdf) are different from the ones I obtained with my implementation.
In this notebook I want to clarify the reasons why they are different. I think my code is right and the mistake
is in the implementation and interpretation in the paper.

In [1]:
import gzip
import numpy as np
import pandas as pd

import itch_trade_sign_classification_test as itch_sign_clas

## Part 1. Original results

To compare in detail the results I use the TotalView-ITCH data for the Apple Inc. stock the 2008.06.02.
The paper uses Eq. 1, 2 and 3 in Section 2.3 to obtain the trade signs. I will only use the data obtained from Eq. 1 and 2 to clarify my position. Eq. 1 uses trade time scale and Eq.2 uses second time scale.

In [2]:
# Ticker and date
ticker = 'AAPL'
year = '2008'
month = '06'
day = '02'

# Filename of the original results
filename_trade_ori = f'../data/itch_trade_classification_trade_{year}{month}{day}_AAPL_ori.txt'
filename_second_ori = f'../data/itch_trade_classification_second_{year}{month}{day}_AAPL_ori.txt'

# Filename of the corrected results
filename_trade_corr = f'../data/itch_trade_classification_trade_{year}{month}{day}_AAPL_corr.txt'
filename_second_corr = f'../data/itch_trade_classification_second_{year}{month}{day}_AAPL_corr.txt'

In [3]:
# Load and compute the data with my implementation
(time_signs, trade_signs,
 vol_signs, price_signs) = itch_sign_clas.itch_trade_classification_data(ticker, year, month, day)
id_trades = itch_sign_clas.itch_trade_classification_eq1_data(ticker, trade_signs, price_signs, year,
                                                              month, day)
emp_s, exp_eq2_s = itch_sign_clas.itch_trade_classification_eq2_data(ticker, time_signs,
                                                                              trade_signs, id_trades, year,
                                                                              month, day)
_, exp_eq3_s = itch_sign_clas.itch_trade_classification_eq3_data(ticker, time_signs, trade_signs,
                                                                    vol_signs, id_trades, year, month, day)

Processing data for the stock AAPL the 2008.06.02

Implementation of Eq. 1.
Processing data for the stock AAPL the 2008.06.02

Implementation of Eq. 2.
Processing data for the stock AAPL the 2008.06.02

Implementation of Eq. 3.
Processing data for the stock AAPL the 2008.06.02



In [4]:
# Create a data frame with my results
trades_no_0 = trade_signs != 0
d1 = {'Time': time_signs[trades_no_0], 'Emp': trade_signs[trades_no_0],
      'Exp': id_trades, 'Price': price_signs[trades_no_0]}
trades_juan_ori = pd.DataFrame(data=d1).astype(int)
trades_juan_ori['Price'] = trades_juan_ori['Price'] / 10000

In [5]:
d2 = {'Eq2': exp_eq2_s, 'Eq3': exp_eq3_s, 'Emp': emp_s}
second_juan_ori = pd.DataFrame(data=d2).astype(int)

In [6]:
# Load results from paper
trades_paper_ori = pd.read_csv(filename_trade_ori, sep='   ',
                               usecols=(1,3,4,5), header=None, engine='python')
trades_paper_ori.columns = ['Time', 'Teo', 'Emp', 'Price']

In [7]:
second_paper_ori = pd.read_csv(filename_second_ori, sep='   ',
                               usecols=(2,3,4), header=None, engine='python')
second_paper_ori.columns = ['Eq2', 'Eq3', 'Emp']

After loading the data, I compare every single result to find where are the differences

In [9]:
print('Similarities')
print('------------')
print()

trade_time_comp = np.sum(trades_paper_ori['Time'] == trades_juan_ori['Time']) \
                         / len(trades_paper_ori['Time'])
print('The similarity of the time in the trade time scale is  {:.2f}%'.format(trade_time_comp * 100))

trade_emp_comp = np.sum(trades_paper_ori['Emp'] == trades_juan_ori['Emp']) / len(trades_paper_ori['Emp'])
second_emp_comp = np.sum(second_paper_ori['Emp'] == second_juan_ori['Emp']) / len(second_paper_ori['Emp'])
print('The similarity of the reference trade signs values for the trade time scale is  {:.2f}%'
      .format(trade_emp_comp * 100))
print('The similarity of the reference trade signs values for the second time scale is {:.2f}%'
      .format(second_emp_comp * 100))
print()

Similarities
------------

The similarity of the time in the trade time scale is  100.00%
The similarity of the reference trade signs values for the trade time scale is  90.62%
The similarity of the reference trade signs values for the second time scale is 93.05%



In [None]:
trans_exp_comp = np.sum(transactions_wang['Exp'] == transactions_juan['Exp']) / len(transactions_wang['Exp'])
perse_eq2_comp = np.sum(persecond_wang['Eq2'] == persecond_juan['Eq2']) / len(persecond_wang['Eq2'])
perse_eq3_comp = np.sum(persecond_wang['Eq3'] == persecond_juan['Eq3']) / len(persecond_wang['Eq3'])

print('The similarity of the experimental result for the transactions is {:.2f}%'.format(trans_exp_comp * 100))
print('The similarity of the Eq. 2 result for the persecond is           {:.2f}%'.format(perse_eq2_comp * 100))
print('The similarity of the Eq. 3 result for the persecond is           {:.2f}%'.format(perse_eq3_comp * 100)) 

In [None]:
eq_2_3_sim_juan = np.sum(persecond_juan['Eq2'] == persecond_juan['Eq3']) / len(persecond_juan['Eq2'])
print(eq_2_3_sim_juan)
eq_2_3_sim_wang = np.sum(persecond_wang['Eq2'] == persecond_wang['Eq3']) / len(persecond_wang['Eq2'])
print(eq_2_3_sim_wang)

## Step 4

Check the differences in every experimental result in the millisecond result and determine the error.

In [None]:
trans_exp_diff = np.where(transactions_wang['Exp'] != transactions_juan['Exp'])[0]

print('The total of different values between the experimental results are ', len(trans_exp_diff))
print('The first ten different values are located in the positions ', trans_exp_diff[:10])

In [None]:
print('The first value of Wang is {} and of Juan is {}'
      .format(transactions_wang['Exp'][0], transactions_juan['Exp'][0]))

In [None]:
wang = 0
juan = 0

for val in trans_exp_diff:
    
    print('The value in pos. {} of Wang is {} and of Juan is {}'
          .format(val, transactions_wang['Exp'][val], transactions_juan['Exp'][val]))
    print('To check which value is correct, we need to find the value in the position {} and {}'
          .format(val, val - 1))
    sign = np.sign(price_signs[val] - price_signs[val - 1])
    print('The price in pos. {} is {} and in position {} is {}. Then the trade sign must be {}'
              .format(val, price_signs[val], val - 1, price_signs[val - 1], sign))
    if (sign == 0):
        print(' '.join(('As the sign is zero, we use the trade sign value in pos. {}. For Wang that value'
              + ' is {} and for Juan that value is {}. The real value of that position is {}').split())
              .format(val - 1, transactions_wang['Exp'][val - 1], transactions_juan['Exp'][val - 1],
                      trade_signs[val - 1]))
        if (transactions_wang['Exp'][val - 1] == trade_signs[val - 1]
            and transactions_juan['Exp'][val - 1] == trade_signs[val - 1]):
            print('Both were right')
            wang += 1
            juan += 1
        elif (transactions_wang['Exp'][val - 1] == trade_signs[val - 1]):
            print('Wang was right')
            wang += 1
        else:
            print('Juan was right')
            juan += 1
    else:
        print('The real value of that position is {}'.format(trade_signs[val]))
        if (trade_signs[val] == transactions_wang['Exp'][val]):
            print('Wang was right')
            wang += 1
        else:
            print('Juan was right')
            juan += 1
    print()

print('Wang was right {} times and Juan was right {} times'.format(wang, juan))    
print()

The reason of this error is due to the fact I just use the values from the open market time. Mi first value is calculated using the difference between the first and last value `diff = price_signs[0] - price_signs[-1]`, then my first value is error prone.

## Step 5

Check the differences in every experimental result in the Eq. 1 result and determine the error.

In [None]:
trans_eq2_diff = np.where(persecond_wang['Eq2'] != persecond_juan['Eq2'])[0]

print('The total of different values between the experimental results are ', len(trans_eq2_diff))
print('The first ten different values are located in the positions ', trans_eq2_diff[:10])

In [None]:
wang = 0
juan = 0

for val in trans_eq2_diff:
    
    print('The value in pos. {} of Wang is {} and of Juan is {}'
          .format(val, persecond_wang['Eq2'][val], persecond_juan['Eq2'][val]))
    print('To check which value is correct, we need to find the value in the second {}'
          .format(34800 + val))
    
    condition = (transactions_wang['Time'] / 1000 >= 34800 + val) \
                * (transactions_wang['Time'] / 1000 < 34800 + val + 1)
    print('Juan: ', list(transactions_juan['Exp'][condition]))
    print('Wang: ', list(transactions_wang['Exp'][condition]))
    
    trades_sign_eq2_juan = np.sum(transactions_juan['Exp'][condition])
    trades_sign_eq2_wang = np.sum(transactions_wang['Exp'][condition])
    trades_sign_eq2_teo = np.sum(transactions_wang['Teo'][condition])
    
    print('The sum of Juan is {}, the sum of Wang is {}, the theoretical value is {}'
          .format(trades_sign_eq2_juan, trades_sign_eq2_wang, np.sign(trades_sign_eq2_teo)))                   

    if (np.sign(trades_sign_eq2_juan) == np.sign(trades_sign_eq2_teo)):
        print('Juan was right')
        juan += 1
    elif (np.sign(trades_sign_eq2_wang) == np.sign(trades_sign_eq2_teo)):
        print('Wang was right')
        wang += 1
    else:
        print('Hi')

    print()

print('Wang was right {} times and Juan was right {} times'.format(wang, juan))    
print()

## Step 6

S. Wang confuse the output for the second accuracy result. I change her code and obtain the correct data. With that this are the new results of the comparison

In [None]:
# Load new data
transactions_wang_corr = pd.read_csv(filename_transaction_corr, sep='   ',
                                usecols=(1,3,4,5), header=None, engine='python')
transactions_wang_corr.columns = ['Time', 'Teo', 'Exp', 'Price']

persecond_wang_corr = pd.read_csv(filename_persecond_corr, sep='   ',
                             usecols=(0,2,3,4), header=None, engine='python')
persecond_wang_corr.columns = ['Time', 'Eq2', 'Eq3', 'Teo']

In [None]:
# Comparison

trans_time_comp = np.sum(transactions_wang_corr['Time'] == transactions_juan['Time']) / len(transactions_wang_corr['Time'])
perse_time_comp = np.sum(persecond_wang_corr['Time'] == persecond_juan['Time']) / len(persecond_wang_corr['Time'])

print('The similarity of the time in the transaction time used is {:.2f}%'.format(trans_time_comp * 100))
print('The similarity of the time in the persecond time used is   {:.2f}%'.format(perse_time_comp * 100))

trans_teo_comp = np.sum(transactions_wang_corr['Teo'] == transactions_juan['Teo']) / len(transactions_wang_corr['Teo'])
perse_teo_comp = np.sum(persecond_wang_corr['Teo'] == persecond_juan['Teo']) / len(persecond_wang_corr['Teo'])

print('The similarity of the reference trade signs values for the transactions is {:.2f}%'.format(trans_teo_comp * 100))
print('The similarity of the reference trade signs values for the persecond is    {:.2f}%'.format(perse_teo_comp * 100))

trans_exp_comp = np.sum(transactions_wang_corr['Exp'] == transactions_juan['Exp']) / len(transactions_wang_corr['Exp'])
perse_eq2_comp = np.sum(persecond_wang_corr['Eq2'] == persecond_juan['Eq2']) / len(persecond_wang_corr['Eq2'])
perse_eq3_comp = np.sum(persecond_wang_corr['Eq3'] == persecond_juan['Eq3']) / len(persecond_wang_corr['Eq3'])

print('The similarity of the experimental result for the transactions is {:.2f}%'.format(trans_exp_comp * 100))
print('The similarity of the Eq. 2 result for the persecond is           {:.2f}%'.format(perse_eq2_comp * 100))
print('The similarity of the Eq. 3 result for the persecond is           {:.2f}%'.format(perse_eq3_comp * 100)) 