### Example Notebook for Pipelines

Includes cleaning, datapreprocessing, feature engineering, and model training

In [1]:
import sklearn
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
import collections
from itertools import chain
import matplotlib.pyplot as plt
from pathlib import Path

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn import set_config
from sklearn.compose import make_column_selector, make_column_transformer
import warnings
warnings.filterwarnings('ignore')

In [3]:
pd.set_option('display.max_columns', None)

# 1. Data
### 1.1 Load/Import Data

In [None]:
TRADE_PATH =""
QUOTE_PATH = ""

In [4]:
trades = pd.read_csv(Path(TRADE_PATH))
quotes = pd.read_csv(Path(QUOTE_PATH))

In [5]:
trades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280948 entries, 0 to 280947
Data columns (total 18 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Unnamed: 0.1                            280948 non-null  int64  
 1   Unnamed: 0                              280948 non-null  int64  
 2   Time                                    280948 non-null  object 
 3   Date                                    280948 non-null  object 
 4   Exchange                                280948 non-null  object 
 5   Symbol                                  280948 non-null  object 
 6   Trade_Volume                            280948 non-null  int64  
 7   Trade_Price                             280948 non-null  float64
 8   Sale_Condition                          280948 non-null  object 
 9   Source_of_Trade                         280948 non-null  object 
 10  Trade_Stop_Stock_Indicator              0 no

Note: All column information of trades and quotes data and valid entries for each column can be found at https://www.nyse.com/publicdocs/nyse/data/Daily_TAQ_Client_Spec_v3.0.pdf

In [6]:
trades.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Time,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Stop_Stock_Indicator,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator
0,0,0,2020-01-03 04:00:00.063086,2020-01-03,P,AAPL,1602,297.35,@ T,N,,0,1196,1,,40000062702080,,1
1,1,1,2020-01-03 04:00:00.267793,2020-01-03,P,AAPL,1,297.12,@ TI,N,,0,1200,2,,40000267410944,,0
2,2,2,2020-01-03 04:00:00.566842,2020-01-03,Q,AAPL,3,297.2,@FTI,N,,0,1205,1,,40000566817254,,1
3,3,3,2020-01-03 04:00:00.566988,2020-01-03,Q,AAPL,1,297.2,@FTI,N,,0,1206,2,,40000566971025,,1
4,4,4,2020-01-03 04:00:02.323113,2020-01-03,Q,AAPL,74,297.2,@FTI,N,,0,1210,3,,40002323090401,,1


In [7]:
quotes.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Time,Exchange,Symbol,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,Sequence_Number,National_BBO_Indicator,FINRA_BBO_Indicator,FINRA_ADF_MPID_Indicator,Quote_Cancel_Correction,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,LULD_BBO_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Participant_Timestamp,FINRA_ADF_Timestamp,FINRA_ADF_Market_Participant_Quote_Indicator,Security_Status_Indicator,Date,YearMonth
0,0,0,2020-01-03 04:00:00.011488,Q,AAPL,0.0,0.0,299.91,1.0,R,1198,2,,,,N,,0,,,,40000011469003,,,,2020-01-03,202001
1,1,1,2020-01-03 04:00:00.064223,P,AAPL,278.0,7.0,0.0,0.0,R,2218,2,,,,N,,0,,,,40000063838720,,,,2020-01-03,202001
2,2,2,2020-01-03 04:00:00.064233,P,AAPL,278.0,14.0,0.0,0.0,R,2219,2,,,,N,,0,,,,40000063840512,,,,2020-01-03,202001
3,3,3,2020-01-03 04:00:00.064235,P,AAPL,278.0,14.0,297.55,1.0,R,2220,4,,,,N,,0,,,,40000063845120,,,,2020-01-03,202001
4,4,4,2020-01-03 04:00:00.064235,P,AAPL,295.58,1.0,297.55,1.0,R,2221,4,,,,N,,0,,,,40000063845888,,,,2020-01-03,202001


### 1.2 Data Visualization & Preliminary Analysis

### 1.3 Data Cleaning

Before we move towards feature generation and building machine learning models, we have to clean the dataset. The necessary steps to clean the trades and quotes data include:
1. Get rid of unnecessary columns.
2. Get rid of the invalid trades and quotes.
3. Event reconstruction.
4. Only keep the natural best bid/offer or last MQU.
5. Assign Last Active Quote and assign to the corresponding trade.

## SCIKIT-LEARN DESIGN

https://arxiv.org/pdf/1309.0238.pdf

Scikit-Learn’s API is remarkably well designed. These are the main design components of Scikit-Learn.

All objects share a consistent and simple interface:

### Estimators

Any object that can estimate some parameters based on a dataset is called an estimator (e.g., a SimpleImputer is an estimator). The estimation itself is performed by the fit() method, and it takes a dataset as a parameter, or two for supervised learning algorithms—the second dataset contains the labels. Any other parameter needed to guide the estimation process is considered a hyperparameter (such as a SimpleImputer’s strategy), and it must be set as an instance variable (generally via a constructor parameter).

### Transformers

Some estimators (such as a SimpleImputer) can also transform a dataset; these are called transformers. Once again, the API is simple: the transformation is performed by the transform() method with the dataset to transform as a parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the case for a SimpleImputer. All transformers also have a convenience method called fit_transform(), which is equivalent to calling fit() and then transform() (but sometimes fit_transform() is optimized and runs much faster).


### Predictors

Finally, some estimators, given a dataset, are capable of making predictions; they are called predictors. For example, the LinearRegression model in the previous chapter was a predictor: given a country’s GDP per capita, it predicted life satisfaction. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a score() method that measures the quality of the predictions, given a test set (and the corresponding labels, in the case of supervised learning algorithms).

### ...

Reference to the base classes for all estimators in scikit-learn can be found at: https://github.com/scikit-learn/scikit-learn/blob/9aaed4987/sklearn/base.py#L153

In [10]:
from generators import *
from feature_generation import FeatureGeneration
from clean_data import CleanData
from preprocess_data import PreprocessData

In [11]:
clean_pipeline = make_pipeline(
    CleanData()
)

In [12]:
clean_trades = clean_pipeline.fit_transform(trades)
clean_trades.head()

test


Unnamed: 0,Unnamed: 0.1,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Stop_Stock_Indicator,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator
2020-01-03 09:15:00.424794,9328,P,AAPL,90,297.49,@FTI,N,,0,27834,2698,,2020-01-03 09:15:00.424794,,1
2020-01-03 09:15:00.596884,9329,K,AAPL,52,297.4,@ TI,N,,0,27835,1702,,2020-01-03 09:15:00.596884,,0
2020-01-03 09:15:00.596884,9330,K,AAPL,2,297.4,@ TI,N,,0,27836,1703,,2020-01-03 09:15:00.596884,,0
2020-01-03 09:15:00.596884,9331,K,AAPL,10,297.4,@ TI,N,,0,27837,1704,,2020-01-03 09:15:00.596884,,0
2020-01-03 09:15:00.596884,9332,K,AAPL,2,297.4,@ TI,N,,0,27838,1705,,2020-01-03 09:15:00.596884,,0


In [13]:
clean_quotes = clean_pipeline.fit_transform(quotes)
clean_quotes.head()

test


Unnamed: 0,Unnamed: 0.1,Exchange,Symbol,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,Sequence_Number,National_BBO_Indicator,FINRA_BBO_Indicator,FINRA_ADF_MPID_Indicator,Quote_Cancel_Correction,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,LULD_BBO_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Participant_Timestamp,FINRA_ADF_Timestamp,FINRA_ADF_Market_Participant_Quote_Indicator,Security_Status_Indicator
2020-01-03 09:15:00.126824,18992,P,AAPL,297.31,3.0,297.5,2.0,R,255901,2,,,,N,,0,,,,2020-01-03 09:15:00.126824,,,
2020-01-03 09:15:00.596884,18993,K,AAPL,297.4,1.0,297.65,10.0,R,255944,0,,,,N,,0,,,,2020-01-03 09:15:00.596884,,,
2020-01-03 09:15:00.596884,18994,K,AAPL,297.4,1.0,297.65,10.0,R,255945,0,,,,N,,0,,,,2020-01-03 09:15:00.596884,,,
2020-01-03 09:15:00.596884,18995,K,AAPL,297.38,2.0,297.65,10.0,R,255946,0,,,,N,,0,,,,2020-01-03 09:15:00.596884,,,
2020-01-03 09:15:01.147613,18996,Q,AAPL,297.41,1.0,297.76,1.0,R,255970,2,,,,N,,0,,,,2020-01-03 09:15:01.147613,,,


### 1.4 Reconstructing Events

In [16]:
clean_trades['Is_Quote'] = False
clean_quotes['Is_Quote'] = True
trade_features = ['Symbol', 'Trade_Volume', 'Trade_Price', 'Trade_Id', 'Trade_Reporting_Facility', 'Participant_Timestamp', 'Is_Quote']
quote_features = ['Symbol', 'Bid_Price', 'Bid_Size', 'Offer_Price', 'Offer_Size', 'Participant_Timestamp', 'Is_Quote']

In [17]:
all_events = clean_trades[trade_features].append(clean_quotes[quote_features], ignore_index=True)
all_events.index.name = "RID"
all_events = all_events.sort_values(by=['Participant_Timestamp', all_events.index.name])
all_events.head(10)

Unnamed: 0_level_0,Symbol,Trade_Volume,Trade_Price,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Is_Quote,Bid_Price,Bid_Size,Offer_Price,Offer_Size
RID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
247201,AAPL,,,,,2020-01-03 09:15:00.126824,True,297.31,3.0,297.5,2.0
0,AAPL,90.0,297.49,2698.0,,2020-01-03 09:15:00.424794,False,,,,
1,AAPL,52.0,297.4,1702.0,,2020-01-03 09:15:00.596884,False,,,,
2,AAPL,2.0,297.4,1703.0,,2020-01-03 09:15:00.596884,False,,,,
3,AAPL,10.0,297.4,1704.0,,2020-01-03 09:15:00.596884,False,,,,
4,AAPL,2.0,297.4,1705.0,,2020-01-03 09:15:00.596884,False,,,,
5,AAPL,94.0,297.4,1706.0,,2020-01-03 09:15:00.596884,False,,,,
247202,AAPL,,,,,2020-01-03 09:15:00.596884,True,297.4,1.0,297.65,10.0
247203,AAPL,,,,,2020-01-03 09:15:00.596884,True,297.4,1.0,297.65,10.0
247204,AAPL,,,,,2020-01-03 09:15:00.596884,True,297.38,2.0,297.65,10.0


### 1.5 Preprocess Data

All preprossing steps are implemented according to the papaer: The Participant Timestamp: Get The Most Out Of TAQ Data https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3984827 

Trade direction is assigned using the tick test, which can be refered from the paper: Inferring Trade Direction from Intraday Data by Charles M. C. Lee, Mark J. Ready https://www.jstor.org/stable/2328845

#### *For labeling valid quotes using AND(&) operator:

|Is_Quote|valid_quotes|Desired|
|---|---|---|
|False|False|False|
|False|True|False|
|True|False|False|
|True|True|True|

In [18]:
preprocess_pipeline = make_pipeline(
    PreprocessData()
)

In [19]:
df_prepared = preprocess_pipeline.fit_transform(all_events)
df_prepared.head(20)

Unnamed: 0_level_0,Symbol,Trade_Volume,Trade_Price,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Is_Quote,Bid_Price,Bid_Size,Offer_Price,Offer_Size,MOX,Valid_Quotes,Trade_Sign,Participant_Timestamp_f
RID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
247201,AAPL,,,,,2020-01-03 09:15:00.126824,True,297.31,3.0,297.5,2.0,0,True,,1578043000.0
0,AAPL,90.0,297.49,2698.0,,2020-01-03 09:15:00.424794,False,,,,,1,False,1.0,1578043000.0
1,AAPL,52.0,297.4,1702.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
2,AAPL,2.0,297.4,1703.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
3,AAPL,10.0,297.4,1704.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
4,AAPL,2.0,297.4,1705.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
5,AAPL,94.0,297.4,1706.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
247202,AAPL,,,,,2020-01-03 09:15:00.596884,True,297.4,1.0,297.65,10.0,2,False,,1578043000.0
247203,AAPL,,,,,2020-01-03 09:15:00.596884,True,297.4,1.0,297.65,10.0,2,False,,1578043000.0
247204,AAPL,,,,,2020-01-03 09:15:00.596884,True,297.38,2.0,297.65,10.0,2,True,,1578043000.0


In [20]:
df_prepared[df_prepared['MOX'] == 11]

Unnamed: 0_level_0,Symbol,Trade_Volume,Trade_Price,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Is_Quote,Bid_Price,Bid_Size,Offer_Price,Offer_Size,MOX,Valid_Quotes,Trade_Sign,Participant_Timestamp_f
RID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
247211,AAPL,,,,,2020-01-03 09:15:05.655102,True,297.41,4.0,297.65,10.0,11,True,,1578043000.0


In [21]:
# import the set_config module from sklearn
from sklearn import set_config

# set the display option for sklearn to 'diagram'
set_config(display='diagram')

# display the pipeline 
preprocess_pipeline

## 2. Feature Generating Pipeline

## 2.1 Generating Features

#### 2.1.1

#### 2.1.2 Return and Imbalance

#### 2.1.3

## 2.2 Pipeline

### 2.2.2 Feature Generation Pipeline

In [22]:
# df_test = df_prepared.copy()[:40000]
# df_test['Trade_Volume'] = df_test['Trade_Volume'].apply(lambda t: t if not np.isnan(t) else 0)
# df_test.reset_index(drop=True, inplace=True)
# df_test.head()

Unnamed: 0,Symbol,Trade_Volume,Trade_Price,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Is_Quote,Bid_Price,Bid_Size,Offer_Price,Offer_Size,MOX,Valid_Quotes,Trade_Sign,Participant_Timestamp_f
0,AAPL,0.0,,,,2020-01-03 09:15:00.126824,True,297.31,3.0,297.5,2.0,0,True,,1578043000.0
1,AAPL,90.0,297.49,2698.0,,2020-01-03 09:15:00.424794,False,,,,,1,False,1.0,1578043000.0
2,AAPL,52.0,297.4,1702.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
3,AAPL,2.0,297.4,1703.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0
4,AAPL,10.0,297.4,1704.0,,2020-01-03 09:15:00.596884,False,,,,,2,False,-1.0,1578043000.0


In [23]:
# RETURN_CALENDAR_SPAN = 5
# RETURN_TRANSACTION_SPAN = None
# RETURN_VOLUME_SPAN = None

# CALENDAR_DELTAS = [(0,.1),(.1,.2),(.2,.4),(.4,.8),(.8,1.6),(1.6,3.2),(3.2,6.4),(6.4,12.8),(12.8,25.6)]
# TRANSACTION_DELTAS = [(0,1),(1,2),(2,4),(4,8),(8,16),(16,32),(32,64),(64,128),(128,256)]
# VOLUME_DELTAS = [(0,100),(100,200),(200,400),(400,800),(800,1600),(1600,3200),(3200,6400),(6400,12800),(12800,25600)]

In [24]:
# df_test = parent_generator_ret_imb(df_test, CALENDAR_DELTAS[:1])

In [25]:
# params = {
#     'return_span': RETURN_CALENDAR_SPAN,
#     'clock_mode': 'calendar',
#     'deltas': CALENDAR_DELTAS
# }

In [None]:
# feature_pipeline = make_pipeline(
#     FeatureGeneration()
# )
# df = feature_pipeline.fit_transform(df_test, params)