In [1]:
import sklearn
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
import collections
from itertools import chain
import matplotlib.pyplot as plt
from pathlib import Path

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn import set_config
from sklearn.compose import make_column_selector, make_column_transformer

In [3]:
import warnings
warnings.filterwarnings('ignore')

# 0. Machine Learning Basics

Machine Learning is the study of computer algorithms that improve automatically through experience - Machine Learning, Tom Mitchell, McGraw Hill, 1997

#### Different types of machine learning:
-  Supervised Learning
    - eg. Regression, Classification
- Unsupervised Learning
    - eg. Clustering, Decision Tree
- Semi-Supervised Learning
- Reinforcement Learning

Interview Question: What's the difference between supervised and unsupervised learning?

#### A Typical Machine Learning Pipeline:
<img src="images/machine_learning_pipeline.png" />

Interview Question: What is a training/validation/test set?

# 1. Data

### 1.1 Load/Import Data

In [4]:
trades = pd.read_csv(r'D:\Junior\Summer Project\reabouttaqproject\AAPL_trades.csv')
quotes = pd.read_csv(r'D:\Junior\Summer Project\reabouttaqproject\AAPL_quotes.csv')

In [5]:
trades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283504 entries, 0 to 283503
Data columns (total 18 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Unnamed: 0.1                            283504 non-null  int64  
 1   Unnamed: 0                              283504 non-null  int64  
 2   Time                                    283504 non-null  object 
 3   Date                                    283504 non-null  object 
 4   Exchange                                283504 non-null  object 
 5   Symbol                                  283504 non-null  object 
 6   Trade_Volume                            283504 non-null  int64  
 7   Trade_Price                             283504 non-null  float64
 8   Sale_Condition                          283504 non-null  object 
 9   Source_of_Trade                         283504 non-null  object 
 10  Trade_Stop_Stock_Indicator              0 no

|Trades Data|Description|
|---|---|
|Unnamed: 0 | dummy index |
|Time| Time the trade was published by SIP|
|Date| Date the trade was published |
|Exchange| The ID of the exchange where the trade took place|
|Symbol| Stock Symbol|
|Trade_Volume | The number of shares traded |
|Trade_Price | The share price of this trade |
|Sale_Condition | The special condition associated with the trade|
|Source_of_Trade | CTA/UTP |
|Trade_Stop_Stock_Indicator | CTA |
|Trade_Correction_Indicator |  |
|Sequence_Number | Message sequence number |
|Trade_Id | Identifier for tracking Trades. Unique per participant per symbol per session within a trading session |
|Trade_Reporting_Facility | The ID of the Trade Reporting Facility |
|Participant_Timestamp | Time when the trade was reported|
|Trade_Reporting_Facility_TRF_Timestamp | If from an Exchange or if the FINRA ADF does not have a proprietary quotation feed, then will be set to blank. If the FINRA ADF or a FINRA TRF provides a proprietary feed of its quotation information, then it’s set to be the time of the quotation|
|Trade_Through_Exempt_Indicator | Denotes whether or not a trade is exempt from Trade Through rules |

|Quotes Data|Description|
|---|---|
|Unnamed: 0 |  |
|Time| Time the quote was published by SIP|
|Exchange|The exchange that issued the quote |
|Symbol| Stock Symbol|
|Bid_Price | The highest price any buyer is willing to pay for shares of this security |
|Bid_Size | The maximum number of shares the highest bidder is willing to buy |
|Offer_Price |The lowest price any seller is willing to take for shares of this security |
|Offer_Size | The maximum number of shares available at the offer price|
|Quote_Condition | Determines whether a quote qualifies for the Best Bid and Best Offer calculation |
|Sequence_Number | message sequence numbers|
|National_BBO_Indicator | The effect this quote has on the NBBO |
|FINRA_BBO_Indicator | Indicates the effect this quote has on the FINRA BBO |
|FINRA_ADF_MPID_Indicator | Denotes  the type of appendage to be included |
|Quote_Cancel_Correction | Indicates that this record is a cancel or a correction of a previous quote|
|Source_Of_Quote | CTA or UTP |
|Retail_Interest_Indicator | Indicates the presence of Retail Price Improvement (RPI) interest between the Bid and the Offer |
|Short_Sale_Restriction_Indicator | Short Sale Restriction status |
|LULD_BBO_Indicator |  |
|SIP_Generated_Message_Identifier | Originator of the message |
|NBBO_LULD_Indicator | LULD Limit Price Band effect on the NBB and NBO |
|Participant_Timestamp | Time the quote was published by the Participant to the SIP |
|FINRA_ADF_Timestamp | A FINRA ADF- or a FINRA TRF-provided timestamp |
|FINRA_ADF_Market_Participant_Quote_Indicator | UTP - FINRA ADF Market Participant Quote Indicator representing the Top of book quotations for each FINRA ADF participant |
|Security_Status_Indicator |  |
|Date |  |
|YearMonth|  |

Note: All column information of trades and quotes data and valid entries for each column can be found at https://www.nyse.com/publicdocs/nyse/data/Daily_TAQ_Client_Spec_v3.0.pdf

In [6]:
pd.set_option('display.max_columns', None)

In [7]:
trades.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Time,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Stop_Stock_Indicator,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator
0,0,0,2020-01-02 04:00:00.064010,2020-01-02,P,AAPL,3801,295.05,@ T,N,,0,1185,1,,40000063617792,,1
1,1,1,2020-01-02 04:00:02.828485,2020-01-02,P,AAPL,1,295.08,@FTI,N,,0,1195,2,,40002828108800,,1
2,2,2,2020-01-02 04:00:06.250392,2020-01-02,Q,AAPL,6,295.25,@ TI,N,,0,1197,1,,40006250366823,,0
3,3,3,2020-01-02 04:00:06.429757,2020-01-02,P,AAPL,1,295.08,@ TI,N,,0,1198,3,,40006429377792,,0
4,4,4,2020-01-02 04:00:28.894835,2020-01-02,P,AAPL,3,295.1,@ TI,N,,0,1205,4,,40028894459136,,0


In [8]:
quotes.head(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Time,Exchange,Symbol,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,Sequence_Number,National_BBO_Indicator,FINRA_BBO_Indicator,FINRA_ADF_MPID_Indicator,Quote_Cancel_Correction,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,LULD_BBO_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Participant_Timestamp,FINRA_ADF_Timestamp,FINRA_ADF_Market_Participant_Quote_Indicator,Security_Status_Indicator,Date,YearMonth
0,0,0,2020-01-02 04:00:00.065165,P,AAPL,278.0,7.0,0.0,0.0,R,2228,2,,,,N,,0,,,,40000064785664,,,,2020-01-02,202001
1,1,1,2020-01-02 04:00:00.065167,P,AAPL,278.0,14.0,0.0,0.0,R,2229,2,,,,N,,0,,,,40000064787456,,,,2020-01-02,202001
2,2,2,2020-01-02 04:00:00.065170,P,AAPL,293.72,9.0,0.0,0.0,R,2230,2,,,,N,,0,,,,40000064790784,,,,2020-01-02,202001
3,3,3,2020-01-02 04:00:00.065681,P,AAPL,293.72,9.0,327.56,1.0,R,2231,4,,,,N,,0,,,,40000065302272,,,,2020-01-02,202001
4,4,4,2020-01-02 04:00:00.065738,P,AAPL,293.72,9.0,320.0,1.0,R,2232,4,,,,N,,0,,,,40000065358592,,,,2020-01-02,202001
5,5,5,2020-01-02 04:00:00.065738,P,AAPL,293.72,9.0,310.0,1.0,R,2233,4,,,,N,,0,,,,40000065360384,,,,2020-01-02,202001
6,6,6,2020-01-02 04:00:00.065744,P,AAPL,293.72,9.0,300.0,1.0,R,2235,4,,,,N,,0,,,,40000065366528,,,,2020-01-02,202001
7,7,7,2020-01-02 04:00:00.065813,P,AAPL,293.72,9.0,299.97,5.0,R,2237,4,,,,N,,0,,,,40000065433856,,,,2020-01-02,202001
8,8,8,2020-01-02 04:00:00.065816,P,AAPL,293.72,9.0,295.88,5.0,R,2238,4,,,,N,,0,,,,40000065437440,,,,2020-01-02,202001
9,9,9,2020-01-02 04:00:00.068515,P,AAPL,295.0,1.0,295.88,5.0,R,2241,4,,,,N,,0,,,,40000068136192,,,,2020-01-02,202001


In [9]:
all_events = pd.concat([trades, quotes], axis=0)
len(trades), len(quotes), len(trades) + len(quotes)

(283504, 1925187, 2208691)

In [10]:
all_events.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2208691 entries, 0 to 1925186
Data columns (total 37 columns):
 #   Column                                        Dtype  
---  ------                                        -----  
 0   Unnamed: 0.1                                  int64  
 1   Unnamed: 0                                    int64  
 2   Time                                          object 
 3   Date                                          object 
 4   Exchange                                      object 
 5   Symbol                                        object 
 6   Trade_Volume                                  float64
 7   Trade_Price                                   float64
 8   Sale_Condition                                object 
 9   Source_of_Trade                               object 
 10  Trade_Stop_Stock_Indicator                    float64
 11  Trade_Correction_Indicator                    float64
 12  Sequence_Number                               int64  
 1

In [11]:
all_events.head(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Time,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Stop_Stock_Indicator,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,FINRA_BBO_Indicator,FINRA_ADF_MPID_Indicator,Quote_Cancel_Correction,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,LULD_BBO_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,FINRA_ADF_Timestamp,FINRA_ADF_Market_Participant_Quote_Indicator,Security_Status_Indicator,YearMonth
0,0,0,2020-01-02 04:00:00.064010,2020-01-02,P,AAPL,3801.0,295.05,@ T,N,,0.0,1185,1.0,,40000063617792,,1.0,,,,,,,,,,,,,,,,,,,
1,1,1,2020-01-02 04:00:02.828485,2020-01-02,P,AAPL,1.0,295.08,@FTI,N,,0.0,1195,2.0,,40002828108800,,1.0,,,,,,,,,,,,,,,,,,,
2,2,2,2020-01-02 04:00:06.250392,2020-01-02,Q,AAPL,6.0,295.25,@ TI,N,,0.0,1197,1.0,,40006250366823,,0.0,,,,,,,,,,,,,,,,,,,
3,3,3,2020-01-02 04:00:06.429757,2020-01-02,P,AAPL,1.0,295.08,@ TI,N,,0.0,1198,3.0,,40006429377792,,0.0,,,,,,,,,,,,,,,,,,,
4,4,4,2020-01-02 04:00:28.894835,2020-01-02,P,AAPL,3.0,295.1,@ TI,N,,0.0,1205,4.0,,40028894459136,,0.0,,,,,,,,,,,,,,,,,,,
5,5,5,2020-01-02 04:00:30.021361,2020-01-02,P,AAPL,2.0,295.1,@ TI,N,,0.0,1206,5.0,,40030020981248,,0.0,,,,,,,,,,,,,,,,,,,
6,6,6,2020-01-02 04:00:31.900055,2020-01-02,P,AAPL,7.0,295.1,@ TI,N,,0.0,1208,6.0,,40031899679744,,0.0,,,,,,,,,,,,,,,,,,,
7,7,7,2020-01-02 04:00:33.047715,2020-01-02,P,AAPL,5.0,295.1,@ TI,N,,0.0,1209,7.0,,40033047341056,,0.0,,,,,,,,,,,,,,,,,,,
8,8,8,2020-01-02 04:00:33.118294,2020-01-02,P,AAPL,5.0,295.1,@ TI,N,,0.0,1210,8.0,,40033117919744,,0.0,,,,,,,,,,,,,,,,,,,
9,9,9,2020-01-02 04:00:33.118809,2020-01-02,P,AAPL,10.0,295.1,@ TI,N,,0.0,1211,9.0,,40033118435584,,0.0,,,,,,,,,,,,,,,,,,,


### 1.2 Data Cleaning and Preprocessing

## SCIKIT-LEARN DESIGN

https://arxiv.org/pdf/1309.0238.pdf

Scikit-Learn’s API is remarkably well designed. These are the main design components of Scikit-Learn.

All objects share a consistent and simple interface:

### Estimators

Any object that can estimate some parameters based on a dataset is called an estimator (e.g., a SimpleImputer is an estimator). The estimation itself is performed by the fit() method, and it takes a dataset as a parameter, or two for supervised learning algorithms—the second dataset contains the labels. Any other parameter needed to guide the estimation process is considered a hyperparameter (such as a SimpleImputer’s strategy), and it must be set as an instance variable (generally via a constructor parameter).

### Transformers

Some estimators (such as a SimpleImputer) can also transform a dataset; these are called transformers. Once again, the API is simple: the transformation is performed by the transform() method with the dataset to transform as a parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the case for a SimpleImputer. All transformers also have a convenience method called fit_transform(), which is equivalent to calling fit() and then transform() (but sometimes fit_transform() is optimized and runs much faster).


### Predictors

Finally, some estimators, given a dataset, are capable of making predictions; they are called predictors. For example, the LinearRegression model in the previous chapter was a predictor: given a country’s GDP per capita, it predicted life satisfaction. A predictor has a predict() method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a score() method that measures the quality of the predictions, given a test set (and the corresponding labels, in the case of supervised learning algorithms).

### ...

Reference to the base classes for all estimators in scikit-learn can be found at: https://github.com/scikit-learn/scikit-learn/blob/9aaed4987/sklearn/base.py#L153

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin, OneToOneFeatureMixin
from sortedcollections import OrderedSet
import time

In [13]:
class PreprocessData(BaseEstimator, TransformerMixin):
    
    def __init__(self, dropped_after_hourse=True, droped_irregular_hours=True):
        self.dropped_after_hourse = dropped_after_hourse
        self.droped_irregular_hours = droped_irregular_hours
        
    
    def fit(self, X, y=None):
        return self
    
    def generate_mox_identifier(self, df):
        """Generate MOX Identifier
        """
        # get participant timestamps
        participant_timestamps = df.index
        # convert timestamps to float
        fl_participant_timestamps = [float(ts.timestamp()*1000) for ts in participant_timestamps]
        # generate unique index for each timestamp
        time_mox_mapping = {ts: mox_idx for mox_idx, ts in enumerate(OrderedSet(fl_participant_timestamps))}
        # generate the mox_identifiers
        mox_identifiers = [time_mox_mapping[t] for t in fl_participant_timestamps]

        df['MOX_Identifiers'] = mox_identifiers

        return df
    
    def transform(self, X):
        cols = X.columns
        if 'Unnamed: 0' in cols:
            X.drop(['Unnamed: 0'], inplace=True, axis=1)
        if 'Time' in cols:
            X.drop(['Time'], inplace=True, axis=1)
        
        # parse date and participant timestamp
        X['Date'] = pd.to_datetime(X['Date'])
        X['Participant_Timestamp'] = pd.to_datetime(
            X["Participant_Timestamp"].astype(str).str.zfill(15), format="%H%M%S%f"
        )
        
        # convert datetime to index
        X["Participant_Timestamp"] = X["Date"].apply(lambda x: x) + X["Participant_Timestamp"].apply(
            lambda x: timedelta(hours=x.hour, minutes=x.minute, seconds=x.second, microseconds=x.microsecond)
        )
        X.index = X["Participant_Timestamp"].values
        
        # remove rows of all NA
        X = X.dropna(axis=1, how="all")
        
        
        # remove invalid trades
        X.drop(X[X['Trade_Price'] < 0].index, inplace=True)
        X.drop(X[X['Trade_Volume'] < 0].index, inplace=True)
        X.drop(X[X['Trade_Reporting_Facility'] == 'D'].index, inplace=True)
        
        # remove invalid quotes
        X.drop(X[X['Bid_Price'] < 0].index, inplace=True)
        X.drop(X[X['Offer_Price'] < X['Bid_Price']].index, inplace=True)
        
        
        # drop after hours if specified
        if self.dropped_after_hourse:
            afterhours_idx = []
            for t in X.index:
                str_t = t.strftime("%H:%M:%S")
                if str_t < "09:00:00" or str_t > "16:00:00":
                    afterhours_idx.append(t)
            X.drop(afterhours_idx, inplace=True)
            
        # remove first and last 15 minutes of regular trading hours
        if self.droped_irregular_hours:
            irregular_idx = []
            for t in X.index:
                str_t = t.strftime("%H:%M:%S")
                if str_t < "09:45:00" or str_t > "15:45:00":
                    irregular_idx.append(t)
            X.drop(irregular_idx, inplace=True)
        
        #sort data according to index
        X = X.sort_index()
        
        #assign MOX Identifiers
        X = self.generate_mox_identifier(X)

        
        return X

Reference: All cleaning steps including the MOX Identifier are implemented based on the papaer: The Participant Timestamp: Get The Most Out Of TAQ Data https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3984827 

In [14]:
process_pipeline = make_pipeline(
    PreprocessData()
)

In [15]:
df_clean = process_pipeline.fit_transform(all_events)

Visualize the pipeline

In [12]:
# import the set_config module from sklearn
from sklearn import set_config

# set the display option for sklearn to 'diagram'
set_config(display='diagram')

# display the pipeline 'num_pipeline'
process_pipeline


In [28]:
df_clean.head(20)

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers
2020-01-02 09:45:00.001258,168494,2020-01-02,N,AAPL,,,,,,2025557,,,2020-01-02 09:45:00.001258,,,297.11,1.0,297.15,1.0,R,2.0,N,,0.0,,A,,202001.0,0
2020-01-02 09:45:00.001451,168495,2020-01-02,N,AAPL,,,,,,2025564,,,2020-01-02 09:45:00.001451,,,297.11,1.0,297.2,1.0,R,0.0,N,,0.0,,A,,202001.0,1
2020-01-02 09:45:00.001459,168496,2020-01-02,N,AAPL,,,,,,2025565,,,2020-01-02 09:45:00.001459,,,297.11,2.0,297.2,1.0,R,2.0,N,,0.0,,A,,202001.0,2
2020-01-02 09:45:00.001518,168497,2020-01-02,N,AAPL,,,,,,2025566,,,2020-01-02 09:45:00.001518,,,297.11,2.0,297.28,1.0,R,0.0,N,,0.0,,A,,202001.0,3
2020-01-02 09:45:00.001538,168498,2020-01-02,N,AAPL,,,,,,2025567,,,2020-01-02 09:45:00.001538,,,297.11,3.0,297.28,1.0,R,2.0,N,,0.0,,A,,202001.0,4
2020-01-02 09:45:00.012368,168499,2020-01-02,Z,AAPL,,,,,,2026234,,,2020-01-02 09:45:00.012368,,,297.08,2.0,297.13,2.0,R,2.0,N,,0.0,,A,,202001.0,5
2020-01-02 09:45:00.012486,168500,2020-01-02,N,AAPL,,,,,,2026243,,,2020-01-02 09:45:00.012486,,,297.11,2.0,297.28,1.0,R,2.0,N,,0.0,,A,,202001.0,6
2020-01-02 09:45:00.038245,168501,2020-01-02,K,AAPL,,,,,,2026679,,,2020-01-02 09:45:00.038245,,,297.04,1.0,297.13,1.0,R,0.0,N,,0.0,,A,,202001.0,7
2020-01-02 09:45:00.079739,168502,2020-01-02,K,AAPL,,,,,,2027059,,,2020-01-02 09:45:00.079739,,,297.04,1.0,297.13,2.0,R,0.0,N,,0.0,,A,,202001.0,8
2020-01-02 09:45:00.117069,168503,2020-01-02,Q,AAPL,,,,,,2027420,,,2020-01-02 09:45:00.117069,,,297.11,1.0,297.13,2.0,R,2.0,N,,0.0,,A,,202001.0,9


In [17]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1812932 entries, 2020-01-02 09:45:00.001258 to 2020-01-02 15:45:00.776189
Data columns (total 29 columns):
 #   Column                                  Dtype         
---  ------                                  -----         
 0   Unnamed: 0.1                            int64         
 1   Date                                    datetime64[ns]
 2   Exchange                                object        
 3   Symbol                                  object        
 4   Trade_Volume                            float64       
 5   Trade_Price                             float64       
 6   Sale_Condition                          object        
 7   Source_of_Trade                         object        
 8   Trade_Correction_Indicator              float64       
 9   Sequence_Number                         int64         
 10  Trade_Id                                float64       
 11  Trade_Reporting_Facility                object        


## 2. Feature Generation

In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector, make_column_transformer

In [20]:
import sys
sys.path.insert(1, '../feature_generation')

In [21]:
from generators import parent_generator

In [22]:
raw_trade_features, raw_quote_features = df_clean.columns.values[:14], df_clean.columns.values[14:]
raw_trade_features

array(['Date', 'Exchange', 'Symbol', 'Trade_Volume', 'Trade_Price',
       'Sale_Condition', 'Source_of_Trade', 'Trade_Correction_Indicator',
       'Sequence_Number', 'Trade_Id', 'Trade_Reporting_Facility',
       'Participant_Timestamp', 'Trade_Reporting_Facility_TRF_Timestamp',
       'Trade_Through_Exempt_Indicator'], dtype=object)

In [23]:
raw_quote_features

array(['Bid_Price', 'Bid_Size', 'Offer_Price', 'Offer_Size',
       'Quote_Condition', 'National_BBO_Indicator', 'Source_Of_Quote',
       'Retail_Interest_Indicator', 'Short_Sale_Restriction_Indicator',
       'SIP_Generated_Message_Identifier', 'NBBO_LULD_Indicator',
       'Security_Status_Indicator', 'YearMonth', 'MOX_Identifiers'],
      dtype=object)

### 2.1 Some Features to Consider

#### References: 
-  How and When are High-Frequency Stock Returns Predictable?
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4095405

#### Features for Trades Data:
-  Trade Side (Tick Test)

#### Features for Quotes Data:
-  $\small\text{Effective Spread} = \text{Offer Price} - \text{Bid Price}$
-  $\small\text{Mid Price} = \large\frac{\text{(Offer Price + Bid Price)}}{2}$
-  $ \text{Microprice} = \large\frac{\text{Offer Price} \times \text{Offer Size} + \text{Bid Price} \times \text{Bid Size}}{\text{Offer Size} + \text{Bid Size}}$
-  $ \text{Imbalance} = \large\frac{\text{Bid Size}}{\text{Offer Size}} $

In [24]:
trade_features_to_generate = ["Trade_Side"]
# quote_features_to_generate = []
quote_features_to_generate = ["Effective_Spread", "Midprice", "Microprice", "Imbalance"]

In [25]:
class GenerateTradeFeatures(BaseEstimator, TransformerMixin):
    
    def __init__(self, features):
        self.features = features
        return
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        trade_features = X.columns
        for f in self.features:
            if f not in trade_features:
                X, _ = parent_generator(X, f)            
        return X
        

In [26]:
class GenerateQuoteFeatures(BaseEstimator, TransformerMixin):
    
    def __init__(self, features):
        self.features = features
        return
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        quote_features = X.columns
        for f in self.features:
            if f not in quote_features:
                X, _ = parent_generator(X, f)
        return X


In [27]:
trade_pipeline = make_pipeline(GenerateTradeFeatures(trade_features_to_generate))
quote_pipeline = make_pipeline(GenerateQuoteFeatures(quote_features_to_generate))

In [28]:
generating_features = make_column_transformer(
    (trade_pipeline, raw_trade_features),
    (quote_pipeline, raw_quote_features)
)

In [29]:
df_copy = df_clean.copy()

In [30]:
data_prepared = generating_features.fit_transform(df_copy)

In [36]:

set_config(display='diagram')

# display the pipeline 'num_pipeline'
generating_features


In [31]:
column_names = np.concatenate((raw_trade_features, trade_features_to_generate, \
                             raw_quote_features, quote_features_to_generate), axis=0)

In [32]:
data_prepared_fr = pd.DataFrame(
    data_prepared,
    # The columns parameter specifies the column names for the DataFrame 
    # and is set to the output of preprocessing.get_feature_names_out().
    columns=column_names,
    # The index parameter sets the index of the DataFrame to the index of the housing data, 
    # preserving the original data's indices.
    index=df_copy.index)
# This line displays the first two rows of the created DataFrame using the head() method.

data_prepared_fr.head(20)

# Under Development

In [121]:
df_clean['Participant_Timestamp_f'] = df_clean['Participant_Timestamp'].apply(lambda t : t.timestamp())

In [136]:
test_copy=df_clean[:500].copy()

In [137]:
test_copy = test_copy.sort_values(by=['Participant_Timestamp'])
test_copy

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Participant_Timestamp_f
2020-01-02 09:45:00.001258,168494,2020-01-02,N,AAPL,,,,,,2025557,,,2020-01-02 09:45:00.001258,,,297.11,1.00,297.15,1.00,R,2.00,N,,0.00,,A,,202001.00,0,1577958300.00
2020-01-02 09:45:00.001451,168495,2020-01-02,N,AAPL,,,,,,2025564,,,2020-01-02 09:45:00.001451,,,297.11,1.00,297.20,1.00,R,0.00,N,,0.00,,A,,202001.00,1,1577958300.00
2020-01-02 09:45:00.001459,168496,2020-01-02,N,AAPL,,,,,,2025565,,,2020-01-02 09:45:00.001459,,,297.11,2.00,297.20,1.00,R,2.00,N,,0.00,,A,,202001.00,2,1577958300.00
2020-01-02 09:45:00.001518,168497,2020-01-02,N,AAPL,,,,,,2025566,,,2020-01-02 09:45:00.001518,,,297.11,2.00,297.28,1.00,R,0.00,N,,0.00,,A,,202001.00,3,1577958300.00
2020-01-02 09:45:00.001538,168498,2020-01-02,N,AAPL,,,,,,2025567,,,2020-01-02 09:45:00.001538,,,297.11,3.00,297.28,1.00,R,2.00,N,,0.00,,A,,202001.00,4,1577958300.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-01-02 09:45:02.147815,168888,2020-01-02,N,AAPL,,,,,,2032666,,,2020-01-02 09:45:02.147815,,,297.17,1.00,297.24,1.00,R,0.00,N,,0.00,,A,,202001.00,423,1577958302.15
2020-01-02 09:45:02.155000,36127,2020-01-02,D,AAPL,40.00,297.20,@ I,N,0.00,210897,8248.00,Q,2020-01-02 09:45:02.155000,94502156344209.00,0.00,,,,,,,,,,,,,,424,1577958302.15
2020-01-02 09:45:02.156000,36128,2020-01-02,D,AAPL,100.00,297.19,@,N,0.00,210898,4645.00,N,2020-01-02 09:45:02.156000,94502157052890.00,0.00,,,,,,,,,,,,,,425,1577958302.16
2020-01-02 09:45:02.179873,168889,2020-01-02,Q,AAPL,,,,,,2032679,,,2020-01-02 09:45:02.179873,,,297.15,1.00,297.21,1.00,R,0.00,N,,0.00,,A,,202001.00,426,1577958302.18


In [148]:
test_copy['Cumulative_Trade_Volume'] = test_copy['Participant_Timestamp_f'].apply(lambda t:
                                                    sum(test_copy.fillna(0)[test_copy['Participant_Timestamp_f'].between(test_copy['Participant_Timestamp_f'][0], t, inclusive='left')]['Trade_Volume']))

test_copy['Cum_Trades'] = (test_copy.fillna(0)['Trade_Price'] != 0).cumsum()

In [152]:
test_copy

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Participant_Timestamp_f,Cumulative_Trade_Volume,Cum_Trades
2020-01-02 09:45:00.001258,168494,2020-01-02,N,AAPL,,,,,,2025557,,,2020-01-02 09:45:00.001258,,,297.11,1.00,297.15,1.00,R,2.00,N,,0.00,,A,,202001.00,0,1577958300.00,0.00,0
2020-01-02 09:45:00.001451,168495,2020-01-02,N,AAPL,,,,,,2025564,,,2020-01-02 09:45:00.001451,,,297.11,1.00,297.20,1.00,R,0.00,N,,0.00,,A,,202001.00,1,1577958300.00,0.00,0
2020-01-02 09:45:00.001459,168496,2020-01-02,N,AAPL,,,,,,2025565,,,2020-01-02 09:45:00.001459,,,297.11,2.00,297.20,1.00,R,2.00,N,,0.00,,A,,202001.00,2,1577958300.00,0.00,0
2020-01-02 09:45:00.001518,168497,2020-01-02,N,AAPL,,,,,,2025566,,,2020-01-02 09:45:00.001518,,,297.11,2.00,297.28,1.00,R,0.00,N,,0.00,,A,,202001.00,3,1577958300.00,0.00,0
2020-01-02 09:45:00.001538,168498,2020-01-02,N,AAPL,,,,,,2025567,,,2020-01-02 09:45:00.001538,,,297.11,3.00,297.28,1.00,R,2.00,N,,0.00,,A,,202001.00,4,1577958300.00,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-01-02 09:45:02.147815,168888,2020-01-02,N,AAPL,,,,,,2032666,,,2020-01-02 09:45:02.147815,,,297.17,1.00,297.24,1.00,R,0.00,N,,0.00,,A,,202001.00,423,1577958302.15,7263.00,101
2020-01-02 09:45:02.155000,36127,2020-01-02,D,AAPL,40.00,297.20,@ I,N,0.00,210897,8248.00,Q,2020-01-02 09:45:02.155000,94502156344209.00,0.00,,,,,,,,,,,,,,424,1577958302.15,7263.00,102
2020-01-02 09:45:02.156000,36128,2020-01-02,D,AAPL,100.00,297.19,@,N,0.00,210898,4645.00,N,2020-01-02 09:45:02.156000,94502157052890.00,0.00,,,,,,,,,,,,,,425,1577958302.16,7303.00,103
2020-01-02 09:45:02.179873,168889,2020-01-02,Q,AAPL,,,,,,2032679,,,2020-01-02 09:45:02.179873,,,297.15,1.00,297.21,1.00,R,0.00,N,,0.00,,A,,202001.00,426,1577958302.18,7403.00,103


In [None]:
df_clean['Cumulative_Trade_Volumn'] = df_clean['Participant_Timestamp_f'].apply(lambda t:
                                                    sum(df_clean.fillna(0)[df_clean['Participant_Timestamp_f'].between(df_clean['Participant_Timestamp_f'][0], t, inclusive='left')]['Trade_Volume']))


In [31]:
df_clean

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Seconds,Participant_Timestamp_f
2020-01-02 09:45:00.001258,168494,2020-01-02,N,AAPL,,,,,,2025557,,,2020-01-02 09:45:00.001258,,,297.11,1.0,297.15,1.0,R,2.0,N,,0.0,,A,,202001.0,0,35100.001258,1.577958e+09
2020-01-02 09:45:00.001451,168495,2020-01-02,N,AAPL,,,,,,2025564,,,2020-01-02 09:45:00.001451,,,297.11,1.0,297.20,1.0,R,0.0,N,,0.0,,A,,202001.0,1,35100.001451,1.577958e+09
2020-01-02 09:45:00.001459,168496,2020-01-02,N,AAPL,,,,,,2025565,,,2020-01-02 09:45:00.001459,,,297.11,2.0,297.20,1.0,R,2.0,N,,0.0,,A,,202001.0,2,35100.001459,1.577958e+09
2020-01-02 09:45:00.001518,168497,2020-01-02,N,AAPL,,,,,,2025566,,,2020-01-02 09:45:00.001518,,,297.11,2.0,297.28,1.0,R,0.0,N,,0.0,,A,,202001.0,3,35100.001518,1.577958e+09
2020-01-02 09:45:00.001538,168498,2020-01-02,N,AAPL,,,,,,2025567,,,2020-01-02 09:45:00.001538,,,297.11,3.0,297.28,1.0,R,2.0,N,,0.0,,A,,202001.0,4,35100.001538,1.577958e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-01-02 15:45:00.373000,249169,2020-01-02,D,AAPL,2.0,299.625,@ I,N,0.0,2759497,40363.0,N,2020-01-02 15:45:00.373000,1.545004e+14,0.0,,,,,,,,,,,,,,1675326,56700.373000,1.577980e+09
2020-01-02 15:45:00.601576,1769323,2020-01-02,Q,AAPL,,,,,,28324287,,,2020-01-02 15:45:00.601576,,,299.61,2.0,299.63,2.0,R,0.0,N,,0.0,,A,,202001.0,1675327,56700.601576,1.577980e+09
2020-01-02 15:45:00.775774,1769324,2020-01-02,P,AAPL,,,,,,28324664,,,2020-01-02 15:45:00.775774,,,299.62,2.0,299.63,1.0,R,2.0,N,,0.0,,A,,202001.0,1675328,56700.775774,1.577980e+09
2020-01-02 15:45:00.776092,1769325,2020-01-02,Z,AAPL,,,,,,28324670,,,2020-01-02 15:45:00.776092,,,299.62,1.0,299.64,1.0,R,2.0,N,,0.0,,A,,202001.0,1675329,56700.776092,1.577980e+09


In [16]:
#in order to compare the timestamps more efficiently and make addition or subtraction. We transfer the timestamps into seconds starting from
#the beginning of a day
raw_time=df_clean['Participant_Timestamp']
raw_time = pd.DataFrame(raw_time)
#split participant_timestamps into date and clock
raw_time['Hours'] = raw_time['Participant_Timestamp'].dt.strftime('%H:%M:%S.%f')

In [17]:
#transfer column hours into timedelta format
raw_time['Hours'] = pd.to_timedelta(raw_time['Hours'])

In [18]:
#use total_seconds() in pandas to calculate the seconds of the column 'Hours'
raw_time['Seconds']=raw_time['Hours'].dt.total_seconds()

In [19]:
Seconds=raw_time['Seconds'].to_frame()
#add into dataframe
df_clean=pd.concat([df_clean, Seconds], axis=1)
df_clean

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Seconds
2020-01-02 09:45:00.001258,168494,2020-01-02,N,AAPL,,,,,,2025557,,,2020-01-02 09:45:00.001258,,,297.11,1.0,297.15,1.0,R,2.0,N,,0.0,,A,,202001.0,0,35100.001258
2020-01-02 09:45:00.001451,168495,2020-01-02,N,AAPL,,,,,,2025564,,,2020-01-02 09:45:00.001451,,,297.11,1.0,297.20,1.0,R,0.0,N,,0.0,,A,,202001.0,1,35100.001451
2020-01-02 09:45:00.001459,168496,2020-01-02,N,AAPL,,,,,,2025565,,,2020-01-02 09:45:00.001459,,,297.11,2.0,297.20,1.0,R,2.0,N,,0.0,,A,,202001.0,2,35100.001459
2020-01-02 09:45:00.001518,168497,2020-01-02,N,AAPL,,,,,,2025566,,,2020-01-02 09:45:00.001518,,,297.11,2.0,297.28,1.0,R,0.0,N,,0.0,,A,,202001.0,3,35100.001518
2020-01-02 09:45:00.001538,168498,2020-01-02,N,AAPL,,,,,,2025567,,,2020-01-02 09:45:00.001538,,,297.11,3.0,297.28,1.0,R,2.0,N,,0.0,,A,,202001.0,4,35100.001538
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-01-02 15:45:00.373000,249169,2020-01-02,D,AAPL,2.0,299.625,@ I,N,0.0,2759497,40363.0,N,2020-01-02 15:45:00.373000,1.545004e+14,0.0,,,,,,,,,,,,,,1675326,56700.373000
2020-01-02 15:45:00.601576,1769323,2020-01-02,Q,AAPL,,,,,,28324287,,,2020-01-02 15:45:00.601576,,,299.61,2.0,299.63,2.0,R,0.0,N,,0.0,,A,,202001.0,1675327,56700.601576
2020-01-02 15:45:00.775774,1769324,2020-01-02,P,AAPL,,,,,,28324664,,,2020-01-02 15:45:00.775774,,,299.62,2.0,299.63,1.0,R,2.0,N,,0.0,,A,,202001.0,1675328,56700.775774
2020-01-02 15:45:00.776092,1769325,2020-01-02,Z,AAPL,,,,,,28324670,,,2020-01-02 15:45:00.776092,,,299.62,1.0,299.64,1.0,R,2.0,N,,0.0,,A,,202001.0,1675329,56700.776092


In [29]:
df_clean[df_clean['Seconds'].between(40000-3,40000,inclusive='left')]
v=pd.Series([35099,35098])
v.between(df_clean['Seconds'][0],df_clean['Seconds'][-1],inclusive='left')==[False,False]

0    True
1    True
dtype: bool

In [254]:
df_clean['Trade_Price'].count()
df_clean[df_clean['Seconds']<=40000]['Trade_Price'].count()

filtered_data = df_clean[df_clean['Seconds'] <= 40000]
counts = filtered_data['Trade_Price'].groupby(filtered_data['Seconds']).count()
reversed_counts = counts[::-1].cumsum()[::-1]
timestamps_below_limit = reversed_counts[(reversed_counts>=14)& (reversed_counts <= 23)].index.tolist()   

In [255]:

df_clean[(df_clean['Seconds']>=timestamps_below_limit[0])&(df_clean['Seconds']<=timestamps_below_limit[-1])]['Trade_Price'].count()

11

In [257]:
df_clean[(df_clean['Seconds']>=timestamps_below_limit[0])&(df_clean['Seconds']<=timestamps_below_limit[-1])]


Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Seconds
2020-01-02 11:06:38.731000,110546,2020-01-02,D,AAPL,1000.0,298.0642,@,N,0.0,933909,27065.0,Q,2020-01-02 11:06:38.731000,110638700000000.0,0.0,,,,,,,,,,,,,,674070,39998.731
2020-01-02 11:06:38.873000,110547,2020-01-02,D,AAPL,10.0,298.0646,@ I,N,0.0,933912,27066.0,Q,2020-01-02 11:06:38.873000,110638900000000.0,0.0,,,,,,,,,,,,,,674071,39998.873
2020-01-02 11:06:38.947000,110548,2020-01-02,D,AAPL,300.0,298.06,@,N,0.0,933913,27067.0,Q,2020-01-02 11:06:38.947000,110638900000000.0,0.0,,,,,,,,,,,,,,674072,39998.947
2020-01-02 11:06:38.947000,110549,2020-01-02,D,AAPL,100.0,298.06,@,N,0.0,933914,27068.0,Q,2020-01-02 11:06:38.947000,110639000000000.0,0.0,,,,,,,,,,,,,,674072,39998.947
2020-01-02 11:06:38.947426,817833,2020-01-02,P,AAPL,,,,,,10273588,,,2020-01-02 11:06:38.947426,,,298.05,1.0,298.07,2.0,R,2.0,N,,0.0,,A,,202001.0,674073,39998.947426
2020-01-02 11:06:38.947647,817834,2020-01-02,Z,AAPL,,,,,,10273589,,,2020-01-02 11:06:38.947647,,,298.05,2.0,298.08,6.0,R,0.0,N,,0.0,,A,,202001.0,674074,39998.947647
2020-01-02 11:06:38.947755,817835,2020-01-02,Z,AAPL,,,,,,10273590,,,2020-01-02 11:06:38.947755,,,298.05,1.0,298.08,6.0,R,0.0,N,,0.0,,A,,202001.0,674075,39998.947755
2020-01-02 11:06:38.948201,817837,2020-01-02,C,AAPL,,,,,,10273592,,,2020-01-02 11:06:38.948201,,,283.18,1.0,309.95,1.0,R,0.0,N,,0.0,,A,,202001.0,674076,39998.948201
2020-01-02 11:06:38.948351,817836,2020-01-02,Q,AAPL,,,,,,10273591,,,2020-01-02 11:06:38.948351,,,298.05,8.0,298.11,2.0,R,2.0,N,,0.0,,A,,202001.0,674077,39998.948351
2020-01-02 11:06:38.996000,110550,2020-01-02,D,AAPL,1.0,298.0699,@ I,N,0.0,933916,16038.0,N,2020-01-02 11:06:38.996000,110639000000000.0,0.0,,,,,,,,,,,,,,674078,39998.996


In [318]:
counts

Seconds
35100.001258    0
35100.001451    0
35100.001459    0
35100.001518    0
35100.001538    0
               ..
39999.923804    0
39999.923933    0
39999.939604    0
39999.981914    0
39999.982019    0
Name: Trade_Price, Length: 674156, dtype: int64

In [None]:
T = datetime.datetime(2020, 1, 2, 11, 15, 8, 263000)

In [81]:
def backwards(data, T, delta1, delta2, M):
    T=T.timestamp()
    if M=='calendar':
        time=pd.Series([T,T-delta1,T-delta2])
        if time.between(data['Participant_Timestamp_f'][0],data['Participant_Timestamp_f'][-1],inclusive='both')==[True, True, True]:
            if delta1 > delta2:
                return print('Invalid input value of delta1 and delta2')
            else:
                backward_window=data[data['Participant_Timestamp_f'].between(T-delta1,T-delta2,inclusive='right')]
        else:
            return print('Invalid Time Input')
    
    if M=='transaction':
        time=pd.Series([T])
        
        if time.between(df_clean['Participant_Timestamp_f'][0],df_clean['Participant_Timestamp_f'][-1],inclusive='both')==True:
            filtered_data = data[data['Participant_Timestamp_f'] <= T]
            if delta1.is_integer==True & delta2.is_integer==True:
                if delta1 & delta2 <= filtered_data['Cum_Trades'][-1]:
                    if delta1 > delta2:
                        return print('Invalid input value of delta1 and delta2')
                    else:
                        backward_window=data[data['Cum_Trades'].between(filtered_data['Cum_Trades'][-1]-delta1,filtered_data['Cum_Trades'][-1]-delta2,inclusive='right')]
                else:
                    return print('Invalid input value of delta1 and delta2')
            else:
                return print('Please Input delta1 and delta2 as Integers')
        else:
            return print('Invalid Time Input')
    
    if M=='volume':
        time=pd.Series([T])
        
        if time.between(df_clean['Participant_Timestamp_f'][0],df_clean['Participant_Timestamp_f'][-1],inclusive='both')==True:
            filtered_data = data[data['Participant_Timestamp_f'] <= T]
            
            if delta1 & delta2 <= filtered_data['Cumulative_Trade_Volumn'][-1]:
                if delta1 > delta2:
                    return print('Invalid input value of delta1 and delta2')
                else:
                    backward_window=data[data['Cumulative_Trade_Volumn'].between(filtered_data['Cumulative_Trade_Volumn'][-1]-delta1,filtered_data['Cumulative_Trade_Volumn'][-1]-delta2,inclusive='right')]
            else:
                return print('Invalid input value of delta1 and delta2')
        else:
            return print('Invalid Time Input')
        

In [366]:
#define a new function to implement the look-back interval process
def backwards(data, T, delta1, delta2, M):
    if M=='calendar':
        #in clock mode 'calendar', the process should return the set of all timestamps belongs to (T-delta2 and T-delta1]
        if T-delta2 < 35100:
            return print('Invalid input value of T and delta1')
        elif T-delta1 > 56701:
            return print('Invalid input value of T')
        elif delta1 > delta2:
            return print('Invalid input value of delta1 and delta2')
        else:
            backward_window=data[(data['Seconds'] > T - delta2) & (data['Seconds'] <= T-delta1)]
    
    if M=='transaction':
        #in clock mode transaction, the process should return all timestamps such that in (t, T], the number of tracsaction belongs to
        #[delta1,delta2) our function return part of the dataframe where these timestamps located.
        if delta1<0:
            return print('Invalid input value of delta1')
        elif delta2<0:
            return print('Invalid input value of delta2')
        elif delta1 > delta2:
            return print('Invalid input value of delta1 and delta2')
        elif T-delta1 > 56701:
            return print('Invalid input value of T')
        elif T-delta2 < 35100:
            return print('Invalid input value of T')
        else:
            #select the part of dataframe before time threshold T.
            filtered_data = data[data['Seconds'] <= T]
            #count total number of transaction before T
            counts = filtered_data['Trade_Price'].groupby(filtered_data['Seconds']).count()
            #reorder it into reverse direction
            reversed_counts = counts[::-1].cumsum()[::-1]
            #generate a list of index such that number of transactions belongs to [delta1,delta2)
            timestamps_between_limit = reversed_counts[(reversed_counts>=delta1) & (reversed_counts<=delta2-1)].index.tolist()
            #according to the index, select the time interval from dataframe
            backward_window= data[(data['Seconds']>=timestamps_between_limit[0])&(data['Seconds']<=timestamps_between_limit[-1])]
    #if backward_window=nan
    if M=='volume':
        #in clock mode transaction, the process should return all timestamps such that in (t, T], the total volume of tracsaction belongs to
        #[delta1,delta2) our function return part of the dataframe where these timestamps located.
        if delta1<0:
            return print('Invalid input value of delta1')
        elif delta1 > delta2:
            return print('Invalid input value of delta1 and delta2')
        elif T > 56701:
            return print('Invalid input value of T')
        elif T < 35100:
            return print('Invalid input value of T')
        else:
            filtered_data = data[data['Seconds'] <= T]
            #calculate the cumulative sum of trade volume
            cumulative_sum = filtered_data.groupby('Seconds')['Trade_Volume'].cumsum()
            ##reorder it into reverse direction
            reversed_cumulative_sum = cumulative_sum[::-1].cumsum()[::-1]
            timestamps_between_limit = reversed_cumulative_sum[(reversed_cumulative_sum >= delta1) & (reversed_cumulative_sum < delta2)].index.tolist()
            #since we have timestamps as index of the list above, use participant_timestamp column to locate the begining and the end timestamp
            backward_window= data[(data['Participant_Timestamp']>=timestamps_between_limit[0])&(data['Participant_Timestamp']<=timestamps_between_limit[-1])]
    return backward_window

In [346]:
filtered_data = df_clean[df_clean['Seconds'] < 40000]
cumulative_sum = filtered_data.groupby('Seconds')['Trade_Volume'].cumsum()
reversed_cumulative_sum = cumulative_sum[::-1].cumsum()[::-1]
timestamps_between_limit = reversed_cumulative_sum[(reversed_cumulative_sum >= 100) & (reversed_cumulative_sum < 350)].index.tolist()

In [347]:
timestamps_between_limit

[Timestamp('2020-01-02 11:06:39.870155'),
 Timestamp('2020-01-02 11:06:39.870167'),
 Timestamp('2020-01-02 11:06:39.874613'),
 Timestamp('2020-01-02 11:06:39.874711'),
 Timestamp('2020-01-02 11:06:39.874905'),
 Timestamp('2020-01-02 11:06:39.875000')]

In [356]:
example=df_clean[(df_clean['Participant_Timestamp']>=timestamps_between_limit[0])&(df_clean['Participant_Timestamp']<=timestamps_between_limit[-1])]
example

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Seconds
2020-01-02 11:06:39.870155,110563,2020-01-02,Q,AAPL,39.0,298.07,@ I,N,0.0,933989,26052.0,,2020-01-02 11:06:39.870155,,0.0,,,,,,,,,,,,,,674114,39999.870155
2020-01-02 11:06:39.870157,817860,2020-01-02,Q,AAPL,,,,,,10274530,,,2020-01-02 11:06:39.870157,,,298.06,5.0,298.11,1.0,R,0.0,N,,0.0,,A,,202001.0,674115,39999.870157
2020-01-02 11:06:39.870167,110564,2020-01-02,Q,AAPL,39.0,298.07,@ I,N,0.0,933990,26053.0,,2020-01-02 11:06:39.870167,,0.0,,,,,,,,,,,,,,674116,39999.870167
2020-01-02 11:06:39.870180,817861,2020-01-02,Q,AAPL,,,,,,10274531,,,2020-01-02 11:06:39.870180,,,298.06,6.0,298.11,1.0,R,2.0,N,,0.0,,A,,202001.0,674117,39999.87018
2020-01-02 11:06:39.870202,817863,2020-01-02,X,AAPL,,,,,,10274533,,,2020-01-02 11:06:39.870202,,,298.06,1.0,298.11,1.0,R,0.0,N,,0.0,,A,,202001.0,674118,39999.870202
2020-01-02 11:06:39.870203,817865,2020-01-02,B,AAPL,,,,,,10274535,,,2020-01-02 11:06:39.870203,,,298.06,1.0,309.95,1.0,R,0.0,N,,0.0,,A,,202001.0,674119,39999.870203
2020-01-02 11:06:39.870216,817866,2020-01-02,Q,AAPL,,,,,,10274536,,,2020-01-02 11:06:39.870216,,,298.06,6.0,298.11,2.0,R,0.0,N,,0.0,,A,,202001.0,674120,39999.870216
2020-01-02 11:06:39.870390,817871,2020-01-02,N,AAPL,,,,,,10274542,,,2020-01-02 11:06:39.870390,,,298.06,5.0,298.11,3.0,R,0.0,N,,0.0,,A,,202001.0,674121,39999.87039
2020-01-02 11:06:39.870392,817872,2020-01-02,P,AAPL,,,,,,10274543,,,2020-01-02 11:06:39.870392,,,298.06,2.0,298.09,2.0,R,0.0,N,,0.0,,A,,202001.0,674122,39999.870392
2020-01-02 11:06:39.870411,817873,2020-01-02,N,AAPL,,,,,,10274544,,,2020-01-02 11:06:39.870411,,,298.06,5.0,298.1,2.0,R,0.0,N,,0.0,,A,,202001.0,674123,39999.870411


In [359]:
example['Trade_Volume'].count()

6

In [367]:
backwards(df_clean, 40000, 100, 350, 'volume')

Unnamed: 0,Unnamed: 0.1,Date,Exchange,Symbol,Trade_Volume,Trade_Price,Sale_Condition,Source_of_Trade,Trade_Correction_Indicator,Sequence_Number,Trade_Id,Trade_Reporting_Facility,Participant_Timestamp,Trade_Reporting_Facility_TRF_Timestamp,Trade_Through_Exempt_Indicator,Bid_Price,Bid_Size,Offer_Price,Offer_Size,Quote_Condition,National_BBO_Indicator,Source_Of_Quote,Retail_Interest_Indicator,Short_Sale_Restriction_Indicator,SIP_Generated_Message_Identifier,NBBO_LULD_Indicator,Security_Status_Indicator,YearMonth,MOX_Identifiers,Seconds
2020-01-02 11:06:39.870155,110563,2020-01-02,Q,AAPL,39.0,298.07,@ I,N,0.0,933989,26052.0,,2020-01-02 11:06:39.870155,,0.0,,,,,,,,,,,,,,674114,39999.870155
2020-01-02 11:06:39.870157,817860,2020-01-02,Q,AAPL,,,,,,10274530,,,2020-01-02 11:06:39.870157,,,298.06,5.0,298.11,1.0,R,0.0,N,,0.0,,A,,202001.0,674115,39999.870157
2020-01-02 11:06:39.870167,110564,2020-01-02,Q,AAPL,39.0,298.07,@ I,N,0.0,933990,26053.0,,2020-01-02 11:06:39.870167,,0.0,,,,,,,,,,,,,,674116,39999.870167
2020-01-02 11:06:39.870180,817861,2020-01-02,Q,AAPL,,,,,,10274531,,,2020-01-02 11:06:39.870180,,,298.06,6.0,298.11,1.0,R,2.0,N,,0.0,,A,,202001.0,674117,39999.87018
2020-01-02 11:06:39.870202,817863,2020-01-02,X,AAPL,,,,,,10274533,,,2020-01-02 11:06:39.870202,,,298.06,1.0,298.11,1.0,R,0.0,N,,0.0,,A,,202001.0,674118,39999.870202
2020-01-02 11:06:39.870203,817865,2020-01-02,B,AAPL,,,,,,10274535,,,2020-01-02 11:06:39.870203,,,298.06,1.0,309.95,1.0,R,0.0,N,,0.0,,A,,202001.0,674119,39999.870203
2020-01-02 11:06:39.870216,817866,2020-01-02,Q,AAPL,,,,,,10274536,,,2020-01-02 11:06:39.870216,,,298.06,6.0,298.11,2.0,R,0.0,N,,0.0,,A,,202001.0,674120,39999.870216
2020-01-02 11:06:39.870390,817871,2020-01-02,N,AAPL,,,,,,10274542,,,2020-01-02 11:06:39.870390,,,298.06,5.0,298.11,3.0,R,0.0,N,,0.0,,A,,202001.0,674121,39999.87039
2020-01-02 11:06:39.870392,817872,2020-01-02,P,AAPL,,,,,,10274543,,,2020-01-02 11:06:39.870392,,,298.06,2.0,298.09,2.0,R,0.0,N,,0.0,,A,,202001.0,674122,39999.870392
2020-01-02 11:06:39.870411,817873,2020-01-02,N,AAPL,,,,,,10274544,,,2020-01-02 11:06:39.870411,,,298.06,5.0,298.1,2.0,R,0.0,N,,0.0,,A,,202001.0,674123,39999.870411


In [114]:
class Volume_and_Duration(BaseEstimator, TransformerMixin):
    
    def __init__(self, X, T, delta1, delta2, M):
        self.X = X
        self.T = T
        self.delta1 = delta1
        self.delta2 = delta2
        self.M = M
        
    
    def fit(self, X, y=None):
        return self

    
    def Breadth(self):
        return backwards(X, T, delta1, delta2, M)['Trade_Price'].count()
    
    def Inmediacy(self):
      
        return len(backwards(X, T, delta1, delta2, M)['Participant_Timestamp_f'].value_counts())/ Breadth(self, X, T, delta1, delta2, M)
        
    
    def VolumeAll(self):
        return backwards(X, T, delta1, delta2, M)['Trade_Volume'].sum()
    
    def VolumeAvg(self):
        return VolumeAll(self, X, T, delta1, delta2, M)/ Breadth(self, X, T, delta1, delta2, M)
    
    def VolumeMax(self):
        return backwards(X, T, delta1, delta2, M)['Trade_Volume'].max()
    

In [385]:
M='volume'
Volume_and_Duration.Breadth(df_clean, 40000, 100, 350, M)

6

# Testing Rolling Windows

In [19]:
from lazypredict.Supervised import LazyRegressor
from sklearn import datasets
from sklearn.utils import shuffle

In [96]:
df_clean2 = df_clean[((df_clean['National_BBO_Indicator']==4) | (pd.isna(df_clean['National_BBO_Indicator']))
                      & ((df_clean['Sale_Condition']=='@   ') | (pd.isna(df_clean['Sale_Condition'])))) & ((df_clean['Trade_Correction_Indicator']==0) | (pd.isna(df_clean['Trade_Correction_Indicator'])))]
df_clean2.index=pd.to_datetime(df_clean2.index)

cleanTrades = df_clean2[df_clean2['Trade_Price'].notna()]
cleanTradePrices = cleanTrades['Trade_Price'].to_frame()

sTradePrices = cleanTradePrices.groupby(pd.Grouper(freq='5s')).agg({"Trade_Price": ["first", "last", "max", "min", "mean"]})
sTradePrices.columns = ['Open', 'Close', 'High', 'Low', 'Average']

cleanTradeVolume = cleanTrades['Trade_Volume'].to_frame()
sTradeVolume = cleanTradeVolume.groupby(pd.Grouper(freq='5s')).agg({'Trade_Volume':['sum']})
sTradeVolume.columns = ['Volume']
sTrade = pd.concat([sTradePrices,sTradeVolume], axis=1)

nbbo_quotes = df_clean2[df_clean2['National_BBO_Indicator']==4]

nbbo_quotes_midprice = generate_midprice(nbbo_quotes)

quotesRelevantDataOnly = nbbo_quotes_midprice.drop(['Unnamed: 0.1', 'Date', 'Exchange', 'Symbol', 'Trade_Volume',
       'Trade_Price', 'Sale_Condition', 'Source_of_Trade',
      'Trade_Correction_Indicator', 'Sequence_Number', 'Trade_Id',
      'Trade_Reporting_Facility', 'Participant_Timestamp',
       'Trade_Reporting_Facility_TRF_Timestamp',
       'Trade_Through_Exempt_Indicator',
       'Quote_Condition',
       'National_BBO_Indicator', 'Source_Of_Quote',
       'Retail_Interest_Indicator', 'Short_Sale_Restriction_Indicator',
       'SIP_Generated_Message_Identifier', 'NBBO_LULD_Indicator',
       'Security_Status_Indicator', 'YearMonth', 'MOX_Identifiers'], axis=1)
quotesRelevantDataOnly.dropna(inplace=True)

#quoteStuff = quotesRelevantDataOnly.groupby(pd.Grouper(freq='15s')).agg({"Mid_Price": "last", "OFI": "first"})
quoteStuff = quotesRelevantDataOnly.groupby(pd.Grouper(freq='5s')).agg({"Mid_Price": "last", 'Bid_Price': 'mean', 'Bid_Size': 'sum', 'Offer_Price': 'mean', 'Offer_Size':'sum'})

inputData = pd.concat([sTrade, quoteStuff], axis=1)

outputData = inputData.Mid_Price.shift(-1).to_frame()
outputData["Price"]=outputData["Mid_Price"]
outputData.drop(["Mid_Price"], axis=1, inplace=True)

finalData = pd.concat([outputData, inputData], axis=1)
finalData.dropna(inplace=True)

In [100]:
finalData

Unnamed: 0,Price,Open,Close,High,Low,Average,Volume,Mid_Price,Bid_Price,Bid_Size,Offer_Price,Offer_Size
2020-01-02 09:45:00,297.16,297.13,297.26,297.26,297.13,297.19,8044.00,297.22,297.17,92.00,297.20,68.00
2020-01-02 09:45:05,297.25,297.25,297.15,297.25,297.15,297.21,9866.00,297.16,297.18,26.00,297.20,26.00
2020-01-02 09:45:10,297.23,297.18,297.26,297.26,297.16,297.21,9140.00,297.25,297.20,56.00,297.23,45.00
2020-01-02 09:45:15,297.25,297.26,297.20,297.28,297.20,297.26,4720.00,297.23,297.26,56.00,297.28,128.00
2020-01-02 09:45:20,297.18,297.20,297.25,297.27,297.18,297.22,2757.00,297.25,297.21,19.00,297.23,21.00
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-01-02 15:44:30,299.60,299.57,299.58,299.61,299.57,299.58,15507.00,299.58,299.58,16.00,299.59,12.00
2020-01-02 15:44:35,299.67,299.59,299.61,299.65,299.58,299.61,11818.00,299.60,299.60,67.00,299.62,82.00
2020-01-02 15:44:40,299.62,299.60,299.66,299.68,299.59,299.64,7749.00,299.67,299.63,48.00,299.65,39.00
2020-01-02 15:44:45,299.62,299.66,299.63,299.66,299.63,299.64,6511.00,299.62,299.64,133.00,299.66,69.00


In [101]:
test_copy=finalData.iloc[0:65]

In [102]:
test_copy

Unnamed: 0,Price,Open,Close,High,Low,Average,Volume,Mid_Price,Bid_Price,Bid_Size,Offer_Price,Offer_Size
2020-01-02 09:45:00,297.16,297.13,297.26,297.26,297.13,297.19,8044.00,297.22,297.17,92.00,297.20,68.00
2020-01-02 09:45:05,297.25,297.25,297.15,297.25,297.15,297.21,9866.00,297.16,297.18,26.00,297.20,26.00
2020-01-02 09:45:10,297.23,297.18,297.26,297.26,297.16,297.21,9140.00,297.25,297.20,56.00,297.23,45.00
2020-01-02 09:45:15,297.25,297.26,297.20,297.28,297.20,297.26,4720.00,297.23,297.26,56.00,297.28,128.00
2020-01-02 09:45:20,297.18,297.20,297.25,297.27,297.18,297.22,2757.00,297.25,297.21,19.00,297.23,21.00
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-01-02 09:50:00,296.90,296.80,296.82,296.85,296.70,296.77,14839.00,296.80,296.78,234.00,296.80,106.00
2020-01-02 09:50:05,296.93,296.82,296.90,297.00,296.82,296.92,7955.00,296.90,296.92,63.00,296.94,56.00
2020-01-02 09:50:10,296.98,296.88,296.93,296.98,296.88,296.93,8320.00,296.93,296.92,38.00,296.95,67.00
2020-01-02 09:50:15,296.99,296.96,296.95,296.96,296.93,296.95,3221.00,296.98,296.93,85.00,296.97,107.00


In [103]:
train_size = int(len(test_copy) * 0.8)
X=test_copy.drop(['Price'],axis=1)
y=test_copy['Price']
X_train=np.array(X[:train_size])
X_test=np.array(X[train_size:])
y_train=np.array(y[:train_size])
y_test=np.array(y[train_size:])

In [112]:
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)

100%|██████████| 42/42 [00:01<00:00, 34.61it/s]


In [113]:
predictions

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LassoLarsIC,-15.29,-0.36,0.1,0.01
PoissonRegressor,-15.75,-0.4,0.1,0.01
LassoCV,-16.06,-0.42,0.1,0.1
LassoLarsCV,-16.13,-0.43,0.1,0.02
LinearRegression,-16.23,-0.44,0.1,0.01
TransformedTargetRegressor,-16.23,-0.44,0.1,0.01
ElasticNetCV,-16.28,-0.44,0.1,0.09
Ridge,-16.34,-0.45,0.1,0.01
HuberRegressor,-17.45,-0.54,0.11,0.03
OrthogonalMatchingPursuit,-17.63,-0.55,0.11,0.01


In [45]:
window_size = 5

def create_dataset(data, window_size):
    x, y= [], []
    for i in range(len(data) - window_size - 1):
        window = data.iloc[i:(i + window_size)]
        x.append(window)
        y.append(data.iloc[i + window_size][['Price']])
       
    return np.array(x),np.array(y)

In [118]:
train_size = int(len(test_copy) * 0.8)

train = test_copy[:train_size]
train_x, train_y= create_dataset(train, window_size)

test = test_copy[train_size-window_size:]
test_x, test_y= create_dataset(test, window_size)

train_x = np.reshape(train_x, (train_x.shape[0], -1))
test_x = np.reshape(test_x, (test_x.shape[0], -1))

In [120]:
np.shape(train_x)

(46, 5, 12)

In [117]:
print(train_x)

[[297.16       297.13       297.2599     ...  19.         297.22571429
   21.        ]
 [297.245      297.2488     297.15       ...  44.         297.18791667
   51.        ]
 [297.23       297.175      297.26       ...  66.         297.22709677
   64.        ]
 ...
 [297.245      297.31       297.3        ...  33.         297.22636364
   46.        ]
 [297.205      297.31       297.2399     ...  72.         297.18956522
   33.        ]
 [297.225      297.22       297.205      ... 104.         297.1528
   64.        ]]


In [72]:
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models, predictions = reg.fit(train_x, test_x, train_y, test_y)

print(models)

 62%|██████▏   | 26/42 [00:02<00:01, 11.98it/s]

LassoLarsIC model failed to execute
You are using LassoLarsIC in the case where the number of samples is smaller than the number of features. In this setting, getting a good estimate for the variance of the noise is not possible. Provide an estimate of the noise variance in the constructor.


 79%|███████▊  | 33/42 [00:02<00:00, 19.27it/s]

RANSACRegressor model failed to execute
`min_samples` may not be larger than number of samples: n_samples = 46.


100%|██████████| 42/42 [00:02<00:00, 14.06it/s]

                                                             Adjusted R-Squared  \
Model                                                                             
Lars                          5813568927784277299665230373624936045217841152.00   
KernelRidge                                                          2746257.92   
GaussianProcessRegressor                                             2741713.62   
LinearSVR                                                            1957993.03   
MLPRegressor                                                         1693283.24   
SGDRegressor                                                             291.64   
PassiveAggressiveRegressor                                               106.72   
QuantileRegressor                                                          4.73   
LassoLars                                                                  4.65   
DummyRegressor                                                             4.65   
Elas


