<h2> Rules </h2>

<li> Output should be in the specified format below </li>


<h3>Background Summary</h3>

An order driven market is a financial market where all buyers and sellers display the prices at which they wish to buy or sell a particular security, as well as the amounts of the security desired to be bought or sold. In these markets, participants may submit limit orders or market orders. 

In a limit order, you specify how much of the asset you want to buy or sell, and the price you want. If there are matching orders on the book (e.g. someone who wants to sell at the same price, or lower, as the price at which you want to buy), your order will be filled immediately. If not, your order will stay on the book until matching orders arrive (which could be never). It is also possible for a limit order to be only partially filled, if the counterparty wants to trade a smaller amount than you did. In that case the rest of the order remains on the book.

In a market order, you only specify how much of the asset you want to trade. Your order is then filled immediately at the best price currently available on the market. For instance, if you place a market buy order, you will be matched with the current lowest-priced sell order on the book. If that order is not large enough to completely fill yours, the next-lowest sell order will be used to fill some more of yours, and so on.(You are encouraged to go through the 1st suggested reading for a pictorial understanding of order book and price dynamics)

In this competition, we use tick data. Tick data refers to any market data which shows the price and volume of every print.  Additionally changes to the state of the order book occur in the form of trades and quotes. A quote event occurs whenever the best bid or the ask price is updated. A trade event takes place when shares are bought or sold.

The aim of this competition is to determine the relationship between recent past order book events and future stock price for 30 seconds time-horizons. Few factors that are explored in the literature to predict price movements:  
<li>Order arrival rate</li>
<li>Bid-ask spread</li>
<li>Order book imbalance</li>
<li>Trade volume @ Bid price vs Trade volume @ Ask price</li>

Certain factors, such as current order book imbalance, tend to have good predictive power for very short-time time-horizons (under 10-20 seconds), however other factors might be important for time-horizons of more than a minute.

Equity markets are very fast and it is important to understand that multiple high-frequency events can occur in the same milliseconds. Analysing and understanding the data is critical before applying machine learning models.

This problem is based on real-life problem we work on. Another important point to note - trade event and quote event timestamp will rarely be at the same-time, usually quote event time stamp is before trade event time stamp. Refer to examples in the <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge_asof.html">link</a> on how you could join this data 


<h2> Suggested Reading </h2>

Basic introduction of Limit Order Book
<li><a href="https://journal.r-project.org/archive/2011/RJ-2011-010/RJ-2011-010.pdf"> Analyzing an Electronic Limit Order Book </a></li>
<li><a href="https://www.amazon.in/Algorithmic-Trading-DMA-Introduction-Strategies/dp/0956399207">Algorithmic Trading and DMA</a></li>
<li><a href="https://www.quantstart.com/articles/high-frequency-trading-ii-limit-order-book">HFT - Limit Order Book</a></li>

<br/>
Advanced Topics 
<li><a href="http://www.personal.psu.edu/qxc2/research/jfuturesmarkets-2008.pdf"> The Information Content of an Open Limit Order Book</a></li>
<li><a href="http://eprints.maths.ox.ac.uk/1895/1/Darryl%20Shen%20%28for%20archive%29.pdf"> Order Imbalance Based Strategy in High Frequency Trading</a></li>


<h2>Problem Statement</h2>

The aim of the problem is to develop a forecasting model to predict a stock's short-term price movement. The use of such prediction models is widely prevalent in algorithmic trading. Algorithmic trading, sometimes referred to as high-frequency trading in specific circumstances, is the use of automated systems to identify true(money making) signals among massive amounts of data that capture the underlying stock dynamics. These models can be leveraged to develop profitable trading strategies(akin to hedge funds) to help investors/traders achieve better returns. Contestants are expected and encouraged to think of empirical models/heuristics in order to better predict the price evolution of the hypothetical stock.

<br/><br/>

<h2>Submission Instructions</h2>
<br/>Algorithm/model should be developed without changing the order in the submission. Steps mentioned is the order of code execution.
<ol>
<li>Install all required libraries</li>
<li>Parameters such as file names for in-sample data & out-sample data</li>
<li>Algo-specific parameters</li>
<li>Functions to load data & evaluate performance</li>
<li>Train the model using in-sample data</li>
    <ul>
        <li>Trained model should be a <a href="https://docs.python.org/3/library/pickle.html">pickle</a> or a function with values</li>
    </ul>
    
<li>Predict with the trained model using out_sample data and evaluate the model performance</li>
</ol>

<b>Important Instructions</b>
<ol>
    <li>Cells that begin with "#[DONOTCHANGE]" shouldn't be changed</li>
    <li>Submission will not be considered if the program fails to run</li>
    <li>Tick frequency shouldn't be modified for out sample data</li>
    <li>Only first-prediction will be considered for a mid-price until it changes [More detail below] </li>
    <li>Model code should be commented</li>
    <li>Brief description of the model (Preferably less than 1-page)</li>
</ol>

<br/><br/>
<b> First-Prediction of every mid-price for evaluation </b>

predMid is the predicted mid-price 30 seconds ahead. NA in predMid implies that the model doesn't have a prediction. Only valid predictions will be considered for evaluation.

<br/>


date|sym|bsize|bid|ask|asize|mid|predMid|ValidPrediction
-----|-----|-----|-----|-----|-----|-----|-----|-----
2018.01.02D08:00:28.913000000|BATS.L|4816|5008|5011|569|5009.5|5020|Yes
2018.01.02D08:00:28.913000000|BATS.L|3327|5008|5011|569|5009.5|5020|<b>No</b>
2018.01.02D08:00:28.913000000|BATS.L|3327|5008|5015|616|5011.5|5018|Yes
2018.01.02D08:00:28.917000000|BATS.L|3363|5008|5015|616|5011.5|NA|-
2018.01.02D08:00:28.939000000|BATS.L|5045|5008|5015|616|5011.5|5018|<b>No</b>
2018.01.02D08:00:28.939000000|BATS.L|5045|5008|5016|45|5012|5005|Yes
2018.01.02D08:00:29.028000000|BATS.L|1718|5008|5016|45|5012|5005|<b>No</b>
2018.01.02D08:00:29.052000000|BATS.L|1718|5008|5015|90|5011.5|NA|-
2018.01.02D08:00:29.052000000|BATS.L|1718|5008|5015|256|5011.5|5020|Yes
2018.01.02D08:00:29.052000000|BATS.L|1718|5008|5015|278|5011.5|5020|<b>No</b>



<br/><br/><br/>

<h2>Data </h2>

In-sample data
<ul>
<li>trade_in.csv</li>
<li>quote_in.csv</li>
</ul>
Out-sample data
<ul>
<li>trade_out.csv</li>
<li>quote_out.csv</li>
</ul>

Your model will be evaluated with a different set of date set.

<b> Data Fields </b>
    
Variable Name|Description|Type|Example
-----|-----|-----|-----
datetime|Datetime of the event|Datetime in format yyyy.mm.ddDHH:MM:SS.fff|2018.02.10D10:20:20.100
ric|Stock ticker|String|BP.L
price|Last trade price|Double|3.45
size|Last trade size|Integer|10000
bid|Current Bid Price|Double|3.45
ask|Current Ask Price|Double|3.5
bsize|Current Bid Size|Integer|4000
asize|Current Ask Size|Integer|5000
mid|0.5 * (bid + ask)|Double|3.475
predictedMid|Mid-price predicted by the model|Double|3.65

In [1]:
#1st step - Install all required libraries
#IMPORTANT : Install necessary libraries if not already present.

!pip install pandas
!pip install pytz
!pip install matplotlib
!pip install numpy
!pip install scipy
!pip install -U scikit-learn




You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Collecting scikit-learn
  Using cached https://files.pythonhosted.org/packages/63/90/46872c58db4a924b794921dc6790f426ffaaf19feca9b5023d396963f175/scikit_learn-0.20.2-cp35-cp35m-win_amd64.whl
Installing collected packages: scikit-learn
  Found existing installation: scikit-learn 0.19.0


Cannot uninstall 'scikit-learn'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [4]:
#[DONOTCHANGE]
#2nd step - all parameters
class Parameters(object):
    pass

param = Parameters()
param.tickSize = 0.5 #tick size is 0.5 GBp i.e. 0.005 GBP

param.fileDirectory = './intraday/'
param.trade_InSampleFile = 'trade_in.csv'
param.quote_InSampleFile = 'quote_in.csv'

param.trade_OutSampleFile = 'trade_out.csv'
param.quote_OutSampleFile = 'quote_out.csv'

In [5]:
#3rd step - Model specific parameters
#param.imbalanceThreshold = 0.7
#param.timeDuration = 30 #30 seconds

In [6]:
#[DONOTCHANGE]
#4th step - Functions to load data & evaluate performance

#Initialise libraries and functions
from sklearn.metrics import mean_squared_error
from math import sqrt

import os
import numpy as np
import pandas as pd

#Disable certain warnings
pd.options.mode.chained_assignment = None

#Identify future mid prices - 30 seconds duration
def IdentifyFutureMidPrices(df, predictionDuration = 30):
    futDat = df[['datetime', 'mid']].rename(columns={'mid':'futMid'})
    futDat['datetime'] = futDat['datetime'] - pd.offsets.timedelta(seconds=int(predictionDuration))
    return pd.merge_asof(df, futDat, on='datetime', direction='backward')

def ReadCSV(file):
    print('Loading file - ' + file)
    df = pd.read_csv(file)
    df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%dD%H:%M:%S.%f")
    return df

#Load data
def LoadData(path, tradeFile, quoteFile):
    tradeFile = os.path.join(path, tradeFile)
    quoteFile = os.path.join(path, quoteFile)

    trade_df = ReadCSV(tradeFile)
    quote_df = ReadCSV(quoteFile)
    
    quote_df['mid'] = 0.5*(quote_df['bid'].copy() + quote_df['ask'].copy())
    quote_df['midChangeGroup'] = quote_df['mid'].diff().ne(0).cumsum()
    quote_df = IdentifyFutureMidPrices(quote_df)
    return trade_df, quote_df

#Evaluation function
#df should contain columns - datetime, sym, bsize, bid, ask, asize, predMid (model predicted mid-price)
#Function to evaluate results
def RMS(df):
    df = df.groupby(['midChangeGroup']).first().reset_index()
    tmp = df.dropna(subset=['predMid', 'futMid'])
    rms = sqrt(mean_squared_error(tmp['futMid'], tmp['predMid']))
    predCount = len(tmp['predMid'])
    print('RMS = %.4f. #Predictions = %s' % (rms, predCount))

In [39]:
#5th Step

from sklearn.metrics import precision_recall_fscore_support as prf
from sklearn.ensemble import BaggingClassifier, AdaBoostRegressor, BaggingRegressor, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import ExtraTreeClassifier, ExtraTreeRegressor
import pickle
from collections import Counter

def my_Feature_Engineering(quote_df, trade_df):
    print('Starting myPred_1')

    df = pd.merge_asof(quote_df, trade_df, on='datetime') #try merging the other way (This seems correct though)
    print('Shape before: ', df.shape)
    df = df.set_index('datetime',drop=False)
    df['midma120'] = df['mid'].rolling('120s').mean()
    df['midma90'] = df['mid'].rolling('90s').mean()
    df['midma60'] = df['mid'].rolling('60s').mean()
    df['midma30'] = df['mid'].rolling('30s').mean()
    df['middiff1'] = df['midma30'] - df['midma90']
    df['askma120'] = df['ask'].rolling('120s').mean()
    df['askma90'] = df['ask'].rolling('90s').mean()
    df['askma60'] = df['ask'].rolling('60s').mean()
    df['askma30'] = df['ask'].rolling('30s').mean()
    df['askdiff1'] = df['askma30'] - df['askma90']
    df['asksizema120'] = df['asize'].rolling('120s').mean()
    df['asksizema90'] = df['asize'].rolling('90s').mean()
    df['asksizema60'] = df['asize'].rolling('60s').mean()
    df['asksizema30'] = df['asize'].rolling('30s').mean()
    df['asksizediff1'] = df['asksizema30'] - df['asksizema90']
    df['bidma120'] = df['bid'].rolling('120s').mean()
    df['bidma90'] = df['bid'].rolling('90s').mean()
    df['bidma60'] = df['bid'].rolling('60s').mean()
    df['bidma30'] = df['bid'].rolling('30s').mean()
    df['biddiff1'] = df['bidma30'] - df['bidma90']
    df['bidsizema120'] = df['bsize'].rolling('120s').mean()
    df['bidsizema90'] = df['bsize'].rolling('90s').mean()
    df['bidsizema60'] = df['bsize'].rolling('60s').mean()
    df['bidsizema30'] = df['bsize'].rolling('30s').mean()
    df['bidsizediff1'] = df['bidsizema30'] - df['bidsizema90']
    #dftemp = pd.DataFrame()
    df['length'] = df['midChangeGroup'].map(df['midChangeGroup'].value_counts())
    df = df.dropna(subset=['price', 'size', 'futMid']).reset_index(drop=True)
    df.drop(columns = ['sym_y'], inplace = True)
    df['time'] = df['datetime'].dt.time
    df['dayofweek'] = df['datetime'].dt.dayofweek
    df['date'] = pd.to_datetime(df['datetime'].dt.date)
    df['diff'] = df['datetime']-df['date']
    df['seconds'] = df['diff'].dt.seconds
    df = df.groupby(['midChangeGroup']).last().reset_index()
    df.drop(columns = ['diff', 'time', 'date'], inplace = True)

    print('Shape after: ', df.shape)
    print('Finished myPred_1')

    print('Starting myPred_2')
    print('Shape before: ', df.shape)
    #we have lost datetime now. Do not drop that in myPred_1 if required later.
    #df['seconds'].min()=28808 for IN data
    df['seconds'] = df['seconds'] - 28808 + 1

    df['futMidLabels'] = df['futMid'] - df['mid']
    #Create Labels 1 if futMid value increases from Mid and -1 if futMid value decreases from Mid
    a = np.array(df['futMidLabels'].values.tolist())
    df['futMidLabels'] = np.where(a > 0, 1, a).tolist()
    a = np.array(df['futMidLabels'].values.tolist())
    df['futMidLabels'] = np.where(a < 0, -1, a).tolist()
    
    df['futMidLabels'] = df['futMidLabels'].astype('int32')
    df.drop(columns = ['sym_x'], inplace = True)
    df = df.drop(df[(df['size']>250)].index)
    df['imbalance'] = (df['bsize']-df['asize'])/(df['bsize']+df['asize'])
    df['Shareimbalance'] = (df['bsize'])/(df['bsize']+df['asize'])

    print('Shape after: ', df.shape)
    print('Finished myPred_1')
    
    return df

def out_1(quote_df, trade_df, load=False, save_IN=False, save_OUT=False):

    if load == True :
        print('Loading...')
        if save_IN == True:
            df_out_2 = pd.read_csv('df_out_2_IN.csv')
        if save_OUT == True:
            df_out_2 = pd.read_csv('df_out_2_OUT.csv')
    else:
        df_out_2 = my_Feature_Engineering(quote_df, trade_df)
        if save_IN == True :
            df_out_2.to_csv('df_out_2_IN.csv')
        if save_OUT == True :
            df_out_2.to_csv('df_out_2_OUT.csv')
            
    return df_out_2

def features_labels_split_classification(features):
    
    features['futMidLabels'] = features['futMidLabels'].astype('int32')
    labels = np.array(features['futMidLabels'])
    features= features.drop(['futMidLabels','futMid', 'midChangeGroup', 'datetime'], axis = 1)
    feature_list = list(features.columns)
    features = np.array(features)

    print('Training Features Shape:', features.shape)
    print('Training Labels Shape:', labels.shape)
    
    return features, labels

def fit_Classification_Model(train_features, train_labels):
    rfc = BaggingClassifier(base_estimator = RandomForestClassifier(n_estimators = 250, random_state = 42, max_depth = 8,  n_jobs=3, verbose=2), verbose=2, n_jobs=2,  n_estimators=4, random_state=42)
    rfc.fit(train_features, train_labels)
    return rfc

def print_classification_inference(rfc, test_features, test_labels):
    
    predictions = rfc.predict(test_features)
    print('Groud Truth Spread:', Counter(test_labels))
    print('Prediction Spread:', Counter(predictions))
    dict = {'predict':predictions, 'labels':test_labels}
    columns = ['predict', 'labels']
    df = pd.DataFrame(dict, columns=columns)
 
    #df.to_csv('predict_label_Out.csv')
    
    tmp = df.copy()
    tmp['labels'] = tmp['labels'].astype('int32')
    tmp['predict'] = tmp['predict'].astype('int32')
    tmp['errsq'] = (tmp['labels']-tmp['predict'])**2 
    
    #Changing values 1.0 and 4.0 to 4.0 except 0.
    tmp['errsq'] = tmp['errsq'].map({1.0: 4.0, 4.0: 4.0, 0.0:0.0})

    misclassified_number = np.sum((tmp['errsq'].values)/4)
    predCount = len(tmp['labels'])
    total_classified = predCount
    misclassified_fraction = misclassified_number/total_classified

    print('misclassified_number:', misclassified_number)
    print('total_classified:', total_classified)
    print('misclassified_fraction:', misclassified_fraction)
    
    x = prf(tmp['labels'], tmp['predict'])

    print('precision: ', x[0])
    print('recall: ', x[1])
    print('f_beta: ', x[2])
    #print('support: ', x[3])
    print('labels:', [-1, 0, 1])
    
    return df

def features_labels_split_regression(df_predict, features):
    features['tickpredict'] = df_predict['predict']
    labels = np.array(features['futMid']-features['mid'])
    features2 = features[['bsize','bid', 'ask', 'asize', 'mid','middiff1','askdiff1','biddiff1',
                          'asksizediff1','bidsizediff1','midma120', 'midma60','midma30','askma120',
                          'askma60','askma30','bidma120', 'bidma60','bidma30','bidsizema120', 'bidsizema60',
                          'bidsizema30','asksizema120', 'asksizema60','asksizema30','price', 'size', 'dayofweek',
                          'seconds', 'imbalance', 'Shareimbalance', 'tickpredict']]
    feature_list = list(features2.columns)
    features2 = np.array(features2)
    
    print('Training Features Shape:', features2.shape)
    print('Training Labels Shape:', labels.shape)
    
    return features2, labels

def fit_Regression_model(train_features, train_labels) :
    #mlp = BaggingRegressor(base_estimator = ExtraTreeRegressor(max_depth=3,random_state=42), n_jobs=2,  n_estimators=8, random_state=42) #1n MAXIMUM
    #mlp = BaggingRegressor(base_estimator = ExtraTreeRegressor(max_depth=3), n_jobs=2,  n_estimators=15)#2n MAXIMUM
    mlp = BaggingRegressor(base_estimator = ExtraTreeRegressor(max_depth=6,random_state=42), n_jobs=2,  n_estimators=10, random_state=42)
    mlp.fit(train_features, train_labels)
    
    return mlp

def combine_pred_with_data_deltachange(predict, features):
    dict2 = {'predict':predict}
    columns = ['predict']
    df = pd.DataFrame(dict2, columns=columns)
    dfcopy = features[['datetime', 'midChangeGroup', 'seconds', 'mid', 'futMid']]
    dfcopy['predMid'] = dfcopy['mid'] + df['predict']
    dfcopy['diff_Mid_futMidPred'] = (dfcopy['predMid']-dfcopy['mid'])**2
    dfcopy['diff_Mid_futMidPred'] = dfcopy['diff_Mid_futMidPred']**0.5
    #dfcopy = dfcopy.drop(dfcopy[(dfcopy['diff_Mid_futMidPred']>5)].index)
    dfcopy['diff_futMid_futMidPred'] = (dfcopy['predMid'] - dfcopy['futMid'])**2
    #Drop predictions at Unstable times determined from Visualizing predictions for IN Data
    dfcopy = dfcopy.drop(dfcopy[(dfcopy['seconds']<10000)].index)
    dfcopy = dfcopy.drop(dfcopy[(dfcopy['seconds']>19500)].index)
    rms = dfcopy['diff_futMid_futMidPred'].mean()
    rms = rms**0.5
    print('rmse',rms)
    dfcopy.drop(columns = ['diff_futMid_futMidPred', 'diff_Mid_futMidPred'], inplace = True)
    
    return dfcopy


print('\n------- Logging Process Steps - START (can ignore)-------\n')

print('Training Process [1/8]')
dirContents = dir()
if not ('tradeIndf' in dirContents and 'quoteIndf' in dirContents):
    tradeIndf, quoteIndf = LoadData(param.fileDirectory, param.trade_InSampleFile, param.quote_InSampleFile)
print('Done....')
    
print('\nTraining Process [2/8]')
load = False #load is True as the features for IN data have been feature engineered beforehand and can be loaded.
if load==True :
    features_IN = out_1(quoteIndf, tradeIndf, load=True, save_IN=True, save_OUT=False)
    features_IN.drop(columns = ['Unnamed: 0'], inplace=True)
else:
    features_IN = out_1(quoteIndf, tradeIndf, load=False, save_IN=True, save_OUT=False)
    features_IN = out_1(quoteIndf, tradeIndf, load=True, save_IN=True, save_OUT=False)
    features_IN.drop(columns = ['Unnamed: 0'], inplace=True)
    print('features were calculated')
print('features_IN Shape after Feature Engineering: ', features_IN.shape)
print('Done....')

print('\nTraining Process [3/8]')
train_features_IN, train_labels_IN = features_labels_split_classification(features_IN)
print('Done....')

print('\nTraining Process [4/8]')
rfc = fit_Classification_Model(train_features_IN, train_labels_IN)
pickle.dump(rfc, open('bagging_rfc.sav', 'wb'))
#print('Loading pickled model..')
#rfc = pickle.load(open('bagging_rfc.sav', 'rb'))
print('Done....')

print('\nTraining Process [5/8]')
df_predict_label_IN = print_classification_inference(rfc, train_features_IN, train_labels_IN)
print('Done....')


print('\nTraining Process [6/8]')
mlp_train_features_IN, mlp_train_labels_IN = features_labels_split_regression(df_predict_label_IN, features_IN)
print('Done....')

print('\nTraining Process [7/8]')
mlp = fit_Regression_model(mlp_train_features_IN, mlp_train_labels_IN)
pickle.dump(mlp, open('bagging_etr.sav', 'wb'))
#print('Loading pickled model..')
#mlp = pickle.load(open('bagging_etr.sav', 'rb'))
print('Done....')

print('\nTraining Process [8/8]')
predict_IN = mlp.predict(mlp_train_features_IN)
Final_IN = combine_pred_with_data_deltachange(predict_IN, features_IN)
Final_IN.head()
Final_IN.to_csv('Final_IN.csv')
print('Done....')

print('\n------- Logging Process Steps - END (can ignore)-------')



------- Logging Process Steps - START (can ignore)-------

Training Process [1/8]
Done....

Training Process [2/8]
Starting myPred_1
Shape before:  (1672797, 12)
Shape after:  (370696, 39)
Finished myPred_1
Starting myPred_2
Shape before:  (370696, 39)
Shape after:  (350333, 41)
Finished myPred_1
Loading...
features were calculated
features_IN Shape after Feature Engineering:  (350333, 41)
Done....



Training Process [3/8]
Training Features Shape: (350333, 37)
Training Labels Shape: (350333,)
Done....

Training Process [4/8]


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:  9.4min finished


Done....

Training Process [5/8]


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:   25.0s finished


Groud Truth Spread: Counter({-1: 165061, 1: 163271, 0: 22001})
Prediction Spread: Counter({-1: 178482, 1: 171851})
misclassified_number: 146831.0
total_classified: 350333
misclassified_fraction: 0.4191183816540263
precision:  [0.58053473 0.         0.58124189]
recall:  [0.62773762 0.         0.61178654]
f_beta:  [0.60321415 0.         0.5961232 ]
labels: [-1, 0, 1]
Done....


  'precision', 'predicted', average, warn_for)



Training Process [6/8]
Training Features Shape: (350333, 32)
Training Labels Shape: (350333,)
Done....

Training Process [7/8]
Done....

Training Process [8/8]
rmse 1.5718962725588526
Done....

------- Logging Process Steps - END (can ignore)-------


In [48]:
#6th step
#Predict with the trained model using out_sample data

#Load the out-sample csv if not in memory
#Do not change tick frequency for outsample dataframe

dirContents = dir()
if not ('tradeOutdf' in dirContents and 'quoteOutdf' in dirContents):
    tradeOutdf, quoteOutdf = LoadData(param.fileDirectory, param.trade_OutSampleFile, param.quote_OutSampleFile)

print('------- Logging Process Steps - START (can ignore)-------\n')
print('This step will take some time!')
def OutSamplePrediction(quote_df, trade_df):
    load=False #load is False if the data is used for prediction and has not been feature engineered
    if load == True :
        features_OUT = out_1(quote_df, trade_df, load=True, save_IN=False, save_OUT=True)
        features_OUT.drop(columns = ['Unnamed: 0'], inplace=True)
    else :
        features_OUT = out_1(quote_df, trade_df, load=False, save_IN=False, save_OUT=True)
        features_OUT = out_1(quote_df, trade_df, load=True, save_IN=False, save_OUT=True)
        features_OUT.drop(columns = ['Unnamed: 0'], inplace=True)
        print('features were calculated')
    print('features_OUT Shape after feature engineering: ', features_OUT.shape)    

    train_features_OUT, train_labels_OUT = features_labels_split_classification(features_OUT)

    df_predict_label_OUT = print_classification_inference(rfc, train_features_OUT, train_labels_OUT)

    mlp_train_features_OUT, mlp_train_labels_OUT = features_labels_split_regression(df_predict_label_OUT, features_OUT)

    predict_OUT = mlp.predict(mlp_train_features_OUT)
    Final_OUT = combine_pred_with_data_deltachange(predict_OUT, features_OUT)
    Final_OUT.head()
    Final_OUT.to_csv('Final_OUT.csv')
    return Final_OUT

Final_OUT = OutSamplePrediction(quoteOutdf, tradeOutdf)
print('\n------- Logging Process Steps - END (can ignore)-------\n')
print('Out-sample prediction')
RMS(Final_OUT)

#IMPORTANT: If need to predict using new data it should be in the same format as the data that was given initially. ( i.e. quote_out, trade_out csv files)


'''#Any required calculations
def OutSamplePrediction(quote_df, trade_df): 
    print('Out-sample prediction')    
    return InSamplePredictionModel(quote_df, trade_df)
    
    
#res = OutSamplePrediction(quoteOutdf, tradeOutdf)
RMS(res)'''

------- Logging Process Steps - START (can ignore)-------

This step will take some time!
Starting myPred_1
Shape before:  (207297, 12)
Shape after:  (53187, 39)
Finished myPred_1
Starting myPred_2
Shape before:  (53187, 39)
Shape after:  (50749, 41)
Finished myPred_1
Loading...
features were calculated
features_OUT Shape after feature engineering:  (50749, 41)
Training Features Shape: (50749, 37)
Training Labels Shape: (50749,)


[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    5.1s finished
  'precision', 'predicted', average, warn_for)


Groud Truth Spread: Counter({-1: 25065, 1: 22654, 0: 3030})
Prediction Spread: Counter({-1: 25697, 1: 25052})
misclassified_number: 24917.0
total_classified: 50749
misclassified_fraction: 0.49098504404027665
precision:  [0.53068452 0.         0.48678748]
recall:  [0.54406543 0.         0.53831553]
f_beta:  [0.53729167 0.         0.51125645]
labels: [-1, 0, 1]
Training Features Shape: (50749, 32)
Training Labels Shape: (50749,)
rmse 1.4883918513233985

------- Logging Process Steps - END (can ignore)-------

Out-sample prediction
RMS = 1.4884. #Predictions = 7322


"#Any required calculations\ndef OutSamplePrediction(quote_df, trade_df): \n    print('Out-sample prediction')    \n    return InSamplePredictionModel(quote_df, trade_df)\n    \n    \n#res = OutSamplePrediction(quoteOutdf, tradeOutdf)\nRMS(res)"