# Assignment

Dear Candidate,

As part of the recuritment process, you're required to complete the assignment detailed below and upload it as detailed in the email.

<!-- The assignment tests your knowlege of python programming and machine learning skills. The test is designed to be completed within -- days -- -->

Kindly submit your orignal work. 

Thank you and Good luck

## Overview

### Problem Statment

The goal of this assignment is to train a trading model which tries to predict the stock market direction in the next 5 mins based on the historical data. 

### Your task
At a high level, you've three tasks.

<b>Part I - Prepare Data</b> : This section involves reading the data files, modifying the data and preparing 4-5 indicators which will be given as an input to our model. The 4 indicators(RSI, ADX, VOI, OIR) are detailed below. You're required to program these 4 indicators to feed into the model. 

*Bonus* - You may add any other indicator/feature which you feel could help in increasing the accuracy of the model. 

<b>Part II - EDA and model training</b> : In this section, you would be required to design a Logistic Regression model. The model is detailed in the section below. You're requried to implement the same model and train it for relevant number of epochs.  

*Bonus* - Once you've the suggested model available, you can chose to modify it (in a separate section below) and experiment to see how the changes affect the accuracy of the model.

<b>Part III - Model Evaluation </b> : In this section, you would be running your model on your test data to evaluate bias/variance tradeoff

### Data
Along with this notebook, we've provided you the dataset in the zip file ---name here ---
The data is organized by each day, and has order book(refer resources) data for 1s time interval. The various columns are:

date: date and time in YYYY-MM-DD hh:mm:ss format

Order Book Data: 

a0: Best ASK price (i.e. the lowest posted price at which someone is willing to sell an asset)

b0: Best BID price (i.e. the highest posted price at which someone is willing to buy an asset)

az0: Best ASK size (i.e. the number of lots being offered for selling at the lowest ask price)

bz0: Best BID size (i.e. the number of lots that people are trying to buy at the bid price)

 

Features:

atv: feature representing one fraction of trading volume ( in number of lots )

btv: feature representing another fraction of trading volume ( in number of lots )

(atv + btv =  total number of trades in the day so far)

 

tbq: sum of all the BID ( buy ) sizes in the market  

tsq: sum of all the ASK ( sell ) sizes in the market  

 

All the above variables are for a derivative instrument 2.

### Resources
1. [Order Book](https://www.investopedia.com/terms/o/order-book.asp)
2. [Technical Indicators](https://www.investopedia.com/terms/i/indicator.asp) 
3. [RSI](https://www.investopedia.com/terms/r/rsi.asp) , [ADX](https://www.investopedia.com/articles/trading/07/adx-trend-indicator.asp), [Calculating ADX](https://traderhq.com/average-directional-index-indicator-guide/)

3. Volume Order Imbalance and Order Imabalnce Ratio - refer to Page 5 and 17 of the document [at](http://eprints.maths.ox.ac.uk/1895/1/Darryl%20Shen%20%28for%20archive%29.pdf)
4. [Tensorflow](https://www.tensorflow.org/)
5. [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)



## Part I Prepare Data

The first part is designed to test your knowledge of data analysis, and the ability to program in an efficient and logical way. 

In this section, you'll be implementing two technical indicators(ADX, RSI) and two order book featrues(VOI, OIR) and finally rearranging the data in a format which could be used for training an ML model

The class ProcessData is provided which is initialized with daily data from the csv files.

<b>Input to the class</b> - 1 day order book data. Shape (22141,8)

<b>Expected output</b> - Input features(X) and ouput(Y) which can be fed into a trainable model. This is detailed further in the section. 

**X**: Shape (Number of points in a day, look_back period, number of technical indicators)<br>
Eg. If there are 200 points which are identified for training in a day, with a look back period of 5mins(300 seconds) and 4 features. Then the ouput(X) from createDataset will have a shape of (200, 300, 4)

**Y**: Shape(Number of points). This will hold the predications (buy/sell/hold) based on the direction in which 
        the market will move in the next 5 mins. Use an appropriate threshold to split the data into buy,sell and hold ( e.g.. if the market moves more than 0.1% => Buy, less than -0.1% => Sell, otherwise Hold )
        
Your task is to implement the following methods:

1. computeOHLC - For technical analysis, you're required to conver the order book data to an OHCL format. 
2. addRSIColumn - This method will compute the RSI for the data(self.data) and assign it in column (self.data[col_name])
3. addADXColumn - This method will compute the ADX for the data(self.data) and assign it in column (self.data[col_name]) 
4. addVOIColumn - This method will compute the VOI for the data(self.data) and assign it in column (self.data[col_name])
5. addOIRColumn - This method will compute the OIR for the data(self.data) and assign it in column (self.data[col_name])
6. createDataset - Finally, you're required to convert the data into a numpy array which can be utilized for training. 



In [None]:
import numpy as np
import pandas as pd
pd.set_option('chained_assignment',None)

In [None]:
class ProcessData:
    """
    Documentation:
    Class to processes raw data 
    
    """
    def __init__(self, data, 
                 path='../dataset', window=1,
                 debug=False, convert_to_log=True, local_data=None):
        self.debug = debug
        self.data = data
        if(convert_to_log):
            self.convert_to_log_returns()
            
    def convert_to_log_returns(self):
        df = self.data
        df['aR0'] = np.log(df['a0']) - np.log(df['a0'].shift(1))
        df['bR0'] = np.log(df['b0']) - np.log(df['b0'].shift(1))
        df['aR0_cm'] = np.log(df['a0_cm']) - np.log(df['a0_cm'].shift(1))
        df['bR0_cm'] = np.log(df['b0_cm']) - np.log(df['b0_cm'].shift(1))
        if(0):#False ==self.debug):
            df.drop(['a0_cm','b0_cm'], inplace=True, axis=1)
    
    def computeOHLC(self, interval='1T',column='b0'):
        """
        This method is expected to take a price series as an input, and resample the series to produce OHLC data,
        based on the input interval
        Input:
        interval- time in seconds at which rate the data needs to be resampled
        column - name of the column which contains the time series
        
        Output:
        self.dataD is set which should have 4 columns(Open, High, Low, Close) and self.data.shape[0]/interval number
        of rows. Such features are required for technical indicators. 
        """
        
        ### Start your code here ###

        ### End your code here ###
        
    
    def addRSIColumn(self,col_name='RSI', period=14, 
                    base='Close'):
        """
        Add RSI Column
        
        Parameters
        ----------
        layer_name: string
            Name of the layer
        period: int, default 14
            The RSI PERIOD
        base: strig from  ['Open','Close','Low','High'], default 'Close'
            Base price 
        
        """
        
        self.computeOHLC()
        RSI = None
        ### Start your code here ###

        ### End your code here ###        
        self.data[col_name] = RSI
        
    def addADXColumn(self, col_name = 'ADX', period =14,
                           Multiplier = 3):
        """
        Add ADX indicator
        
        Parameters
        ----------
        layer_name: string
            Name of the layer
        period: int, default 14
            The ADX number of periods
        Multiplier: int
            The ATR multiplier
        """
        self.computeOHLC()
        ADX = None
        ### Start your code here ###

        ### End your code here ###        
        self.data[col_name] = ADX
        
    def addVOIColumn(self, col_name = 'VOI',
                    Va='az0', Ra='aR0', Vb='bz0', Rb='bR0'):
        if(self.debug):
            print("Adding VOI Layer")
        """
        Add Volume Imbalance Column
        
        Parameters
        ----------
        layer_name: string
            Name of the layer
        Va: string
            Ask volume column
        Pa: string
            Ask price column
        Vb: string
            Bid volume column
        Pb: string
            Bid price column
        """
        VOI = None
        ### Start your code here ###

        ### End your code here ###        
        self.data[col_name] = VOI
        
        
    def addOIRColumn(self, col_name = 'OIR',
                    Va='az0', Vb='bz0'):
        if(self.debug):
            print("Adding OIR Layer")
        """
        Add Volume Imbalance Layer
        
        Parameters
        ----------
        layer_name: string
            Name of the layer
        Va: string
            Ask volume column
        Vb: string
            Bid volume column
        """
        OIR = None
        ### Start your code here ###

        ### End your code here ###        
        self.data[col_name] = OIR
    
    def createDataset(self,X_cols,lag,freq=100):
        '''
        
        Input features(X) and ouput(Y) which can be fed into a trainable model. This is detailed further in the section. 
        
        X : Shape (Number of points in a day, look_back period, number of technical indicators)<br>
        Eg. If there are 200 points which are identified for training in a day, with a look back period of 5mins(300s) and 4 features. Then the ouput,X from createDataset will have a shape of (200, 300, 4)
        
        Y: Shape(Number of points). This will hold the predications (buy/sell/hold) based on the direction in which 
        the market has moved in the next 5 mins. Use an appropriate threshold to split the data.
        '''
        X = np.empty((0,lag,len(X_cols)))
        Y = np.empty((0))
        ### Start your code here ###
        
        ### End your code here ###
        
        assert X.shape[0] == Y.shape[0]
        return X , Y
        


In [None]:
days = [20170403,  20170405, 20170406, 20170407,#]#,#20170410,
        20170413,20170417,
        20170418,20170419,20170420,20170421]#,20170406]
training=1
days += [20170424,  20170425, 20170426, 20170427,#]#,#20170410,
        20170428,20170502,
        20170503,20170504,20170505,20170508]#,20170406]

days += [20170509,  20170510, 20170511, 20170512,#]#,#20170410,
        20170515,20170517,
        20170518,20170519,20170522,20170523]

### Run the below cell after implementing the required methods in the class above 

In [None]:
X = {}
Y = {}
for day in days:
    print(day)
    data = pd.read_csv('./assignment_data/test_'+str(day)+'.csv',
                       index_col='date')
    
    sData = ProcessData(data, debug=False)
    sData.data.index = pd.to_datetime(sData.data.index)
    sData.addVOIColumn()
    sData.addOIRColumn()
    sData.addRSIColumn()
    sData.addADXColumn()
    X[day], Y[day] = sData.createDataset(X_cols = ['VOI','OIR','ADX','RSI'], lag=300)
    
    

In [None]:
X[day].shape, Y[day].shape

## Part II EDA and Model training

### Exploratory Data Analysis

In this part, you are required to visualize the procssed data and test for such properties at stationariy, heteroskedasticity etc which pertain to time series analysis. This perhaps will be the most important criteria in understanding the data set and coming up with useful features for the next phase Also, feel free to design more features to be used for your model below. 

Here, you may chose to do some cleanup such as, but not limited to, removing outliers, invalid points etc




In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
X[day].shape

In [None]:
#sData.data[['VOI','OIR','ADX','RSI']].plot()

## Train model

After going back and forth with EDA and feature engineering, you're required to train your model here and evaluate how it performs on the training set. Necessarily adjustments will need to be made to come up with the right parameters.

In [None]:
class Model():
    def __init__(self,scope="irage/base",seq_length=300,num_features=4):
        ### Start your code here ###


        ### End your code here ###
        #assert(self.loss)    

In [None]:
tf.reset_default_graph()
model0 = Model()

In [None]:
#np.random.randint(4,size=[32])

### Run model
Add code below to train your model 

In [None]:
from random import shuffle

In [None]:
train_days = days[:-10]
n_epochs = 10
for epoch in range(n_epochs):
    print("==Epoc==",epoch)
    shuffle(train_days)
    
    for day in train_days:
        ### Start your code here ###
        pass
        ### End your code here ###
        

        

## Part III Model Evaluation
Here, you'll be analysing how the model performs on validation data. 

Resources:<br>
[Model Accuracy](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2016/05/history_training_dataset.png)<br>
[Bias Variance Trade-off](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

Based on the analysis you may have to go back to the model and control for overfitting etc. This will be a continuos process until you've satisfactory results


In [None]:
test_days = days[-10:]

for day in test_days:
    ### Start your code here ###
    pass
    ### End your code here ###

## Bonus Model

After running the basic Neural Network model with the fully-connected layers, you may try replicating the model below and evaluate how the performance is afffected

![title](img/LSTM.png)



Model Architecture: Here is a snapshot of the necessary layers(in that order)<br>
1. LSTM for one sample. This will take self.feat as the input(eg ?,300,4) and generate an LSTM output which will feed into the next layers
2. LSTM for the entire batch. This will take output from first LSTM and generate an output across the entire batch.<br>
Hint - The state ouputs will need to be preserved in case you decide to break a days' data into multiple mini-batches.
For both the above layers, you may refer to the following tensorflow methods:<br>
a. LSTM Cell - https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell<br>
b. RNN - https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn

3. Full connected layer(s) - Add 1 or more fully connected layer to the output from 2 above. https://www.tensorflow.org/api_docs/python/tf/layers/dense

4. Loss function and optimizer

