

# Capstone Project: Systematic Trading Strategies

## Goal: 
Implement a process of developing trading strategies in Indian Equity Markets using fundamental data of liquid listed companies that maximises out of sample average returns keeping risk under control.

## Who cares?

I do. I am planning to start my own investment firm. I have not focussed much on fundamental data before, neither have I explored advanced machine learning techniques to develop trading strategies. I specialize in analyzing short-term (seconds to minutes) movements in markets, not over months and years. If there indeed is value derived out of using some subset of Data Science that I learn during this course, I am happy to invest my own money on these trading strategies. Even otherwise, I believe making investment process data-driven and automated has benefits for everyone interested invest wisely and with minimum cost. 

## Data

I have subscribed to two data sets from Quandl.

1. [https://www.quandl.com/data/DEB-Core-India-Fundamentals-Data](https://www.quandl.com/data/DEB-Core-India-Fundamentals-Data) Contains fundamental indicators derived from company financials from published balance sheet, income statements and cash flow statements from roughly 4000 listed stocks on NSE or BSE.  History goes back to 2005
2. [https://www.quandl.com/data/TC1-Indian-Equities-Adjusted-End-of-Day-Prices](https://www.quandl.com/data/TC1-Indian-Equities-Adjusted-End-of-Day-Prices) Contains daily prices (open, low, high, close), volume and value traded for constituents of top 500 NSE listed stocks by market capitalization. The prices are adjusted from corporate actions like stock splits, rights issues, dividends, buy backs etc.

## Approach

We wish to follow the approach of a white paper published by a Deutsche Bank research group. In that paper the authors review various machine learning algorithms and apply them in practice on Japanese Equity markets data. The Following are the steps involved

1. **Data Selection** To decide the stock universe, date range for training set and test-set. Factors involved:
  1. Data availability
  2. Stock liquidity
  3. Avoid any bias in universe selection like survivorship
2. ** Problem Formulation: ** Describe 
    1. the context,
    2. the feature space, 
    3. the target variable, 
    3. the optimization problem and 
    4. the model selection metric and 
    5. the evaluation criteria
2. **Investment signal creation and classification**: A close look at fundamental data set to see whether it needs enrichment i.e. computing well known indicators to make a comprehensive list of possible investment signals classified into various factors like growth, value, quality, size, momentum etc
3. **Visualize performance of individual investment signals ** If each investment signal is used only by itself as a ranker for long stock portfolio, how is the in-sample performance of the strategy.
4. **Data pre-processing** like filling missing data, normalizing, uniformizing, sector-neutralizing, quantizing, winsorizing etc to make variations in output variable (returns) more sensitive to variations in input variables and also make combining variables more sensible.
5. **Create a benchmark** by using a simple linear model (with some basic checks) for other machine learning techniques to see if they do any better
6. **Test most promising models** as per the paper and see if they perform any better than the simple model
7. **Document the conclusions and insights**



## Data Selection 
To decide the date range for training set and test-set and stock universe for this study. Factors involved:

1. Data availability: 
    1. Fundamental Data-set is available 2005 onwards. 
    2. Adjusted Price data is available 2001 onwards. 
    3. Earnings release dates are available 2011 onwards. 
    4. There are more than 1000 stocks common in all the three data-sets
    5. ** Decisions: ** 
        1. Lets *not* throw away 2005-10 for absence of earnings release dates. Instead use it with a delay of 63 days from quarter end whenever earnings release dates are not available
        2. Training set: 2005 to 2011 
        3. Test set: 2012 - 2015 
        4. Reserved set (only for final reporting): 2016 - 2017 
2. Stock liquidity:
    1. Small cap stocks are known to be prone to manipulation (both of stock market and company fundamentals) 
    2. Trading in large amounts in smaller stocks will lead to heavy transaction costs. 
    3. ** Decision: ** Let us use top 200 stocks by market cap at each rebalance date. 
3. Avoid any bias in universe selection like survivorship:
    1. ** Decision: ** Instead of a fixed universe of stocks for entire period, let us use top 200 stocks by market cap, updated at every rebalance date. 



In [72]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#fundamental data or the feature set
dfn = pd.read_pickle('fundamental_data_dfn.pkl')

In [2]:
#We checked that MCAP is the same in standlone and consolidated in 24,442 and different only in 60. 
#Since consolidaed is null frequently, lets use standalone to define MCAP. 
#dfn[(dfn.indicator == 'MCAP') & (~dfn.consolidated.isnull()) & ~(dfn.consolidated == dfn.standalone)].count()

In [146]:
def createFeatureSet(dfn):
    ''' Create a training and test set '''
    trainStart = '2005-03-31';trainEnd = '2011-12-31';
    testStart = '2012-01-01' ; testEnd = '2015-12-31';
#    testEnd = '2017-07-31';

    rebalDays = 30 #Days after which to rebalance
    stockUniverseSize = 200 #top stocks by market cap will be chosen as our universe

    #Get dates training and test
    trainDateIndex = pd.date_range(start=trainStart, end=trainEnd, freq=str(rebalDays)+'D')
    testDateIndex = pd.date_range(start=testStart, end=testEnd, freq=str(rebalDays)+'D')

    #Create a dataframe sorted by date and then by market cap of tickers
    df_mcap = dfn[dfn.indicator == 'MCAP'].sort_values(['date','standalone'],ascending = (True,False))
    df_mcap = df_mcap[['date','ticker','standalone']]
    df_mcap.columns = ['QuarterEndDate','ticker','MCAP']

    #remove certain "crappy stocks" whose MCAP has serious errors identified painfully
    crappy_stocks = pd.read_csv('crappy_stocks.csv')
    crappy_stocks = crappy_stocks.ticker.tolist()
    df_mcap = df_mcap[~df_mcap.ticker.isin(crappy_stocks)]
    
    def createDFtopN(dateIndex,df_mcap,stockUniverseSize):
        '''Creates a DataFrame with cartesian product of all dates in dateIndex 
        and top stockUniverseSize market cap tickers'''
        all_tickers = df_mcap.ticker.unique()
        #Create the cartesian (cross) product of all dates and all tickers
        multiIndex = pd.MultiIndex.from_product([dateIndex,all_tickers],names = ['date','ticker'])
        X = pd.DataFrame(0,index=multiIndex,columns = ['temp'])
        X.reset_index(inplace=True)
        
        #In the cross product of all dates and all tickers, as of join the market caps
        X = pd.merge_asof(left=X,right=df_mcap,left_on='date',right_on='QuarterEndDate',by='ticker')
        #There will be many null MCAPs because many stocks listed later and couldn't 
        #be filled by as of join in the cartesian product. Delete the nulls
        X = X.loc[X.MCAP.notnull(),['date','ticker','MCAP']]
        #For each date sort the tickers by descending market cap
        X = X.sort_values(['date','MCAP'],ascending = (True,False))
        #Now retain top stockUniverseSize and delete the rest
        X = X.groupby('date').apply(lambda x:x[0:stockUniverseSize])
        #Clean-up
        X = X.drop('date',1)
        X = X.reset_index()
        X = X.drop('level_1',1)

        #Check for bad data. Note this is not percentage but fractional change!
        X['pct_change_mcap'] = X.groupby('ticker')['MCAP'].pct_change()
        #We drop wherever market-cap more than 10 times in a quarter 
        #checked visually, such a change is more likely an error than real
        X = X[X.pct_change_mcap.abs() < 10]
        return X
    
    #Create rebalance dates x top stockUniverseSize train and test place holders
    X_train = createDFtopN(trainDateIndex,df_mcap,stockUniverseSize)
    X_test = createDFtopN(testDateIndex,df_mcap,stockUniverseSize)
    
    all_features = dfn.indicator.unique()
    
    return X_train, X_test
X_train, X_test = createFeatureSet(dfn)

In [147]:
X_train

Unnamed: 0,date,ticker,MCAP,pct_change_mcap
200,2005-04-30,ONGC,124783.48,0.0
201,2005-04-30,NTPC,70952.22,0.0
202,2005-04-30,IOC,51126.22,0.0
203,2005-04-30,BHARTIARTL,38133.03,0.0
204,2005-04-30,SAIL,26116.53,0.0
205,2005-04-30,TATASTEEL,22210.32,0.0
206,2005-04-30,BHEL,18704.31,0.0
207,2005-04-30,GAIL,17994.63,0.0
208,2005-04-30,TATAMOTORS,14956.64,0.0
209,2005-04-30,NMDC,14722.29,0.0


## Problem Formulation
### Context 
We are trying to create a robot to replace a portfolio manger. Typically, a long-short equity portfolio manager would identify stocks, using a combination of fundamental and technical analysis, that she believes would out-perform the benchmark (typically an index) and those that would under-perform the index in the long-term (typically 2-3 years). 

She would go long the former and short the latter to create delta-neutral long-short portfolio of stocks. She would periodically, typically every month or on special events like earnings releases, re-assess his portfolio in light of fresh information, and possibly "rebalance" i.e. change the weights of stock holdings in her portfolio slightly or at times dramatically. 

Her objective is to achieve great long-term returns. The clients would choose to invest in those funds whose fund-managers have a track-record of consistently providing great returns on their investment after taking away all costs and fees. Clients are usually also concerned about risk of price fluctuation in the short-term, and not just long-term returns. 

### Formulation

Lets say we are sitting at time $t$. We build our stock universe by using latest available market-cap information and choose top $N$ (say 200) stocks where both fundamental and price data are available. 

As of time $t$, we have information about company financials and stock price history for each of $N$ stocks in the form of a feature vector. These features are derived from the most recent earnings data released by each company. We will add more features that are "technical" i.e. not related to company's financial health, but only related to its stock price movements. Lets say there are $F$ features. All features are stored in matrix $X_t$ of size $NxF$. Let's say we have a model, yet to be optimized, which ranks the $N$ stocks as a function of all their features from "best" to "worst". We go long (i.e. buy) the top 20% stocks, and short (i.e. borrow and sell) the bottom 20% stocks making an equally weighted long short portfolio. 

We move to time $t+1$, lets say, the time is measured in months (typical rebalance horizon for long-term fund). Let the stock price of $i$th stock see a return of $y_{t,i}$ from time $t$ to $t+1$. At $t+1$, we repeat the process by a fresh ranking of refreshed top $N$ market-cap stocks as a function of $X_{t+1}$ and then build another quintile portfolio based on it. We will *rebalance* our portfolio by buying and selling stocks such that at $t+1$, we hold the new top 20% as long and new bottom 20% as short. 

How does the model rank the $N$ stocks? By estimating $y_t$ at time $t$, which is unknown at time $t$ but known at time $t+1$. By using the benefit of hindsight, we can create a training set for our model that is aware of the values of $y_t$ at each time $t$. Various machine learning algorithms can then be applied predict $y_t$. We can use mean squared error in prediction of $y_t$ as the objective function to be minimized. 

To select the best model, we will not limit ourselves to using mean squared error in stock returns as a metric, but instead use out of sample performance of the quintile portfolio as our guiding light. Standard metrics of portfolio performance like Compounded Annual Rate of Returns (CAGR), Annualized Volatility (Vol), Sharpe Ratio, Information Ratio and Maximum Drawdown (MaxDD) etc will be computed along with a visual cumulative return chart to help us make a balanced decision about model selecton. 

