<a href="https://colab.research.google.com/github/daehkim/pair-trading/blob/master/tradingStrategy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS7641 Machine Learning
*Application of Machine Learning in Pairs Trading*

In [0]:
import pandas as pd
import numpy as np
import os
import datetime
import math
import sklearn
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

## Price History Table
Here is the price table we used for this function. I used the top 3 stocks as an example. I'll change the data after I get any pair.

In [17]:
# Import training dataset
training_set = pd.read_csv("training_data.csv")

# Filtering the table only for the price 
filter_col = [col for col in training_set if col.startswith('price_')]
training_set_price = training_set[filter_col]
training_set_price.head(3)

Unnamed: 0,price_20070103,price_20070104,price_20070105,price_20070108,price_20070109,price_20070110,price_20070111,price_20070112,price_20070116,price_20070117,price_20070118,price_20070119,price_20070122,price_20070123,price_20070124,price_20070125,price_20070126,price_20070129,price_20070130,price_20070131,price_20070201,price_20070202,price_20070205,price_20070206,price_20070207,price_20070208,price_20070209,price_20070212,price_20070213,price_20070214,price_20070215,price_20070216,price_20070220,price_20070221,price_20070222,price_20070223,price_20070226,price_20070227,price_20070228,price_20070301,...,price_20151104,price_20151105,price_20151106,price_20151109,price_20151110,price_20151111,price_20151112,price_20151113,price_20151116,price_20151117,price_20151118,price_20151119,price_20151120,price_20151123,price_20151124,price_20151125,price_20151127,price_20151130,price_20151201,price_20151202,price_20151203,price_20151204,price_20151207,price_20151208,price_20151209,price_20151210,price_20151211,price_20151214,price_20151215,price_20151216,price_20151217,price_20151218,price_20151221,price_20151222,price_20151223,price_20151224,price_20151228,price_20151229,price_20151230,price_20151231
0,49.06,50.34,49.63,49.5,50.62,49.68,49.2,48.9,46.83,47.05,46.32,47.49,45.11,44.85,44.57,45.9,45.55,45.8,46.01,46.63,45.5,46.07,45.41,44.48,45.71,46.9,45.71,46.795,46.91,46.27,46.49,46.66,48.05,47.89,47.24,46.25,45.75,44.0,45.32,44.14,...,80.0,79.35,86.34,87.47,86.88,86.72,83.0,86.51,85.5,83.54,84.92,84.6,85.46,87.43,89.66,88.16,89.03,91.18,89.35,90.68,89.89,93.29,88.53,86.75,86.84,83.6,83.92,80.51,82.61,82.74,80.8,79.01,76.4,77.93,78.98,80.55,78.97,77.15,77.22,77.15
1,41.02,41.89,39.66,40.42,40.54,41.54,41.41,42.23,41.71,41.28,42.74,42.76,42.53,41.85,42.2,41.94,41.53,41.64,41.67,41.28,41.57,40.98,41.2,41.35,41.48,41.32,40.93,41.13,41.46,41.41,41.25,41.77,41.46,40.59,41.44,40.41,40.01,37.47,39.69,38.84,...,123.95,123.48,114.87,113.83,113.73,114.99,113.93,113.73,114.71,113.99,114.98,115.18,115.65,116.14,116.46,117.47,118.06,116.68,117.15,115.71,115.02,116.5,116.37,117.04,115.58,114.86,114.31,114.7,115.35,115.69,114.37,112.69,112.92,114.53,115.9,118.21,117.64,119.37,119.25,116.67
2,24.12,24.3,23.89,23.7,23.13,20.9,20.84,21.58,20.79,20.58,19.97,20.21,20.05,20.61,21.22,18.18,17.62,17.35,16.86,16.8,16.86,17.29,17.22,17.22,17.68,17.71,17.57,17.61,17.39,16.85,17.01,17.0,17.45,17.23,17.21,17.2,17.31,16.65,16.4,16.56,...,35.87,36.92,37.17,36.58,36.65,37.17,36.79,35.77,36.46,36.11,36.41,36.94,37.22,37.07,37.1,37.34,37.96,37.2,37.78,37.63,37.4,37.51,36.96,36.49,35.48,35.5,34.94,34.66,35.32,35.57,35.26,34.99,35.22,35.08,35.95,36.05,35.67,35.92,35.46,34.92


## Create the spread funtion (price pair's relation)
We will create the function of spread in here. The basic function of spread is defined as blow:

$Spread = log(a) - nlog(b)$

where the 'a' and 'b' are prices of stocks A and B respectively and the 'n' is hedge ratio. Our target is finding dynamics of spread based on the machine learning. We will use the supervised machine learning to implement this part and the possible candidates are 'linear regression' and 'support vector machine (SVM)'

In [0]:
def create_spread_function(a, b, start_t, end_t, alg='log'):
    """
    * Input
        - a, b: Stock A and B's price history
        - start_t, end_t: start/end time of the analysis on the data. 
            They use the same unit with the data. For example, 0 means the 
            first data of the a and b. 
            (Analyze the data from a[start_t], b[start_t] to a[end_t], 
            b[end_t])
        - alg: Type of algorithm. The 'log' means the log normalization
    * Output
        - The function of spread. Output of this function is spread and z_score.
    * Function
        - Apply the supervised machine learning to find the dynamics of
        spread
    """
    
    def log_spread_func(a, b):
        spread = math.log(a) - w_avg * math.log(b)
        z_score = spread/w_std

        return (spread, z_score)

    def lr_spread_func(a, b):
        spread = a - w_avg * b
        z_score = spread/w_std

        return (spread, z_score)

    def svm_spread_func(a, b):
        pass

    # Slice the date
    target_a = a[start_t:end_t]
    target_b = b[start_t:end_t]
    total_date = end_t-start_t

    # Find the coefficient of the log normalization
    if alg == 'log':
        print("Log normalization")
        # use the log function
        target_a = np.log(target_a)
        target_b = np.log(target_b)

        # Calculate the weight
        w_list = target_b/target_a
        w_avg = np.average(w_list)

        # Calculate the standard deviation for the z-score calculation
        w_std = np.std(w_list)

        return log_spread_func

    # Find the coefficient of the linear regression
    elif alg == 'lr':
        print("Linear regression")
        # Train the data using linear regression
        target_a = target_a[:,np.newaxis]
        regr = linear_model.LinearRegression()
        regr.fit(target_a, target_b)
        
        # Calculate the weight
        predict_b = regr.predict(target_a)
        w_list = target_b/predict_b
        w_avg = np.average(w_list)

        # Calculate the standard deviation for the z-score calculation
        w_std = np.std(w_list)

        return lr_spread_func

    elif alg == 'svm':
        print("Support Vector Machine")
        return svm_spread_func

    print("Check the algorithm. Input was " + alg)

    pass

### How to use the spread function

Here, we will see how to use the spread function.
Right now, the results are bad because the stock a and b is randomly choosen and does not have any relation.

In [90]:
# Code verification
a = training_set_price.loc[0]
b = training_set_price.loc[1]

# Check the function based on the log normalization
spread_func = create_spread_function(a, b, 0, 1000, 'log')
(spread, z_score) = spread_func(a[0], b[0])
print(spread, z_score)

# Check the function based on the linear regression
spread_func = create_spread_function(a, b, 0, 1000, 'lr')
(spread, z_score) = spread_func(a[0], b[0])
print(spread, z_score)

Log normalization
-0.1478672531468277 -1.2993837129839456
Linear regression
8.042285297435534 53.31521967302066


## Generate the z-score matrix
Generate the z-score matrix based on the 'gen_z_score()' function.
It will be used for our strategy to deciding how much stocks we will buy/sell.
* Input: stock pairs list, stock's present price
* Output: z-score matrix
* Function: calculate the z-scores in the stock pairs list and generate the matrix about z-score

In [0]:
def gen_z_score_matrix():
    pass

In [0]:
# Code verification


## Generate buy/sell the stock matrix
Based on the z-score matrix, we need to decide which stocks to by and which stocks to sell. We will buy the stock if the z-score is above certain point. (Ex. 2-sigma) It will be the argument. In the same way, if the z-score reaches below the certain point, we will sell the stock. (Ex. 0) We also need to care about when the pair of stocks shows the unusual behavior. When we buy at the 2-sigma and if it reaches the 3-sigma, we should sell the stock to stop the loss. 
* Input: entry point, stop loss point, take profit point, commission for trade, z-score matrix
* Output: buy matrix, sell matrix (each component is amount of stock we will buy/sell)
* Function: decide our action based on the z-score matrix

In [0]:
def gen_action_matrix():
    pass

In [0]:
# Code verification