## Derivations of beta β
A linear regression takes the form of:  
               
$$ y = bx + ε $$ 

where x is the regressor; b is x's coefficient, and ε is noise. its estimate can be expressed by:  
               
$$ {y = \hat{b} x} $$ 
               
As the Ordinary Least Square method states, the best value of beta would minimize the difference between the estimate value $ \hat{y} $ and the actual y. Thus we have:  
               
$$ arg\underset{b}min (y - \hat{y})^2 $$ 

Let:  
        
$$ F =  (y - \hat{y})^2$$
$$F = (y - \hat{b} x)^2 $$
Take the derivative of F with respect to $\hat{b}$, and make it equal to 0 for optimization  
  
$${dF\over{d\hat{b}}} = 2(y - \hat{b} x)(-x) = 0 $$  

$$(y - \hat{b} x)(-x) = 0 $$  

$$ \hat{b} x x' = xy $$  

and expression of estimated beta would be:   
$$ \hat{b} = (xx')^{-1} xy $$

Now, for the standard deviation of beta, first take calculate the variance of beta:  
$$ Var(b) = Var((xx')^{-1} xy) $$
$$ Var(b) = Var((xx')^{-1} x (b x + ε)) $$
$$ Var(b) = Var((xx')^{-1} x b x + (xx')^{-1} x ε)) $$
$$ Var(b) = Var((xx')^{-1} x ε)) $$




In [1]:
import numpy as np
import pandas as pd


In [521]:
#loading data
df = pd.read_csv('data.csv',).drop('Date',axis=1)
df.head()

Unnamed: 0,BABA,FB
0,76.690002,102.220001
1,78.629997,102.730003
2,77.330002,102.970001
3,72.720001,97.919998
4,70.800003,97.330002


In [491]:
class regression():
    """
    This is class contains all the 
    statistical means I need for this
    pair trading project
    """
    def __init__(self, x, y, * args):
        """
        x and y are pandas dataframe
        """
        self.x = x
        self.y = y
        self.x_copy = x
        self.y_copy = y
        self.coef_dict = {}
        self.coef = np.array([])
        self.error = np.array([])
        self.std_dict = {}
        self.t_dict = {}

    
    def addconst(self):
        self.x['constant'] = np.ones(len(self.x))

    def fit(self):
        # Coefficient
        self.coef = np.linalg.inv(self.x.T.dot(self.x)).dot(self.x.T.dot(self.y))
        for i in range(len(self.x.columns)):
            self.coef_dict[self.x.columns[i]] = self.coef[i][0]
        
        # t statistics    
        self.error = pd.DataFrame(self.y.values - self.x.dot(self.coef))
        self.residual_cov = self.error.T.dot(self.error)/ len(self.y)
        self.coef_cov = np.kron(np.linalg.inv(self.x.T.dot(self.x))
                                ,self.residual_cov )
        self.coef_stderror = np.sqrt(self.y.shape[0]/(self.y.shape[0]-self.x.shape[1])*np.diag(self.coef_cov))
        
        for i in range(len(self.x.columns)):
            self.std_dict[self.x.columns[i]] = self.coef_stderror[i]
        
        self.tstats = self.coef.T/self.coef_stderror.reshape(self.coef.T.shape)
        for i in range(len(self.x.columns)):
            self.t_dict[self.x.columns[i]] = self.tstats[0][i]
        
        
    def report(self):
        print('='*20+'Beta Result Report'+'='*20+'\n')
        for each in self.coef_dict:
            print('[+]Beta {}: {}'.format(each, self.coef_dict[each]))
        print('\n'+'='*20+'T statistics Result Report'+'='*20+'\n')
        for each in self.t_dict:
            print('[+]Beta {}: {}'.format(each, self.t_dict[each]))

        print('\n')

In [522]:
df.head()

Unnamed: 0,BABA,FB
0,76.690002,102.220001
1,78.629997,102.730003
2,77.330002,102.970001
3,72.720001,97.919998
4,70.800003,97.330002


In [523]:
x = pd.DataFrame(df)
y = pd.DataFrame(df)

In [524]:
x.head()

Unnamed: 0,BABA,FB
0,76.690002,102.220001
1,78.629997,102.730003
2,77.330002,102.970001
3,72.720001,97.919998
4,70.800003,97.330002


In [525]:
y.head()

Unnamed: 0,BABA,FB
0,76.690002,102.220001
1,78.629997,102.730003
2,77.330002,102.970001
3,72.720001,97.919998
4,70.800003,97.330002


In [504]:
model = regression(x,y)
model.addconst()
model.fit()
model.report()


[+]Beta BABA: 1.0000000000000109
[+]Beta FB: -2.7590228611282758e-14
[+]Beta constant: 5.987966785180904e-13


[+]Beta BABA: 238002823132444.62
[+]Beta FB: -12.886779566522529
[+]Beta constant: 21472.18709591645




In [505]:
model.t_dict = {'FB': model.t_dict}
model.std_dict = {'FB':model.std_dict}
model.coef_dict = {'FB':model.coef_dict}

In [506]:
t1 = pd.DataFrame(model.t_dict)
t2 = pd.DataFrame(model.std_dict)
t3 = pd.DataFrame(model.coef_dict)


In [507]:
t = pd.concat({'Coefficient':t1,
               'std':t2,
               'T statistics':t3,}, axis = 1)
t

Unnamed: 0_level_0,Coefficient,std,T statistics
Unnamed: 0_level_1,FB,FB,FB
BABA,238002800000000.0,4.201631e-15,1.0
FB,-12.88678,2.140972e-15,-2.759023e-14
constant,21472.19,2.7887080000000004e-17,5.987967e-13


In [509]:
#testing

In [514]:
import statsmodels.api as sm
from sklearn import linear_model


In [526]:
x

Unnamed: 0,BABA,FB
0,76.690002,102.220001
1,78.629997,102.730003
2,77.330002,102.970001
3,72.720001,97.919998
4,70.800003,97.330002
...,...,...
1190,269.730011,249.529999
1191,271.089996,254.820007
1192,276.010010,256.820007
1193,276.929993,261.790009


In [527]:
y

Unnamed: 0,BABA,FB
0,76.690002,102.220001
1,78.629997,102.730003
2,77.330002,102.970001
3,72.720001,97.919998
4,70.800003,97.330002
...,...,...
1190,269.730011,249.529999
1191,271.089996,254.820007
1192,276.010010,256.820007
1193,276.929993,261.790009


In [528]:
regr = linear_model.LinearRegression()
regr.fit(x, y)

LinearRegression()

In [529]:
regr.coef_

array([[ 1.00000000e+00, -6.46424909e-17],
       [-2.58569964e-16,  1.00000000e+00]])