# Basic OLS

This notebook estimates a linear regression and tests various hypotheses.

## Loading Packages

In [1]:
using Dates, DelimitedFiles, Statistics, LinearAlgebra, Distributions

include("jlFiles/printmat.jl")
include("jlFiles/NWFn.jl")

printblue(txt) = printstyled(string(txt,"\n"),color=:blue,bold=true)

printblue (generic function with 1 method)

## Loading Data

In [2]:
x = readdlm("Data/FFmFactorsPs.csv",',',skipstart=1)

                #yearmonth, market, small minus big, high minus low
(ym,Rme,RSMB,RHML) = (x[:,1],x[:,2]/100,x[:,3]/100,x[:,4]/100) 
x = nothing                   

printlnPs("Sample size:",size(Rme))

Sample size:       388


## OLS Estimates and Their Distribution

Consider the linear regression

$
y_{t}=\beta^{\prime}x_{t}+u_{t},
$

where $y_{t}$ is a scalar and $x_{t}$ is $k\times1$. The OLS estimate is

$
\hat{\beta} = S_{xx}^{-1}S_{xy}, \: \text{ where } \: 
S_{xx}      = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}
\: \text{ and } \:
S_{xy}      = \sum\nolimits_{t=1}^{T}x_{t}y_{t}.
$

When $x_t$ and $u_t$ are independent (Gauss-Markov assumptions...), then he distribution of the estimates is (typically)

$
(\hat{\beta}-\beta_{0})\overset{d}{\rightarrow}N(0,S_{xx}^{-1}\sigma^2),
$

where $\sigma^2$ is the variance of the residual.

To calculate the estimates and the covariance matrix, we define the matrices $X_{T\times k}$ and $Y_{T\times1}$
by letting $x_{t}^{\prime}$ and $y_{t}$ be the $t^{th}$ rows

$
X_{T\times k}=\left[
\begin{array}[c]{l}
x_{1}^{\prime}\\
\vdots\\
x_{T}^{\prime}
\end{array}
\right] \ \text{ and } \ Y_{T\times1}=\left[
\begin{array}[c]{l}
y_{1}\\
\vdots\\
y_{T}
\end{array}
\right].
$

The estimates can then be calculated as 
```
b = X\Y
```

## A Function for OLS

In [3]:
"""
    OlsGMFn(Y,X)

LS of Y on X; for one dependent variable, Gauss-Markov assumptions

# Usage
(b,u,Yhat,V,R2a) = OlsGMFn(Y,X)

# Input
- `Y::Array`:     Tx1, the dependent variable
- `X::Array`:     Txk matrix of regressors (including deterministic ones)

# Output
- `b::Array`:     kx1, regression coefficients
- `u::Array`:     Tx1, residuals Y - yhat
- `Yhat::Array`:  Tx1, fitted values X*b
- `V::Array`:     kxk matrix, covariance matrix of b
- `R2a::Number`:  scalar, R2 value

"""
function OlsGMFn(Y,X)
    T    = size(Y,1)
    b    = X\Y
    Yhat = X*b
    u    = Y - Yhat
    σ²   = var(u)
    V    = inv(X'X)*σ²
    R2a  = 1 - σ²/var(Y)
    return b,u,Yhat,V,R2a
end

OlsGMFn

In [4]:
Y = Rme
T = size(Y,1)
X = [ones(T) RSMB RHML]

(b,_,_,V,R2a) = OlsGMFn(Y,X)

printblue("OLS Results")
println("     b        StdErr")
printmat([b sqrt.(diag(V))])

[34m[1mOLS Results[22m[39m
     b        StdErr
     0.007     0.002
     0.217     0.073
    -0.429     0.074



## Testing a Hypothesis

Since the estimator $\hat{\beta}_{_{k\times1}}$ satisfies

$
\hat{\beta}-\beta_{0} \sim N(0,V_{k\times k})  ,
$

we can easily apply various tests. To test a joint linear hypothesis of the
form

$
\gamma_{q\times1}=R\beta-a,
$

use the test

$
(R\beta-a)^{\prime}\Lambda ^{-1}(R\beta
-a)\overset{d}{\rightarrow}\chi_{q}^{2} \: \text{, where } \: \Lambda=RVR^{\prime}.
$

In [5]:
R = [0 1 0;               #testing if b(2)=0 and b(3)=0
     0 0 1]
a = [0;0]
Γ = R*V*R'
test_stat = (R*b-a)'inv(Γ)*(R*b-a)

printblue("Testing Rb=a:")
println("test-statictic and 10% critical value of chi-square(2)")
printmat([test_stat quantile(Chisq(2),0.9)])

[34m[1mTesting Rb=a:[22m[39m
test-statictic and 10% critical value of chi-square(2)
    60.010     4.605



## Distribution of OLS Estimates

when the Gauss-Markov assumption do not hold. 

The distribution of the estimates is (typically)

$
(\hat{\beta}-\beta_{0})\overset{d}{\rightarrow}N(0,V)
\: \text{ where } \: V=S_{xx}^{-1} S S_{xx}^{-1}
$

and where $S$ is the covariance matrix of $\sum_{t=1}^{T}u_{t}x_{t}$.

*When* the Gauss-Markov assumptions do hold, then $S$ can be simplified as $S=S_{xx}\sigma^2$, where $\sigma^2$ is the variance of $u_t$. Clearly, this means that $V$ can be written $V=S_{xx}^{-1}\sigma^2$.

In [6]:
b   = X\Y
u   = Y - X*b                         #residuals
Sxx = X'X

V_iid = inv(Sxx)*var(u)               #traditional covariance matrix, Gauss-Markov

S_W = (X.*u)'*(X.*u)                  #White's covariance matrix 
V_W = inv(Sxx)'S_W*inv(Sxx)           #Cov(b), White

S_NW  = NWFn(X.*u,1)*T                 #Newey-West covariance matrix
V_NW  = inv(Sxx)'S_NW*inv(Sxx)         #Cov(b), Newey-West

println("\n     b       std_iid    std_W    std_NW")
printmat([b sqrt.(diag(V_iid)) sqrt.(diag(V_W))  sqrt.(diag(V_NW))])


     b       std_iid    std_W    std_NW
     0.007     0.002     0.002     0.002
     0.217     0.073     0.113     0.124
    -0.429     0.074     0.097     0.108



In [7]:
"""
    OlsFn(Y,X,m=0)

LS of Y on X; for one dependent variable, using Newey-West covariance matrix

# Usage
(b,u,Yhat,V,R2a) = OlsFn(Y,X,m)

# Input
- `Y::Array`:     Tx1, the dependent variable
- `X::Array`:     Txk matrix of regressors (including deterministic ones)
- `m::Int`:       scalar, bandwidth in Newey-West  

# Output
- `b::Array`:     kx1, regression coefficients
- `u::Array`:     Tx1, residuals Y - Yhat
- `Yhat::Array`:  Tx1, fitted values X*b
- `V::Array`:     kxk matrix, covariance matrix of b
- `R2a::Number`:  scalar, R2 value

"""
function OlsFn(Y,X,m=0)
    T    = size(Y,1)
    b    = X\Y
    Yhat = X*b
    u    = Y - Yhat
    S0   = NWFn(X.*u,m)*T          #Newey-West covariance matrix
    Sxx  = X'X
    V    = inv(Sxx)'S0*inv(Sxx)
    R2a  = 1 - var(u)/var(Y)
    return b,u,Yhat,V,R2a
end

OlsFn

In [8]:
(b,_,_,V,R2a) = OlsFn(Y,X,1)

printblue("OLS Results")
println("     b        StdErr (NW)")
printmat([b sqrt.(diag(V))])

[34m[1mOLS Results[22m[39m
     b        StdErr (NW)
     0.007     0.002
     0.217     0.124
    -0.429     0.108



## Different Ways to Calculate OLS Estimates (extra)

Consider the linear regression

$
y_{t}=\beta^{\prime}x_{t}+u_{t},
$

where $y_{t}$ is a scalar and $x_{t}$ is $k\times1$. The OLS estimate is

$
\hat{\beta} = S_{xx}^{-1}S_{xy}, \: \text{ where } \: 
S_{xx}      = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}
\: \text{ and } \:
S_{xy}      = \sum\nolimits_{t=1}^{T}x_{t}y_{t}.
$


Instead of these sums (loops over $t$), matrix multiplication can be used to
speed up the calculations. Create matrices $X_{T\times k}$ and $Y_{T\times1}$
by letting $x_{t}^{\prime}$ and $y_{t}$ be the $t^{th}$ rows

$
X_{T\times k}=\left[
\begin{array}[c]{l}
x_{1}^{\prime}\\
\vdots\\
x_{T}^{\prime}
\end{array}
\right] \ \text{ and } \ Y_{T\times1}=\left[
\begin{array}[c]{l}
y_{1}\\
\vdots\\
y_{T}
\end{array}
\right].
$

We can then calculate the same matrices as

$
S_{xx}       =X^{\prime}X \ \text{ and } \: S_{xy}=X^{\prime}Y \: \text{, so } \: 
\hat{\beta}  =(X^{\prime}X)^{-1}X^{\prime}Y.
$

However, instead of inverting $S_{xx}$, we typically get much better numerical
precision by solving the system of $T$ equations

$
X_{T\times k}b_{k\times1}=Y_{T\times1}
$

for the vector $b$ that minimizes the sum of squared errors. This
is easily done by using the command
```
b = X\Y
```

In [9]:
printblue("Three different ways to calculate OLS estimates:")

K    = size(X,2)
S_xx = zeros(K,K)
S_xy = zeros(K,1)
for t = 1:T
    #local x_t, y_t            #only needed in REPL/scripts
    #global S_xx, S_xy         
    x_t = X[t,:]               #a vector
    y_t = Y[t:t,:]             
    S_xx = S_xx + x_t*x_t'     #KxK
    S_xy = S_xy + x_t*y_t      #Kx1
end
b1 = inv(S_xx)*S_xy          #OLS coeffs, version 1

b2 = inv(X'X)*X'Y            #OLS coeffs, version 2

b3 = X\Y                     #OLS coeffs, version 3

println("\n      b1       b2        b3")
printmat([b1 b2 b3])

[34m[1mThree different ways to calculate OLS estimates:[22m[39m

      b1       b2        b3
     0.007     0.007     0.007
     0.217     0.217     0.217
    -0.429    -0.429    -0.429

