# Basic OLS

This notebook estimates a linear regression and reports standard errors (assuming iid residuals).

For a package, consider [GLM.jl](https://github.com/JuliaStats/GLM.jl) (not used here).

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, Statistics, LinearAlgebra, Distributions

include("jlFiles/printmat.jl")

printyellow (generic function with 1 method)

## Loading Data

In [2]:
x = readdlm("Data/FFmFactorsPs.csv",',',skipstart=1)

                #yearmonth, market, small minus big, high minus low
(ym,Rme,RSMB,RHML) = (x[:,1],x[:,2]/100,x[:,3]/100,x[:,4]/100) 
x = nothing

printlnPs("Sample size:",size(Rme))

Sample size:    (388,)


## OLS Estimates and Their Distribution

Consider the linear regression

$
y_{t}=\beta^{\prime}x_{t}+u_{t},
$

where $y_{t}$ is a scalar and $x_{t}$ is $k\times1$. The OLS estimate is

$
\hat{\beta} = S_{xx}^{-1}S_{xy}, \: \text{ where } \: 
S_{xx}      = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}
\: \text{ and } \:
S_{xy}      = \sum\nolimits_{t=1}^{T}x_{t}y_{t}.
$

When $x_t$ and $u_t$ are independent (Gauss-Markov assumptions), then the distribution of the estimates is (typically)

$
(\hat{\beta}-\beta_{0})\overset{d}{\rightarrow}N(0,S_{xx}^{-1}\sigma^2),
$

where $\sigma^2$ is the variance of the residual.

To calculate the estimates and the covariance matrix, we define the matrices $X_{T\times k}$ and $Y_{T\times1}$
by letting $x_{t}^{\prime}$ and $y_{t}$ be the $t^{th}$ rows

$
X_{T\times k}=
\begin{bmatrix}
x_{1}^{\prime}\\
\vdots\\
x_{T}^{\prime}
\end{bmatrix}
\ \text{ and } \ Y_{T\times1}=
\begin{bmatrix}
y_{1}\\
\vdots\\
y_{T}
\end{bmatrix}.
$

The estimates can then be calculated as 
```
b = X\Y
```

## A Function for OLS

In [3]:
"""
    OlsGMFn(Y,X)

LS of Y on X; for one dependent variable, Gauss-Markov assumptions

# Usage
(b,u,Yhat,V,R2) = OlsGMFn(Y,X)

# Input
- `Y::Vector`:    T-vector, the dependent variable
- `X::Matrix`:    Txk matrix of regressors (including deterministic ones)

# Output
- `b::Vector`:    k-vector, regression coefficients
- `u::Vector`:    T-vector, residuals Y - yhat
- `Yhat::Vector`: T-vector, fitted values X*b
- `V::Matrix`:    kxk matrix, covariance matrix of b
- `R2::Number`:   scalar, R2 value

"""
function OlsGMFn(Y,X)

    T    = size(Y,1)

    b    = X\Y
    Yhat = X*b
    u    = Y - Yhat

    σ2   = var(u)
    V    = inv(X'X)*σ2
    R2   = 1 - σ2/var(Y)

    return b, u, Yhat, V, R2

end

OlsGMFn

In [4]:
Y = Rme                    #to get standard OLS notation
T = size(Y,1)
X = [ones(T) RSMB RHML]

(b,_,_,V,R2) = OlsGMFn(Y,X)
Stdb = sqrt.(diag(V))        #standard error

printblue("OLS Results:\n")
xNames = ["c","SMB","HML"]
printmat([b Stdb],colNames=["b","std"],rowNames=xNames)

printlnPs("R2: ",R2)

[34m[1mOLS Results:[22m[39m

            b       std
c       0.007     0.002
SMB     0.217     0.073
HML    -0.429     0.074

      R2:      0.134


# Missing Values

The next cells use some simple functions to remove observations ($t$) where $y_{t}$ and/or some of the $x_t$ variables are NaN/missing.

We illustrate the usage by a very simple example.

In [5]:
include("jlFiles/excise.jl")

FindNNPs

In [6]:
y = [1,2,3.0]
x = [1 11;
     1 NaN;
     1 13.5]

(y1,x1) = excise(y,x)
println("before")
printmat(y,x,colNames=["y","x1","x2"])

println("after")
printmat(y1,x1,colNames=["y","x1","x2"])

println("OLS using only observations without any NaN/missing")
b = x1\y1
printmat(b)

before
         y        x1        x2
     1.000     1.000    11.000
     2.000     1.000       NaN
     3.000     1.000    13.500

after
         y        x1        x2
     1.000     1.000    11.000
     3.000     1.000    13.500

OLS using only observations without any NaN/missing
    -7.800
     0.800



# Different Ways to Calculate OLS Estimates (extra)

The next cell calculates the OLS estimates in three different ways: (1) a loop to create $S_{xx}$ and $S_{xy}$ followed by $S_{xx}^{-1}S_{xy}$; (2) $(X'X)^{-1}X'Y$; (3) and `X\Y`. They should give the same result in well-behaved data sets, but (3) is probably the most stable version.

In [7]:
printblue("Three different ways to calculate OLS estimates:")

K    = size(X,2)
S_xx = zeros(K,K)
S_xy = zeros(K,1)
for t = 1:T
    #local x_t, y_t            #local/global is needed in script
    #global S_xx, S_xy
    x_t = X[t,:]               #a vector
    y_t = Y[t:t,:]
    S_xx = S_xx + x_t*x_t'     #KxK
    S_xy = S_xy + x_t*y_t      #Kx1
end
b1 = inv(S_xx)*S_xy          #OLS coeffs, version 1

b2 = inv(X'X)*X'Y            #OLS coeffs, version 2

b3 = X\Y                     #OLS coeffs, version 3

printmat([b1 b2 b3],colNames=["b1","b2","b3"],rowNames=xNames)

[34m[1mThree different ways to calculate OLS estimates:[22m[39m
           b1        b2        b3
c       0.007     0.007     0.007
SMB     0.217     0.217     0.217
HML    -0.429    -0.429    -0.429



# Basic OLS, A Numerical Example (extra)

The cells below shows the details of the calculations for a linear regression.

## The "Data"

In [8]:
(y₁,y₂,y₃) = (-1.5,-0.6,2.1)        #dependent variable in t=1,2,3 

x₁ = [1;-1]                        #regressors in t=1
x₂ = [1;0]                         #regressors in t=2
x₃ = [1;1]                         #regressors in t=3
println()




## The calculations in t=1

In [9]:
printblue("x₁ (regressors in t=1):")
printmat(x₁)
printblue("x₁':")
printmat(x₁')

printblue("x₁*x₁':")
printmat(x₁*x₁')
printblue("x₁y₁:")
printmat(x₁*y₁)

[34m[1mx₁ (regressors in t=1):[22m[39m
     1    
    -1    

[34m[1mx₁':[22m[39m
     1        -1    

[34m[1mx₁*x₁':[22m[39m
     1        -1    
    -1         1    

[34m[1mx₁y₁:[22m[39m
    -1.500
     1.500



## All Periods, finding the OLS Estimates

In [10]:
printmagenta("This illustrates Sxx and Sxy (for all t):\n")

Sxx = x₁*x₁' + x₂*x₂' + x₃*x₃'
Sxy = x₁*y₁ + x₂*y₂ + x₃*y₃

println("Sxx and Sxy")
printmat(Sxx)
printmat(Sxy)

[35m[1mThis illustrates Sxx and Sxy (for all t):[22m[39m

Sxx and Sxy
     3         0    
     0         2    

     0.000
     3.600



In [11]:
printblue("Sxx^(-1):")               #a matrix inverse
printmat(Sxx^(-1))

printblue("checking Sxx * Sxx^(-1):")
printmat(Sxx * Sxx^(-1))

[34m[1mSxx^(-1):[22m[39m
     0.333     0.000
     0.000     0.500

[34m[1mchecking Sxx * Sxx^(-1):[22m[39m
     1.000     0.000
     0.000     1.000



In [12]:
b1 = Sxx^(-1)*Sxy
printblue("The estimated coeffs are:\n")
xNames = ["Regressor 1","Regressor 2"]
printmat(b1,rowNames=xNames)

[34m[1mThe estimated coeffs are:[22m[39m

Regressor 1     0.000
Regressor 2     1.800



## Matrix Version

This cell redoes the previous calculations, but using (the convenient) matrix version.

In [13]:
X = [x₁';x₂';x₃']       #put xₜ' in row t of X
Y = [y₁;y₂;y₃]          #put yₜ in row t of Y

printblue("X: (xₜ' in row t of X):\n")
printmat(X)
printblue("Y (yₜ in row t of Y)\n")
printmat(Y)

b2 = (X'X)^(-1)*X'Y
printblue("The estimated coeffs are:\n")
printmat(b2,rowNames=xNames)

[34m[1mX: (xₜ' in row t of X):[22m[39m

     1        -1    
     1         0    
     1         1    

[34m[1mY (yₜ in row t of Y)[22m[39m

    -1.500
    -0.600
     2.100

[34m[1mThe estimated coeffs are:[22m[39m

Regressor 1     0.000
Regressor 2     1.800



## How to Code

to get a more robust (numerical) calculation. (The choice between the different methods typically only matter when the regressors are strongly correlated and/or have very strange magnitudes.)

In [14]:
b3 = X\Y
printblue("The estimated coeffs are:\n")
printmat(b3,rowNames=xNames)

[34m[1mThe estimated coeffs are:[22m[39m

Regressor 1    -0.000
Regressor 2     1.800

