# Basic OLS

This notebook estimates a linear regression and reports standard errors (assuming iid residuals).

For a package, consider [GLM.jl](https://github.com/JuliaStats/GLM.jl) (not used here).

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, Statistics, LinearAlgebra, Distributions

include("jlFiles/printmat.jl")

printyellow (generic function with 1 method)

## Loading Data

In [2]:
x = readdlm("Data/FFmFactorsPs.csv",',',skipstart=1)

                #yearmonth, market, small minus big, high minus low
(ym,Rme,RSMB,RHML) = (x[:,1],x[:,2]/100,x[:,3]/100,x[:,4]/100) 
x = nothing

printlnPs("Sample size:",size(Rme))

Sample size:    (388,)


## OLS Estimates and Their Distribution

Consider the linear regression

$
y_{t}=\beta^{\prime}x_{t}+u_{t},
$

where $y_{t}$ is a scalar and $x_{t}$ is $k\times1$. The OLS estimate is

$
\hat{\beta} = S_{xx}^{-1}S_{xy}, \: \text{ where } \: 
S_{xx}      = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}
\: \text{ and } \:
S_{xy}      = \sum\nolimits_{t=1}^{T}x_{t}y_{t}.
$

When $x_t$ and $u_t$ are independent (Gauss-Markov assumptions), then the distribution of the estimates is (typically)

$
(\hat{\beta}-\beta_{0})\overset{d}{\rightarrow}N(0,S_{xx}^{-1}\sigma^2),
$

where $\sigma^2$ is the variance of the residual.

To calculate the estimates and the covariance matrix, we define the matrices $X_{T\times k}$ and $Y_{T\times1}$
by letting $x_{t}^{\prime}$ and $y_{t}$ be the $t^{th}$ rows

$
X_{T\times k}=
\begin{bmatrix}
x_{1}^{\prime}\\
\vdots\\
x_{T}^{\prime}
\end{bmatrix}
\ \text{ and } \ Y_{T\times1}=
\begin{bmatrix}
y_{1}\\
\vdots\\
y_{T}
\end{bmatrix}.
$

The estimates can then be calculated as 
```
b = X\Y
```

## A Function for OLS

In [3]:
"""
    OlsGMFn(Y,X)

LS of Y on X; for one dependent variable, Gauss-Markov assumptions

# Usage
(b,u,Yhat,V,R²) = OlsGMFn(Y,X)

# Input
- `Y::Vector`:    T-vector, the dependent variable
- `X::Matrix`:    Txk matrix of regressors (including deterministic ones)

# Output
- `b::Vector`:    k-vector, regression coefficients
- `u::Vector`:    T-vector, residuals Y - yhat
- `Yhat::Vector`: T-vector, fitted values X*b
- `V::Matrix`:    kxk matrix, covariance matrix of b
- `R²::Number`:   scalar, R² value

"""
function OlsGMFn(Y,X)

    T    = size(Y,1)

    b    = X\Y
    Yhat = X*b
    u    = Y - Yhat

    σ²   = var(u)             #\sigma\^2[TAB]
    V    = inv(X'X)*σ²
    R²   = 1 - σ²/var(Y)

    return b, u, Yhat, V, R²

end

OlsGMFn

In [4]:
Y = Rme                    #to get standard OLS notation
T = size(Y,1)
X = [ones(T) RSMB RHML]

(b,_,_,V,R²) = OlsGMFn(Y,X)
Stdb = sqrt.(diag(V))        #standard error

printblue("OLS Results:\n")
xNames = ["c","SMB","HML"]
printmat([b Stdb],colNames=["b","std"],rowNames=xNames)

printlnPs("R²: ",R²)

[34m[1mOLS Results:[22m[39m

            b       std
c       0.007     0.002
SMB     0.217     0.073
HML    -0.429     0.074

      R²:      0.134


# Missing Values

The next cells use some simple functions to remove observations ($t$) where $y_t$ and/or some of the $x_t$ variables are NaN/missing. We illustrate the usage by a very simple example.

An alternative approach is to fill *both* $y_t$ and $x_t$ with zeros (if any of them contains NaN/missing) and then do the regression.

In [5]:
include("jlFiles/excise.jl")

FindNNPs

In [6]:
y = [1,2,3.0]
x = [1 11;
     1 NaN;
     1 13.5]

(y1,x1) = excise(y,x)
println("before")
printmat(y,x,colNames=["y","x1","x2"])

println("after")
printmat(y1,x1,colNames=["y","x1","x2"])

println("OLS using only observations without any NaN/missing")
b = x1\y1
printmat(b)

before
         y        x1        x2
     1.000     1.000    11.000
     2.000     1.000       NaN
     3.000     1.000    13.500

after
         y        x1        x2
     1.000     1.000    11.000
     3.000     1.000    13.500

OLS using only observations without any NaN/missing
    -7.800
     0.800



# Different Ways to Calculate OLS Estimates (extra)

The next cell calculates the OLS estimates in three different ways: (1) a loop to create $S_{xx}$ and $S_{xy}$ followed by $S_{xx}^{-1}S_{xy}$; (2) $(X'X)^{-1}X'Y$; (3) and `X\Y`. They should give the same result in well-behaved data sets, but (3) is probably the most stable version.

In [7]:
printblue("Three different ways to calculate OLS estimates:")

k    = size(X,2)
Sxx = zeros(k,k)
Sxy = zeros(k,1)
for t = 1:T
    #local x_t, y_t            #local/global is needed in script
    #global Sxx, Sxy
    x_t = X[t,:]               #a vector
    y_t = Y[t]
    Sxx = Sxx + x_t*x_t'     #kxk, same as Sxx += x_t*x_t'
    Sxy = Sxy + x_t*y_t      #kx1, same as Sxy += x_t*y_t
end
b1 = inv(Sxx)*Sxy            #OLS coeffs, version 1

b2 = inv(X'X)*X'Y            #OLS coeffs, version 2

b3 = X\Y                     #OLS coeffs, version 3

printmat([b1 b2 b3],colNames=["b1","b2","b3"],rowNames=xNames)

[34m[1mThree different ways to calculate OLS estimates:[22m[39m
           b1        b2        b3
c       0.007     0.007     0.007
SMB     0.217     0.217     0.217
HML    -0.429    -0.429    -0.429

