# Basic OLS

This notebook estimates a linear regression and reports standard errors (assuming iid residuals).

For a package, consider [GLM.jl](https://github.com/JuliaStats/GLM.jl) or [LinearRegression.jl](https://github.com/st--/LinearRegression.jl) (not used here).

## Load Packages and Extra Functions

The key functions are from the `FinEcmt_OLS` module found in the `jlFiles` subfolder.

The `DelimitedFiles` package is used for imported the (csv) data file and the `LinearAlgebra` package for some matrix operations (eg. `diag()`, which extracts the diagonal of a matrix.)

In [1]:
MyModulePath = joinpath(pwd(),"jlFiles")     #add /jlFiles to module path
!in(MyModulePath,LOAD_PATH) && push!(LOAD_PATH,MyModulePath);

In [2]:
using FinEcmt_OLS, Statistics, DelimitedFiles, LinearAlgebra

## Loading Data

In [3]:
x = readdlm("Data/FFmFactorsPs.csv",',',skipstart=1)

                #yearmonth, market, small minus big, high minus low
(ym,Rme,RSMB,RHML) = (x[:,1],x[:,2]/100,x[:,3]/100,x[:,4]/100)
x = nothing

printlnPs("Sample size:",size(Rme))

Sample size:    (388,)


## OLS Estimates and Their Distribution

Consider the linear regression

$
y_{t}=\beta^{\prime}x_{t}+u_{t},
$

When $x_t$ and $u_t$ are independent and $u_t$ is iid (Gauss-Markov assumptions), then the distribution of the estimates is (typically)

$
\hat{\beta} \sim  N(\beta_{0},S_{xx}^{-1}\sigma^2),
$

where $\sigma^2$ is the variance of the residual and $S_{xx} = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}$.

In matrix form, these expressions are 
$
Y = Xb + u, \: \text{ and } \: S_{xx} = X'X,
$
where $X$ is defined below.


### Matrix Form

To calculate the estimates it is often convenient to work with matrices. Define $X_{T\times k}$ by 
letting $x_{t}^{\prime}$ and be the $t^{th}$ row

$$
X_{T\times k}=
\begin{bmatrix}
x_{1}^{\prime}\\
\vdots\\
x_{T}^{\prime}
\end{bmatrix}
$$

In contrast $Y$ is a just a vector with $T$ elements (or possibly a $T \times 1$ matrix).

This is implemented in the `OlsGM()` function from the `FinEcmt_OLS` module. The source file is in the `jlFiles` subfolder. The next cells print the documentation and source code of the function. In particular, notice the `b = X\Y` and `V = inv(X'X)*σ²`.

In [4]:
display("text/markdown",@doc OlsGM)

```
OlsGM(Y,X)
```

LS of Y on X; for one dependent variable, Gauss-Markov assumptions

### Input

  * `Y::Vector`:    T-vector, the dependent variable
  * `X::Matrix`:    Txk matrix of regressors (including deterministic ones)

### Output

  * `b::Vector`:    k-vector, regression coefficients
  * `u::Vector`:    T-vector, residuals Y - yhat
  * `Yhat::Vector`: T-vector, fitted values X*b
  * `V::Matrix`:    kxk matrix, covariance matrix of b
  * `R²::Number`:   scalar, R² value


In [5]:
using CodeTracking
println(@code_string OlsGM([1],[1]))    #print the source code

function OlsGM(Y,X)

    T    = size(Y,1)

    b    = X\Y
    Yhat = X*b
    u    = Y - Yhat

    σ²   = var(u)
    V    = inv(X'X)*σ²
    R²   = 1 - σ²/var(Y)

    return b, u, Yhat, V, R²

end


## OLS Regression

In [6]:
Y = Rme                    #to get standard OLS notation
T = size(Y,1)
X = [ones(T) RSMB RHML]

(b,_,_,V,R²) = OlsGM(Y,X)
Stdb = sqrt.(diag(V))        #standard errors

printblue("OLS Results:\n")
xNames = ["c","SMB","HML"]
printmat(b,Stdb,colNames=["b","std"],rowNames=xNames)

printlnPs("R²: ",R²)

[34m[1mOLS Results:[22m[39m

            b       std
c       0.007     0.002
SMB     0.217     0.073
HML    -0.429     0.074

      R²:      0.134


# Missing Values (extra)

The next cells use a simple function (`excise()`) to remove observations ($t$) where $y_t$ and/or some of the $x_t$ variables are NaN/missing. We illustrate the usage by a very simple example.

An alternative approach is to fill *both* $y_t$ and $x_t$ with zeros (if any of them contains NaN/missing) by using the `OLSyxReplaceNaN` function and then do the regression. This is illustrated in the subsequent cell.

In [7]:
(y0,x0) = (copy(Y),copy(X))    #so we can can change some values
x0[2,2] = NaN                  #set a value to NaN

(y1,x1) = excise(y0,x0)
println("obs 1-3 before")
printmat(y0[1:3],x0[1:3,:];colNames=vcat("y",xNames))

println("after")
printmat(y1[1:3],x1[1:3,:];colNames=vcat("y",xNames))

printblue("OLS using only observations without any NaN/missing:")
b = x1\y1
printmat(b)

obs 1-3 before
         y         c       SMB       HML
     0.042     1.000     0.037     0.023
    -0.034     1.000       NaN     0.012
     0.058     1.000     0.032    -0.007

after
         y         c       SMB       HML
     0.042     1.000     0.037     0.023
     0.058     1.000     0.032    -0.007
     0.001     1.000     0.022     0.011

[34m[1mOLS using only observations without any NaN/missing:[22m[39m
     0.007
     0.218
    -0.428



In [8]:
(vv,y2,x2) = OLSyxReplaceNaN(Y,X)

println("after")
printmat(y2[1:3],x2[1:3,:];colNames=vcat("y",xNames))

printblue("OLS from setting observations with any NaN/missing to 0:")
b = x2\y2
printmat(b)

after
         y         c       SMB       HML
     0.042     1.000     0.037     0.023
    -0.034     1.000     0.005     0.012
     0.058     1.000     0.032    -0.007

[34m[1mOLS from setting observations with any NaN/missing to 0:[22m[39m
     0.007
     0.217
    -0.429



# Different Ways to Calculate OLS Estimates (extra)

Recall that OLS can be calculated as

$
\hat{\beta} = S_{xx}^{-1}S_{xy}, \: \text{ where } \: 
S_{xx}      = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}
\: \text{ and } \:
S_{xy}      = \sum\nolimits_{t=1}^{T}x_{t}y_{t}.
$


The next cell calculates the OLS estimates in three different ways: (1) a loop to create $S_{xx}$ and $S_{xy}$ followed by $S_{xx}^{-1}S_{xy}$; (2) $(X'X)^{-1}X'Y$; (3) and `X\Y`. They should give the same result in well-behaved data sets, but (3) is probably the most stable version.

In [9]:
printblue("Three different ways to calculate OLS estimates:")

k    = size(X,2)
Sxx = zeros(k,k)
Sxy = zeros(k,1)
for t = 1:T
    #local x_t, y_t            #local/global is needed in script
    #global Sxx, Sxy
    x_t = X[t,:]               #a vector
    y_t = Y[t]
    Sxx = Sxx + x_t*x_t'     #kxk, same as Sxx += x_t*x_t'
    Sxy = Sxy + x_t*y_t      #kx1, same as Sxy += x_t*y_t
end
b1 = inv(Sxx)*Sxy            #OLS coeffs, version 1

b2 = inv(X'X)*X'Y            #OLS coeffs, version 2

b3 = X\Y                     #OLS coeffs, version 3

printmat(b1,b2,b3,colNames=["b1","b2","b3"],rowNames=xNames)

[34m[1mThree different ways to calculate OLS estimates:[22m[39m
           b1        b2        b3
c       0.007     0.007     0.007
SMB     0.217     0.217     0.217
HML    -0.429    -0.429    -0.429

