# Case 1: Implementing LS solver

by Ken Bastiaensen, Milan Van den Heuvel, Gonzalo Villa

*Advanced Econometrics 2016-2017. Case 1, least squares (LS) implementation.*

The system of linear equations is given in matrix notation
$$ y = X \beta + \mu$$

with dependent variables $y$ and residuals $\mu \in \mathbb{R}^{N\times1}$, independent variables $X \in \mathbb{R}^{N\times K}$ and coefficients $\beta \in \mathbb{R}^{K\times 1}$.

We use the estimator for the coefficients $\hat\beta = (X'X)^{-1}Xy$ as given in course. In practice more advanced methods are used for stability, see the bottom part of this document.

With $\hat\mu = y - X \hat\beta$, the standard error $\sigma$ is estimated by 
$$\hat\sigma^2 = \frac{\hat\mu'\hat\mu}{N - K}$$
and variance-covariance matrix of $\hat\beta$ given by
$$V[\hat\beta] = \hat\sigma^2 (X'X)^{-1}$$

The t distribution for this null hypothesis that $\beta_0=0$ is then
$$ t = \frac{\hat\beta - 0}{s.e.} $$ with the standard errors $s.e.$ on the diagonal of $\hat\sigma$.

In [2]:
using Distributions: TDist, cdf, ccdf

In [3]:
type ols_results
  coefs 
  yhat
  res
  vcv
  tstat
  pval
end

In [21]:
Int(floor(3.1))

3

In [22]:
# keyword arguments are placed after semicolon
function gls(y, X; corr=:none, lags=nothing)
    
    # more stable: β̂ = X \ y, see notes at bottom
    β̂ = inv(X' * X) * (X' * y)
    ŷ = X * β̂
    μ̂ = y - ŷ
    
    T, K = size(X)
    σ̂² = dot(μ̂, μ̂) / (T - K)

    #use correction for variance covariance
    if corr == :none
        vcv = σ̂² * inv(X' * X)
        
    elseif corr == :white
        # or do newey_west with lags=0
        vcv = inv(X' * X) * X' * diagm(μ̂.^2) * X * inv(X' * X)
        
    elseif corr == :newey_west
        if lags == nothing 
            lags = Int(floor(T^(1/4)))
        end
        vcv = newey_west(X, μ̂, lags)
    else
        error("")
    end

    # T statistics for H₀: β₀ = 0
    tstat = β̂ ./ sqrt(diag(vcv))
    
    # absolute value and times two for double sided test
    pval  = 2 * ccdf(TDist(T-K), abs(tstat)) 

    return ols_results(β̂, ŷ, μ̂, vcv, tstat, pval)
end



gls (generic function with 1 method)

In [5]:
function newey_west(X, μ̂, lags::Integer)
    if lags == 0 #White estimator
        return inv(X'*X) * X' * diagm(μ̂.^2) * X * inv(X'*X)
    end
    
    T, k = size(X)
    vcv = zeros(k, k)
    for lag in 0:lags
        w = 1 - lag / (lags + 1)
        for t in (lag + 1):T
            # Calculates the off-diagonal terms
            update = w * μ̂[t] * μ̂[t-lag] * (X[t-lag,:]*X[t,:]' + X[t,:]*X[t-lag,:]')
            vcv = vcv + update
        end
    end  
    vcv = inv(X'*X) * vcv * inv(X'*X) 
end

newey_west (generic function with 1 method)

In [6]:
#test if it runs
gls(randn(20), randn(20,3))

ols_results([0.332724,0.00614142,-0.0920556],[-0.0667136,-0.232537,0.384747,0.744804,0.254334,-0.0802231,-0.645334,0.0597585,0.79786,0.132128,0.013184,0.122888,-0.277104,0.559242,0.318354,0.11863,0.415519,0.0543735,-0.308223,0.159704],[-0.113509,2.20071,-0.209796,0.0192079,0.310677,1.40704,-0.799849,-0.548425,0.9567,0.25381,-2.35936,-0.219342,0.540743,-1.6388,1.7748,0.121553,-0.529137,0.416274,0.0554698,0.522746],[0.0532598 0.00916819 -0.00937236; 0.00916819 0.0439666 -0.0108212; -0.00937236 -0.0108212 0.0854043],[1.44173,0.0292892,-0.315],[0.167547,0.976975,0.756596])

In [7]:
#simulation test
K = 3   # number of parameters
N = 100 # number of observations

# Create actual parameters and observations
β = randn(K) #[1, 10, 100]
X = randn(N, K)
X[:,1] = ones(N) # intercept
σ = 5
μ = σ * randn(N) # ~ N(0, σ)
y = X * β + μ;

In [11]:
res = gls(y, X)

ols_results([-0.828255,0.0523316,-0.357631],[-0.983187,-0.725403,-1.25167,-0.666347,-0.591869,-0.827688,-0.858945,-0.949696,-1.2801,-1.2284  …  -1.36285,-0.896799,-0.580468,-0.885217,-0.888864,-0.137331,-0.87076,-0.530072,-0.332972,-1.11631],[6.87641,-8.27622,0.0312563,-0.0760314,-7.14556,-0.325214,1.9361,8.70725,-5.06309,0.996656  …  1.85042,11.2515,3.03991,2.07551,-5.25347,-4.22124,0.255503,0.554342,0.421814,-0.559656],[0.222278 0.00841732 0.0103649; 0.00841732 0.25577 0.000744493; 0.0103649 0.000744493 0.236897],[-1.75677,0.103476,-0.734776],[0.082112,0.917799,0.464248])

In [23]:
gls(y, X; corr=:newey_west)

ols_results([-0.828255,0.0523316,-0.357631],[-0.983187,-0.725403,-1.25167,-0.666347,-0.591869,-0.827688,-0.858945,-0.949696,-1.2801,-1.2284  …  -1.36285,-0.896799,-0.580468,-0.885217,-0.888864,-0.137331,-0.87076,-0.530072,-0.332972,-1.11631],[6.87641,-8.27622,0.0312563,-0.0760314,-7.14556,-0.325214,1.9361,8.70725,-5.06309,0.996656  …  1.85042,11.2515,3.03991,2.07551,-5.25347,-4.22124,0.255503,0.554342,0.421814,-0.559656],[0.412017 -0.0127841 0.0795977; -0.0127841 0.350845 -0.033321; 0.0795977 -0.033321 0.405198],[-1.29035,0.08835,-0.561826],[0.199998,0.929781,0.575531])

In [24]:
# We test with the illustration of the slides 
y = [1673,1688,1666,1735,1749,1756,1815,1867,1948,2048,2128,2165,2257,2316,2324]
X2=[1839,1844,1831,1881,1883,1910,1969,2016,2126,2239,2336,2404,2487,2535,2595]
X = hcat(ones(length(X2)), X2, collect(1:15))
gls(y, X)

ols_results([300.286,0.741981,8.04356],[1672.83,1684.59,1682.98,1728.13,1737.65,1765.73,1817.55,1860.47,1950.13,2042.02,2122.03,2180.53,2250.16,2293.82,2346.38],[0.167431,3.41396,-16.9838,6.87355,11.346,-9.73102,-2.55145,6.53189,-2.12957,5.98303,5.96733,-15.5309,6.8411,22.1825,-22.38],[6133.65 -3.70794 220.206; -3.70794 0.00225946 -0.137052; 220.206 -0.137052 8.90154],[3.83421,15.6096,2.69597],[0.00237732,2.46242e-9,0.0194537])

# Estimate Cigarette Demand

In [None]:
X = readcsv()

# Background on OLS
## Problem
Ordinary Least Squares (OLS) is a solution for a system of linear equations by minimizing the sum of squared differences between the observed values ($y$) and the corresponding modelled values ($\hat y = X \hat\beta$).

A system of equations is given by
$$ y_i = \beta_1 + \beta_2 X_{i2}  + \beta_3 X_{i3}  + ... + \beta_K X_{iK} + \mu_i,  \quad i=1,...,N$$

or in matrix notation
$$ y = X \beta + \mu$$

with $y, \mu \in \mathbb{R}^{N\times1}$, $X \in \mathbb{R}^{N\times K}$ and $\beta \in \mathbb{R}^{K\times 1}$.

The OLS solves
$$\operatorname{\,min} \, \big\|y - X \hat\beta \big\|^2$$
that with some algebra leads to the [normal equations](https://en.wikipedia.org/wiki/Normal_equations) 
$$ X' X\beta = X'y$$ 

## Solution

### Normal equations

In case of a [rank][1] complete matrix $X$ the normal equations can be solved by inverting the ([Gram](https://en.wikipedia.org/wiki/Gramian_matrix)) matrix $X'X$ such that $\hat\beta = (X'X)^{-1}Xy$.

As X'X is a symmetric, positive definite matrix it's computationally more efficient to use the cholesky decomposition of $X'X = LL'$ with $L$ a lower triangular matrix. This gives $LL'\hat\beta = X'y$. First solve $Lz = X'y$ for $z$ by [forward substitution](https://en.wikipedia.org/wiki/Forward_substitution) and then $L'\hat\beta = z$ for $\beta$ by backward substitution.

However, solving the normal equations is numerically unstable, i.e. it is very sensitive to small pertubations in $X$. The [condition number](https://en.wikipedia.org/wiki/Condition_number) $\kappa$ of the system is worsened: $\kappa(X'X) = [\kappa(X)]^2$. We show the impact further below.

[1]: https://en.wikipedia.org/wiki/Rank_(linear_algebra)

We first simulate some parameters and data using the same notation as in the slides.

In [1]:
K = 5   # number of parameters
N = 100 # number of observations

# Create actual parameters and observations
β = randn(K)
X = randn(N, K)
X[:,1] = ones(N) # intercept
σ = 0.1
μ = randn(N) * σ # ~ N(0,1)
y = X * β + μ;

In [4]:
#define function
OLS_normal(y, X) = inv(X'X) * X'y

# test and print result side by side for quick sanity check
β̂_normal = OLS_normal(y, X)
hcat(β̂_normal, β)

5×2 Array{Float64,2}:
  2.19354    2.1908  
 -0.763111  -0.758047
  1.02621    1.03849 
  1.29059    1.27825 
  1.59663    1.59965 

In [5]:
function OLS_chol(y, X)
    X′X_chol = cholfact(X'X)
    X′X_chol\(X'*y)
end

# test and print result side by side for quick sanity check
β̂_chol = OLS_chol(y, X)
hcat(β̂_chol, β)

5×2 Array{Float64,2}:
  2.19354    2.1908  
 -0.763111  -0.758047
  1.02621    1.03849 
  1.29059    1.27825 
  1.59663    1.59965 

### QR Factorization
We decompose $X$ to its orthogonal [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition):
$$X = Q\begin{bmatrix}
R \\
0
\end{bmatrix} $$
with $Q\in\mathbb{R}^{N\times N} $ an [orthogonal matrix](https://en.wikipedia.org/wiki/Orthogonal_matrix) and $R\in\mathbb{R}^{k\times k}$ an upper triangular matrix (with positive diagonal elements).
The solution is then given by
$$R \hat\beta =\left(Q' \mathbf y \right)_K$$

This is the default method in Julia for solving OLS (for non-square matrix $X$) with `β̂ = X\y`

In [6]:
function OLS_qr(y, X)
    X_qr = qrfact(X)
    X_qr \ y
end


OLS_qr (generic function with 1 method)

In [7]:
β̂_qr = OLS_qr(y, X)
hcat(β̂_qr, β)

5×2 Array{Float64,2}:
  2.19354    2.1908  
 -0.763111  -0.758047
  1.02621    1.03849 
  1.29059    1.27825 
  1.59663    1.59965 

### SVD
Another orthogonal decomposition uses the [singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) (SVD) of $X$:
$$X = U \Sigma V'$$
with $U \in\mathbb{R}^{N\times N}$ and $V \in\mathbb{R}^{K\times K}$ both an orthogonal matrix, and $\Sigma\in\mathbb{R}^{N\times K}$ a diagonal matrix. SVD can transform any matrix into a diagonal matrix with the right choice of orthogonal coordinate systems for its domain and range. This even works for rank deficient matrices and is numerically stable! In such an ill-conditioned system, SVD will return the solution $\hat\beta$ that has the smallest norm.




In [8]:
function OLS_svd(y, X)
    X_svd = svdfact(X)
    X_svd \ y
end

OLS_svd (generic function with 1 method)

In [9]:
β̂_svd = OLS_svd(y, X)
hcat(β̂_svd, β)

5×2 Array{Float64,2}:
  2.19354    2.1908  
 -0.763111  -0.758047
  1.02621    1.03849 
  1.29059    1.27825 
  1.59663    1.59965 

## Numerical Stability
We now test numerical stability (example 3.5 in [1]) with 
$$X^s = \begin{bmatrix}
1 & 1\\
1 & 1+\epsilon
\end{bmatrix}, 
y^s = \begin{bmatrix}
2 \\
2 
\end{bmatrix}$$
This gives the solution $\hat\beta=\begin{bmatrix}2 & 0\end{bmatrix}'$ .

This system is ill-conditioned because a slight perturbation in $$y^s_e = \begin{bmatrix}
2 \\
2 + \epsilon
\end{bmatrix}$$
gives the compeltely different solution $\hat\beta=\begin{bmatrix}1 & 1\end{bmatrix}'$

We show that methods using the normal equations do not find the second solution.

[1]: Numerically Efficient Methods for Solving Least Squares Problems, Do Q Lee.

In [10]:
ϵ = 1e-7
Xˢ = [1 1;1 1+ϵ]
yˢ = [2; 2]
OLS_normal(yˢ, Xˢ)

2-element Array{Float64,1}:
 2.0
 0.0

In [11]:
yˢₑ = [2; 2+ϵ]
@show OLS_normal(yˢₑ, Xˢ)
@show OLS_chol(  yˢₑ, Xˢ);

OLS_normal(yˢₑ,Xˢ) = [0.75,1.125]
OLS_chol(yˢₑ,Xˢ) = [0.727273,1.27273]


In [12]:
@show OLS_qr( yˢₑ, Xˢ)
@show OLS_svd(yˢₑ, Xˢ);

OLS_qr(yˢₑ,Xˢ) = [1.0,1.0]
OLS_svd(yˢₑ,Xˢ) = [1.0,1.0]


### Advanced methods
Several advanced methods exist such as [bayesian OLS](https://en.wikipedia.org/wiki/Bayesian_linear_regression) (with possible priors), sparse OLS solvers, and iterative solvers. Several other methods for OLS exist , 

In [1]:
using IterativeSolvers

In [5]:
OLS_lsmr(y, X) = lsmr(X, y)

OLS_lsmr (generic function with 1 method)

In [6]:
β̂_lsmr = OLS_lsmr(y, X)
hcat(β̂_lsmr, β)

5×2 Array{Float64,2}:
 -0.673437  -0.671959
  0.998477   0.993547
  0.566924   0.567868
 -0.930122  -0.928808
  0.367578   0.34622 

(also exists [James-Stein estimator](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator))