# OLS, Testing with Non-iid Residuals

This notebook tests for autocorrelation and heteroskedasticity and then tests hypotheses using standard errors from White's and Newey-West's methods.

You may also consider the [HypothesisTests.jl](https://github.com/JuliaStats/HypothesisTests.jl) package (not used here).

## Load Packages and Extra Functions

In [1]:
MyModulePath = joinpath(pwd(),"src")
!in(MyModulePath,LOAD_PATH) && push!(LOAD_PATH,MyModulePath)
using FinEcmt_OLS

In [2]:
#=
include(joinpath(pwd(),"src","FinEcmt_OLS.jl"))
using .FinEcmt_OLS
=#

In [3]:
using DelimitedFiles, Statistics, LinearAlgebra, Distributions

## Loading Data

In [4]:
x = readdlm("Data/FFmFactorsPs.csv",',',skipstart=1)

                #yearmonth, market, small minus big, high minus low
(ym,Rme,RSMB,RHML) = (x[:,1],x[:,2]/100,x[:,3]/100,x[:,4]/100)
x = nothing

printlnPs("Sample size:",size(Rme))

Sample size:    (388,)


## OLS under the Gauss-Markov Assumptions

(assuming iid residuals)

In [5]:
Y = Rme
T = size(Y,1)
X = [ones(T) RSMB RHML]

(b,u,_,V,R²) = OlsGM(Y,X)
std_iid = sqrt.(diag(V))

printblue("OLS Results (assuming iid residuals):\n")
xNames = ["c","SMB","HML"]
printmat(b,std_iid;colNames=["b","std_iid"],rowNames=xNames)

[34m[1mOLS Results (assuming iid residuals):[22m[39m

            b   std_iid
c       0.007     0.002
SMB     0.217     0.073
HML    -0.429     0.074



# Distribution of OLS Estimates without the Gauss-Markov Assumptions


The distribution of the OLS estimates is (typically)

$(\hat{\beta}-\beta_{0})\overset{d}{\rightarrow}N(0,V)
\: \text{ where } \: V=S_{xx}^{-1} S S_{xx}^{-1}.$

and where $S_{xx} = \sum\nolimits_{t=1}^{T}x_{t}x_{t}^{\prime}$ 
and $S$ is the covariance matrix of $\sum_{t=1}^{T}u_{t}x_{t}$.


*When* the Gauss-Markov assumptions hold, then $S$ can be simplified to $S_{xx}\sigma^2$, where $\sigma^2$ is the variance of $u_t$, so $V=S_{xx}^{-1}\sigma^2$.

In contrast, with heteroskedasticity and/or autocorrelation, $S$ must be estimated differently.

# Heteroskedasticity

## Regression Diagnostics: Heteroskedasticity

The `OlsWhitesTest()` function does White's test for heteroskedasticity. Again, the regression must have an intercept for this test to be useful.

In [6]:
@doc2 OlsWhitesTest

```
OlsWhitesTest(u,x)
```

Test of heteroskedasticity. Notice that the regression must contain  an intercept for the test to be useful.

### Input

  * `u::Vector`:   T-vector, residuals
  * `x::Matrix`:   Txk, regressors

### Output

  * `RegrStat::Number`: test statistic
  * `pval::Number`:     p-value


In [7]:
#println(@code_string OlsWhitesTest([1],[1]))    #print the source code

In [8]:
(WhiteStat,pval) = OlsWhitesTest(u,X)

printblue("White's test (H₀: heteroskedasticity is not correlated with regressors):\n")
printmat([WhiteStat,pval],rowNames=["stat","p-val"])

[34m[1mWhite's test (H₀: heteroskedasticity is not correlated with regressors):[22m[39m

stat     77.278
p-val     0.000



## White's Covariance Matrix

If $u_{t}x_{t}$ is not autocorrelated, then $S$ simplifies to $\sum_{t=1}^{T} x_tx_t^{\prime}\sigma_t^2$. White's method replaces $\sigma_t^2$ by $\hat{u}_{t}^{2}$. This estimate is robust to heteroskedasticity (in particular, time variation in $\sigma_t^2$ that is related to $x_t$).

### A Remark on the Code
$S_{xx}$ can be calculated as `Sxx = X'X` and $S$ as `S = (X.*u)'*(X.*u)`.

Clearly, these calculations can also be done in a loop like
```
for t = 1:T
   Sxx = Sxx + X[t,:]*X[t,:]' 
   S   = S   + X[t,:]*X[t,:]'*u[t]^2
end
```

In [9]:
Sxx = X'X

S     = (X.*u)'*(X.*u)                #S according to White's method
V     = inv(Sxx)'S*inv(Sxx)           #Cov(b), White
std_W = sqrt.(diag(V))

printblue("Coefficients and standard errors (from different methods):\n")
xx = [b std_iid std_W]
printmat(xx;colNames=["b","std_iid","std_White"],rowNames=xNames,width=12)

[34m[1mCoefficients and standard errors (from different methods):[22m[39m

              b     std_iid   std_White
c         0.007       0.002       0.002
SMB       0.217       0.073       0.113
HML      -0.429       0.074       0.097



# Autocorrelation

## Regression Diagnostics: Autocorrelation of the Residuals

The `OlsAutoCorr()` function estimates autocorrelations, calculates the DW and Box-Pierce statistics for the input (often, the residual).

In [10]:
@doc2 OlsAutoCorr

```
OlsAutoCorr(u,L=1)
```

Test the autocorrelation of OLS residuals

### Input

  * `u::Vector`:   T-vector, residuals
  * `L::Int`:      scalar, number of lags in autocorrelation and Box-Pierce test

### Output

  * `AutoCorr::Matrix`:   Lx3, autocorrelation, t-stat and p-value
  * `BoxPierce::Matrix`:  1x2, Box-Pierce statistic and p-value
  * `DW::Number`:         DW statistic

### Requires

  * StatsBase, Distributions


In [11]:
#println(@code_string OlsAutoCorr([1],5))    #print the source code

In [12]:
L = 3     #number of autocorrs to test

(ρStats,BoxPierce,DW) = OlsAutoCorr(u,L)

printmagenta("Testing autocorrelation of residuals\n")

printblue("Autocorrelations (lag 1 to $L):\n")
printmat(ρStats,colNames=["autocorr","t-stat","p-val"],rowNames=1:L,cell00="lag")

printblue("\nBoxPierce ($L lags): ")
printmat(BoxPierce',rowNames=["stat","p-val"])

printblue("DW statistic:")
printlnPs(DW)

[35m[1mTesting autocorrelation of residuals[22m[39m

[34m[1mAutocorrelations (lag 1 to 3):[22m[39m

lag  autocorr    t-stat     p-val
1       0.074     1.467     0.142
2      -0.037    -0.733     0.464
3       0.019     0.377     0.706


[34m[1mBoxPierce (3 lags): [22m[39m
stat      2.831
p-val     0.418

[34m[1mDW statistic:[22m[39m
     1.849


## Autocorrelation of of `X.*u`

What matters most for the uncertainty about a slope coefficient is not the autocorrelation of the residual itself, but of the residual times the regressor. This is tested below.

In [13]:
k = size(X,2)
for i in 1:k         #iterate over different regressors
    #local pStats
    ρStats, = OlsAutoCorr(X[:,i].*u,L)
    printblue("Autocorrelations of $(xNames[i])*u  (lag 1 to $L):")
    printmat(ρStats,colNames=["autocorr","t-stat","p-val"],rowNames=1:L,cell00="lag")
end

[34m[1mAutocorrelations of c*u  (lag 1 to 3):[22m[39m
lag  autocorr    t-stat     p-val
1       0.074     1.467     0.142
2      -0.037    -0.733     0.464
3       0.019     0.377     0.706

[34m[1mAutocorrelations of SMB*u  (lag 1 to 3):[22m[39m
lag  autocorr    t-stat     p-val
1       0.219     4.312     0.000
2      -0.014    -0.268     0.789
3       0.044     0.857     0.391

[34m[1mAutocorrelations of HML*u  (lag 1 to 3):[22m[39m
lag  autocorr    t-stat     p-val
1       0.278     5.472     0.000
2       0.131     2.582     0.010
3       0.225     4.438     0.000



## Newey-West's Covariance Matrix

Let $g_t=u_{t}x_{t}$ be a $k$-vector of data.

To calculate the Newey-West covariance matrix, we first need

$\Lambda_{s} = \sum_{t=s+1}^{T} (g_{t}-\bar{g})(g_{t-s}-\bar{g})^{\prime},$

which is proportional to the $s$th autocovariance matrices.

Then we form a linear
combination (with tent-shaped weights) of those autocovariance matrices (from
lag $-m$ to $m$) as in

$S = \mathrm{Cov}(\sum_t g_t)  = 
\Lambda_{0} + \sum_{s=1}^{m}( 1-\frac{s}{m+1})  
(\Lambda_{s}+\Lambda_{s}^{\prime}).$

With $m=0$ this is the same as White's method.

If we divide $S$ by $T$, then we get an estimate of $\mathrm{Cov}(\sqrt{T} \bar{g})$, and if we instead divide by $T^2$ then we get an estimate of $\mathrm{Cov}(\bar{g})$.

The `CovNW()` function implements this.

In [14]:
@doc2 CovNW

```
CovNW(g0,m=0,DivideByT=0)
```

Calculates covariance matrix of sample sum (DivideByT=0), √T*(sample average) (DivideByT=1) or sample average (DivideByT=2).

### Input

  * `g0::Matrix`:      Txq matrix of data
  * `m::Int`:          number of lags to use
  * `DivideByT::Int`:  divide the result by T^DivideByT

### Output

  * `S::Matrix`: qxq covariance matrix

### Remark

  * `DivideByT=0`: Var(g₁+g₂+...), variance of sample sum
  * `DivideByT=1`: Var(g₁+g₂+...)/T = Var(√T gbar), where gbar is the sample average. This is  the same as Var(gᵢ) if data is iid
  * `DivideByT=2`: Var(g₁+g₂+...)/T^2 = Var(gbar)


In [15]:
using CodeTracking
println(@code_string CovNW([1],2))    #print the source code

function CovNW(g0,m=0,DivideByT=0)

    T = size(g0,1)                    #g0 is Txq
    m = min(m,T-1)                    #number of lags

    g = g0 .- mean(g0,dims=1)         #normalizing to zero means

    S = g'g                           #(qxT)*(Txq)
    for s = 1:m
        Λ_s = g[s+1:T,:]'g[1:T-s,:]   #same as Sum[g_t*g_{t-s}',t=s+1,T]
        S   = S  +  (1 - s/(m+1))*(Λ_s + Λ_s')
    end

    (DivideByT > 0) && (S = S/T^DivideByT)

    return S

end


In [16]:
S      = CovNW(X.*u,2)         #S acccording to Newey-West, 2 lags
V      = inv(Sxx)'S*inv(Sxx)     #Cov(b), Newey-West
std_NW = sqrt.(diag(V))

S       = CovNW(X.*u,0)        #S acccording to Newey-West, 0 lags
V       = inv(Sxx)'S*inv(Sxx)
std_NW0 = sqrt.(diag(V))

printblue("Coefficients and standard errors (from different methods):\n")
xx = [b std_iid std_W std_NW std_NW0]
printmat(xx,colNames=["b","std_iid","std_White","std_NW","std_NW 0 lags"],rowNames=xNames,width=16)

printred("Remark: NW with 0 lags should be the same as White's method")

[34m[1mCoefficients and standard errors (from different methods):[22m[39m

                  b         std_iid       std_White          std_NW   std_NW 0 lags
c             0.007           0.002           0.002           0.002           0.002
SMB           0.217           0.073           0.113           0.129           0.113
HML          -0.429           0.074           0.097           0.118           0.097

[31m[1mRemark: NW with 0 lags should be the same as White's method[22m[39m


# A Convenience Function for Printing the Tests (extra)

In [17]:
DiagnosticsNoniidTable(X,u,3,xNames)

[34m[1mWhite's test (H₀: heteroskedasticity is not correlated with regressors)[22m[39m
stat     77.278
p-val     0.000

[34m[1mTesting autocorrelation of residuals (lag 1 to 3)[22m[39m
lag  autocorr    t-stat     p-val
1       0.074     1.467     0.142
2      -0.037    -0.733     0.464
3       0.019     0.377     0.706

[34m[1mBoxPierce (3 lags) [22m[39m
stat      2.831
p-val     0.418

[34m[1mDW statistic[22m[39m
     1.849          

[34m[1mAutocorrelations of c*u  (lag 1 to 3)[22m[39m
lag  autocorr    t-stat     p-val
1       0.074     1.467     0.142
2      -0.037    -0.733     0.464
3       0.019     0.377     0.706

[34m[1mAutocorrelations of SMB*u  (lag 1 to 3)[22m[39m
lag  autocorr    t-stat     p-val
1       0.219     4.312     0.000
2      -0.014    -0.268     0.789
3       0.044     0.857     0.391

[34m[1mAutocorrelations of HML*u  (lag 1 to 3)[22m[39m
lag  autocorr    t-stat     p-val
1       0.278     5.472     0.000
2       0.131     2.582    