# Case 2: Properties of OLS and simulation methods

by Milan Van den Heuvel, Ken Bastiaensen, Gonzalo Villa
*Advanced Econometrics 2016-2017.

Strict set of GM assumptions:
* X is deterministic, x is thus fixed over repeated samples
* errors $\mu$ are normally distributed with assumed homoscedastic errors

#### Question: *Give the small sample and asymptotic properties of the OLS estimator for $\beta$ and for the estimator of the standard errors.*

Small sample properties:
* OLS is the best unbiased estimator
* The estimator is normally distributed (stems from the fact that $\hat{\beta}$ is linear function of the disturbance vector $\mu$)
* The covariance matrix $\sigma^2(X'X)^{-1}$ can be estimated with an unbiased estimator of $\sigma^2$ given by:

$$\hat{\sigma}^2 = \frac{\hat{\mu}'\hat{\mu}}{N-K} = \frac{y'My}{N-K}$$


Asymptotic properties:
* same under the GM conditions
* $\bar{x}_N$ assymptotically approaches $N(\mu,\frac{\sigma^2}{N})$

# Part 1: Properties of Monte Carlo simulations

In [1]:
using Distributions: Normal, TDist, ccdf, fit
using Plots
gc()

## General normality test (Jarque-Berra)

In [2]:
#testing for normality
function JB_test(X)
    E_X = mean(X,1)[1]
    σ = std(X,1)[1]
    n = length(X)
    S = sum((X - E_X).^3/σ^3)/n
    K = sum((X - E_X).^4/σ^4)/n
    return n*(S^2/6 + (K-3)^2/24)
end

JB_test (generic function with 1 method)

In [3]:
# include functions from file
include("functions_lib.jl"); 

In [4]:
# Note that we can use unicode for identifiers by using latex and tab completion (e.g. \beta+<TAB>)
β₀ = 10
β₁ = 1
β  = [β₀, β₁]
σ² = 1
T  = 25  # sample size
runs = 10_00 # underscore for readability, doesn't affect number

1000

## Deterministic X

We create a function to run MC simulations:
1. specify a population = N(5,2) and draw a sample once to have a deterministic sample.
1. simulate y by simulating errors with variance $\sigma^2$ (= 1 here).
1. run ols and store results.
1. The 'true' standard errors is the standard deviation over all estimated $\hat\beta$ (True SE = $\sum_{run=0}^{runs} se(\hat\beta_{run}$)).
1. return true value and mean of estimated for $\beta$ as well as for its standard error.


In [5]:
# simple implementation
function mc_simple(β, σ², T, runs)
    K = length(β)
    
    # simulate X once, deterministically
    X  = hcat(ones(T), rand(Normal(5, 2), T, K-1)) #concatenation of column of ones for the constant terms and the randomly drawn x's for the beta terms 
    
    # variables with mc results
    β_mc    = zeros(runs, K)
    β_var_mc= zeros(runs, K)

    # pre-allocate memory to speed up value-allocation process
    Xβ = X * β
    μ_dist = Normal(0, √σ²) 
    
    for run = 1:runs
        y = Xβ + rand(μ_dist, T)
        result = ols(y, X)
        
        β_mc[run, :] = result.coefs
        β_var_mc[run, :] = diag(result.vcv)
    end
    
    return β_mc, β, mean(β_mc,1), sqrt(mean(β_var_mc,1)), std(β_mc,1)
end

mc_simple (generic function with 1 method)

In [6]:
mc_simple(β, σ², T, runs)

(
[9.61881 1.07845; 10.1862 1.00875; … ; 10.8417 0.862877; 9.57835 1.08263],

[10,1],
[9.97834 1.00522],

[0.56242 0.11271],

[0.560389 0.110103])

In [7]:
T = 25
jb = zeros(1000)
for i = 1:1000
    β_mcs, True_β, Est_β, True_σ, Est_σ = mc_simple(β, σ², T, runs)
    jb[i] = JB_test(β_mcs[:,1])
    i+=1
end
histogram(jb)

#### Interlude: Julia speedups
You can profile the code to identify possible speedups. We see that most of the time is spent in solving OLS. Because X is deterministic and we only need to factorize it once. Changing this part of the code almost doubles the speed. See the bottom of this notebook.

### Comparison to true standard errors
We see that the mean of the estimated standard errors are close to the 'true' standard errors, even when running only 100 simulations for 25 samples:

In [8]:
for T = [25, 50, 100, 500]
    β_mcs, True_β, Est_β, True_σ, Est_σ = mc_simple(β, σ², T, runs)
    println("For sample size: ", T, " True_β: ", True_β," Est_β: ", Est_β, " True_σ: ", True_σ, " Est_σ: ", Est_σ)
end

For sample size: 25 True_β: [10,1] Est_β: [9.95891 1.00778] True_σ: [0.74098 0.135049] Est_σ: [0.750185 0.136415]
For sample size: 50 True_β: [10,1] Est_β: [10.0004 1.00004] True_σ: [0.424114 0.0761988] Est_σ: [0.421667 0.0763837]
For sample size: 100 True_β: [10,1] Est_β: [9.99162 1.00082] True_σ: [0.282303 0.0531131] Est_σ: [0.286169 0.0534766]
For sample size: 500 True_β: [10,1] Est_β: [10.0046 0.998993] True_σ: [0.115654 0.0213827] Est_σ: [0.116825 0.0215092]


### t-test
We now perform a t-test for several null hypothesis for $\beta_1 = 1; 0.9; 0.8$ and this for several sample sizes, we also report the p-values. Because we know that X is normally distributed we do not need to use the simulated t-stats since we know that the t-values will be student-t distributed and can immediately do a standard t-test.

In [9]:
runs = 10_000
for T = [25, 50, 100, 500, 10000]
    println("## T = ",T," ##")
    for β₁_hyp = [1, 0.9, 0.7, 0.5]
        β_mc, _, β_mean, _, β_se = mc_simple(β, σ², T, runs)
        K = size(β_mean)[1] #amount of estimated parameters = amount of d.o.f. lost
        ttest = (β_mean[2] - β₁_hyp) / β_se[2]
        pval  = 2 * ccdf(TDist(T-K), abs(ttest)) # what is the change that if you reject a correct null
        println("β₁=", β₁_hyp, "; T-test: ", ttest)
        println("β₁=", β₁_hyp, "; P-val: ", pval)
    end
end

## T = 25 ##
β₁=1.0; T-test: -0.017576452359100398
β₁=1.0; P-val: 0.9861220475251091
β₁=0.9; T-test: 0.7423645710834789
β₁=0.9; P-val: 0.4650706976235941
β₁=0.7; T-test: 2.9089420116615585
β₁=0.7; P-val: 0.007696620812111777
β₁=0.5; T-test: 4.219458329024743
β₁=0.5; P-val: 0.0003022060621204789
## T = 50 ##
β₁=1.0; T-test: 0.004250407171241445
β₁=1.0; P-val: 0.9966259335743748
β₁=0.9; T-test: 1.4519020429488192
β₁=0.9; P-val: 0.15290209612917022
β₁=0.7; T-test: 4.359381228936486
β₁=0.7; P-val: 6.67380890812638e-5
β₁=0.5; T-test: 7.24347315864582
β₁=0.5; P-val: 2.785889744062523e-9
## T = 100 ##
β₁=1.0; T-test: 0.010438932632286346
β₁=1.0; P-val: 0.9916920953296358
β₁=0.9; T-test: 1.946718605556937
β₁=0.9; P-val: 0.05440263304951062
β₁=0.7; T-test: 5.7380802723242414
β₁=0.7; P-val: 1.0470850679868123e-7
β₁=0.5; T-test: 9.531425484317243
β₁=0.5; P-val: 1.1516977211697495e-15
## T = 500 ##
β₁=1.0; T-test: -0.00024933203613682184
β₁=1.0; P-val: 0.9998011614633155
β₁=0.9; T-test: 4.39270705

# I (milan) am not sure if we need to take the 2,5% and 97,5% values from the t-values for every estimated parameter. Since we know X is normally distributed, we know the t-test is student-t distributed so I don't think this is necessary and the previous is enough

In [10]:
runs = 10_000
t-values = zeros(runs)
for T = [25, 50, 100, 500, 10000]
    println("## T = ",T," ##")
    for β₁_hyp = [1, 0.9, 0.7, 0.5]
        β_mc, _, β_mean, _, β_se = mc_simple(β, σ², T, runs)
        t-values = (β_mc[:,2].-β₁_hyp)/ β_se[2]
    end
end

LoadError: error in method definition: function Base.- must be explicitly imported to be extended

## Conclusion

From these results we clearly see that under the strict assumptions of Gauss-Markov the OLS estimator for $\beta$ ($\hat{\beta}$) and the estimator for the standard deviation ($\hat{\sigma}$) of this estimator are unbiased and, even for small samples, very close to the true values. The estimated standard deviations can also be seen to decline with increasing sample size, thus the distribution of the estimated parameters grows more peaked around the real values. Since the error terms are pulled from a normal distribution and $\hat{\beta}$ is a weighted sum of these, where the weigting is deterministic, it is itself also normally distributed. This is proven by the Jarque-Berra test.

## MC with stochastic variables

X now changes over each run, this causes some properties to change.

We know X and $\mu$ are still independent.

OLS estimator is still:
* unbiased 
* efficient 

but in small samples no longer necessarily normally distributed and the standard covariance matrix should be interpreted as being conditional on X. In large samples, the estimator will still be normally distributed.

In [11]:
function mc_stoch(β, σ², T, runs)
    K = length(β)
    
    # variables with mc results
    β_mc    = zeros(runs, length(β))
    β_var_mc= zeros(runs, length(β))

    # pre-allocate
    μ_dist = Normal(0, √σ²)
    X_dist = Normal(5, 2)
    
    X = ones(T, K)
    for run = 1:runs
        # simulate inside the loop
        X[:, 2:end] = rand(X_dist, T, K-1)
        y = X*β + rand(μ_dist, T)
        result = ols(y, X)
        
        β_mc[run, :] = result.coefs
        β_var_mc[run,:] = diag(result.vcv)
    end
    
    return β_mc, β, mean(β_mc,1), sqrt(mean(β_var_mc,1)), std(β_mc,1)
end

mc_stoch (generic function with 1 method)

In [12]:
mc_stoch(β, σ², 10, 5)

(
[7.24927 1.39322; 10.067 0.872992; … ; 10.9337 0.76667; 9.93937 1.00783],

[10,1],
[9.39928 1.06201],

[0.853356 0.137324],

[1.41997 0.2639])

In [13]:
β_mcs2, True_β, Est_β, True_σ, Est_σ = mc_stoch(β, σ², 25, 5000)
histogram(β_mcs2[:,1])

In [14]:
jb = zeros(1000)
for i = 1:1000
    β_mcs, True_β, Est_β, True_σ, Est_σ = mc_stoch(β, σ², 25, 1000)
    jb[i] = JB_test(β_mcs[:,1])
    i+=1
end
histogram(jb)

In [15]:
histogram(jb, nbins =100)

In [16]:
runs = 10_000
for T = [25, 50, 100, 500]
    println("## T = ",T," ##")
    for β₁_hyp = [1, 0.9, 0.7, 0.5]
        _, β_mean, _, β_se = mc_stoch(β, σ², T, runs)
        K = size(β_mean)[1] #amount of estimated parameters = amount of d.o.f. lost
        ttest = (β_mean[2] - β₁_hyp) / β_se[2]
        pval  = 2 * ccdf(TDist(T-K), abs(ttest)) # what is the change that you reject a correct null
        println("β₁=", β₁_hyp, "; T-test: ", ttest)
        println("β₁=", β₁_hyp, "; P-val: ", pval)
    end
end

## T = 25 ##
β₁=1.0; T-test: 0.0
β₁=1.0; P-val: 1.0
β₁=0.9; T-test: 0.9370817916891403
β₁=0.9; P-val: 0.35845289752253584
β₁=0.7; T-test: 2.803705616836144
β₁=0.7; P-val: 0.010083805280657106
β₁=0.5; T-test: 4.692650262354372
β₁=0.5; P-val: 0.0001001342959117299
## T = 50 ##
β₁=1.0; T-test: 0.0
β₁=1.0; P-val: 1.0
β₁=0.9; T-test: 1.3715439009734423
β₁=0.9; P-val: 0.17658308070756756
β₁=0.7; T-test: 4.127594015327864
β₁=0.7; P-val: 0.000145231790756149
β₁=0.5; T-test: 6.84446596555132
β₁=0.5; P-val: 1.2734540651698976e-8
## T = 100 ##
β₁=1.0; T-test: 0.0
β₁=1.0; P-val: 1.0
β₁=0.9; T-test: 1.9685561828414593
β₁=0.9; P-val: 0.051829053356772126
β₁=0.7; T-test: 5.914610272827658
β₁=0.7; P-val: 4.8740740449270176e-8
β₁=0.5; T-test: 9.847158528390104
β₁=0.5; P-val: 2.5979871961813796e-16
## T = 500 ##
β₁=1.0; T-test: 0.0
β₁=1.0; P-val: 1.0
β₁=0.9; T-test: 4.461637732724556
β₁=0.9; P-val: 1.006453335725981e-5
β₁=0.7; T-test: 13.371980689378116
β₁=0.7; P-val: 4.6095771335049883e-35
β₁=0.5; T-te

## Conclusion

The Jarque-Berra test shows that the power of it is lower because it is clear that there is a lot more probability mass beyond the critical value of 5,99 than in the deterministic case. The estimators are still unbiased and consistent.

# Lagged Dependent Variable

Introducing lagged dependent variables makes it so that the assumption "X and $\mu$ are independent" has to be relaxed to $E[\mu_t|x_t] = 0$ or thus that the errors are contemporaneously independent with any explanatory variables.

The OLS estimator becomes:
* Biased: $E[\hat{\beta}|X] = \beta + (X'X)^{-1}X'E[\mu|X]$ => $E[\hat{\beta}] = E_X(E[\hat{\beta}|X]) \neq \beta$
* Consistent and asymptotically normally distributed: $plim\hat{\beta} = \beta + plim \frac{X'X}{T}^{-1} plim\frac{X'\mu}{T}$ = 0 because $plim\frac{X'\mu}{T} = E(x_t\mu_t) = 0$
* $\hat{\sigma}^2 = \frac{\hat{\mu}'\hat{\mu}}{T-k}$ is still a consistent estimator for $\sigma^2$

In [17]:
# AR1 MC simulation
function mc_ar1(β, σ², T, runs)
    K = length(β)
    β₀, β₁ = β
    σ = √σ² # = sqrt(σ²)
    
    # variables with mc averages
    β_mc     = zeros(runs, K)
    β_var_mc = zeros(runs, K)

    # pre-allocate
    y = zeros(T)
    X = ones(T, K) # fill second column with y_{t-1}
    y₀_dist = Normal(β₀/(1-β₁), sqrt(σ²/(1-β₁^2)))
    
    for run = 1:runs
        # simulate y
        y₀ = rand(y₀_dist) 
        y[1] = β₀ + β₁*y₀ + σ*randn() 
        for t = 2:T
            y[t] = β₀ + β₁*y[t-1] + σ*randn() 
        end
        # copy into X
        X[1,2] = y₀
        X[2:end, 2] = y[1:end-1]
        
        # ols
        result = ols(y, X)
        β_mc[run,:]    = result.coefs
        β_var_mc[run,:]= diag(result.vcv)
    end
    
    return β, mean(β_mc,1), sqrt(mean(β_var_mc,1)), std(β_mc,1)
end

mc_ar1 (generic function with 1 method)

In [18]:
β₀, β₁ = 10, 0.1
σ² = 1
T = 1000
runs = 10_000
@time mc_ar1([β₀, β₁], σ², T, runs)

  1.563452 seconds (1.16 M allocations: 1.150 GB, 46.51% gc time)


([10.0,0.1],
[10.0168 0.0985177],

[0.351274 0.031485],

[0.349283 0.0313248])

Let's plot the bias

In [19]:
using Plots
gr();

In [20]:
Ts  = vcat(collect(10:10:90), collect(100:25:500))
β₁s = [0, 0.5, 0.9]
β̂ = [mc_ar1([β₀, β₁], σ², T, runs)[2][1] for T in Ts, β₁ in β₁s]

26×3 Array{Float64,2}:
 10.9945  14.6086  47.2055
 10.4985  12.5004  30.1839
 10.3476  11.5828  23.5716
 10.2455  11.2308  20.2302
 10.1867  11.0285  18.1248
 10.1688  10.828   16.606 
 10.1619  10.7504  15.7192
 10.1195  10.6046  14.98  
 10.123   10.5369  14.3785
 10.1093  10.5263  14.0045
 10.0782  10.4065  13.1364
 10.0778  10.3369  12.5719
 10.0687  10.2906  12.2245
 10.0572  10.2597  11.9424
 10.0302  10.2105  11.7263
 10.0393  10.215   11.5136
 10.028   10.1691  11.4035
 10.0397  10.1582  11.3002
 10.0334  10.1516  11.1739
 10.0376  10.1366  11.0818
 10.0296  10.1255  11.0184
 10.024   10.1282  10.9126
 10.0128  10.1106  10.8527
 10.0217  10.1188  10.8371
 10.0167  10.1119  10.8053
 10.0238  10.0985  10.7641

In [21]:
plot(Ts, β̂, label=string.(β₁s'))

Given a certain sample size and estimated AR(1) coefficient, you can use the matrix for $\hat\beta$ (or the graph) to estimate the bias for $\beta_1$ (note that reported values are relative to 10).

## Appendix: Julia performance profiling

In [22]:
# around 4s on my laptop
@time mc_simple(β, σ², T, 100_000);

 16.041294 seconds (11.20 M allocations: 12.241 GB, 43.62% gc time)


In [23]:
using ProfileView #run `Pkg.add("ProfileView")` if not yet installed

LoadError: ArgumentError: Module ProfileView not found in current path.
Run `Pkg.add("ProfileView")` to install the ProfileView package.

In [24]:
Profile.clear()
@profile mc_simple(β, σ², T, 1_000)
ProfileView.view() #interactive graph with mouse over,  scroll and drag

LoadError: UndefVarError: ProfileView not defined

In [25]:
# implementation that factorizes X only once
function mc_fact(β, σ², T, runs)
    
    # simulate X once, deterministically
    X = hcat(ones(T), rand(Normal(5, 2), T))
    
    # variables with mc results    
    β_mc    = zeros(runs, length(β))
    β_var_mc= zeros(runs) #only keep σ̂²T = dot(μ̂, μ̂) = σ̂²*(T-K) per run

    # pre-allocate
    Xβ     = X * β
    μ_dist = Normal(0, √σ²)
    x_fact = factorize(X)
    XtXinvd= diag(inv(X'*X))
    
    for run = 1:runs
        y = Xβ  + rand(μ_dist, T)
        β̂ = x_fact \ y #factorization already done now
        μ̂ = y - X * β̂
        σ̂²T = dot(μ̂, μ̂) #put factor /(T-K) outside of loop
        
        β_mc[run, :]    = β̂
        β_var_mc[run,:] = σ̂²T
    end
    se_true = std(β_mc, 1)
    se_mc   = sqrt(mean(β_var_mc) / (T - length(β)) * XtXinvd)
    return β, mean(β_mc, 1), se_true, se_mc
end

mc_fact (generic function with 1 method)

In [26]:
# runs in about 2s on my laptop
mc_fact(β, σ², 25, 1) # first run includes JIT compilation
@time mc_fact(β, σ², T, 100_000) 

 10.075074 seconds (6.80 M allocations: 10.332 GB, 53.23% gc time)


([10,1],
[10.0002 0.999964],

[0.0891672 0.0165258],

[0.0890406,0.0164875])