# Instrumental Variables

This notebook defines a function for 2SLS and illustrates it by redoing an example from Ch 10.3.3 in "Principles of Econometrics", 3rd edition (Hill, Griffiths and Lim).

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, LinearAlgebra, Statistics

include("jlFiles/printmat.jl")
include("jlFiles/Ols.jl")
include("jlFiles/CovNWFn.jl")
include("jlFiles/UtilityFunctions.jl");

# A Function for IV & 2SLS

See the lecture notes for a derivation and more detailed explanation.

In [2]:
"""
    TwoSLSFn(y,x,z,NWQ=false,m=0)

# Input
- `y::VecOrMat`:      Tx1 or T-vector of the dependent variable
- `x::VecOrMat`:      Txk matrix (or vector) of regressors
- `z::VecOrMat`:      TxL matrix (or vector) of instruments
- `NWQ:Bool`:         if true, then Newey-West's covariance matrix is used, otherwise Gauss-Markov
- `m::Int`:           scalar, bandwidth in Newey-West; 0 means White's method

# Output
- `b::Vector`:             k-vector, regression coefficients
- `fnOutput::NamedTuple`:  with
  - res                Tx1 or Txn matrix, residuals y - yhat
  - yhat               Tx1 or Txn matrix, fitted values
  - Covb               matrix, covariance matrix of vec(b) = [beq1;beq2;...]
  - R2                 1xn, R2
  - R2_stage1          k-vector, R2 of each x[:,i] in first stage regression on z
  - δ_stage1           Lxk matrix, coeffs from 1st stage x = z'δ
  - Stdδ_stage1        Lxk matrix, std of δ

# Requires
- Statistics, LinearAlgebra
- CovNWFn


"""
function TwoSLSFn(y,x,z,NWQ=false,m=0)

    (Ty,n) = (size(y,1),size(y,2))
    (k,L)  = (size(x,2),size(z,2))

    δ         = z\x             #stage 1 estimates, Lxk, one column per regression
    xhat      = z*δ             #TxL * Lxk - > Txk
    resx      = x - xhat        #Txk
    R2_stage1 = [cor(x[:,i],xhat[:,i])^2  for i=1:k]

    Szz_1 = inv(z'z)             #stage 1 standard errors
    Stdδ  = similar(δ)           #Lxk standard errors of δ
    for i = 1:k                  #loop over columns in x
        if NWQ                   #NW standard errors
            S      = CovNWFn(resx[:,i].*z,m)
            Covδ_i = Szz_1*S*Szz_1
        else                     #standard errors assuming iid
            Covδ_i = Szz_1*var(resx[:,i])
        end
        Stdδ[:,i] = sqrt.(diag(Covδ_i))
    end

    b    = xhat\y            #stage 2 estimates
    yhat = x*b               #notice: from y=x'b+u, not 2nd stage regression
    res  = y - yhat

    R2   = cor(y,yhat)^2
    Sxz  = x'z              #stage 2 standard errors 
    if NWQ     #Cov(b) using Newey-West 
        S    = CovNWFn(res.*z,m)
        B    = inv(Sxz*Szz_1*Sxz')*Sxz*Szz_1
        Covb = B*S*B'
    else       #Cov(b) assuming iid residuals, independent of z
        Covb = var(res)*inv(Sxz*Szz_1*Sxz')
    end

    fnOutput = (;res,yhat,Covb,R2,R2_stage1,δ_stage1=δ,Stdδ_stage1=Stdδ)

    return b, fnOutput

end

TwoSLSFn

# Loading the Data

The next cells replicates an old example from Hill et al (2008). See the lecture notes for more details.

### A remark on the code
The data set contains many different variables. To import them with their correct names, we create a named tuple of them by using the function `PutDataInNT()` which was included above. (This is convenient, but not important for the focus of this notebook. An alternative is to use the `DataFrames.jl` package.)

In [3]:
(x,header) = readdlm("Data/mrozEd.txt",header=true)
X          = PutDataInNT(x,header)                         #NamedTuple with X.wage, X.exper, etc

c = ones(size(x,1))                                       #constant, used in the regressions

println("The variables in X (use as, for instance, X.wage): ")
printmat(keys(X))

The variables in X (use as, for instance, X.wage): 
(:taxableinc, :federaltax, :hsiblings, :hfathereduc, :hmothereduc, :siblings, :lfp, :hours, :kidsl6, :kids618, :age, :educ, :wage, :wage76, :hhours, :hage, :heduc, :hwage, :faminc, :mtr, :mothereduc, :fathereduc, :unemployment, :bigcity, :exper)



## OLS

estimation of the log wage on education, experience and experience^2. Only data points where wage > 0 are used.

In [4]:
vv     = X.wage .> 0     #find data points where X.wage > 0
                         #OLS on wage>0
(b_OLS,_,_,Covb,) = OlsGMFn(log.(X.wage[vv]),[c X.educ X.exper X.exper.^2][vv,:])
Stdb_ols = sqrt.(diag(Covb))

colNames = ["coef","std"]
rowNames = ["c","educ","exper","exper^2"]
printblue("OLS estimates:\n")
printmat(b_OLS,Stdb_ols;colNames,rowNames)

[34m[1mOLS estimates:[22m[39m

             coef       std
c          -0.522     0.198
educ        0.107     0.014
exper       0.042     0.013
exper^2    -0.001     0.000



## IV (2SLS)

using the mother's education as an instrument for the person's education.

In [5]:
(b_iv,fO2) = TwoSLSFn(log.(X.wage[vv]),[c X.educ X.exper X.exper.^2][vv,:],
                      [c X.exper X.exper.^2 X.mothereduc][vv,:])

zNames = ["c","exper","exper^2","mothereduc"]

printblue("first-stage estimates: coeffs (each regression in its own column)")
printmat(fO2.δ_stage1;colNames=rowNames,rowNames=zNames)

printblue("first-stage estimates: std errors")
printmat(fO2.Stdδ_stage1;colNames=rowNames,rowNames=zNames)

printblue("first-stage estimates: R²")
printmat(fO2.R2_stage1';colNames=rowNames)

[34m[1mfirst-stage estimates: coeffs (each regression in its own column)[22m[39m
                   c      educ     exper   exper^2
c              1.000     9.775    -0.000     0.000
exper          0.000     0.049     1.000    -0.000
exper^2       -0.000    -0.001     0.000     1.000
mothereduc     0.000     0.268    -0.000    -0.000

[34m[1mfirst-stage estimates: std errors[22m[39m
                   c      educ     exper   exper^2
c              0.000     0.422     0.000     0.000
exper          0.000     0.042     0.000     0.000
exper^2        0.000     0.001     0.000     0.000
mothereduc     0.000     0.031     0.000     0.000

[34m[1mfirst-stage estimates: R²[22m[39m
         c      educ     exper   exper^2
       NaN     0.153     1.000     1.000



In [6]:
Stdb_iv = sqrt.(diag(fO2.Covb))
printblue("IV estimates")
printmat(b_iv,Stdb_iv;colNames,rowNames)

printred("The results should be very close to Hill et al, 10.3.3,
but with small differences due to how df adjustments are made to variances")

[34m[1mIV estimates[22m[39m
             coef       std
c           0.198     0.471
educ        0.049     0.037
exper       0.045     0.014
exper^2    -0.001     0.000

[31m[1mThe results should be very close to Hill et al, 10.3.3,[22m[39m
[31m[1mbut with small differences due to how df adjustments are made to variances[22m[39m
