# Panel Regressions

This notebook illustrates several panel data models and estimation methods (pooled OLS, fixed effects, the "between" estimator, first differences, etc)

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, Statistics, LinearAlgebra

include("jlFiles/printmat.jl")
include("jlFiles/CovNWFn.jl")
include("jlFiles/Ols.jl")
include("jlFiles/lagFn.jl")
include("jlFiles/excise.jl");

## Loading Data

### A Remark on the Code

The data set contains many variables (one per column). It is convenient to store them as a named tuple where the names are taken directly from the header in the file. This allows us to later use, for instance, `X.lwage` to refer to the data on log wages. Since this is fully automatic, it also reduces the risk of referring to the wrong data.

In [2]:
"""
    PutDataInNT(x,header)

Creates a NamedTuple with, for instance, `X.a`, `X.b` and `X.c` where `x` is a matrix and `header = ["a" "b" "c"]`.

"""
function PutDataInNT(x,header)
    namesB = tuple(Symbol.(header)...)                            #a tuple (:a,:b,:b)
    X      = NamedTuple{namesB}([x[:,i] for i=1:size(x,2)])       #NamedTuple with X.a, X.b and X.c
    return X
end

PutDataInNT

In [3]:
(x,header) = readdlm("Data/nls_panelEd.txt",header=true)    #classical data set from Hill et al (2008)

X = PutDataInNT(x,header)                         #NamedTuple with X.id, X.lwage, etc
println(keys(X))

NT = size(x,1)
c  = ones(NT)

T = 5                 #number of time periods
N = round(Int,NT/T)   #number of individuals

id = X.id

println("\nT=$T and N=$N")

(:id, :year, :lwage, :hours, :age, :educ, :collgrad, :msp, :nev_mar, :not_smsa, :c_city, :south, :black, :union, :exper, :exper2, :tenure, :tenure2)

T=5 and N=716


## Creating Variables for the Regressions

The next cell creates a matrix $yx$ which has the dependent variable as the first column and the regressors as the remaining columns.

We then print the first few observations of (some of) the data. Notice the structure: the first 5 observations are for individual (`id`) 1 (period 1-5), the next 5 for individual 2.

The subsequent cell makes a "within transformation" by creating 

$
yx^*_{it} = yx_{it} - \bar{yx}_{it}, 
$

where $\bar{yx}_{it}$ is a row vector with the averages of each column of $yx$ for individual $i$.

In [4]:
xNames = ["exper/100","exper^2/100","tenure/100","tenure^2/100","south","union"]
yx     = [X.lwage c X.exper/100 X.exper2/100 X.tenure/100 X.tenure2/100 X.south X.union]
K      = size(yx,2) - 1

printblue("The first few lines of (some of) the data:\n")
printmat(Any[id[1:11] yx[1:11,1:3]],colNames=["id","lnwage","c","exper/100"],rowNames=string.(1:11),cell00="obs")

[34m[1mThe first few lines of (some of) the data:[22m[39m

obs        id    lnwage         c exper/100
1       1.000     1.808     1.000     0.077
2       1.000     1.863     1.000     0.086
3       1.000     1.789     1.000     0.102
4       1.000     1.847     1.000     0.122
5       1.000     1.856     1.000     0.136
6       2.000     1.281     1.000     0.076
7       2.000     1.516     1.000     0.084
8       2.000     1.930     1.000     0.104
9       2.000     1.919     1.000     0.120
10      2.000     2.201     1.000     0.132
11      3.000     1.815     1.000     0.114



In [5]:
id_uniq = unique(id)               #which id values are in data set
N       = length(id_uniq)          #number of cross-sectional units

yxStar = fill(NaN,size(yx))          #individual de-meaning
yxbar  = fill(NaN,N,1+K) 
for i = 1:N                          #loop over individuals
    #local vv_i                      #local/global is needed in script
    vv_i          = id .== id_uniq[i]                #locate rows in yx which refer to individual i
    yxbar[i,:]    = mean(yx[vv_i,:],dims=1)          #averages for individual i
    yxStar[vv_i,:] = yx[vv_i,:] .- yxbar[i:i,:]      #i:i to keep it a row vector
end

## Pooled OLS, FE, and Between Estimations

'Pooled OLS' is just OLS on the original data `(y,x)`, 'fixed effects' (FE) is on the individually demeaned data `yxStar` and the 'between estimator' is a cross-sectional OLS on the individual means `yxbar`.

### A Remark on the Code

- `OlsNWFn(y,x,m=0)` is a function for doing OLS. Setting `m>=1` gives Newey-West standard errors. With `m=0` (the default) we instead get White's standard errors.

In [6]:
(b,res,yhat,Covb,R2,) = OlsNWFn(yx[:,1],yx[:,2:end])            #pooled OLS
xutLS = hcat(b,b./sqrt.(diag(Covb)))
xutLS = xutLS[2:end,:]                                          #drop the intercept from printing


(b,res,yhat,Covb,R2,) = OlsNWFn(yxStar[:,1],yxStar[:,3:end])    #fixed effect, skip the constant
xutFE = hcat(b,b./sqrt.(diag(Covb))*sqrt(NT-N-2)/sqrt(NT-2))
s2_e  = sum(res.^2)/(NT-N-(K-1))              #needed for GLS (see below)


(b,res,yhat,Covb,R2,) = OlsNWFn(yxbar[:,1],yxbar[:,2:end])      #between estimator
xutB = hcat(b,b./sqrt.(diag(Covb)))
xutB = xutB[2:end,:]                                            #skip the intercept
s2_u = max(0,sum(res.^2)/(N-K) - s2_e/T)      #needed for GLS


printblue("Point estimates:")
printmat([xutLS[:,1] xutFE[:,1] xutB[:,1]],colNames=["Pooled","FE","Between"],rowNames=xNames)

printblue("t-stats:")
printmat([xutLS[:,2] xutFE[:,2] xutB[:,2]],colNames=["Pooled","FE","Between"],rowNames=xNames)

[34m[1mPoint estimates:[22m[39m
                Pooled        FE   Between
exper/100        7.837     4.108    10.641
exper^2/100     -0.201    -0.041    -0.317
tenure/100       1.206     1.391     1.247
tenure^2/100    -0.024    -0.090    -0.016
south           -0.196    -0.016    -0.201
union            0.110     0.064     0.121

[34m[1mt-stats:[22m[39m
                Pooled        FE   Between
exper/100        8.954     5.917     4.573
exper^2/100     -5.264    -1.466    -3.054
tenure/100       2.346     3.975     0.883
tenure^2/100    -0.828    -4.136    -0.198
south          -13.247    -0.367    -6.519
union            6.928     4.181     3.102



## First-Difference Model

To estimate the first-difference model, we first need to calculate the differences (over two time periods) for the *same individual*.

### A Remark on the Code

- In the cell below, we call on the function `lagFn` which lags the data once (as a default). For the first time period, the result is a NaN (as there are no earlier values). 
- After the loop we locate and delete all rows that include some NaNs. This means that we will have only $T-1$ data points for each individual.

In [7]:
yxStarΔ = fill(NaN,size(yx))
for i = 1:N                          #individual first-differencing, loop over individuals
    #local vv_i                      #only in script
    vv_i            = id .== id_uniq[i]   #rows in yx which refer to individual i
    yxStarΔ[vv_i,:] = yx[vv_i,:] - lagFn(yx[vv_i,:])   #yx[t] -yx[t-1]
end

yxStarΔ = excise(yxStarΔ)          #cut out rows with NaNs
yxStarΔ[:,2] .= 1                  #put a (non-zero) constant back

println("size of yxStarΔ: ",size(yxStarΔ))

size of yxStarΔ: (2864, 8)


In [8]:
(b,res,yhat,Covb,R2,) = OlsNWFn(yxStarΔ[:,1],yxStarΔ[:,2:end])
xutΔ = hcat(b,b./sqrt.(diag(Covb)))
xutΔ = xutΔ[2:end,:]

printblue("1-st difference estimation:")
printmat(xutΔ,colNames=["Coef","Std"],rowNames=xNames)

[34m[1m1-st difference estimation:[22m[39m
                  Coef       Std
exper/100        3.548     2.277
exper^2/100     -0.045    -0.933
tenure/100       1.293     2.527
tenure^2/100    -0.083    -2.329
south           -0.024    -0.395
union            0.044     3.115



## GLS of Random Effects Model (extra)

GLS is similar to the FE estimator discussed above, except that it is based on a 'quasi-demeaning' $y_{it} - \vartheta\bar{y_{it}}$ and similarly for $x_{it}$.

In [9]:
ϑ = 1 - sqrt(s2_e)/sqrt(T*s2_u+s2_e)                       #GLS
printlnPs("ϑ in GLS: ",ϑ,"\n")

yxStar_ϑ = fill(NaN,size(yx))
for i = 1:N
    #local vv_i              #local/global is needed in script
    vv_i             = id .== id_uniq[i]
    yxStar_ϑ[vv_i,:] = yx[vv_i,:] .- ϑ*yxbar[i:i,:]
end

(b,res,yhat,Covb,R2,) = OlsNWFn(yxStar_ϑ[:,1],yxStar_ϑ[:,2:end])
xutGLS = hcat(b,b./sqrt.(diag(Covb)))
xutGLS = xutGLS[2:end,:]

printblue("GLS:")
printmat(xutGLS,colNames=["Coef","Std"],rowNames=xNames)

ϑ in GLS:      0.774          

[34m[1mGLS:[22m[39m
                  Coef       Std
exper/100        4.570     7.111
exper^2/100     -0.063    -2.387
tenure/100       1.380     4.032
tenure^2/100    -0.074    -3.575
south           -0.132    -5.255
union            0.075     5.611

