# Panel Regressions

This notebook illustrates several panel data models and estimation methods (pooled OLS, fixed effects, the "between" estimator, first differences, etc)

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, Statistics, LinearAlgebra

include("jlFiles/printmat.jl")
include("jlFiles/CovNWFn.jl")
include("jlFiles/Ols.jl")
include("jlFiles/lagFn.jl")
include("jlFiles/excise.jl")

FindNNPs

## Loading Data

In [2]:
x = readdlm("Data/nls_panelEd.txt",skipstart=1)    #classical data set from Hill et al (2008)

NT = size(x,1)
c  = ones(NT)

T = 5                 #number of time periods
N = round(Int,NT/T)   #number of individuals

(id,year,lnwage)   = (x[:,1],x[:,2],x[:,3])
(exper,exper2)     = (x[:,15],x[:,16])
(tenure,tenure2)   = (x[:,17],x[:,18])
(south,tradeunion) = (x[:,12],x[:,14])

id = round.(Int,id)    #id should be an integer (1,2,3,...)

println("T=$T and N=$N")

T=5 and N=716


## Creating Variables for the Regressions

The next cell creates a matrix $yx$ which has the dependent variable as the first column and the regressors as the remaining columns.

The first observations of the data set are tabulated. Notice the structure: the first 5 observations are for individual (`id`) 1 (period 1-5), the next 5 for individual 2. 

The subsequent cell makes a "within transformation" by creating 

$
yx^*_{it} = yx_{it} - \bar{yx}_{it}, 
$

where $\bar{yx}_{it}$ is a row vector with the averages of each column of $yx$ for individual $i$.

In [3]:
xNames = ["exper/100","exper^2/100","tenure/100","tenure^2/100","south","union"]
yx = [lnwage c exper/100 exper2/100 tenure/100 tenure2/100 south tradeunion]
K  = size(yx,2) - 1

printblue("The first few lines of the data set:\n")
rowNames = [string("obs",i) for i=1:11]
printmat(Any[id[1:11] yx[1:11,1:3]],colNames=["id","lnwage","c","exper/100"],rowNames=rowNames)

[34m[1mThe first few lines of the data set:[22m[39m

             id    lnwage         c exper/100
obs1      1         1.808     1.000     0.077
obs2      1         1.863     1.000     0.086
obs3      1         1.789     1.000     0.102
obs4      1         1.847     1.000     0.122
obs5      1         1.856     1.000     0.136
obs6      2         1.281     1.000     0.076
obs7      2         1.516     1.000     0.084
obs8      2         1.930     1.000     0.104
obs9      2         1.919     1.000     0.120
obs10     2         2.201     1.000     0.132
obs11     3         1.815     1.000     0.114



In [4]:
id_uniq = unique(id)               #which id values are in data set
N       = length(id_uniq)

yxStar = fill(NaN,size(yx))          #individual de-meaning
yxbar  = fill(NaN,(N,1+K)) 
for i = 1:N                          #loop over individuals
    #local vv_i                      #local/global is needed in script
    vv_i          = id .== id_uniq[i]                #locate rows in yx which refer to individual i
    yxbar[i,:]    = mean(yx[vv_i,:],dims=1)          #averages for individual i
    yxStar[vv_i,:] = yx[vv_i,:] .- yxbar[i:i,:]      #i:i to keep it a row vector
end

## Pooled OLS, FE, and Between Estimations

'Pooled OLS' is just OLS on the original data `y,x`, 'fixed effects' (FE) is on the individually demeaned data `yxStar` and the 'between estimator' is a cross-sectional OLS on the individual means.

In [5]:
(b,res,yhat,Covb,R2,) = OlsNWFn(yx[:,1],yx[:,2:end])            #pooled OLS
xutLS = hcat(b,b./sqrt.(diag(Covb)))
xutLS = xutLS[2:end,:]


(b,res,yhat,Covb,R2,) = OlsNWFn(yxStar[:,1],yxStar[:,3:end])    #fixed effect
xutFE = hcat(b,b./sqrt.(diag(Covb))*sqrt(NT-N-2)/sqrt(NT-2))
s2_e  = sum(res.^2)/(NT-N-(K-1))              #needed for GLS (see below)


(b,res,yhat,Covb,R2,) = OlsNWFn(yxbar[:,1],yxbar[:,2:end])      #between estimator
xutB = hcat(b,b./sqrt.(diag(Covb)))
xutB = xutB[2:end,:]
s2_u = max(0,sum(res.^2)/(N-K) - s2_e/T)      #needed for GLS


printblue("Point estimates:")
printmat([xutLS[:,1] xutFE[:,1] xutB[:,1]],colNames=["Pooled","FE","Between"],rowNames=xNames)

printblue("t-stats:")
printmat([xutLS[:,2] xutFE[:,2] xutB[:,2]],colNames=["Pooled","FE","Between"],rowNames=xNames)

[34m[1mPoint estimates:[22m[39m
                Pooled        FE   Between
exper/100        7.837     4.108    10.641
exper^2/100     -0.201    -0.041    -0.317
tenure/100       1.206     1.391     1.247
tenure^2/100    -0.024    -0.090    -0.016
south           -0.196    -0.016    -0.201
union            0.110     0.064     0.121

[34m[1mt-stats:[22m[39m
                Pooled        FE   Between
exper/100        8.954     5.917     4.573
exper^2/100     -5.264    -1.466    -3.054
tenure/100       2.346     3.975     0.883
tenure^2/100    -0.828    -4.136    -0.198
south          -13.247    -0.367    -6.519
union            6.928     4.181     3.102



## First-Difference Model

To estimate the first-difference model, we first need to calculate the differences (over two time periods) for the *same individual*.

In the cell below, we call on the function `lagnFn` which lags the data once (as a default). For the first time period, the result is a NaN (as there are no earlier values). After the loop we locate and delete all rows that include some NaNs. This means that we will have only $T-1$ data points for each individual.

In [6]:
yxStarΔ = fill(NaN,size(yx))
for i = 1:N                          #individual first-differencing, loop over individuals
    #local vv_i                      #only in script
    vv_i            = id .== id_uniq[i]   #rows in yx which refer to individual i
    yxStarΔ[vv_i,:] = yx[vv_i,:] - lagFn(yx[vv_i,:])   #yx[t] -yx[t-1]
end

yxStarΔ = excise(yxStarΔ)          #cut out rows with NaNs
yxStarΔ[:,2] .= 1                    #constant  

println("size of yxStarΔ: ",size(yxStarΔ))

size of yxStarΔ: (2864, 8)


In [7]:
(b,res,yhat,Covb,R2,) = OlsNWFn(yxStarΔ[:,1],yxStarΔ[:,2:end])
xutΔ = hcat(b,b./sqrt.(diag(Covb)))
xutΔ = xutΔ[2:end,:]

printblue("1-st difference estimation:")
printmat(xutΔ,colNames=["Coef","Std"],rowNames=xNames)

[34m[1m1-st difference estimation:[22m[39m
                  Coef       Std
exper/100        3.548     2.277
exper^2/100     -0.045    -0.933
tenure/100       1.293     2.527
tenure^2/100    -0.083    -2.329
south           -0.024    -0.395
union            0.044     3.115



## GLS of Random Effects Model (extra)

GLS is similar to the FE estimator discussed above, except that it is based on a 'quasi-demeaning' $y_{it} - \vartheta\bar{y_{it}}$ and similarly for $x_{it}$.

In [8]:
ϑ = 1 - sqrt(s2_e)/sqrt(T*s2_u+s2_e)                       #GLS
printlnPs("ϑ in GLS: ",ϑ,"\n")

yxStar_ϑ = fill(NaN,size(yx))
for i = 1:N
    #local vv_i              #local/global is needed in script
    vv_i             = id .== id_uniq[i]
    yxStar_ϑ[vv_i,:] = yx[vv_i,:] .- ϑ*yxbar[i:i,:]
end

(b,res,yhat,Covb,R2,) = OlsNWFn(yxStar_ϑ[:,1],yxStar_ϑ[:,2:end])
xutGLS = hcat(b,b./sqrt.(diag(Covb)))
xutGLS = xutGLS[2:end,:]

printblue("GLS:")
printmat(xutGLS,colNames=["Coef","Std"],rowNames=xNames)

ϑ in GLS:      0.774         

[34m[1mGLS:[22m[39m
                  Coef       Std
exper/100        4.570     7.111
exper^2/100     -0.063    -2.387
tenure/100       1.380     4.032
tenure^2/100    -0.074    -3.575
south           -0.132    -5.255
union            0.075     5.611



# A More Ambitious Function for Panel Regressions (extra)

In [9]:
include("jlFiles/FixedEffects.jl")
include("jlFiles/PanelOls.jl")

replaceNaNinYX!

We first reshuffle the dependent variable into an $T\times N$ matrix `Y` and a $T \times K \times N$ array `X`. This will help handling autocorrelation and cross-sectional clustering. 

In [10]:
Y = fill(NaN,T,N)               #reshuffling the data
X = fill(NaN,T,K,N)

for i = 1:N
    vv_i = id .== id_uniq[i]   #rows in yx which refer to individual i
    Y[:,i] = yx[vv_i,1]
    X[:,:,i] = yx[vv_i,2:end]
end

display(Y)

5×716 Matrix{Float64}:
 1.80829  1.28093  1.81482  2.31254  …  1.53039  1.52823  1.46094  1.60944
 1.86342  1.51585  1.91991  2.34858     1.59881  2.4065   1.49669  1.45944
 1.78937  1.93017  1.95838  2.37349     1.60405  2.55886  1.55984  1.42712
 1.84653  1.91903  2.00707  2.3689      1.26794  2.64418  1.6536   1.49437
 1.85645  2.20097  2.08985  2.35053     1.55823  2.58664  1.61586  1.34142

We next call a function `FixedEffects()` to remove individual (and/or time) fixed effects.

In [11]:
(Ystar,Xstar) = FixedEffects(Y,X,:id)         #:id for individual fixed effects. :t for time fixed effects
Xstar[:,1,:] .= 1                             #put back the intercept
println()




Finally, we call the `PanelOls()` function. The output is a named tuple. Use `keys(fO)` to see the entries.

In [12]:
fO = PanelOls(Ystar,Xstar)

θ      = fO.theta
stderr = sqrt.(diag(fO.CovW))
tstat  = θ./stderr

printblue("results from PanelOls()")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

printred("Compare with the FE estimates above. The t-stats might differ because of lack of small-sample adjustment.")

[34m[1mresults from PanelOls()[22m[39m
                  coef    t-stat
c               -0.000    -0.000
exper/100        4.108     6.616
exper^2/100     -0.041    -1.640
tenure/100       1.391     4.445
tenure^2/100    -0.090    -4.624
south           -0.016    -0.411
union            0.064     4.675

[31m[1mCompare with the FE estimates above. The t-stats might differ because of lack of small-sample adjustment.[22m[39m


## Clustered Standard Errors

We now redo the estimation but provide information on clustering for the standard errors. For simplicity, the clusters are defined as the value of the `South` dummy in $t=1$.

In [13]:
clust = convert.(Int,X[1,6,:])       #define clusters based on South/North in t=1

fO = PanelOls(Ystar,Xstar,0,clust)   #0 autocorrelations, but clustering

θ      = fO.theta
stderrW = sqrt.(diag(fO.CovW))       #White's std
stderrC = sqrt.(diag(fO.CovC))       #clustered std
tstatW  = θ./stderrW
tstatC  = θ./stderrC

printblue("results from PanelOls()")
printmat(θ,tstatW,tstatC,colNames=["coef","t-stat White","t-stat Clust"],rowNames=["c";xNames],width=15)

[34m[1mresults from PanelOls()[22m[39m
                       coef   t-stat White   t-stat Clust
c                    -0.000         -0.000         -0.000
exper/100             4.108          6.616          5.586
exper^2/100          -0.041         -1.640         -2.027
tenure/100            1.391          4.445          3.829
tenure^2/100         -0.090         -4.624         -4.880
south                -0.016         -0.411         -0.467
union                 0.064          4.675          3.631



## Individual and Time Fixed Effects

Redo the panel regression, but first we reconstruct `(Ystar,Xstar)` to handle both individual and time fixed effects.

In [14]:
(Ystar,Xstar) = FixedEffects(Y,X,:idt)
Xstar[:,1,:] .= 1

fO = PanelOls(Ystar,Xstar)

θ      = fO.theta
stderr = sqrt.(diag(fO.CovW))
tstat  = θ./stderr

printblue("results from PanelOls()")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mresults from PanelOls()[22m[39m
                  coef    t-stat
c                0.000     0.000
exper/100        6.713     4.654
exper^2/100     -0.045    -1.762
tenure/100       1.347     4.279
tenure^2/100    -0.090    -4.641
south           -0.014    -0.358
union            0.065     4.801

