# Panel Regressions (extra)

This notebook uses functions (from files included below) to redo some of the panel regressions from the first notebook on panels. The notebook is essentially an example of how to use the functions. (In contrast, the first notebook tries to explain the background to the various estimation approaches.)

The functions can handle autocorrelation and cross-sectional clustering.

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, Statistics, LinearAlgebra

include("jlFiles/printmat.jl")
include("jlFiles/FixedEffects.jl")
include("jlFiles/PanelOls.jl")
include("jlFiles/UtilityFunctions.jl");

## Loading Data

In [2]:
(data,header) = readdlm("Data/nls_panelEd.txt",header=true)    #classical data set from Hill et al (2008)

X = PutDataInNT(data,header)                      #NamedTuple with X.X.id, X.lwage, etc
println(keys(X))

id = convert.(Int,X.id)                           #id of individuals: 1,2,...,N
T = length(unique(X.year))                        #number of time periods
N = length(unique(id))                            #number of individuals
NT = N*T
c  = ones(NT)

println("\nT=$T and N=$N")

(:id, :year, :lwage, :hours, :age, :educ, :collgrad, :msp, :nev_mar, :not_smsa, :c_city, :south, :black, :union, :exper, :exper2, :tenure, :tenure2)

T=5 and N=716


## Creating Variables for the Regressions

The next cell creates a matrix $yx$ which has the dependent variable as the first column and the regressors as the remaining columns.

We then print the first few observations of (some of) the data. Notice the structure: the first 5 observations are for individual (`id`) 1 (period 1-5), the next 5 for individual 2.

In [3]:
y = X.lwage
xNames = ["exper/100","exper^2/100","tenure/100","tenure^2/100","south","union"]
x = [c X.exper/100 X.exper2/100 X.tenure/100 X.tenure2/100 X.south X.union]
K = size(x,2)                  #number of regressors

printblue("The first few lines of (some of) the data:\n")
printmat(id[1:11],y[1:11],x[1:11,1:2];colNames=["id","lnwage","c","exper/100"],rowNames=string.(1:11),cell00="obs")

[34m[1mThe first few lines of (some of) the data:[22m[39m

obs        id    lnwage         c exper/100
1       1         1.808     1.000     0.077
2       1         1.863     1.000     0.086
3       1         1.789     1.000     0.102
4       1         1.847     1.000     0.122
5       1         1.856     1.000     0.136
6       2         1.281     1.000     0.076
7       2         1.516     1.000     0.084
8       2         1.930     1.000     0.104
9       2         1.919     1.000     0.120
10      2         2.201     1.000     0.132
11      3         1.815     1.000     0.114



## Reshuffling the Data

to fit the convention in the `PanelOls()` function.

We use the `PanelyxReshuffle(y,x,id)` (included above) function to reshuffle the dependent variable into an $T\times N$ matrix `Y` and the regressors into a $T \times K \times N$ array `X`. This allows the `PanelOls()` function to handle autocorrelation and cross-sectional clustering.

In [4]:
(yb,xb) = PanelyxReshuffle(y,x,id)

println("The yb matrix is now TxN, while xb is TxKxN")
display(yb)

The yb matrix is now TxN, while xb is TxKxN


5×716 Matrix{Float64}:
 1.80829  1.28093  1.81482  2.31254  …  1.53039  1.52823  1.46094  1.60944
 1.86342  1.51585  1.91991  2.34858     1.59881  2.4065   1.49669  1.45944
 1.78937  1.93017  1.95838  2.37349     1.60405  2.55886  1.55984  1.42712
 1.84653  1.91903  2.00707  2.3689      1.26794  2.64418  1.6536   1.49437
 1.85645  2.20097  2.08985  2.35053     1.55823  2.58664  1.61586  1.34142

We next call a function `FixedEffects()` to remove individual (and/or time) fixed effects.

In [5]:
(yˣ,xˣ) = FixedEffects(yb,xb,:id)         #:id for individual fixed effects. :t for time fixed effects
xˣ[:,1,:] .= 1                             #put back a non-zero intercept
println()




Finally, we call the `PanelOls()` function. The output is a named tuple. Use `keys(fO)` to see the names of the entries.

In [6]:
fO = PanelOls(yˣ,xˣ)

θ      = fO.theta
StdErr = sqrt.(diag(fO.CovW))
tstat  = θ./StdErr

printblue("results from PanelOls():\n")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

printred("Compare with the FE estimates in the other notebook.")

[34m[1mresults from PanelOls():[22m[39m

                  coef    t-stat
c               -0.000    -0.000
exper/100        4.108     6.616
exper^2/100     -0.041    -1.640
tenure/100       1.391     4.445
tenure^2/100    -0.090    -4.624
south           -0.016    -0.411
union            0.064     4.675

[31m[1mCompare with the FE estimates in the other notebook.[22m[39m


## Clustered Standard Errors

We now redo the estimation but provide information on clustering for the standard errors. For simplicity, the clusters are defined as the value of the `South` dummy in $t=1$.

In [7]:
clust = convert.(Int,xb[1,6,:])       #define clusters based on South/North in t=1

fO = PanelOls(yˣ,xˣ,0,clust)   #0 autocorrelations, but clustering

θ       = fO.theta
StdErrW = sqrt.(diag(fO.CovW))       #White's std
StdErrC = sqrt.(diag(fO.CovC))       #clustered std
tstatW  = θ./StdErrW
tstatC  = θ./StdErrC

printblue("FE regression with clustered standard errors:\n")
printmat(θ,tstatW,tstatC;colNames=["coef","t-stat White","t-stat Clust"],
                         rowNames=["c";xNames],width=15)

[34m[1mFE regression with clustered standard errors:[22m[39m

                       coef   t-stat White   t-stat Clust
c                    -0.000         -0.000         -0.000
exper/100             4.108          6.616          5.586
exper^2/100          -0.041         -1.640         -2.027
tenure/100            1.391          4.445          3.829
tenure^2/100         -0.090         -4.624         -4.880
south                -0.016         -0.411         -0.467
union                 0.064          4.675          3.631



## Individual and Time Fixed Effects

Redo the panel regression, but first we reconstruct `(yˣ,xˣ)` to handle both individual and time fixed effects: see the `:idt` in the function call.

In [8]:
(yˣ,xˣ) = FixedEffects(yb,xb,:idt)
xˣ[:,1,:] .= 1

fO = PanelOls(yˣ,xˣ)

θ      = fO.theta
StdErr = sqrt.(diag(fO.CovW))
tstat  = θ./StdErr

printblue("Panel regressionn with individual and time fixed effects:\n")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mPanel regressionn with individual and time fixed effects:[22m[39m

                  coef    t-stat
c                0.000     0.000
exper/100        6.713     4.654
exper^2/100     -0.045    -1.762
tenure/100       1.347     4.279
tenure^2/100    -0.090    -4.641
south           -0.014    -0.358
union            0.065     4.801



# Unbalanced Panels (extra)

The `PanelOls()` is coded in such a way that an unbalanced panel (NaNs/missings in `(y,x)`) can be handled by zeroing out all of (`y[t,i],x[t,:,i]`) if there is a NaN/missing value there. Also, `FixedEffects()` will calculate means based on only those (`y[t,i],x[t,:,i]`) observations that have no NaNs/missings, that is, based on those observations that are effectlively used in `PanelOls()`.

To do that, the `PanelyxReplaceNaN()` function is useful. The next cell illustrates how it works. There is also a `PanelyxReplaceNaN!()` which *overwrites* its input arguments, as a way of saving memory space.

In [9]:
y0   = [NaN 11;2 12;3 13]                     #y has a NaN for t=1, i = 1
x0   = hcat(ones(3,1,2),randn(3,2,2))
(vvM,y,x) = PanelyxReplaceNaN(y0,x0)

printblue("after 'zeroing out' observations with missings/NaNs, here y[t,1]=NaN:\n")
println("y[:,1:2], that is, for i = [1,2]")
printmat(y)
println("x[:,:,1], that is, for i = 1")
printmat(x[:,:,1])
println("x[:,:,2], that is, for i = 2")
printmat(x[:,:,2])

printred("Notice that (y[1,1],x[1,:,1]) are filled with zeros")

[34m[1mafter 'zeroing out' observations with missings/NaNs, here y[t,1]=NaN:[22m[39m

y[:,1:2], that is, for i = [1,2]
     0.000    11.000
     2.000    12.000
     3.000    13.000

x[:,:,1], that is, for i = 1
     0.000     0.000     0.000
     1.000    -2.172     1.452
     1.000     0.706    -0.174

x[:,:,2], that is, for i = 2
     1.000    -1.336     2.189
     1.000    -0.284     1.252
     1.000     0.847    -0.538

[31m[1mNotice that (y[1,1],x[1,:,1]) are filled with zeros[22m[39m


## FE Estimation with some Missing Data

We introduce some missing values/NaNs in the data set, just in order to illustrate the workflow.

In [10]:
(yc,xc) = (copy(yb),copy(xb))
(yc[1],xc[end]) = (NaN,NaN)               #introduce some missings/NaNs

(ycˣ,xcˣ,) = FixedEffects(yc,xc,:id)      #demeans, using only those obs that have no missings/NaNs
xcˣ[:,1,:] .= 1                           #putting the constant back
             
vvM = PanelyxReplaceNaN!(ycˣ,xcˣ)         #zero out rows with any NaN/missing, overwrites (ycˣ,xcˣ)
fO = PanelOls(ycˣ,xcˣ)                    #FE estimation

θ      = fO.theta
StdErr = sqrt.(diag(fO.CovW))
tstat  = θ./StdErr

printblue("Results after zeroing out observations with some missing values/NaNs:\n")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mResults after zeroing out observations with some missing values/NaNs:[22m[39m

                  coef    t-stat
c                0.000     0.000
exper/100        4.080     6.563
exper^2/100     -0.039    -1.579
tenure/100       1.402     4.478
tenure^2/100    -0.090    -4.667
south           -0.016    -0.413
union            0.064     4.673

