# Panel Regressions

This notebook uses functions 
The functions can handle autocorrelation and cross-sectional clustering.

## Load Packages and Extra Functions

The key functions used in this notebook are from the (local) `FinEcmt_OLS` module.

In [1]:
MyModulePath = joinpath(pwd(),"jlFiles")
!in(MyModulePath,LOAD_PATH) && push!(LOAD_PATH,MyModulePath);

In [2]:
using FinEcmt_OLS, DelimitedFiles, Statistics, LinearAlgebra

# Loading Data and Reshuffling Data

In [3]:
(data,header) = readdlm("Data/nls_panelEd.txt",header=true)    #classical data set from Hill et al (2008)

Z = PutDataInNT(data,header)                      #NamedTuple with Z.id, Z.lwage, etc
println(keys(Z))

id = convert.(Int,Z.id)                           #id of individuals: 1,2,...,N
T = length(unique(Z.year))                        #number of time periods
N = length(unique(id))                            #number of individuals
NT = N*T
c  = ones(NT)

println("\nT=$T and N=$N")

(:id, :year, :lwage, :hours, :age, :educ, :collgrad, :msp, :nev_mar, :not_smsa, :c_city, :south, :black, :union, :exper, :exper2, :tenure, :tenure2)

T=5 and N=716


## Creating Variables for the Regressions

The next cell creates a $NT$-vector $y$ and a $NT \times K$ matrix $x$.

We then print the first few observations of (some of) the data. Notice the structure: the first 5 observations are for individual (`id`) 1 (period 1-5), the next 5 for individual 2.

In [4]:
y = Z.lwage
xNames = ["exper/100","exper^2/100","tenure/100","tenure^2/100","south","union"]
x = [c Z.exper/100 Z.exper2/100 Z.tenure/100 Z.tenure2/100 Z.south Z.union]
K = size(x,2)                  #number of regressors

printblue("The first few lines of (some of) the data:\n")
printmat(id[1:11],y[1:11],x[1:11,1:2];colNames=["id","lnwage","c","exper/100"],rowNames=string.(1:11),cell00="obs")

[34m[1mThe first few lines of (some of) the data:[22m[39m

obs        id    lnwage         c exper/100
1       1         1.808     1.000     0.077
2       1         1.863     1.000     0.086
3       1         1.789     1.000     0.102
4       1         1.847     1.000     0.122
5       1         1.856     1.000     0.136
6       2         1.281     1.000     0.076
7       2         1.516     1.000     0.084
8       2         1.930     1.000     0.104
9       2         1.919     1.000     0.120
10      2         2.201     1.000     0.132
11      3         1.815     1.000     0.114



# Pooled OLS with `OlsNW()`

The next cell does pooled OLS estimation and reports White's standard errors. It is straightforward to used the `IndividualDemean()` or the `FirstDiff()` functions from the local `FinEcmt_OLS` module to also do fixed effects and first difference estimations. 

However, the OLS code is somewhat cumbersome to extend to clustered standard errors and also for handling autocorrelation (since all data is stacked to observation 35 and 36 (say) may belong to different cross-sectional units. For that reason, we later introduce the `PanelOls()` function which can handle these complications.

In [5]:
(b_LS,res,yhat,Covb,R2,) = OlsNW(y,x)            #pooled OLS
tstat_LS = b_LS./sqrt.(diag(Covb))

printblue("results from Ols():\n")
printmat(b_LS,tstat_LS,colNames=["coef","t-stat"],rowNames=["c";xNames])


[34m[1mresults from Ols():[22m[39m

                  coef    t-stat
c                1.285    28.513
exper/100        7.837     8.954
exper^2/100     -0.201    -5.264
tenure/100       1.206     2.346
tenure^2/100    -0.024    -0.828
south           -0.196   -13.247
union            0.110     6.928



# Reshuffling the Data

to fit the convention in the `PanelOls()` function which we use below.

We use the `PanelyxReshuffle(y,x,id)` from the local `FinEcmt_OLS` module function to reshuffle the dependent variable into an $T\times N$ matrix `Y` and the regressors into a $T \times K \times N$ array `X`. This allows the `PanelOls()` function to handle autocorrelation and cross-sectional clustering.

In [6]:
#@doc2 PanelyxReshuffle

In [7]:
(Y,X) = PanelyxReshuffle(y,x,id)

println("The Y matrix is now TxN, while X is TxKxN")
display(Y)

The Y matrix is now TxN, while X is TxKxN


5×716 Matrix{Float64}:
 1.80829  1.28093  1.81482  2.31254  …  1.53039  1.52823  1.46094  1.60944
 1.86342  1.51585  1.91991  2.34858     1.59881  2.4065   1.49669  1.45944
 1.78937  1.93017  1.95838  2.37349     1.60405  2.55886  1.55984  1.42712
 1.84653  1.91903  2.00707  2.3689      1.26794  2.64418  1.6536   1.49437
 1.85645  2.20097  2.08985  2.35053     1.55823  2.58664  1.61586  1.34142

# Pooled Estimation with `PanelOls()`

using the `PanelOls()` function from the (local) `FinEcmt_OLS` module. The output is a named tuple. Use `keys(fO)` to see the names of the entries. The function reports White's covariance matrix () which gives the same results as using a traditional OLS approach, but it can also report a covariance matrix that allows for clustering (see further below in the notebook).

In [8]:
@doc2 PanelOls

#using CodeTracking
#println(@code_string PanelOls([1],[1]))

```
PanelOls(y,x,m=0,clust=[],vvM=[])
```

Pooled OLS estimation.

### Input

  * `y::Matrix`:          TxN matrix with the dependent variable, `y[t,i]` is for period `t`, individual `i`
  * `x::3D Array`:        TxKxN matrix with K regressors
  * `m::Int`:             (optional), scalar, number of lags in covariance estimation
  * `clust::Vector{Int}`: (optional), N vector with cluster number for each individual, [`ones(N)`]
  * `vvM::Matrix`:        (optional), TxN with true/false where false indicates NaN/missings in observation `(t,i)`

### Output

  * `fnOutput::NamedTuple`:   named tuple with the following elements

    1. `theta`         (K*L)x1 vector, LS estimates of regression coeefficients on kron(z,x)
    2. `CovDK`         (K*L)x(K*L) matrix, Driscoll-Kraay covariance matrix
    3. `CovC`          covariance matrix, cluster
    4. `CovW`          covariance matrix, White's
    5. `R2`            scalar, (pseudo-) R2
    6. `yhat`          TxN matrix with fitted values
    7. `Nb`            T-vector, number of obs in each period

### Notice

  * for TxNxK -> TxKxN, do `x = permutedims(z,[1,3,2])`
  * for an unbalanced panel, set row t of `(y[t,i],x[t,:,i])` to zeros if there is a NaN/missing value in that row (see vvM)


## Pooled Estimation with the `PanelOls()` Function

In [9]:
fO = PanelOls(Y,X)

θ_pooled      = fO.theta
StdErr        = sqrt.(diag(fO.CovW))
tstat_pooled  = θ_pooled./StdErr

printblue("results from pooled OLS using PanelOls(), White's standard errors:\n")
printmat(θ_pooled,tstat_pooled,colNames=["coef","t-stat"],rowNames=["c";xNames])

printred("compare with the result from `OlsNW()`: they should be the same")

[34m[1mresults from pooled OLS using PanelOls(), White's standard errors:[22m[39m

                  coef    t-stat
c                1.285    28.513
exper/100        7.837     8.954
exper^2/100     -0.201    -5.264
tenure/100       1.206     2.346
tenure^2/100    -0.024    -0.828
south           -0.196   -13.247
union            0.110     6.928

[31m[1mcompare with the result from `OlsNW()`: they should be the same[22m[39m


## First Difference Estimation

by first creating first difference (`ΔY,ΔX`) and then using `PanelOls`.

In [10]:
ΔY = Y[2:end,:]   - Y[1:end-1,:]         #first differences
ΔX = X[2:end,:,:] - X[1:end-1,:,:]
ΔX[:,1,:] .= 1                           #put back a non-zero intercept

fO = PanelOls(ΔY,ΔX)

θ_Δ     = fO.theta
StdErr  = sqrt.(diag(fO.CovW))
tstat_Δ = θ_Δ./StdErr

printblue("results from first difference model using PanelOls(), White's standard errors:\n")
printmat(θ_Δ,tstat_Δ,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mresults from first difference model using PanelOls(), White's standard errors:[22m[39m

                  coef    t-stat
c                0.010     0.633
exper/100        3.548     2.277
exper^2/100     -0.045    -0.933
tenure/100       1.293     2.527
tenure^2/100    -0.083    -2.329
south           -0.024    -0.395
union            0.044     3.115



## Fixed Effects Estimation

The `FixedEffects()` function from the local `FinEcmt_OLS` module removes individual (and/or time) fixed effects. Then we apply the `PanelOls()` function.

In [11]:
(yˣ,xˣ,) = FixedEffects(Y,X,:id)   #:id for individual fixed effects. :t for time fixed effects
xˣ[:,1,:] .= 1;                    #put back a non-zero intercept

fO = PanelOls(yˣ,xˣ)

θ_FE     = fO.theta
StdErr   = sqrt.(diag(fO.CovW))
tstat_FE = θ_FE./StdErr

printblue("results from fixed effects model using PanelOls(), White's standard errors:\n")
printmat(θ_FE,tstat_FE,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mresults from fixed effects model using PanelOls(), White's standard errors:[22m[39m

                  coef    t-stat
c               -0.000    -0.000
exper/100        4.108     6.616
exper^2/100     -0.041    -1.640
tenure/100       1.391     4.445
tenure^2/100    -0.090    -4.624
south           -0.016    -0.411
union            0.064     4.675



# Clustered Standard Errors

We now redo the FE estimation but provide information on clustering for the standard errors. For simplicity, the clusters are defined as the value of the `South` dummy in $t=1$. Clearly, this could be done for the pooled OLS or the first difference models as well.

In [12]:
clust = convert.(Int,X[1,6,:])      #define clusters based on South/North in t=1

fO = PanelOls(yˣ,xˣ,0,clust)         #0 autocorrelations, but clustering

θ       = fO.theta
StdErrW = sqrt.(diag(fO.CovW))       #White's std
StdErrC = sqrt.(diag(fO.CovC))       #clustered std
tstatW  = θ./StdErrW
tstatC  = θ./StdErrC

printblue("FE estimation with clustered standard errors:\n")
printmat(θ,tstatW,tstatC;colNames=["coef","t-stat White","t-stat Clust"],
                         rowNames=["c";xNames],width=15)

[34m[1mFE estimation with clustered standard errors:[22m[39m

                       coef   t-stat White   t-stat Clust
c                    -0.000         -0.000         -0.000
exper/100             4.108          6.616          5.586
exper^2/100          -0.041         -1.640         -2.027
tenure/100            1.391          4.445          3.829
tenure^2/100         -0.090         -4.624         -4.880
south                -0.016         -0.411         -0.467
union                 0.064          4.675          3.631



## Individual and Time Fixed Effects (extra)

Redo the FE panel regression, but first we reconstruct `(yˣ,xˣ)` to handle both individual and time fixed effects: see the `:idt` in the function call.

In [13]:
(yˣ,xˣ,) = FixedEffects(Y,X,:idt)
xˣ[:,1,:] .= 1              #put back the intercept

fO = PanelOls(yˣ,xˣ)

θ      = fO.theta
StdErr = sqrt.(diag(fO.CovW))
tstat  = θ./StdErr

printblue("Panel regression with individual and time fixed effects:\n")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mPanel regression with individual and time fixed effects:[22m[39m

                  coef    t-stat
c                0.000     0.000
exper/100        6.713     4.654
exper^2/100     -0.045    -1.762
tenure/100       1.347     4.279
tenure^2/100    -0.090    -4.641
south           -0.014    -0.358
union            0.065     4.801



## Unbalanced Panels (extra)

The `PanelOls()` is coded in such a way that an unbalanced panel (NaNs/missings in `(y,x)`) can be handled by zeroing out all of (`y[t,i],x[t,:,i]`) if there is a NaN/missing value for observation `(t,i)`. 

Importantly, `FixedEffects()` will calculate means based on only those observations that have no NaNs/missings, that is, based on those observations that are effectlively used in `PanelOls()`.

This is illustrated in the cell below by first setting some of the data to `NaN` and then apply the functions. (The results are similar, but clearly not identical, to those above.)

In [14]:
(Yc,Xc) = (copy(Y),copy(X))               #create new arrays (we change them below)
(Yc[1],Xc[end]) = (NaN,NaN)               #introduce some missings/NaNs

(ycˣ,xcˣ,) = FixedEffects(Yc,Xc,:id)      #demeans, using only those obs that have no missings/NaNs
xcˣ[:,1,:] .= 1                           #putting the constant back

vvM = PanelyxReplaceNaN!(ycˣ,xcˣ)         #zero out rows with any NaN/missing, overwrites (ycˣ,xcˣ)
fO = PanelOls(ycˣ,xcˣ)                    #FE estimation

θ      = fO.theta
StdErr = sqrt.(diag(fO.CovW))
tstat  = θ./StdErr

printblue("Results after zeroing out observations with some missing values/NaNs:\n")
printmat(θ,tstat,colNames=["coef","t-stat"],rowNames=["c";xNames])

[34m[1mResults after zeroing out observations with some missing values/NaNs:[22m[39m

                  coef    t-stat
c                0.000     0.000
exper/100        4.080     6.563
exper^2/100     -0.039    -1.579
tenure/100       1.402     4.478
tenure^2/100    -0.090    -4.667
south           -0.016    -0.413
union            0.064     4.673

