# Panel Regressions

This notebook illustrates several panel data models and estimation methods: pooled OLS, fixed effects, the "between" estimator, first differences, etc.

## Load Packages and Extra Functions

In [1]:
using Printf, DelimitedFiles, Statistics, LinearAlgebra

include("jlFiles/printmat.jl")
include("jlFiles/CovNWFn.jl")
include("jlFiles/Ols.jl")
include("jlFiles/UtilityFunctions.jl");

## Loading Data

### A Remark on the Code

The data set contains many variables (one per column). It is convenient to store them as a named tuple where the names are taken directly from the header in the file. This allows us to later use, for instance, `X.lwage` to refer to the data on log wages. As an alternative, consider the `DataFrames.jl` package. 

The second cell below shows some of the data used in the regressions. Notice the structure: the first 5 observations are for individual (`id`) 1 (period 1-5), the next 5 for individual 2. The code we use below does *not* rely on this structure, as it could equally well handle the case where the first `N` observations are for period 1 for each of the ids.

In [2]:
(data,header) = readdlm("Data/nls_panelEd.txt",header=true)    #classical data set from Hill et al (2008)

X = PutDataInNT(data,header)                      #NamedTuple with X.X.id, X.lwage, etc
println(keys(X))

id = convert.(Int,X.id)                           #id of individuals: 1,2,...,N
T = length(unique(X.year))                        #number of time periods
N = length(unique(id))                            #number of individuals
NT = N*T
c  = ones(NT)

println("\nT=$T and N=$N")

(:id, :year, :lwage, :hours, :age, :educ, :collgrad, :msp, :nev_mar, :not_smsa, :c_city, :south, :black, :union, :exper, :exper2, :tenure, :tenure2)

T=5 and N=716


In [3]:
y = X.lwage
xNames = ["exper/100","exper^2/100","tenure/100","tenure^2/100","south","union"]
x = [c X.exper/100 X.exper2/100 X.tenure/100 X.tenure2/100 X.south X.union]
K = size(x,2)                  #number of regressors

printblue("The first few lines of (some of) the data:\n")
printmat(id[1:11],y[1:11],x[1:11,1:2];colNames=["id","lnwage","c","exper/100"],rowNames=string.(1:11),cell00="obs")

[34m[1mThe first few lines of (some of) the data:[22m[39m

obs        id    lnwage         c exper/100
1       1         1.808     1.000     0.077
2       1         1.863     1.000     0.086
3       1         1.789     1.000     0.102
4       1         1.847     1.000     0.122
5       1         1.856     1.000     0.136
6       2         1.281     1.000     0.076
7       2         1.516     1.000     0.084
8       2         1.930     1.000     0.104
9       2         1.919     1.000     0.120
10      2         2.201     1.000     0.132
11      3         1.815     1.000     0.114



## Pooled OLS, FE, and Between Estimations

For the FE and between methods, we create individually demeaned data $(y^*,x^*)$ and individal averages $(\hat{y},\hat{x}),$ by

$
x^*_{it} = x_{it} - \bar{x}_{i}, 
$

where $\bar{x}_{i}$ is a row vector with the averages of each column of $x$ for individual $i$. The transformation of $y$ to $y^*$ is similar.


'Pooled OLS' is just OLS on the original data `(y,x)`, 'fixed effects' (FE) is on the individually demeaned data `(yˣ,xˣ)` and the 'between estimator' is a cross-sectional OLS on the individual means `(ȳ,x̄)`.



### A Remark on the Code

- `OlsNWFn(y,x,m=0)` is a function for doing OLS. Setting `m>=1` gives Newey-West standard errors. With `m=0` (the default) we instead get White's standard errors.

In [4]:
"""
    IndividualDemean(y,x,id,ϑ=1)

Demean y and x by individuals (cross-sectional units), assuming `y` is a NT vector 
and `x` is NTxK and both are organised as the NT-vector `id`. The code handles both
the case where data is organised according to id or according to time.

ϑ=1 is used for the FE estimator, but the GLS estimator uses 0<ϑ<1 (see below).

"""
function IndividualDemean(y,x,id,ϑ=1)

    id_uniq = unique(id)               #which id values are in data set
    N       = length(id_uniq)
    K       = size(x,2)

    (yˣ,xˣ) = (fill(NaN,size(y)),fill(NaN,size(x)))
    (ȳ,x̄)   = (fill(NaN,N),fill(NaN,N,K))
        
    for i = 1:N                              #loop over individuals
        vv_i       = id .== id_uniq[i]           #locate rows which refer to individual i
        ȳ[i]       = mean(y[vv_i])               #averages for individual i
        x̄[i,:]     = mean(x[vv_i,:],dims=1)
        yˣ[vv_i]   = y[vv_i]   .- ϑ*ȳ[i]    
        xˣ[vv_i,:] = x[vv_i,:] .- ϑ*x̄[i:i,:]       #i:i to keep it a row vector
    end

    return yˣ,xˣ,ȳ,x̄

end

IndividualDemean

In [5]:
(yˣ,xˣ,ȳ,x̄) = IndividualDemean(y,x,id);    #individual deamaning

xˣ[:,1] .= 1;               #put the intercept back

In [6]:
(b,res,yhat,Covb,R2,) = OlsNWFn(y,x)            #pooled OLS
xutLS = hcat(b,b./sqrt.(diag(Covb)))

(b,res,yhat,Covb,R2,) = OlsNWFn(yˣ,xˣ)          #fixed effect
xutFE = hcat(b,b./sqrt.(diag(Covb)))
s2_e  = sum(res.^2)/(NT-N-(K-1))                #needed for GLS (see below)

(b,res,yhat,Covb,R2,) = OlsNWFn(ȳ,x̄)         #between estimator
xutB = hcat(b,b./sqrt.(diag(Covb)))
s2_u = max(0,sum(res.^2)/(N-K) - s2_e/T)      #needed for GLS
printblue("Point estimates:")
printmat(xutLS[:,1],xutFE[:,1],xutB[:,1];colNames=["Pooled","FE","Between"],rowNames=["c";xNames])

printblue("t-stats:")
printmat(xutLS[:,2],xutFE[:,2],xutB[:,2];colNames=["Pooled","FE","Between"],rowNames=["c";xNames])

[34m[1mPoint estimates:[22m[39m
                Pooled        FE   Between
c                1.285    -0.000     1.122
exper/100        7.837     4.108    10.641
exper^2/100     -0.201    -0.041    -0.317
tenure/100       1.206     1.391     1.247
tenure^2/100    -0.024    -0.090    -0.016
south           -0.196    -0.016    -0.201
union            0.110     0.064     0.121

[34m[1mt-stats:[22m[39m
                Pooled        FE   Between
c               28.513    -0.000     9.801
exper/100        8.954     6.616     4.573
exper^2/100     -5.264    -1.640    -3.054
tenure/100       2.346     4.445     0.883
tenure^2/100    -0.828    -4.624    -0.198
south          -13.247    -0.411    -6.519
union            6.928     4.675     3.102



## First-Difference Model

To estimate the first-difference model, we first need to calculate the differences (over two time periods) for the *same individual*. This is done in a function defined in the next cell.

### A Remark on the Code

- `FirstDiff()` calls on the function `lagFn` which lags the data once (as a default). For the first time period, the result is a NaN (as there are no earlier values).
- After the loop in `FirstDiff()` we locate and delete all rows that include some NaNs. This means that we will have only $T-1$ data points for each individual.

In [7]:
"""
    FirstDiff(y,x,id)

Calculate first differences (for each individual) of y and x. Data is assumed to be 
organised as in `IndividualDemean()`. It is important, however, that obs (i,t+1) is 
below that of (i,t).

"""
function FirstDiff(y,x,id)

    id_uniq = unique(id)               #which id values are in data set
    N       = length(id_uniq)
    K       = size(x,2)
  
    (Δy,Δx) = (fill(NaN,size(y)),fill(NaN,size(x)))

    for i = 1:N                          #individual first-differencing, loop over individuals
        vv_i       = id .== id_uniq[i]   ##locate rows which refer to individual i
        Δy[vv_i,:] = y[vv_i]   - lagFn(y[vv_i,:])   #y[t] -y[t-1]
        Δx[vv_i,:] = x[vv_i,:] - lagFn(x[vv_i,:])   #x[t,:] -x[t-1,:]        
    end

    (Δy,Δx) = (excise(Δy),excise(Δx))          #cut out rows with NaNs

    return Δy,Δx
    
end    

FirstDiff

In [8]:
(Δy,Δx)  = FirstDiff(y,x,id)
Δx[:,1] .= 1;                  #put a (non-zero) constant back

In [9]:
(b,res,yhat,Covb,R2,) = OlsNWFn(Δy,Δx)
xutΔ = hcat(b,b./sqrt.(diag(Covb)))
xutΔ = xutΔ[2:end,:]

printblue("1-st difference estimation:")
printmat(xutΔ,colNames=["Coef","Std"],rowNames=xNames)

[34m[1m1-st difference estimation:[22m[39m
                  Coef       Std
exper/100        3.548     2.277
exper^2/100     -0.045    -0.933
tenure/100       1.293     2.527
tenure^2/100    -0.083    -2.329
south           -0.024    -0.395
union            0.044     3.115



## GLS of Random Effects Model (extra)

GLS is similar to the FE estimator discussed above, except that it is based on a 'quasi-demeaning' $y_{it} - \vartheta\bar{y_{i}}$ and similarly for $x_{it}$.

In [10]:
ϑ = 1 - sqrt(s2_e)/sqrt(T*s2_u+s2_e)                       #GLS
printlnPs("ϑ in GLS: ",ϑ,"\n")
(yˣ,xˣ,) = IndividualDemean(y,x,id,ϑ);

ϑ in GLS:      0.774          



In [11]:
(b,res,yhat,Covb,R2,) = OlsNWFn(yˣ,xˣ)
xutGLS = hcat(b,b./sqrt.(diag(Covb)))
xutGLS = xutGLS[2:end,:]

printblue("GLS:")
printmat(xutGLS,colNames=["Coef","Std"],rowNames=xNames)

[34m[1mGLS:[22m[39m
                  Coef       Std
exper/100        4.570     7.111
exper^2/100     -0.063    -2.387
tenure/100       1.380     4.032
tenure^2/100    -0.074    -3.575
south           -0.132    -5.255
union            0.075     5.611

