<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Examples-from-Mehmetoglu-&amp;-Jakobsen-(2016)" data-toc-modified-id="Examples-from-Mehmetoglu-&amp;-Jakobsen-(2016)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Examples from Mehmetoglu &amp; Jakobsen (2016)</a></span><ul class="toc-item"><li><span><a href="#Defining-a-data-set-as-panel-data" data-toc-modified-id="Defining-a-data-set-as-panel-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Defining a data set as panel data</a></span></li></ul></li><li><span><a href="#Estimating-panel-data-models" data-toc-modified-id="Estimating-panel-data-models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Estimating panel data models</a></span><ul class="toc-item"><li><span><a href="#Pooled-OLS" data-toc-modified-id="Pooled-OLS-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Pooled OLS</a></span></li><li><span><a href="#Between-Effects" data-toc-modified-id="Between-Effects-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Between Effects</a></span></li></ul></li><li><span><a href="#Fixed-Effects" data-toc-modified-id="Fixed-Effects-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fixed Effects</a></span><ul class="toc-item"><li><span><a href="#Interpreting-the-results-of-Fixed-Effects-model" data-toc-modified-id="Interpreting-the-results-of-Fixed-Effects-model-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Interpreting the results of Fixed Effects model</a></span></li><li><span><a href="#Postestimation" data-toc-modified-id="Postestimation-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Postestimation</a></span></li></ul></li><li><span><a href="#Random-Effects" data-toc-modified-id="Random-Effects-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Random Effects</a></span></li><li><span><a href="#Selecting-between-models" data-toc-modified-id="Selecting-between-models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Selecting between models</a></span><ul class="toc-item"><li><span><a href="#Pooled-OLS-vs-Panel-Model?" data-toc-modified-id="Pooled-OLS-vs-Panel-Model?-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Pooled OLS vs Panel Model?</a></span></li><li><span><a href="#Fixed-vs-Random-Effects?" data-toc-modified-id="Fixed-vs-Random-Effects?-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Fixed vs Random Effects?</a></span></li></ul></li><li><span><a href="#Tips-and-tricks" data-toc-modified-id="Tips-and-tricks-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Tips and tricks</a></span><ul class="toc-item"><li><span><a href="#Decomposing-variables-into-between-and-within-effects" data-toc-modified-id="Decomposing-variables-into-between-and-within-effects-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Decomposing variables into between and within effects</a></span></li><li><span><a href="#Time-dummies" data-toc-modified-id="Time-dummies-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Time dummies</a></span></li></ul></li></ul></div>

# Panel Data Analysis

## Examples from Mehmetoglu & Jakobsen (2016)

Let's get some panel data, using the British Household Panel Survey (1991–2005) teaching data set.[^1]

[*Add brief description of the BHPS and how it is now Understanding Society*]

[^1]: The British Household Panel Survey is made available by the Institute for Social and Economic Research, University of Essex. See British Household Panel Survey: Waves 1–11, 1991–2002: Teaching Dataset (Work, Family and Health) [computer file], 2nd edition, http://dx. doi.org/10.5255/UKDA-SN-4901-2. The data has been prepared for use by Morten Blekesaune, University of Agder.

In [None]:
use "../data/BritishHouseholdPanel.dta", clear
desc, f

`xtreg` estimates a random effects model by default.

### Defining a data set as panel data

First, we need to tell Stata we are dealing with panel data, as this allows us to access some time-series operators that are useful:

In [10]:
xtset id year

       panel variable:  id (unbalanced)
        time variable:  year, 1991 to 2005, but with gaps
                delta:  1 unit


The `xtset` command takes two arguments: a variable representing the unique identifer of the panel members (*id*) and a variable capturing the unique identifier for the time period (*year*). This combination of variables must uniquely identify every observation (row) in the data: we can check whether this is the case using the `isid` command - if no error message is returned, then those variables uniquely identify an every observation:

In [None]:
isid id year

We can confirm that *id* and *age* do not uniquely identify observations:

In [None]:
isid id age

Then we can use `xtdescribe` to learn more about the patterns of observations in our panel:

In [11]:
xtdescribe


      id:  10014578, 10014608, ..., 1.596e+08                n =       7602
    year:  1991, 1992, ..., 2005                             T =         15
           Delta(year) = 1 unit
           Span(year)  = 15 periods
           (id*year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       4         8        13      15      15

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+-----------------
     1190     15.65   15.65 |  111111111111111
      169      2.22   17.88 |  1111...........
      167      2.20   20.07 |  .............11
      146      1.92   21.99 |  ..........11111
      136      1.79   23.78 |  .......11111111
      134      1.76   25.55 |  1..............
      123      1.62   27.16 |  ...........1111
      122      1.60   28.77 |  11.............
      115      1.51   30.28 |  ............111
     5300     69.72  100.00 | (other patterns)
 ----

Let's unpack these results:
* There are 7602 panel members (*n*) and 15 time periods (*T*).
* The time period variable (*year*) changes by 1 unit (*Delta(year)*).
* 50% of panel members are observed at least 8 times (*Distribution of T_i*).
* 1190 panel members are observed in every time period, 169 are observed only in the first 4 periods etc (see frequency table).

In [None]:
tab year

In [7]:
by id: gen numobs = _N

In [9]:
xttab numobs


                  Overall             Between            Within
   numobs |    Freq.  Percent      Freq.  Percent        Percent
----------+-----------------------------------------------------
        1 |     554      0.88       554      7.29         100.00
        2 |    1264      2.00       632      8.31         100.00
        3 |    1497      2.37       499      6.56         100.00
        4 |    2084      3.29       521      6.85         100.00
        5 |    2240      3.54       448      5.89         100.00
        6 |    2586      4.09       431      5.67         100.00
        7 |    2905      4.59       415      5.46         100.00
        8 |    3528      5.57       441      5.80         100.00
        9 |    3150      4.98       350      4.60         100.00
       10 |    3810      6.02       381      5.01         100.00
       11 |    4532      7.16       412      5.42         100.00
       12 |    4776      7.55       398      5.24         100.00
       13 |    6643     1

## Estimating panel data models

Using our BHPS data set, let's specify a statistical model for predicting mental distress as a function of four factors: age, sex and marital status (eq. 1.1):

\begin{equation} \text{Y}_{it} = \beta_0 + \beta_1X_{1it} + \beta_2X_{2i} + \beta_3X_{3it} + \beta_4X_{4it} + \beta_5X_{5it} + \beta_6X_{6it} + \text{e}_{it} \tag{1.1} \end{equation}

Where:

$\text{Y}$ is our outcome variable for individual *i* at time *t*;

$\beta_0$ is the constant term, which is our prediction for the outcome when the values of all other variables in the model are set to 0.

$\text{X}_{1it}$ captures the age of individual *i* at time *t*, and $\beta_1$ is the effect of this variable on the outcome.

We note that $\text{X}_{2i}$ (*woman*) is time-invariant - that is, it's value is constant over time within an individual - and thus does not have the subscript *t*.

### Pooled OLS

The starting point for any statistical modelling of panel data is to estimate a *Pooled OLS* model. All of the observations for each unit in the panel are pooled together and they are analysed using the standard statistical model for cross-sectional data (Gayle & Lambert, 2018). Remember that Pooled OLS can produce consistent estimates of the regression coefficients [*AND STANDARD ERRORS?*] "if model is correctly specified and the explanatory (X) variables are uncorrelated with the error term. However, in panel data the error term will in most cases be correlated over time for a given unit (Cameron and Trivedi, 2010)." [Can I think of an example, even hypothetical? Yes, respondent diaries capturing data on non-consecutive time periods.]

Fundamental problem of pooling observations (Gayle & Lambert, 2018, p. 58):
> The model does not recognise that there are multiple contributions of data from the same individuals, and therefore, it estimates results as if there are many individuals who shared the same characteristics. This impacts upon the estimate of measures such as variances and standard errors.

In [None]:
regress mental age woman couple separated divorced never_married

The results of the test show we can reject the null hypothesis that there is no autocorrelation present. Therefore, the 

### Between Effects

In [None]:
xtreg mental age woman couple separated divorced never_married, be

Is the same as collapsing the data set to mean values of the independent and dependent variables, and running a regular linear regression. Note how the results are exactly the same as `xtreg , be` command, and how the R-squared calculations match (*between = 0.0364*).

In [None]:
preserve
    quietly regress mental age woman couple separated divorced never_married
    collapse (mean) mental age woman couple separated divorced never_married if e(sample), by(id)
    regress mental age woman couple separated divorced never_married
restore

An advantage of Between Effects:
By collapsing the data set to mean values, we now have one row per unit, which means we've sidestepped the issue of non-independence of observations (Gayle & Lambert, 2018).

There are two major disadvantages to selecting a *Between Effects* model for longitudinal data:
1. You cannot include independent variables that are constant across units in a given time period, but do vary over these periods.
2. The model throws away a lot of information, resulting in an inability to address certain questions.

On the former, let's try and include a variable capturing changing macroeconomic conditions across the time period.

In [None]:
/* If year is odd, then bad economic year , else it's good */

gen macecon = cond(mod(year, 2)==0, 1, 0)
label define macecon_lab 1 "Good" 0 "Bad"
label values macecon macecon_lab
tab year macecon 

In [None]:
xtreg mental macecon age woman couple separated divorced never_married, be

In [None]:
preserve
    quietly regress mental age woman couple separated divorced never_married
    collapse (mean) mental macecon age woman couple separated divorced never_married if e(sample), by(id)
    sum macecon, detail
    regress mental age woman couple separated divorced never_married
restore

Panel regression models take into account the multiple contributions

## Fixed Effects

Conceptualising FE:
> ...consider it a standard cross-sectional regression model with the addition of a dummy variable being included for every respondent in the dataset except for one (i.e., *n - 1* dummy variables are added to the model).

This results in each unit having their own intercept, to which we add the effects of the explanatory variables.

Mehmetoglu & Jakobsen (2016: 239):
> One of the main worries when it comes to panel data is whether or not the error term is correlated with one or more of our X-variables. Suppose, for example, we were investigating wage (Y) as a function of education (X1) and experience (X2), but we had a suspicion that an unmeasured variable called ability (C) was influencing our model.

\begin{equation} \text{Y}_{it} = \beta_0 + \beta_1X_{1it} + \beta_2X_{2it} + \text{(c}_{i} + {e}_{it}) \tag{1.1} \end{equation}

Where:

$\text{Y}$ is our outcome variable for individual *i* at time *t*;

$\beta_0$ is the constant term, which is our prediction for the outcome when the values of all other variables in the model are set to 0.

$\text{X}_{1it}$ captures the age of individual *i* at time *t*, and $\beta_1$ is the effect of this variable on the outcome.

$\text{c}_{i} + {e}_{it}$ is the combined error term (unexplained variance), which has two components:
* $\text{c}_{i}$ is the unit-level variance accounted for by the omitted variable *C*.
* $\text{e}_{it}$ is the observation-level variance that is unaccounted for (unsystematic variation in the outcome).

The problem with omitting variable *C* is that one or more of X1 and X2 will 'soak up' the unexplained variation produced by C's absence from the model, and hence return biased estimates of X1 and/or X2 (depends which one of these is correlated with *C*.

The solution FE provides is that if C is constant within an individual (e.g., ability varies between but not within individuals), the FE estimator estimates unbiased betas for the observed variables that are correlated with C.

By including unit-specific dummies, the regression formula now looks like this:

\begin{equation} \text{Y}_{it} = \beta_1X_{1it} + \beta_2X_{2it} + \alpha_{i} + \text{e}_{it}) \tag{1.1} \end{equation}

Where:

$\alpha_{i}$ captures the effect of the unit-specfic unobserved effect ($\lambda_{i}$) and the constant ($\beta_0$). In essence the unit-specific effect shifts the overall intercept up or down the Y axis by the value of $\lambda$.

What this translates to is every unit having their own constant term i.e., intercept. $\alpha$ varies across but not within units. Basically, every unit in the panel has a unique starting point before adding the effect of the independent variables.

Why is the unobserved effect incorporated into the constant? Because it doens't vary within an individual: remember we are only modelling within-unit variation and the dummy variable does not vary within a unit, therefore it does not make sense to interpret the unit-specific term like an explanatory variable (e.g., a one-unit increase in being a particular unit?)

[*Show a LSDV approach and compare to `xtreg, fe`*]

In [1]:
use "../data/Happiness.dta", clear
desc, f




Contains data from ../data/Happiness.dta
  obs:            16                          
 vars:             7                          16 Feb 2015 09:31
 size:           448                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
id              float   %9.0g      id         
income          float   %9.0g                 
happiness       float   %9.0g                 
income1000      float   %9.0g                 
income1000DM    float   %9.0g                 
happinessDM     float   %9.0g                 
car_ownership   float   %9.0g                 
--------------------------------------------------------------------------------
Sorted by: 


Let's say we're interested in predicting a person's happiness based on their income:

In [None]:
regress happiness income1000

There appears to be a negative association between happiness and income: that is, for every one-unit increase in income (measured in 000s), we predict happiness to decline by *-.0946145*. However such a simple model ignores other important explanatory factors of happiness, such as mental health , which is also correlated with income. Therefore there is an *unobserved time-invariant variable* that is biasing the coefficient for income.

In this fictional example Nicole, who has the highest income, also has a mental health issues; because income and mental health are negatively correlated (more income, poorer mental health), the coefficient for `income1000` soaks up some of the variation in happiness that is explained by mental health.

In [None]:
generate Dbob=id==1 if !missing(id)
generate Dsarah=id==2 if !missing(id)
generate Dpeter=id==3 if !missing(id)
generate Dnicole=id==4 if !missing(id)

In [None]:
regress happiness Dsarah Dpeter Dnicole income1000

We can sum these coefficients to produce a predicted happiness score for each person:

In [None]:
gen Bob = -4.107029 + (0.0824934 * income1000)
gen Sarah = -4.107029 + 5.111141 + (0.0824934 * income1000)
gen Peter = -4.107029 + 4.226857 + (0.0824934 * income1000)
gen Nicole = -4.107029 + 9.410411 + (0.0824934 * income1000)

Now we can graph these regression lines as follows:

In [None]:
*twoway (line Bob income1000) (line Sarah income1000) (line Peter income1000) (line Nicole income1000)

In [4]:
xtset id

       panel variable:  id (balanced)


In [5]:
xtreg happiness income1000, fe


Fixed-effects (within) regression               Number of obs     =         16
Group variable: id                              Number of groups  =          4

R-sq:                                           Obs per group:
     within  = 0.3665                                         min =          4
     between = 0.7123                                         avg =        4.0
     overall = 0.2912                                         max =          4

                                                F(1,11)           =       6.36
corr(u_i, Xb)  = -0.7916                        Prob > F          =     0.0283

------------------------------------------------------------------------------
   happiness |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  income1000 |   .0824934   .0327004     2.52   0.028     .0105203    .1544665
       _cons |   .5800729    1.91721     0.30   0.768    -3.6396

In [6]:
predict fixed, u
sum fixed, detail




                            u[id]
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -4.687102      -4.687102
 5%    -4.687102      -4.687102
10%    -4.687102      -4.687102       Obs                  16
25%    -2.573674      -4.687102       Sum of Wgt.          16

50%    -.0181035                      Mean          -4.47e-08
                        Largest       Std. Dev.      3.451385
75%     2.573674       4.723309
90%     4.723309       4.723309       Variance       11.91206
95%     4.723309       4.723309       Skewness       .0159669
99%     4.723309       4.723309       Kurtosis       1.965356


In [8]:
xtreg happiness income1000, re
predict random, u
sum random, detail



Random-effects GLS regression                   Number of obs     =         16
Group variable: id                              Number of groups  =          4

R-sq:                                           Obs per group:
     within  = 0.3665                                         min =          4
     between = 0.7123                                         avg =        4.0
     overall = 0.2912                                         max =          4

                                                Wald chi2(1)      =       1.16
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.2824

------------------------------------------------------------------------------
   happiness |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  income1000 |   .0404467   .0376284     1.07   0.282    -.0333037     .114197
       _cons |   3.024037   2.472703     1.22   0.221    -1.822

### Interpreting the results of Fixed Effects model

* $\text{_cons}$ is the intercept and represents the average value of the fixed effects.
* $\beta_1$ is the predicted change in the outcome for a one-unit increase in $\text{X}_{1}$, net of the influence of unobserved time-invariant factors.
* $\text{rho}$ (also represented as $\rho$) is the proportion of variance in the outcome explained by differences between individuals (the fixed effects), rather than changes over time. It is also known as the *intraclass correlation* which is a measure of how similar observations are within a unit. If $\text{rho}$ > .5 then most of the variance in the outcome is due to differences between units, if $\text{rho}$ < .5 then most of the variance is accounted for by differences within units (i.e., the effects of the explanatory variables).
* $\text{sigma_u}$ (or $\sigma_u$) is the standard deviation of residuals within units.
* $\text{sigma_e}$ (or $\sigma_e$) is the standard deviation of residuals ei.
* $\text{corr(u_i, Xb)}$ is the correlation between the unit-level errors and the independent variables in the model.
* $\text{R-sq: within}$ is the proportion of variance explained by the independent variables (excluding the unit-specific term).

Let's go back to BHPS data and our model for predicting mental distress.

In [1]:
use "../data/BritishHouseholdPanel.dta", clear
desc, f

xtset id year




Contains data from ../data/BritishHouseholdPanel.dta
  obs:        63,285                          
 vars:            11                          1 Oct 2014 10:04
 size:     2,594,685                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
id              double  %12.0g                Person identifier
age             int     %8.0g      oage       age at date of interview
mastat          byte    %8.0g      omastat    marital status
year            int     %8.0g      odoiy4     Year of interview
mental          float   %9.0g                 Mental Distress
woman           float   %9.0g                 
married         float   %9.0g                 
couple          float   %9.0g                 
separated       float   %9.0g                 
divorced     

In [2]:
regress mental age woman couple separated divorced never_married


      Source |       SS           df       MS      Number of obs   =    62,549
-------------+----------------------------------   F(6, 62542)     =    208.54
       Model |  1067.66694         6  177.944491   Prob > F        =    0.0000
    Residual |  53366.0332    62,542  .853283125   R-squared       =    0.0196
-------------+----------------------------------   Adj R-squared   =    0.0195
       Total |  54433.7002    62,548  .870270835   Root MSE        =    .92373

-------------------------------------------------------------------------------
       mental |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          age |   .0024708   .0003504     7.05   0.000      .001784    .0031576
        woman |   .2023694   .0074189    27.28   0.000     .1878285    .2169104
       couple |   .0075139   .0105725     0.71   0.477    -.0132083    .0282361
    separated |   .5583078   .0312398    17.

In [3]:
xtreg mental age woman couple separated divorced never_married, fe

note: woman omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =     62,549
Group variable: id                              Number of groups  =      7,584

R-sq:                                           Obs per group:
     within  = 0.0051                                         min =          1
     between = 0.0014                                         avg =        8.2
     overall = 0.0032                                         max =         15

                                                F(5,54960)        =      56.33
corr(u_i, Xb)  = -0.0459                        Prob > F          =     0.0000

-------------------------------------------------------------------------------
       mental |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          age |   .0045017   .0008475     5.31   0.000     .0028406    .0061628
        woman | 

Firstly, notice how the coefficient for *woman* has been excluded from the model: this is because this measure does not vary within our units and thus cannot contribute to explaining variation in mental distress for a given person.

Secondly, there are substantive differences between the results of the Pooled OLS and Fixed Effects models. Mehmetoglu and Jakobsen (2016, p. 248):
> ...mental distress is higher for those who are separated, but actually a bit lower for divorcees than for those who are married. This is calculated by looking at the within variation (changes in persons going through different categories) and tells us that the end of a marriage (usually in the form of separation) is stressful, but that people get over the distress (later, when divorced).

### Postestimation

We can use some of Stata's postestimation commands to learn more about the results of our Fixed Effects model. For example, what if we wanted to know

In [4]:
predict pr_mental, xb
predict fixed, u


(2 missing values generated)

(736 missing values generated)


In [5]:
l id year fixed in 1/100


     +-----------------------------+
     |       id   year       fixed |
     |-----------------------------|
  1. | 10014578   1991   -.2738157 |
  2. | 10014578   1992   -.2738157 |
  3. | 10014578   1993   -.2738157 |
  4. | 10014578   1996   -.2738157 |
  5. | 10014578   1998   -.2738157 |
     |-----------------------------|
  6. | 10014578   1999   -.2738157 |
  7. | 10014578   2000   -.2738157 |
  8. | 10014608   1991    -1.23806 |
  9. | 10014608   1992    -1.23806 |
 10. | 10014608   1993    -1.23806 |
     |-----------------------------|
 11. | 10014608   1996    -1.23806 |
 12. | 10014608   1998    -1.23806 |
 13. | 10016813   1992    .5171139 |
 14. | 10016813   1994    .5171139 |
 15. | 10016813   1999    .5171139 |
     |-----------------------------|
 16. | 10016848   1992    .3650586 |
 17. | 10016848   1994    .3650586 |
 18. | 10016848   1999    .3650586 |
 19. | 10016848   2000    .3650586 |
 20. | 10016848   2001    .3650586 |
     |-----------------------------|


## Random Effects

Random Effects can be used when there is no or little association between the error term and explanatory variables, that is $\text{cov(x}_{I}{,c}_{i}) = 0$.

> We can perform a Hausman test, as described on p. 240, which tests if both the fixed and random effects are consistent estimators. If this holds, the random effects model is more efficient, and its standard errors should be smaller than those of the fixed effects model.

It's standard errors should be smaller because it takes account of more variation in the outcome &mdash; that is, it takes observed variation between individuals into account.

RE is a matrix-weighted average of the FE and BE estimators. Units that are only observed at one data point are included, but they only contribute through the between effects estimator.

Using RE allows us to say speak of our explanatory variables in two ways:
* Average change in Y for a one-unit change in X between units.
* Average change in Y for a one-unit change in X within units.

y i t = β 0 R E + β 1 R E x 1 i t + β 2 R E x 2 i + v i + e i t 

Note the inclusion of a unit-specific error term (instead of unit-specific intercept in FE).

Advantages:
* Can include units who appear only once in the data set (contribute to BE part of model).
* Can include observed time-invariant variables i.e., important factors that do not vary within units. These are disregarded by FE model for obvious reasons.
* Allow you to speak about the effects of variables within and across units.
* More efficient estimates of coefficients i.e., less uncertainty around the coefficient for a variable.

Technical aspect of RE: The unit-specific effect is considered to be drawn from a probability distribution (Cameron & Trivedi, 2005).

Gayle and Lambert (2018, p. 63).
> This means that the random effects model does not estimate a parameter for each individual respondent, but it does include a parameter that summarizes the overall distribution of individual respondent's differences (e.g., a variance estimate for this distribution). 

Because RE assumes the residuals and explanatory variables are uncorrelated, the coefficients of the latter are considered unbiased. Therefore the aim is to soak up as much remaining variation as possible, resulting in more efficient estimates. Unlike FE, where the aim is to estimate the coefficients more accurately (net of the unit-specific effect) - [*Can I show examples of this?*]

Random Effects is also known as random intercepts or variance components model and it commonly employed as part of Multilevel modelling approaches for panel and non-panel data alike.

## Selecting between models

### Pooled OLS vs Panel Model?

Can we ignore the panel component of the data? More formally, is it correct to assume that the error terms are independent across observations? We can perform an *autocorrelation* test to check whether this assumption is met:

In [None]:
*net sj 3-2 st0039
*net install st0039

xtserial mental age woman couple separated divorced never_married

The results of the Wooldridge strongly suggest the error terms are correlated across observations. In practice this means that the values of these variables typically vary less *within* than across units. An obvious example would be the `age` variable - observe how the standard deviation within a unit is smaller than between units. This is unsurprising when you consider this is an unbalanced panel where most individuals only appear a handful of times in the sample, and usually in consecutive years.

In [4]:
xtsum age


Variable         |      Mean   Std. Dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
age      overall |  41.28089   11.30308         20         64 |     N =   63285
         between |             11.93296         20         64 |     n =    7602
         within  |             3.609779   31.90589   56.58089 | T-bar = 8.32478


[*Explain `xtsum` results using a more intuitive example*]

* Overall results is obvious.
* Between results collapse data set down to one row per unit, hence slightly different figures to overall results. Min and Max now refer to average values.
* Within results calculate differences between observed value for a unit in a given period and the unit's mean value across all periods (and the global mean also, hence why results are counter-intuitive).

In [5]:
l id age in 1/50


     +----------------+
     |       id   age |
     |----------------|
  1. | 10014578    54 |
  2. | 10014578    55 |
  3. | 10014578    56 |
  4. | 10014578    59 |
  5. | 10014578    61 |
     |----------------|
  6. | 10014578    62 |
  7. | 10014578    63 |
  8. | 10014608    57 |
  9. | 10014608    58 |
 10. | 10014608    59 |
     |----------------|
 11. | 10014608    62 |
 12. | 10014608    64 |
 13. | 10016813    37 |
 14. | 10016813    39 |
 15. | 10016813    44 |
     |----------------|
 16. | 10016848    33 |
 17. | 10016848    35 |
 18. | 10016848    40 |
 19. | 10016848    41 |
 20. | 10016848    42 |
     |----------------|
 21. | 10016848    43 |
 22. | 10016848    44 |
 23. | 10016848    45 |
 24. | 10016848    46 |
 25. | 10017933    49 |
     |----------------|
 26. | 10017933    49 |
 27. | 10017933    51 |
 28. | 10017933    52 |
 29. | 10017933    54 |
 30. | 10017933    55 |
     |----------------|
 31. | 10017933    56 |
 32. | 10017933    58 |
 33. | 10017933

The presence of serial (auto) correlation suggests we cannot ignore the panel component of the data. However, that does not mean we need to estimate a panel model. We could use the `regress, cluster()` option to relax the assumption that the error terms are independent. Let's remind ourselves of the results from the Pooled OLS model:

In [None]:
regress mental age woman couple separated divorced never_married

Now let's control for the presence of serial correlation in the data &mdash; notice how the standard errors of the coefficients are larger than those in the model without the `cluster(id)` option:

In [None]:
regress mental age woman couple separated divorced never_married, cluster(id)

We no longer have underestimated standard errors, resulting in more more accurate t tests of the coefficients (though in the above example there were no shifts from statistically significant to insignificant). However we may still want to control for unit-specific differences in the outcome &mdash; that is, is some of the variation in the outcome explained by unobserved heterogeneity? 

In [None]:
xtreg mental age woman couple separated divorced never_married, re
xttest0

Rejection of the null hypothesis suggests that there is a panel effect on the outcome, and that a Random Effects model is preferred over Pooled OLS.

### Fixed vs Random Effects?

For most repeated contacts data sets, it would be erroneous to ignore the panel component of the data, even after controlling for autocorrelation of the error terms. We then have a choice between Fixed Effects and Random Effects. The *Hausman* test checks whether the coefficients. The null hypothesis of this statistical test is that the coefficients from the Random Effects model are consistent; put another way, it checks whether the estimates from the Random and Fixed Effects model are sufficiently similar. Failure to reject the null hypothesis provides evidence in favour of the Random Effects model, otherwise the Fixed Effects model is more appropriate.

In [None]:
quietly xtreg mental age woman couple separated divorced never_married, fe
estimates store fixed

quietly xtreg mental age woman couple separated divorced never_married, re
estimates store random

hausman fixed random

In our example, it appears that the coefficients from the Random Effects model are inconsistent and thus the Fixed Effects model should be preferred. Often you'll find that the Hausman test favours the FE model but this isn't definitive proof that the FE model is more appropriate.


Pieces of information relevant to your decision:
* Is your dependent variable influenced by changes within units or between units? If the former is the main analytical consideration then the FE model will be most appropriate, if the latter then the BE model might suffice. If both, then the RE model is most appropriate.
* Are you concerned with omitted variable bias? That is, are you worried your explanatory variables are biased due to the exclusion of unobserved time-invariant covariates in the model? If yes, the the FE model is preferred, if not then RE. Look for the `corr(u_i, Xb)` statistic in the FE model results to check if the residuals are correlated with the explanatory variables (an indicator that OVB is an issue).

## Tips and tricks

### Decomposing variables into between and within effects

Show example based on Gould (n.d.).

### Time dummies

"Control for time effects whenever unexpected variation or special events my affect the outcome variable" (Torres-Reyna, 2007). We can control for time effects by including dummy variables for *n - 1* time periods in the data:

In [None]:
xtreg mental age woman couple separated divorced never_married i.year, fe

We can test whether the dummy variables for time periods are equal to 0: if so we can conclude that the effect of time on the outcome is globally statisticall insignificant. We reject the null hypothesis that all of the coefficients for time are equal to 0 and thus we should control for the effect of time in our model.

In [None]:
testparm i.year