<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pooling-observations" data-toc-modified-id="Pooling-observations-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pooling observations</a></span></li></ul></div>

In [2]:
import pandas as pd
import ipystata

pd.options.display.float_format = '{:,.0f}'.format # change display of decimal numbers

## Pooling observations

The potential problem with pooling observations is that it assumes they are independent of each other. For example, here are some fictional longitudinal data we have collected: we have five individuals and each person was surveyed four times (2015-2019):

In [3]:
panel_data = pd.read_csv("../data/lda-simple-example-2020-08-28.csv", index_col=False)
panel_data.index += 1
panel_data

Unnamed: 0,pid,year,sex,age,income
1,10001,2015,male,22,20000
2,10001,2016,male,23,20000
3,10001,2017,male,24,22000
4,10001,2018,male,25,24000
5,10002,2015,female,45,29000
6,10002,2016,female,46,29000
7,10002,2017,female,47,29000
8,10002,2018,female,48,29500
9,10003,2015,female,31,41500
10,10003,2016,female,32,42400


From a statistical modelling perspective, treating panel data - multiple observations on the same individuals over time - as a set of pooled observations is akin to having the following data set to begin with:

In [4]:
pooled = pd.read_csv("../data/lda-simple-example-pooled-2020-08-28.csv", index_col=False)
pooled.index += 1
pooled

Unnamed: 0,pid,year,sex,age,income
1,10001,2015,male,22,20000
2,10002,2016,male,23,20000
3,10003,2017,male,24,22000
4,10004,2018,male,25,24000
5,10005,2015,female,45,29000
6,10006,2016,female,46,29000
7,10007,2017,female,47,29000
8,10008,2018,female,48,29500
9,10009,2015,female,31,41500
10,10010,2016,female,32,42400


That is, we're ignoring the panel component and acting as if we have 1 observation each for 20 individuals, as opposed to 4 observations each for 5 individuals. Ignoring the panel component can lead to an underestimation of uncertainty or variability in your estimates. 

For example, let's explore the correlation between the income and sex.

In [8]:
%%stata -d panel_data

gen fem = (sex=="female")
table fem, c(mean income sd income)
regress income fem


--------------------------------------
      fem | mean(income)    sd(income)
----------+---------------------------
        0 |        17625      6545.173
        1 |  29266.66667       11863.8
--------------------------------------

      Source |       SS           df       MS      Number of obs   =        20
-------------+----------------------------------   F(1, 18)        =      6.34
       Model |   650536333         1   650536333   Prob > F        =    0.0215
    Residual |  1.8481e+09        18   102673426   R-squared       =    0.2604
-------------+----------------------------------   Adj R-squared   =    0.2193
       Total |  2.4987e+09        19   131508316   Root MSE        =     10133

------------------------------------------------------------------------------
      income |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |   11641.67   4624.965     2.52   0.02

Now let's adjust our estimates for the fact that observations are nested (clustered) within individuals and see what influence this has (if any):

In [7]:
%%stata

regress income fem, cluster(pid)


Linear regression                               Number of obs     =         20
                                                F(1, 4)           =       2.05
                                                Prob > F          =     0.2253
                                                R-squared         =     0.2604
                                                Root MSE          =      10133

                                    (Std. Err. adjusted for 5 clusters in pid)
------------------------------------------------------------------------------
             |               Robust
      income |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |   11641.67    8127.32     1.43   0.225    -10923.39    34206.72
       _cons |      17625   3147.402     5.60   0.005      8886.41    26363.59
------------------------------------------------------------------------------



We see that the coefficient for *fem* is unchanged but that the standard error is much larger, and thus the coefficient is no longer statistically significant.

To understand why this is the case, let's consider variation from two perspectives:
* within individuals
* between individuals

In [44]:
%%stata -o var_df

foreach stat in mean sd {
    bys pid: egen ind_`stat'_inc = `stat'(income)
    egen ov_`stat'_inc = `stat'(income)
    replace ind_`stat'_inc = ceil(ind_`stat'_inc)
    replace ov_`stat'_inc = ceil(ov_`stat'_inc)
}

(0 real changes made)
(0 real changes made)
(12 real changes made)
(20 real changes made)



In [49]:
var_df

Unnamed: 0,pid,year,sex,age,income,fem,ind_mean_inc,ov_mean_inc,ind_sd_inc,ov_sd_inc
0,10001,2015,male,22,20000,0,21500,24610,1915,11468
1,10001,2016,male,23,20000,0,21500,24610,1915,11468
2,10001,2017,male,24,22000,0,21500,24610,1915,11468
3,10001,2018,male,25,24000,0,21500,24610,1915,11468
4,10002,2015,female,45,29000,1,29125,24610,250,11468
5,10002,2016,female,46,29000,1,29125,24610,250,11468
6,10002,2017,female,47,29000,1,29125,24610,250,11468
7,10002,2018,female,48,29500,1,29125,24610,250,11468
8,10003,2015,female,31,41500,1,43175,24610,1542,11468
9,10003,2016,female,32,42400,1,43175,24610,1542,11468


In [54]:
%%stata -o collapsed_df

collapse (mean) mean_inc=income (sd) sd_inc=income, by(fem pid)

In [55]:
collapsed_df

Unnamed: 0,pid,fem,mean_inc,sd_inc
0,10001,0,21500,1915
1,10004,0,13750,7500
2,10002,1,29125,250
3,10003,1,43175,1541
4,10005,1,15500,1732
