In [22]:
pip install linearmodels



In [23]:
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

import linearmodels as lm
from linearmodels import PanelOLS
from linearmodels import RandomEffects
from linearmodels import FirstDifferenceOLS

In [24]:
g = pd.read_csv("panel-for-R.csv")
g.head()

Unnamed: 0,idnum,panelwave,ballot,form,formwt,oversamp,sample,panstat_2,panstat_3,mar1,...,wtpan12,wtpan123,wtpannr12,wtpannr123,xmarsex,xmovie,xnorcsiz,year,yearval,zodiac
0,9,1,3,2,1,1,9,1,1,5.0,...,0.414689,0.487828,0.435503,0.470575,1.0,2.0,1.0,2006.0,,9.0
1,9,2,3,2,1,1,9,1,1,5.0,...,0.414689,0.487828,0.435503,0.470575,1.0,1.0,1.0,2008.0,,9.0
2,9,3,3,2,1,1,9,1,1,1.0,...,0.414689,0.487828,0.435503,0.470575,1.0,1.0,1.0,2010.0,,9.0
3,10,1,1,1,1,1,9,1,1,5.0,...,0.829377,0.858741,0.766632,0.828371,1.0,,1.0,2006.0,,3.0
4,10,2,1,1,1,1,9,1,1,5.0,...,0.829377,0.858741,0.766632,0.828371,1.0,,1.0,2008.0,,3.0


### 1. Run a naive ("pooled") OLS regression on the panel data. Tell we how you expect your Xs to affect your Y and why.   Interpret your results.

My X is respondent's age when the first child was born. (agekdbrn)

My Y is respondent's income. (realrinc)

I expect that people with higher incomes have children later.

In [27]:
lm_ols = smf.ols(formula = 'realrinc ~ agekdbrn', data = g).fit()
print (lm_ols.summary())

                            OLS Regression Results                            
Dep. Variable:               realrinc   R-squared:                       0.032
Model:                            OLS   Adj. R-squared:                  0.031
Method:                 Least Squares   F-statistic:                     65.34
Date:                Thu, 07 Dec 2023   Prob (F-statistic):           1.08e-15
Time:                        23:25:15   Log-Likelihood:                -24065.
No. Observations:                1982   AIC:                         4.813e+04
Df Residuals:                    1980   BIC:                         4.814e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -8935.2949   4415.926     -2.023      0.0

In a simple OLS regression, for each higher age when the first child was born, on average, $1437 more in income is earned. Since people earn more if they have children later, my expectation is supported. It is highly statistically significant as p-value < 0.001 and it explains about 3 percent of the variance in income earned.

We can see that the skew is as high as 8, so let us run a log-log model.

In [25]:
g['lnrealrinc'] = np.log(g['realrinc'])
g['lnagekdbrn'] = np.log(g['agekdbrn'])

In [26]:
g[["lnrealrinc", 'lnagekdbrn']].describe()

Unnamed: 0,lnrealrinc,lnagekdbrn
count,2847.0,3560.0
mean,9.592979,3.149982
std,1.16226,0.221506
min,5.556828,2.564949
25%,9.131094,2.995732
50%,9.824241,3.135494
75%,10.290534,3.295837
max,13.081842,4.025352


In [28]:
lm_ols2 = smf.ols(formula = 'lnrealrinc ~ lnagekdbrn', data = g).fit()
print (lm_ols2.summary())

                            OLS Regression Results                            
Dep. Variable:             lnrealrinc   R-squared:                       0.072
Model:                            OLS   Adj. R-squared:                  0.072
Method:                 Least Squares   F-statistic:                     154.4
Date:                Thu, 07 Dec 2023   Prob (F-statistic):           3.41e-34
Time:                        23:25:16   Log-Likelihood:                -3018.2
No. Observations:                1982   AIC:                             6040.
Df Residuals:                    1980   BIC:                             6052.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.2330      0.351     14.899      0.0

In a log-log model, for every 1% increase in the age when the first child was born, on average, there is a 1.378% increase in income. This is also highly statistically significant. And through the r-squared, we can see the model now explains about 7% of the variance in log income.

### 2. Run a first differences regression on the same model in Question 1. Interpret your results. Do you draw a different conclusion than in Question 1? Explain.

In [29]:
g = g.set_index(['idnum', 'panelwave'])
lm_fd = FirstDifferenceOLS.from_formula('realrinc ~ agekdbrn', g)
res_fd = lm_fd.fit(cov_type='clustered', cluster_entity=True)
print(res_fd)

                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:               realrinc   R-squared:                        0.0015
Estimator:         FirstDifferenceOLS   R-squared (Between):              0.3070
No. Observations:                 845   R-squared (Within):               0.0015
Date:                Thu, Dec 07 2023   R-squared (Overall):              0.2426
Time:                        23:25:16   Log-likelihood                -1.033e+04
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      1.2798
Entities:                        1007   P-value                           0.2583
Avg Obs:                       1.9682   Distribution:                   F(1,844)
Min Obs:                       1.0000                                           
Max Obs:                       3.0000   F-statistic (robust):             0.4570
                            

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)


In a first differences regression, for each higher age when the first child was born, on average, $797 more in income is earned. The result is in the same direction as the result of the simple OLS regression, but the change in income is smaller. The result is not statistical significant as p-value is near 0.5. And the r-squared shows that only 0.15% of the variance in income is explained by the model.

Let us see a first differences model with logged income and logged age when the first child was born.

In [30]:
lm_fd2 = FirstDifferenceOLS.from_formula('lnrealrinc ~ lnagekdbrn', g)
res_fd2 = lm_fd2.fit(cov_type='clustered', cluster_entity=True)
print(res_fd2)

Inputs contain missing values. Dropping rows with missing observations.
  super().__init__(dependent, exog, weights=weights, check_rank=check_rank)


                     FirstDifferenceOLS Estimation Summary                      
Dep. Variable:             lnrealrinc   R-squared:                        0.0003
Estimator:         FirstDifferenceOLS   R-squared (Between):             -0.1414
No. Observations:                 845   R-squared (Within):              -0.0003
Date:                Thu, Dec 07 2023   R-squared (Overall):             -0.1398
Time:                        23:25:16   Log-likelihood                   -1181.1
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      0.2924
Entities:                        1007   P-value                           0.5888
Avg Obs:                       1.9682   Distribution:                   F(1,844)
Min Obs:                       1.0000                                           
Max Obs:                       3.0000   F-statistic (robust):             0.2531
                            

Now, the result is in the opposite direction as the result of the log-log OLS model. For each 1% change in the age when the first child was born, there is a -0.207% change in income. The relationship between changes in log age when the first child was born and log income is more statistically nonsignicant with p-value higher than 0.6. Also, the r-sq shows that these changes explain only 0.03% in the variance in logged income.