In [1]:
import pandas as pd
from statsmodels.regression.linear_model import OLS
import numpy as np
from linearmodels.iv import IV2SLS

# 14.32 Problem Set 5

## Problem 4

Loading the data from the Current Population Survey Merged Outgoing Rotation Groups for 2018 from the National Bureau of Economic Research.

In [2]:
full_df = pd.read_stata('morg18.dta')

We add a constant column to run OLS with a constant later on.

In [3]:
full_df['_cons'] = 1.

### Part A

In this problem, we are interested in determining if getting married has an impact on one's weekly earnings. To answer this question, we will regress the logarithm of weekly earnings on a binary indictor of married vs un-married. There are a few possible confounding variables in this case:

- The age of an individual is likely strongly correlated with both marital status and weekly earnings.
- A person's sex is likely strongly correlated to weekly earnings and it might possibly be weakly correlated to their marital status.
- A person's race is likely strongly correlated to weekly earnings and it might possible be weakly correlated to their marital status.

By controlling for these factors, we might reduce some omitted variables bias.

On the other hand, instead of controlling for these factors, we could use an instrumental variable to remove the bias. In this case, whether or not an individual has a child could act as an instrumental variable:

- Relevance: The presence of a child is likely strongly correlated to an individual's marriage status as there is some moral incentive to get married before (or as the result of) getting pregnant.
- Exclusion: The presence of a child is unlikely to directly correlated with an individual's weekly earnings. Nonetheless, their might be an indirect correlation with earnings outside of marital status (e.g. a correlation through age).

As a result, we can try both of these regressions. To do so, we can extract the important columns:

In [4]:
df = full_df[['_cons', 'earnwke', 'age', 'marital',
              'race', 'sex', 'ownchild']]

We are only going to consider people who have a positive weekly income. Therefore, we drop all `NaN` incomes and take only people with income not equal to zero.

In [5]:
df = df.dropna()
df = df[df['earnwke'] != 0.]

We don't want to regress on the weekly wage directly, instead we use the logarithm of the weekly wage.

In [6]:
df['lgearnwke'] = np.log(df['earnwke'])
df = df.drop(columns=['earnwke'])

The data set breaks marital status into several categories. We are only interested in whether someone is married or not.

In [7]:
df['married'] = df['marital'].replace({
    1: 1.,  # Married Civilian Spouse Present
    2: 1.,  # Married AF Spouse Present
    3: 1.,  # Married Spouse Absent
    4: 0.,  # Widowed
    5: 0.,  # Divorced
    6: 0.,  # Separated
    7: 0.,  # Never Married
})
df = df.drop(columns=['marital'])

There are many different categories of race represented in this data set. To make things super simple, we are only going to indicate whether someone white or non-white.

In [8]:
df['white'] = (df['race'] == 1).astype(np.float64)
df = df.drop(columns=['race'])

The data set represent male as 1 and female as 2. To simplify this, we transform the data such that male is 1 and female is 0.

In [9]:
df['male'] = (df['sex'] == 1).astype(np.float64)
df = df.drop(columns=['sex'])

Lastly, I don't really care about the number of children, only whether or not a child is present. Therefore, I simplify this variable to an indicator.

In [10]:
df['children'] = (df['ownchild'] > 0).astype(np.float64)
df = df.drop(columns=['ownchild'])

Now our dataframe consists of the following columns.

In [11]:
list(df.columns)

['_cons', 'age', 'lgearnwke', 'married', 'white', 'male', 'children']

### Part B

#### Part I

Our data has already been imported into `df` as in Part A. Here's a preview:

In [12]:
print(df[:10])

    _cons  age  lgearnwke  married  white  male  children
2     1.0   52   6.805723      0.0    0.0   0.0       1.0
3     1.0   19   5.991465      0.0    0.0   0.0       0.0
4     1.0   56   7.130899      0.0    0.0   0.0       0.0
5     1.0   22   4.248495      0.0    0.0   0.0       0.0
6     1.0   48   6.522093      0.0    1.0   1.0       0.0
17    1.0   59   6.684612      0.0    0.0   1.0       0.0
18    1.0   27   5.953243      0.0    0.0   1.0       0.0
19    1.0   30   6.514713      0.0    0.0   0.0       0.0
20    1.0   49   5.786897      0.0    0.0   0.0       0.0
22    1.0   26   6.628041      1.0    1.0   0.0       0.0


#### Part II

We can quickly look at the summary statistics of our data set:

In [13]:
print(df.describe())

          _cons            age      lgearnwke        married          white  \
count  159559.0  159559.000000  159559.000000  159559.000000  159559.000000   
mean        1.0      42.310800       6.604410       0.542401       0.807889   
std         0.0      14.414349       0.833248       0.498200       0.393961   
min         1.0      16.000000      -4.605170       0.000000       0.000000   
25%         1.0      30.000000       6.173786       0.000000       1.000000   
50%         1.0      42.000000       6.659294       1.000000       1.000000   
75%         1.0      54.000000       7.154615       1.000000       1.000000   
max         1.0      85.000000       7.967145       1.000000       1.000000   

                male       children  
count  159559.000000  159559.000000  
mean        0.509172       0.323185  
std         0.499917       0.467694  
min         0.000000       0.000000  
25%         0.000000       0.000000  
50%         1.000000       0.000000  
75%         1.000000  

#### Part III

We can run an ordinary least squares regression of `lgearnwke` on `married` with heteroskedastic robust standard errors without controlling for anything and without using instrumental variables.

In [14]:
simple_model = OLS(df['lgearnwke'], df[['_cons', 'married']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.051
Method:                 Least Squares   F-statistic:                     8439.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        12:01:56   Log-Likelihood:            -1.9314e+05
No. Observations:              159559   AIC:                         3.863e+05
Df Residuals:                  159557   BIC:                         3.863e+05
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          6.4000      0.003   2054.561      0.0

Using the results from this regression, we can explicitly test our hypothesis:

$$ H_0 : married = 0 $$
$$ H_1 : married \neq 0 $$

In [15]:
married_t_test = simple_results.t_test('married = 0')
print(married_t_test.summary())

                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.3769      0.004     91.866      0.000       0.369       0.385


From this $t$-test, we can see the $p$-value is equal to zero or, equivalently, that the confidence interval does not include zero. Thus, it is safe to reject the null that marriage has no effect on weekly wages. From the coefficient, and individual who is married makes approximately 37\% more on average than an individual who is not married.

#### Part IV

Now we can control for some other confounding factors. 

First we consider a person's age. In order for age to contribute omitted variable bias, it needs to be both correlated to weekly wage and marital status. First we check if age and weekly wage are correlated.

In [16]:
print(df[['lgearnwke', 'age']].cov())

           lgearnwke         age
lgearnwke   0.694302    2.323699
age         2.323699  207.773450


Clearly it is correlated. Now we check if age is correlated to marital status.

In [17]:
print(df[['married', 'age']].cov())

          married         age
married  0.248204    2.305386
age      2.305386  207.773450


Since both have positive correlation, we expect that the marriage coefficient is overstated. Let's run the controlled regression.

In [18]:
simple_model = OLS(df['lgearnwke'], df[['_cons', 'married', 'age']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.067
Model:                            OLS   Adj. R-squared:                  0.067
Method:                 Least Squares   F-statistic:                     4981.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        12:01:56   Log-Likelihood:            -1.9175e+05
No. Observations:              159559   AIC:                         3.835e+05
Df Residuals:                  159556   BIC:                         3.835e+05
Df Model:                           2                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          6.1090      0.007    867.475      0.0

We were correct. The married coefficient was overstated by about 7\%.

Now let's consider a person's sex. First we check if sex and weekly wage are correlated.

In [19]:
print(df[['lgearnwke', 'male']].cov())

           lgearnwke      male
lgearnwke   0.694302  0.076026
male        0.076026  0.249917


It is slightly correlated. Now we check if sex is correlated to marital status.

In [20]:
print(df[['married', 'male']].cov())

          married      male
married  0.248204  0.013799
male     0.013799  0.249917


Since both have positive correlation, we expect that the marriage coefficient is slightly overstated. Let's run the controlled regression.

In [21]:
simple_model = OLS(df['lgearnwke'], df[['_cons', 'married', 'male']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.080
Model:                            OLS   Adj. R-squared:                  0.080
Method:                 Least Squares   F-statistic:                     7099.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        12:01:57   Log-Likelihood:            -1.9066e+05
No. Observations:              159559   AIC:                         3.813e+05
Df Residuals:                  159556   BIC:                         3.814e+05
Df Model:                           2                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          6.2638      0.004   1709.057      0.0

We were correct. The married coefficient was overstated by about 1\%.

Now let's consider a person's race. First we check if race and weekly wage are correlated.

In [22]:
print(df[['lgearnwke', 'white']].cov())

           lgearnwke     white
lgearnwke   0.694302  0.014954
white       0.014954  0.155205


It is very slightly correlated. Now we check if race is correlated to marital status.

In [23]:
print(df[['married', 'white']].cov())

          married     white
married  0.248204  0.016284
white    0.016284  0.155205


Since both have slightly positive correlation, we expect that the marriage coefficient is slightly overstated. Let's run the controlled regression.

In [24]:
simple_model = OLS(df['lgearnwke'], df[['_cons', 'married', 'white']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.052
Model:                            OLS   Adj. R-squared:                  0.052
Method:                 Least Squares   F-statistic:                     4305.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        12:01:57   Log-Likelihood:            -1.9308e+05
No. Observations:              159559   AIC:                         3.862e+05
Df Residuals:                  159556   BIC:                         3.862e+05
Df Model:                           2                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          6.3558      0.005   1301.841      0.0

We were correct. The married coefficient is very slightly overstated by about 0.3\%.

Let's combine all of these results into a single controlled regression.

In [25]:
simple_model = OLS(df['lgearnwke'], df[['_cons', 'married', 'age', 
                                        'male', 'white']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.098
Model:                            OLS   Adj. R-squared:                  0.098
Method:                 Least Squares   F-statistic:                     4031.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        12:01:57   Log-Likelihood:            -1.8910e+05
No. Observations:              159559   AIC:                         3.782e+05
Df Residuals:                  159554   BIC:                         3.783e+05
Df Model:                           4                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          5.9292      0.008    728.509      0.0

Putting all the controls together drastically reduces the coefficient of married. We can explicitly test the our hypothesis:

$$ H_0 : married = 0 $$
$$ H_1 : married \neq 0 $$

In [26]:
married_t_test = simple_results.t_test('married = 0')
print(married_t_test.summary())

                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.2833      0.004     68.324      0.000       0.275       0.291


From this $t$-test, we can see the $p$-value is equal to zero or, equivalently, that the confidence interval does not include zero. Thus, it is safe to reject the null that marriage has no effect on weekly wages. From the coefficient, and individual who is married makes approximately 28\% more on average than an individual who is not married.

Instead of controlling for omitted variables, we can try running two-stage least squares using the instrumental variable for the presence of children.

First we test for relevance by regressing married on children.

In [27]:
simple_model = OLS(df['married'], df[['_cons', 'children']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:                married   R-squared:                       0.112
Model:                            OLS   Adj. R-squared:                  0.112
Method:                 Least Squares   F-statistic:                 2.293e+04
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        12:01:57   Log-Likelihood:            -1.0574e+05
No. Observations:              159559   AIC:                         2.115e+05
Df Residuals:                  159557   BIC:                         2.115e+05
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          0.4271      0.002    283.742      0.0

Clearly the instrument is relevant. I could not think of an additional instrument, so I am just going to take the exclusion criteria on faith.

Now we run two-stage least squares.

In [28]:
iv2sls_model = IV2SLS(df['lgearnwke'], df['_cons'], df['married'], 
                      df['children'])
iv2sls_results = iv2sls_model.fit(cov_type='robust')
print(iv2sls_results.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:              lgearnwke   R-squared:                      0.0316
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0316
No. Observations:              159559   F-statistic:                    2688.7
Date:                Wed, May 01 2019   P-value (F-stat)                0.0000
Time:                        12:01:58   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
_cons          6.2742     0.0069     905.53     0.0000      6.2606      6.2878
married        0.6088     0.0117     51.852     0.00

Interestingly, it appears that using children as an instrument increases the coefficient of married on weekly earnings (and it remains statistically significant). Since this contradicts the logical assumption that marriage has no direct effect on earnings, I would likely figure that the presence of children does not satisfy the exclusion criterion for the reasons outlined in Part A. If I could come up with another instrument, I would try testing this condition.

(Note: After regression yielded unsuspected results, I did a little research and found the paper "Marriage and Earnings" by Cornwell and Rupert. They identified the same coefficient as the simple model when not controlling for other factors (37\%). They also conjectured that there is a likely confounding variable of years of marriage amongst other factors. They were able to reduce the coefficient to zero by controlling for various factors and using instruments: number of siblings and years of education of the father.)