In [1]:
import pandas as pd
from statsmodels.regression.linear_model import OLS
import numpy as np
from linearmodels.iv import IV2SLS

# 14.32 Problem Set 5

## Problem 4

Loading the data from the Current Population Survey Merged Outgoing Rotation Groups for 2018 from the National Bureau of Economic Research.

In [2]:
full_df = pd.read_stata('morg18.dta')

We add a constant column to run OLS with a constant later on.

In [3]:
full_df['_cons'] = 1.

### Part A

In this problem, we are interested in determining if getting married has an impact on one's weekly earnings. To answer this question, we will regress the logarithm of weekly earnings on a binary indictor of married vs un-married. There are a few possible confounding variables in this case:

- The age of an individual is likely strongly correlated with both marital status and weekly earnings.
- The education level of an individual is likely strongly correlated with both marital status and weekly earnings.
- Whether an individual has children or not is likely strongly correlated to both marital status and weekly earnings.
- A person's sex is likely strongly correlated to weekly earnings and it might possibly be weakly correlated to their marital status.
- A person's race is likely strongly correlated to weekly earnings and it might possible be weakly correlated to their marital status.

By controlling for these factors, we might reduce some omitted variables bias. To do so, we can extract the important columns:

In [4]:
df = full_df[['_cons', 'earnwke', 'marital', 'age', 
              'ihigrdc', 'ownchild', 'race', 'sex']]

We are only going to consider people who have a positive weekly income. Therefore, we drop all `NaN` incomes (this will also drop unknown education levels in `ihigrdc`) and take only people with income not equal to zero.

In [5]:
df = df.dropna()
df = df[df['earnwke'] != 0.]

We don't want to regress on the weekly wage directly, instead we use the logarithm of the weekly wage.

In [6]:
df['lgearnwke'] = np.log(df['earnwke'])
df = df.drop(columns=['earnwke'])

The data set breaks marital status into several categories. We are only interested in whether someone is married or not.

In [7]:
df['married'] = df['marital'].replace({
    1: 1.,  # Married Civilian Spouse Present
    2: 1.,  # Married AF Spouse Present
    3: 1.,  # Married Spouse Absent
    4: 0.,  # Widowed
    5: 0.,  # Divorced
    6: 0.,  # Separated
    7: 0.,  # Never Married
})
df = df.drop(columns=['marital'])

I don't really care about the number of children, only whether or not a child is present. Therefore, I simplify this variable to an indicator.

In [8]:
df['children'] = (df['ownchild'] > 0).astype(np.float64)
df = df.drop(columns=['ownchild'])

There are many different categories of race represented in this data set. To make things super simple, we are only going to indicate whether someone is black.

In [9]:
df['white'] = (df['race'] == 1).astype(np.float64)
df = df.drop(columns=['race'])

The data set represent male as 1 and female as 2. To simplify this, we transform the data such that male is 1 and female is 0.

In [10]:
df['male'] = (df['sex'] == 1).astype(np.float64)
df = df.drop(columns=['sex'])

Now our dataframe consists of the following columns.

In [11]:
list(df.columns)

['_cons',
 'age',
 'ihigrdc',
 'lgearnwke',
 'married',
 'children',
 'white',
 'male']

### Part B

#### Part I

Our data has already been imported into `df` as in Part A. Here's a preview:

In [12]:
print(df[:10])

    _cons  age  ihigrdc  lgearnwke  married  children  white  male
2     1.0   52     12.0   6.805723      0.0       1.0    0.0   0.0
3     1.0   19     12.0   5.991465      0.0       0.0    0.0   0.0
5     1.0   22     12.0   4.248495      0.0       0.0    0.0   0.0
6     1.0   48     12.0   6.522093      0.0       0.0    1.0   1.0
17    1.0   59     12.0   6.684612      0.0       0.0    0.0   1.0
18    1.0   27     12.0   5.953243      0.0       0.0    0.0   1.0
19    1.0   30     12.0   6.514713      0.0       0.0    0.0   0.0
20    1.0   49     12.0   5.786897      0.0       0.0    0.0   0.0
28    1.0   48     15.0   7.025005      1.0       0.0    0.0   1.0
31    1.0   51     14.0   7.489110      1.0       0.0    0.0   1.0


#### Part II

We can quickly look at the summary statistics of our data set:

In [13]:
print(df.describe())

          _cons            age        ihigrdc      lgearnwke        married  \
count  105108.0  105108.000000  105108.000000  105108.000000  105108.000000   
mean        1.0      41.815143      12.749990       6.421394       0.500590   
std         0.0      15.025220       2.309817       0.803723       0.500002   
min         1.0      16.000000       0.000000      -4.605170       0.000000   
25%         1.0      29.000000      12.000000       6.004677       0.000000   
50%         1.0      41.000000      12.000000       6.461468       1.000000   
75%         1.0      54.000000      14.000000       6.915723       1.000000   
max         1.0      85.000000      18.000000       7.967145       1.000000   

            children          white           male  
count  105108.000000  105108.000000  105108.000000  
mean        0.303459       0.804439       0.530873  
std         0.459754       0.396634       0.499048  
min         0.000000       0.000000       0.000000  
25%         0.000000   

#### Part III

We can run an ordinary least squares regression of `lgearnwke` on `married` with heteroskedastic robust standard errors without controlling for anything and without using instrumental variables.

In [14]:
simple_model = OLS(df['lgearnwke'], df[['_cons', 'married']])
simple_results = simple_model.fit(cov_type='HC1', use_t=True)
print(simple_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.061
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     6779.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        15:54:39   Log-Likelihood:            -1.2289e+05
No. Observations:              105108   AIC:                         2.458e+05
Df Residuals:                  105106   BIC:                         2.458e+05
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          6.2233      0.004   1761.417      0.0

Using the results from this regression, we can explicitly test our hypothesis:

$$ H_0 : married = 0 $$
$$ H_1 : married \neq 0 $$

In [15]:
simple_married_t_test = simple_results.t_test('married = 0')
print(simple_married_t_test.summary())

                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.3957      0.005     82.333      0.000       0.386       0.405


From this $t$-test, we can see the $p$-value is equal to zero or, equivalently, that the confidence interval does not include zero. Thus, it is safe to reject the null that marriage has no effect on weekly wages. From the coefficient, and individual who is married makes approximately 40\% more on average than an individual who is not married.

#### Part IV

Now we can control for some other confounding factors. 

- We would expect $\mathrm{cov}(lgearnwke, age)$ and $\mathrm{cov}(married, age)$ to both be strongly positive. This should cause overstatement of the coefficient on married.
- We would expect $\mathrm{cov}(lgearnwke, ihigrd)$ to be strongly positive and $\mathrm{cov}(married, ihigrd)$ to be slightly positive. This should cause overstatement of the coefficient on married.
- We would expect $\mathrm{cov}(lgearnwke, children)$ to be slightly positive and $\mathrm{cov}(married, children)$ to be strongly positive. This should cause overstatement of the coefficient on married.
- We would expect $\mathrm{cov}(lgearnwke, male)$ to be strongly positive and $\mathrm{cov}(married, male)$ to be roughly zero. This may or may not have a noticeable effect.
- We would expect $\mathrm{cov}(lgearnwke, white)$ to be strongly positive and $\mathrm{cov}(married, white)$ to be roughtly zero. This may or may not have a noticeable effect.

Running the controlled regression:

In [16]:
controlled_model = OLS(df['lgearnwke'], df[['_cons', 'married', 'age', 
                                            'ihigrdc', 'children', 'male', 
                                            'white']])
controlled_results = controlled_model.fit(cov_type='HC1', use_t=True)
print(controlled_results.summary())

                            OLS Regression Results                            
Dep. Variable:              lgearnwke   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     4170.
Date:                Wed, 01 May 2019   Prob (F-statistic):               0.00
Time:                        15:54:39   Log-Likelihood:            -1.1383e+05
No. Observations:              105108   AIC:                         2.277e+05
Df Residuals:                  105101   BIC:                         2.277e+05
Df Model:                           6                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          4.4266      0.016    270.281      0.0

Putting all the controls together drastically reduces the coefficient of married. We can explicitly test the our hypothesis:

$$ H_0 : married = 0 $$
$$ H_1 : married \neq 0 $$

In [17]:
controlled_married_t_test = controlled_results.t_test('married = 0')
print(controlled_married_t_test.summary())

                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.1652      0.005     33.159      0.000       0.155       0.175


From this $t$-test, we can see the $p$-value is still equal to zero or, equivalently, that the confidence interval still does not include zero. Thus, it is safe to reject the null that marriage has no effect on weekly wages. From the coefficient, and individual who is married makes approximately 17\% more on average than an individual who is not married. However, this is significantly lower than the naive estimate which suggests the effect of marriage on income is not nearly as strong as it first appears.

Furthermore, there are likely more omitted variables present in this model. Some quick Googling suggests the length of someone's marriage could be a strong confounder. Unfortunately I couldn't test this using the CPS data. Though it is kind of incorporated in the model using age.