* Python code replication of: " https://www.kaggle.com/janniskueck/pm1-notebook-inference "
* Created by: Anzony Quispe & Alexander Quispe

This notebook contains an example for teaching.

# An inferential problem: The Gender Wage Gap

In the previous lab, we already analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 D  + \beta_2' W + \epsilon,
\end{align}

where $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the
$W$'s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

## Data analysis

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set.

In [1]:
import pandas as pd
import numpy as np
import pyreadr as rr # package to use data form R format
import math

In [2]:
#!pip install pyreadr==0.4.2

In [39]:
#rdata_read = pyreadr.read_r("../../data/wage2015_subsample_inference.Rdata")

data  = pd.read_csv(r'../../../data/wage2015_subsample_inference.csv')

# Extracting the data frame from rdata_read
#data = rdata_read[ 'data' ]
data['occ']=pd.Categorical(data.occ)
data['occ2']=pd.Categorical(data.occ2)
data['ind']=pd.Categorical(data.ind)
data['ind2']=pd.Categorical(data.ind2)


data.shape

(5150, 21)

In [3]:
rdata_read = rr.read_r(r"../../../data/wage2015_subsample_inference.Rdata")


# Extracting the data frame from rdata_read
data = rdata_read[ 'data' ]


data.shape

(5150, 20)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   wage    5150 non-null   float64 
 1   lwage   5150 non-null   float64 
 2   sex     5150 non-null   float64 
 3   shs     5150 non-null   float64 
 4   hsg     5150 non-null   float64 
 5   scl     5150 non-null   float64 
 6   clg     5150 non-null   float64 
 7   ad      5150 non-null   float64 
 8   mw      5150 non-null   float64 
 9   so      5150 non-null   float64 
 10  we      5150 non-null   float64 
 11  ne      5150 non-null   float64 
 12  exp1    5150 non-null   float64 
 13  exp2    5150 non-null   float64 
 14  exp3    5150 non-null   float64 
 15  exp4    5150 non-null   float64 
 16  occ     5150 non-null   category
 17  occ2    5150 non-null   category
 18  ind     5150 non-null   category
 19  ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 736.3+ KB


***Variable description***

- occ : occupational classification
- ind : industry classification
- lwage : log hourly wage
- sex : gender (1 female) (0 male)
- shs : some high school
- hsg : High school graduated
- scl : Some College
- clg: College Graduate
- ad: Advanced Degree
- ne: Northeast
- mw: Midwest
- so: South
- we: West
- exp1: experience

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   wage    5150 non-null   float64 
 1   lwage   5150 non-null   float64 
 2   sex     5150 non-null   float64 
 3   shs     5150 non-null   float64 
 4   hsg     5150 non-null   float64 
 5   scl     5150 non-null   float64 
 6   clg     5150 non-null   float64 
 7   ad      5150 non-null   float64 
 8   mw      5150 non-null   float64 
 9   so      5150 non-null   float64 
 10  we      5150 non-null   float64 
 11  ne      5150 non-null   float64 
 12  exp1    5150 non-null   float64 
 13  exp2    5150 non-null   float64 
 14  exp3    5150 non-null   float64 
 15  exp4    5150 non-null   float64 
 16  occ     5150 non-null   category
 17  occ2    5150 non-null   category
 18  ind     5150 non-null   category
 19  ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 736.3+ KB


In [6]:
data[['occ','occ2','ind','ind2']].describe()

Unnamed: 0,occ,occ2,ind,ind2
count,5150,5150,5150,5150
unique,351,22,230,21
top,4700,17,770,18
freq,174,670,297,664


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5150 entries, 10 to 32643
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   wage    5150 non-null   float64 
 1   lwage   5150 non-null   float64 
 2   sex     5150 non-null   float64 
 3   shs     5150 non-null   float64 
 4   hsg     5150 non-null   float64 
 5   scl     5150 non-null   float64 
 6   clg     5150 non-null   float64 
 7   ad      5150 non-null   float64 
 8   mw      5150 non-null   float64 
 9   so      5150 non-null   float64 
 10  we      5150 non-null   float64 
 11  ne      5150 non-null   float64 
 12  exp1    5150 non-null   float64 
 13  exp2    5150 non-null   float64 
 14  exp3    5150 non-null   float64 
 15  exp4    5150 non-null   float64 
 16  occ     5150 non-null   category
 17  occ2    5150 non-null   category
 18  ind     5150 non-null   category
 19  ind2    5150 non-null   category
dtypes: category(4), float64(16)
memory usage: 736.3+ KB


To start our (causal) analysis, we compare the sample means given gender:

In [8]:
Z = data[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]

data_female = data[data[ 'sex' ] == 1 ]
Z_female = data_female[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]

data_male = data[ data[ 'sex' ] == 0 ]
Z_male = data_male[ [ "lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1" ] ]


table = np.zeros( (12, 3) )
table[:, 0] = Z.mean().values
table[:, 1] = Z_male.mean().values
table[:, 2] = Z_female.mean().values
table_pandas = pd.DataFrame( table, columns = [ 'All', 'Men', 'Women']) # from table to dataframe
table_pandas.index = ["Log Wage","Sex","Some High School","High School Graduate","Some College","Gollage Graduate","Advanced Degree", "Northeast","Midwest","South","West","Experience"]
table_html = table_pandas.to_html() # html format

table_pandas

Unnamed: 0,All,Men,Women
Log Wage,2.970787,2.98783,2.949485
Sex,0.444466,0.0,1.0
Some High School,0.023301,0.031807,0.012669
High School Graduate,0.243883,0.294303,0.180865
Some College,0.278058,0.273331,0.283967
Gollage Graduate,0.31767,0.293953,0.347313
Advanced Degree,0.137087,0.106606,0.175186
Northeast,0.227767,0.22195,0.235037
Midwest,0.259612,0.259,0.260376
South,0.296505,0.298148,0.294452


In [9]:
Z = data[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]

data_female = data[data[ 'sex' ] == 1 ]
Z_female = data_female[ ["lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"] ]

data_male = data[ data[ 'sex' ] == 0 ]
Z_male = data_male[ [ "lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1" ] ]


table = np.zeros( (12, 3) )
table[:, 0] = Z.mean().values
table[:, 1] = Z_male.mean().values
table[:, 2] = Z_female.mean().values
table_pandas = pd.DataFrame( table, columns = [ 'All', 'Men', 'Women'])
table_pandas.index = ["Log Wage","Sex","Some High School","High School Graduate","Some College","Gollage Graduate","Advanced Degree", "Northeast","Midwest","South","West","Experience"]
table_html = table_pandas.to_html() # html format

table_pandas

Unnamed: 0,All,Men,Women
Log Wage,2.970787,2.98783,2.949485
Sex,0.444466,0.0,1.0
Some High School,0.023301,0.031807,0.012669
High School Graduate,0.243883,0.294303,0.180865
Some College,0.278058,0.273331,0.283967
Gollage Graduate,0.31767,0.293953,0.347313
Advanced Degree,0.137087,0.106606,0.175186
Northeast,0.227767,0.22195,0.235037
Midwest,0.259612,0.259,0.260376
South,0.296505,0.298148,0.294452


In [10]:
print( table_html )

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>All</th>
      <th>Men</th>
      <th>Women</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Log Wage</th>
      <td>2.970787</td>
      <td>2.987830</td>
      <td>2.949485</td>
    </tr>
    <tr>
      <th>Sex</th>
      <td>0.444466</td>
      <td>0.000000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>Some High School</th>
      <td>0.023301</td>
      <td>0.031807</td>
      <td>0.012669</td>
    </tr>
    <tr>
      <th>High School Graduate</th>
      <td>0.243883</td>
      <td>0.294303</td>
      <td>0.180865</td>
    </tr>
    <tr>
      <th>Some College</th>
      <td>0.278058</td>
      <td>0.273331</td>
      <td>0.283967</td>
    </tr>
    <tr>
      <th>Gollage Graduate</th>
      <td>0.317670</td>
      <td>0.293953</td>
      <td>0.347313</td>
    </tr>
    <tr>
      <th>Advanced Degree</th>
      <td>0.137087</td>
      <td>0.106606</td>
      

In particular, the table above shows that the difference in average *logwage* between men and women is equal to $0,038$

In [11]:
data_female['lwage'].mean()- data_male['lwage'].mean()

-0.03834473367441493

Thus, the unconditional gender wage gap is about $3,8$\% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

This unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$:

\begin{align}
\log(Y) &=\beta D + \epsilon.
\end{align}

We verify this by running an ols regression in R.

In [12]:
pip install statsmodels

Note: you may need to restart the kernel to use updated packages.


In [13]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [14]:
nocontrol_model = smf.ols( formula = 'lwage ~ sex', data = data )
nocontrol_est = nocontrol_model.fit().summary2().tables[1]['Coef.']['sex']
nocontrol_est
nocontrol_se2 = nocontrol_model.fit().summary2().tables[1]['Std.Err.']['sex']


# robust standar erros
HCV_coefs = nocontrol_model.fit().cov_HC0
nocontrol_se = np.power( HCV_coefs.diagonal() , 0.5)[1]
nocontrol_se

# print unconditional effect of gender and the corresponding standard error

print( f'The estimated gender coefficient is {nocontrol_est} and the corresponding standard error is {nocontrol_se2}' )
print( f'The estimated gender coefficient is {nocontrol_est} and the corresponding robust standard error is {nocontrol_se}','\n' )

The estimated gender coefficient is -0.03834473367441481 and the corresponding standard error is 0.015987825519430385
The estimated gender coefficient is -0.03834473367441481 and the corresponding robust standard error is 0.01590193507909572 



Note that the standard error is computed with the *R* package *sandwich* to be robust to heteroskedasticity. 


Next, we run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}

Here, we are considering the flexible model from the previous lab. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

Let us run the ols regression with controls.

## Ols regression with controls

In [81]:
flex = 'lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
control_model = smf.ols( formula = flex, data = data ).fit().summary2()
control_model

0,1,2,3
Model:,OLS,Adj. R-squared:,0.319
Dependent Variable:,lwage,AIC:,7095.8594
Date:,2022-04-02 09:36,BIC:,8706.3604
No. Observations:,5150,Log-Likelihood:,-3301.9
Df Model:,245,F-statistic:,10.83
Df Residuals:,4904,Prob (F-statistic):,2.69e-305
R-squared:,0.351,Scale:,0.22166

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,3.2797,0.2842,11.5402,0.0000,2.7225,3.8368
occ2[T.10],0.0210,0.1565,0.1339,0.8935,-0.2859,0.3278
occ2[T.11],-0.6424,0.3091,-2.0784,0.0377,-1.2484,-0.0365
occ2[T.12],-0.0675,0.2520,-0.2677,0.7889,-0.5616,0.4267
occ2[T.13],-0.2330,0.2315,-1.0062,0.3144,-0.6869,0.2209
occ2[T.14],0.2562,0.3227,0.7940,0.4272,-0.3764,0.8888
occ2[T.15],-0.1939,0.2595,-0.7470,0.4551,-0.7026,0.3149
occ2[T.16],-0.0551,0.1471,-0.3748,0.7078,-0.3434,0.2332
occ2[T.17],-0.4156,0.1361,-3.0534,0.0023,-0.6825,-0.1488

0,1,2,3
Omnibus:,395.012,Durbin-Watson:,1.898
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1529.25
Skew:,0.303,Prob(JB):,0.0
Kurtosis:,5.6,Condition No.:,68698.0


In [85]:
control_est.index('pear')

TypeError: 'Index' object is not callable

In [87]:
flex = 'lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'

# The smf api replicates R script when it transform data
control_model = smf.ols( formula = flex, data = data )
control_est = control_model.fit().summary2().tables[1]['Coef.']['sex']

print(control_model.fit().summary2().tables[1])

HCV_coefs = control_model.fit().cov_HC0
control_se = np.power( HCV_coefs.diagonal() , 0.5)[42]  # error standard for sex's coefficients 

control_se


print( f"Coefficient for OLS with controls {control_est} and the corresponding robust standard error is {control_se}" )

# confidence interval
control_model.fit().conf_int( alpha=0.05 ).loc[['sex']]


               Coef.  Std.Err.          t         P>|t|    [0.025    0.975]
Intercept   3.279677  0.284196  11.540202  2.037819e-30  2.722526  3.836828
occ2[T.10]  0.020954  0.156498   0.133896  8.934903e-01 -0.285852  0.327761
occ2[T.11] -0.642418  0.309090  -2.078417  3.772286e-02 -1.248372 -0.036463
occ2[T.12] -0.067477  0.252049  -0.267716  7.889294e-01 -0.561605  0.426651
occ2[T.13] -0.232978  0.231538  -1.006220  3.143593e-01 -0.686896  0.220940
...              ...       ...        ...           ...       ...       ...
exp4:scl    0.021076  0.024529   0.859230  3.902557e-01 -0.027012  0.069164
exp4:clg    0.007869  0.022753   0.345868  7.294565e-01 -0.036736  0.052475
exp4:mw     0.006244  0.015870   0.393446  6.940073e-01 -0.024868  0.037356
exp4:so     0.000314  0.013628   0.023075  9.815913e-01 -0.026402  0.027031
exp4:we     0.001768  0.015960   0.110804  9.117763e-01 -0.029521  0.033058

[246 rows x 6 columns]
Coefficient for OLS with controls -0.06955320329685015 and the c

Unnamed: 0,0,1
sex,-0.099387,-0.039719


In [39]:
control_model 

<statsmodels.regression.linear_model.OLS at 0x2451301e340>

The estimated regression coefficient $\beta_1\approx-0.0696$ measures how our linear prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size $4$\% for women increases to about $7$\% after controlling for worker characteristics.  


Next, we are using the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols.

## Partialling-Out using ols

In [73]:
#np.vstack(( t_Y.values , t_D.values )).T

data_res = pd.DataFrame( np.vstack(( t_Y.values , t_D.values )).T , columns = [ 't_Y', 't_D' ] )
data_res

Unnamed: 0,t_Y,t_D
0,-0.567731,0.105742
1,0.404602,-0.569249
2,-0.481643,-0.045516
3,-1.074180,0.400720
4,0.097208,0.312384
...,...,...
5145,-0.262079,-0.409648
5146,0.582018,0.547954
5147,-0.001855,0.035117
5148,0.387129,-0.309997


In [65]:
# models
# model for Y
flex_y = 'lwage ~  (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)'
# model for D
flex_d = 'sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)' 

# partialling-out the linear effect of W from Y
t_Y = smf.ols( formula = flex_y , data = data ).fit().resid

# partialling-out the linear effect of W from D
t_D = smf.ols( formula = flex_d , data = data ).fit().resid


data_res = pd.DataFrame( np.vstack(( t_Y.values , t_D.values )).T , columns = [ 't_Y', 't_D' ] )


# regression of Y on D after partialling-out the effect of W
partial_fit =  smf.ols( formula = 't_Y ~ t_D' , data = data_res ).fit()
partial_est = partial_fit.summary2().tables[1]['Coef.']['t_D']


# standard error
HCV_coefs = partial_fit.cov_HC0
partial_se = np.power( HCV_coefs.diagonal() , 0.5)[1]

print( f"Coefficient for D via partialling-out {partial_est} and the corresponding robust standard error is {partial_se}" )

# confidence interval
partial_fit.conf_int( alpha=0.05 ).loc[['t_D']]


Coefficient for D via partialling-out -0.0695532032968462 and the corresponding robust standard error is 0.015000474421753372


Unnamed: 0,0,1
t_D,-0.098671,-0.040435


Again, the estimated coefficient measures the linear predictive effect (PE) of $D$ on $Y$ after taking out the linear effect of $W$ on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.

We know that the partialling-out approach works well when the dimension of $W$ is low
in relation to the sample size $n$. When the dimension of $W$ is relatively high, we need to use variable selection
or penalization for regularization purposes. 
