## Group3 - Replication_1
#### Members
1. Andrea Ulloa (20172597)
2. Ana Angulo (20171627)
3. Angela Coapaza (20171636) 



## Question 1:

### An Inferencial Problem: The College-Educated Wage Gap
Using the data from the March Supplement of the U.S. Current Population Survey (2015) , in  this lab, we are going to focus in  payments of the college-educated workers and answer the following inference question:

What is the difference in predicted wages between workers with some college education (scl) versus college graduate workers(clg)?

To investigate the College-Educated Wage Gap, we consider the following log-linear regression model:

\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 SCL  + \beta_2 CLG + \beta_3'W  + \epsilon,
\end{align}

Where  SCL is the indicator of workers with some college education( 1  if yes 0  otherwise), CLG is the indicator of college graduate workers (1  if yes 0  otherwise) and the  𝑊 's are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of workers with some college education and college graduate workers. 


### Data Analysis

We consider the same subsample of the U.S. Current Population Survey (2015). Let us load the data set.

***Variable description***

- occ : occupational classification
- ind : industry classification
- lwage : log hourly wage
- sex : gender (1 female) (0 male)
- shs : some high school
- hsg : High school graduated
- scl : Some College
- clg: College Graduate
- ad: Advanced Degree
- ne: Northeast
- mw: Midwest
- so: South
- we: West
- exp1: experience

In [35]:
# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr as rr
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [13]:
# Extracting the data
total_data  = pd.read_csv(r'../../data/wage2015_subsample_inference.csv')
total_data['occ']=pd.Categorical(total_data.occ)
total_data['occ2']=pd.Categorical(total_data.occ2)
total_data['ind']=pd.Categorical(total_data.ind)
total_data['ind2']=pd.Categorical(total_data.ind2)

#Determining the dimension of our data set.
total_data.shape
# There are 5150 obs and 21 features

(5150, 21)

Focus on the subset of college-educated workers (scl, clg variables). Thus, we will filter the observations that are at least one year old or have finished college.

In [12]:
print(total_data['scl'].value_counts()) # there are 1432
print(total_data['clg'].value_counts()) # there are 1636
# In our new base we should have 3068 observations

0.0    3718
1.0    1432
Name: scl, dtype: int64
0.0    3514
1.0    1636
Name: clg, dtype: int64


In [21]:
data = total_data[(total_data['scl'] == 1) | (total_data['clg'] == 1)]
data
# This new dataframe it's correct! We have 3068 observations 

Unnamed: 0,rownames,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,...,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
0,10,9.615385,2.263364,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,7.0,0.49,0.343,0.2401,3600.0,11,8370.0,18
1,12,48.076923,3.872802,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,31.0,9.61,29.791,92.3521,3050.0,10,5070.0,9
4,19,28.846154,3.361977,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,22.0,4.84,10.648,23.4256,2015.0,6,9470.0,22
5,30,11.730769,2.462215,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,0.01,0.001,0.0001,1650.0,5,7460.0,14
9,71,19.230769,2.956512,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,4.0,0.16,0.064,0.0256,3255.0,10,8190.0,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5140,32596,45.546559,3.818735,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,5.0,0.25,0.125,0.0625,3255.0,10,8190.0,18
5143,32606,24.038462,3.179655,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,17.0,2.89,4.913,8.3521,2550.0,8,9480.0,22
5144,32619,13.846154,2.628007,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,10.0,1.00,1.000,1.0000,800.0,2,770.0,4
5145,32620,14.769231,2.692546,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,9.0,0.81,0.729,0.6561,4700.0,16,4970.0,9


Let's go to describe the main variables that we will use later to obtain the causal effect of sex on income: 

wage, log-wage, sex, some college, college graduate, avdanced degree, experience

In [29]:
data.describe()
#"lwage","sex","scl","clg","ne","mw","so","we","exp1" - to regression
#"wage, lwage","sex","scl","clg","ad","ne","mw","so","we","exp1" - to describe

Unnamed: 0,rownames,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
count,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0,3068.0
mean,15716.577249,23.657384,3.000022,0.470991,0.0,0.0,0.466754,0.533246,0.0,0.265971,0.285854,0.221643,0.226532,12.700945,2.676344,7.133814,21.345586
std,9752.832944,19.3677,0.54451,0.499239,0.0,0.0,0.498975,0.498975,0.0,0.441921,0.451894,0.41542,0.418655,10.312857,3.766616,13.285832,47.858967
min,10.0,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.01,0.001,0.0001
25%,7262.0,14.17004,2.65113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5,0.2025,0.091125,0.041006
50%,15089.5,19.230769,2.956512,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,9.0,0.81,0.729,0.6561
75%,24595.5,27.990239,3.331855,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,19.0,3.61,6.859,13.0321
max,32624.0,490.196078,6.194805,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,42.5,18.0625,76.765625,326.253906


We continue to start our (causal) analysis.

First, we compare the sample means given gender:

In [28]:
Z = data[ ["lwage","sex","scl","clg","ne","mw","so","we","exp1" ] ]

data_female = data[data[ 'sex' ] == 1 ]
Z_female = data_female[ ["lwage","sex","scl","clg","ne","mw","so","we","exp1" ] ]

data_male = data[ data[ 'sex' ] == 0 ]
Z_male = data_male[ ["lwage","sex","scl","clg","ne","mw","so","we","exp1" ] ]


table = np.zeros( (9, 3) ) #9 filas, 3 columnas
table[:, 0] = Z.mean().values # el promedio de cada variable
table[:, 1] = Z_male.mean().values
table[:, 2] = Z_female.mean().values
table_pandas = pd.DataFrame( table, columns = [ 'All', 'Men', 'Women'])
table_pandas.index = ["Log Wage","Sex","Some College","Gollage Graduate", "Northeast","Midwest","South","West","Experience"]

table_pandas

Unnamed: 0,All,Men,Women
Log Wage,3.000022,3.038412,2.956904
Sex,0.470991,0.0,1.0
Some College,0.466754,0.481824,0.449827
Gollage Graduate,0.533246,0.518176,0.550173
Northeast,0.226532,0.219347,0.234602
Midwest,0.265971,0.261245,0.27128
South,0.285854,0.290819,0.280277
West,0.221643,0.228589,0.213841
Experience,12.700945,12.433148,13.00173


In [34]:
data_female["lwage"].mean() - data_male["lwage"].mean() # On average women earn 8,15% less than men

-0.08150855508736621

In this sample of workers who are or have finished university, women earn less on average than men. It is also interesting to note that in this sample the average experience is 12 years, and women have on average more years of experience than men, but still earn less.

This unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$:

\begin{align}
\log(Y) &=\beta D + \epsilon.
\end{align}

So, we will do the unconditional ols regression

In [None]:
nocontrol_model = smf.ols( formula = 'lwage ~ sex', data = data ) # el coef de sex debe ser igual al hallado por la diferencia de medias

nocontrol_est = nocontrol_model.fit().summary2().tables[1]['Coef.']['sex']
nocontrol_se2 = nocontrol_model.fit().summary2().tables[1]['Std.Err.']['sex'] # desviación estandar del parametro de sexo

# Robust standar error
HCV_coefs = nocontrol_model.fit().cov_HC0 # Obtenemos la matriz de varianzas y covarianzas ajustadas por heterocedasticidad
nocontrol_se = np.power( HCV_coefs.diagonal() , 0.5)[1] # Para obtener la desviacion estandar saco la raiz cuadrada de la diagonal que contienen las varianzas



# print unconditional effect of gender and the corresponding standard error

print( f'The estimated gender coefficient is {nocontrol_est} and the corresponding standard error is {nocontrol_se2}' )
print( f'The estimated gender coefficient is {nocontrol_est} and the corresponding robust standard error is {nocontrol_se}','\n' )

## Proof of the Frisch - Waugh - Lovell Theorem

For the proof of the theorem, we use the following:

1. The Partialling-out operation

    Be the following equation:
    $V$ = $\beta$$W$ + e
        
    $\tilde{V}$ =  $V$ - $\alpha_{YW}$$W$
    
    $\alpha_{YW}$ is the $\hat{\beta}$ (estimated parameter of the regression of V with W)
    
    We're creating a "residual" V by subtracting the part of V
    that is linearly predicted
    

2. This property tells us that from a linear combination of the sum of two vectors it follows that there is a linear       combination of the residualized vectors.

 $Y$ = $V$ + $W$  $\longrightarrow$  $\tilde{Y}$ = $\tilde{V}$ + $\tilde{W}$ 
    
    

Then let be the following regression:

\begin{align} 
Y= T \beta_1 + X \beta_2 + e   
\end{align}

where,

$T$: treatment variable

$\beta_1$: parameter that captures the causal effect

$X$: other regressors

$e$: error

Since we are interested in knowing only the value of $\beta_1$, we partialling-out to both sides of our regression equation:

\begin{align} 
\tilde{Y}= \tilde{T}\beta_1 +  \tilde{X}\beta_2 + \tilde{e}   
\end{align}

- Since the regression is a linear combination, we can obtain a linear equation of the errors estimated by the property defined at the beginning.

- Each argument of the equation [2] is the estimated error of the regression of that variable with respect to $X$ 

 $\tilde{Y}$ =  $Y$ - $\alpha_{YX}$$X$
 
 $\tilde{T}$ =  $T$ - $\alpha_{YX}$$X$

 $\tilde{X}$ =  $X$ - $\alpha_{XX}$$X$

 $\tilde{e}$ =  $e$ - $\alpha_{eX}$$X$
 

- Some of these estimated error will be deleted because:

 $\alpha_{XX}$ = $I$. So, $\tilde{X}$ = 0
 
 By definition: 
 $E(e | X)$ = 0, $\alpha_{eX}$= 0, and $\tilde{e}$=$e$
 

- So the  equation [2] reduces to the equation [3]:

\begin{align} 
\tilde{Y}= \tilde{T}\beta_1 + {e}   
\end{align}

Finally we come to what F-W-L proves in their theorem: we can reduce the  equation [1] containing many regressors on $X$ to a simple residual regression (which is defined by partially removing the linear effect of $X$ from $Y$ and $T$) that only has the parameter that we are interested in estimating.

The estimated parameter $\beta_1$ of the  equation [1] will be equal to the parameter $\beta_1$ that will be estimated in the regression [3].