# Application: Heterogeneous Effect of Gender on Wage Using Double Lasso

 We use US census data from the year 2012 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  For a detailed description of the variables we refer to the help page.



This analysis allows a closer look how discrimination according to gender is related to other socio-economic variables.



First, we have to load the following packages:

In [1]:
install.packages(c("xtable","hdm"))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘iterators’, ‘foreach’, ‘shape’, ‘RcppEigen’, ‘glmnet’, ‘checkmate’, ‘Formula’




We consider the high-dimensional linear regression model:

  $$
  Y =  \alpha_1 D + \alpha_2 DZ +  \beta W + \varepsilon.
  $$

We can load the data as follows:

In [9]:
library(hdm)
data(cps2012)
str(cps2012)

'data.frame':	29217 obs. of  23 variables:
 $ year        : num  2012 2012 2012 2012 2012 ...
 $ lnw         : num  1.91 1.37 2.54 1.8 3.35 ...
 $ female      : num  1 1 0 1 0 0 0 0 0 1 ...
 $ widowed     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ divorced    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ separated   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ nevermarried: num  0 0 0 0 0 0 1 0 0 0 ...
 $ hsd08       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hsd911      : num  0 1 0 0 0 0 0 0 0 0 ...
 $ hsg         : num  0 0 1 1 0 1 1 0 0 0 ...
 $ cg          : num  0 0 0 0 1 0 0 0 1 0 ...
 $ ad          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ mw          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ so          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ we          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ exp1        : num  22 30 19 14 15 23 33 23.5 15 15.5 ...
 $ exp2        : num  4.84 9 3.61 1.96 2.25 ...
 $ exp3        : num  10.65 27 6.86 2.74 3.38 ...
 $ exp4        : num  23.43 81 13.03 3.84 5.06 ...
 $ weight      : num  569 626 264 257 257 ...
 $ 

You can use the following model matrix to estimate the target parameters $\alpha_1$ and $\alpha_2$:

In [10]:
# create the model matrix for the covariates
X <- model.matrix(~-1 + female + female:(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) +(widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3)^2, data = cps2012)

X <- X[, which(apply(X, 2, var) != 0)] # exclude all constant variables

**Exercise 1:** Have a look at the proposed model matrix above and compare it with the definiton of the linear regression model. Which variables are included in $Z$ and $W$?

**Exercise 2:** Demean the model matrix $X$, i.e., calculate the mean of each variable and substract it from each observation. Why could this be important? Hint: It could be helpful to use the function *apply()*.

**Exercise 3:** Specify the relevant indices of the columns of $X$ that we are interested in and safe them as the variable "index.gender".
Hint: These are the indices of the covariates corresponding to $\alpha_1$ and $\alpha_2$. The function *grep()* can be helpful.

**Exercise 4:** Use the function *rlassoEffects* from the *hdm* package to estimate the parameters of interest. Have a look at the estimated coefficients. Do we have any significant heterogeneity? Hint: You just need to specify the *index* input parameter according to Exercise 3.

**Exercise 5:** Due to multiple testing issues, have a look at the joint confidence intervals (e.g., $90\%$ confidence intervals). Do we still have significant heterogeneity? Hint: Set "joint = TRUE" in the *confint()* function.