# Application: Heterogeneous Effect of Sex on Wage Using Double Lasso

We use US census data from the year 2015 to analyse the effect of sex and interaction effects of other variables with sex on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  For a detailed description of the variables we refer to the help page.



This analysis allows a closer look how discrimination according to sex is related to other socio-economic variables.

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from scipy.stats import norm
from sklearn.linear_model import LassoCV, Lasso, LinearRegression
import patsy
import warnings
import statsmodels.api as sm
warnings.simplefilter('ignore')
np.random.seed(1234)

In [None]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
data = pd.read_csv(file)

In [None]:
data.describe()

Define outcome and regressors

In [None]:
y = np.log(data['wage']).values
Z = data.drop(['wage', 'lwage'], axis=1)
Z.columns

## Feature Engineering

Construct all our control variables

In [None]:
# Ultra flexible controls of all pair-wise interactions (around 1k variables); un-comment to run this
Zcontrols = patsy.dmatrix('0 + (shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we+exp1+exp2+exp3+exp4)**2',
                           Z, return_type='dataframe')

Zcontrols = Zcontrols - Zcontrols.mean(axis=0)

Construct all the variables that we will use to model heterogeneity of effect in a linear manner

In [None]:
Zhet = patsy.dmatrix('0 + (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',
                      Z, return_type='dataframe')
Zhet = Zhet - Zhet.mean(axis=0)

Construct all interaction variables between sex and heterogeneity variables

In [None]:
Zhet['sex'] = Z['sex']
Zinteractions = patsy.dmatrix('0 + sex + sex * (shs+hsg+scl+clg+mw+so+we+exp1+exp2+exp3+exp4)',
                               Zhet, return_type='dataframe')
interaction_cols = [c for c in Zinteractions.columns if c.startswith('sex')]

Put all the variables together

In [None]:
X = pd.concat([Zinteractions, Zcontrols], axis=1)
X.shape

## Double Lasso for All Interactive Effects

\We use "plug-in" tuning with a theoretically valid choice of penalty $\lambda = 2 \cdot c \hat{\sigma}  \Phi^{-1}(1-\alpha/2p)/\sqrt{n}$, where $c>1$ and $1-\alpha$ is a confidence level, and $\Phi^{-1}$ denotes the quantile function. Under homoskedasticity, this choice ensures that the Lasso predictor is well behaved, delivering good predictive performance under approximate sparsity. In practice, this formula will work well even in the absence of homoskedasticity, especially when the random variables $\epsilon$ and $X$ in the regression equation decay quickly at the tails.

In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks.

Here, we use a convenient, conservative bound $\hat{\sigma} = \sqrt{Var{Y}}$. The iterative estimation of $\hat{\sigma}$ is provided by RLasso (see end of the notebook).

Note: In the book, we multiply instead of divide by $\sqrt{n}$. This is because there, Lasso minimizes the sum of errors, versus sklearn's Lasso whose objective minimizes the average errors.

In [None]:
a = 0.05
const = 1.1

For each target predictive effect estimate it via the partialling out process and calculate the quantities needed for the covariance calculation, which is the residual outcome, the residual target variable and the final stage residual epsilon.

In [None]:
alpha = {}
res_y, res_D, epsilon = {}, {}, {}
for c in interaction_cols:
    print(f"Double Lasso for target variable {c}")
    D = X[c].values
    W = X.drop([c], axis=1)

    # Do the lasso penalty here
    hatsigma = np.std(y)
    lmbda_theory = 2*const*hatsigma*norm.ppf(1-a/(2*X.shape[1]))/np.sqrt(X.shape[0])
    lasso_model = lambda: make_pipeline(StandardScaler(), Lasso(alpha=lmbda_theory))
    res_y[c] = y - lasso_model().fit(W, y).predict(W)

    # Do the lasso penalty here with Var(D)
    hatsigma = np.std(D)
    lmbda_theory = 2*const*hatsigma*norm.ppf(1-a/(2*X.shape[1]))/np.sqrt(X.shape[0])
    lasso_model = lambda: make_pipeline(StandardScaler(), Lasso(alpha=lmbda_theory))
    res_D[c] = D - lasso_model().fit(W, D).predict(W)

    # Last Stage
    final = LinearRegression(fit_intercept=False).fit(res_D[c].reshape(-1, 1), res_y[c])
    epsilon[c] = res_y[c] - final.predict(res_D[c].reshape(-1, 1))
    alpha[c] = [final.coef_[0]]

In [None]:
# Calculate the covariance matrix of the estimated parameters
V = np.zeros((len(interaction_cols), len(interaction_cols)))
for it, c in enumerate(interaction_cols):
    Jc = np.mean(res_D[c]**2)
    for itp, cp in enumerate(interaction_cols):
        Jcp = np.mean(res_D[cp]**2)
        Sigma = np.mean(res_D[c] * epsilon[c] * epsilon[cp] * res_D[cp])
        V[it, itp] = Sigma / (Jc * Jcp)

# Calculate standard errors for each parameter
n = X.shape[0]
for it, c in enumerate(interaction_cols):
    alpha[c] += [np.sqrt(V[it, it] / n)]

# put all in a dataframe
df = pd.DataFrame.from_dict(alpha, orient='index', columns=['point', 'stderr'])

# Calculate and Pointwise p-value
summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['Std. Error'] = df['stderr']
summary['p-value'] = norm.sf(np.abs(df['point'] / df['stderr']), loc=0, scale=1) * 2
summary['ci_lower'] = df['point'] - 1.96 * df['stderr']
summary['ci_upper'] = df['point'] + 1.96 * df['stderr']
summary

In [None]:
# Joint Intervals
Drootinv = np.diagflat(1/np.sqrt(np.diag(V)))
scaledCov = Drootinv @ V @ Drootinv
np.random.seed(123)
U = np.random.multivariate_normal(np.zeros(scaledCov.shape[0]), scaledCov, size=10000)
z = np.max(np.abs(U), axis=1)
c = np.percentile(z, 95)
summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['CI lower'] = df['point'] - c * df['stderr']
summary['CI upper'] = df['point'] + c * df['stderr']
summary

We can also set the penalized lasso model to be estimated based on the theoretically motivated penalty level using the hdmpy package. To install it run
```
!pip install multiprocess
!git clone https://github.com/maxhuppertz/hdmpy.git
```

You can run the cells below and then repeat the whole analysis above using the newly defined `lasso_model` variable.

In [None]:
import sys
sys.path.insert(1, "./hdmpy")

In [None]:
# We wrap the package so that it has the familiar sklearn API
import hdmpy
from sklearn.base import BaseEstimator, clone

class RLasso(BaseEstimator):

    def __init__(self, *, post=True):
        self.post = post

    def fit(self, X, y):
        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)
        return self

    def predict(self, X):
        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])

lasso_model = lambda: RLasso(post=False)

In [None]:
alpha = {}
res_y, res_D, epsilon = {}, {}, {}
for c in interaction_cols:
    print(f"Double Lasso for target variable {c}")
    D = X[c].values
    W = X.drop([c], axis=1)
    res_y[c] = y - lasso_model().fit(W, y).predict(W)
    res_D[c] = D - lasso_model().fit(W, D).predict(W)
    final = LinearRegression(fit_intercept=False).fit(res_D[c].reshape(-1, 1), res_y[c])
    epsilon[c] = res_y[c] - final.predict(res_D[c].reshape(-1, 1))
    alpha[c] = [final.coef_[0]]

# Calculate the covariance matrix of the estimated parameters
V = np.zeros((len(interaction_cols), len(interaction_cols)))
for it, c in enumerate(interaction_cols):
    Jc = np.mean(res_D[c]**2)
    for itp, cp in enumerate(interaction_cols):
        Jcp = np.mean(res_D[cp]**2)
        Sigma = np.mean(res_D[c] * epsilon[c] * epsilon[cp] * res_D[cp])
        V[it, itp] = Sigma / (Jc * Jcp)

# Calculate standard errors for each parameter
n = X.shape[0]
for it, c in enumerate(interaction_cols):
    alpha[c] += [np.sqrt(V[it, it] / n)]

# put all in a dataframe
df = pd.DataFrame.from_dict(alpha, orient='index', columns=['point', 'stderr'])

# Calculate and pointwise p-value
summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['Std. Error'] = df['stderr']
summary['p-value'] = norm.sf(np.abs(df['point'] / df['stderr']), loc=0, scale=1) * 2
summary['ci_lower'] = df['point'] - 1.96 * df['stderr']
summary['ci_upper'] = df['point'] + 1.96 * df['stderr']
summary

### Joint Confidence Intervals

In [None]:
Drootinv = np.diagflat(1/np.sqrt(np.diag(V)))
scaledCov = Drootinv @ V @ Drootinv
np.random.seed(123)
U = np.random.multivariate_normal(np.zeros(scaledCov.shape[0]), scaledCov, size=10000)
z = np.max(np.abs(U), axis=1)
c = np.percentile(z, 95)
c
summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['CI lower'] = df['point'] - c * df['stderr']
summary['CI upper'] = df['point'] + c * df['stderr']
summary

In lieu of using $Var[Y]$ and $Var[D]$ as conservative bounds, we can also get close to the rlasso if we pick estimate $\hat{\sigma}$ via iterative regularization based on CV. Although CV itself is not valid for inference, we can use it to estimate $\hat{\sigma}$ as LassoCV is consistent.


In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
tmp = lambda: make_pipeline(StandardScaler(), LassoCV(cv=cv))

In [None]:
alpha = {}
res_y, res_D, res_y_tmp, res_D_tmp, epsilon = {}, {}, {}, {}, {}
for c in interaction_cols:
    print(f"Double Lasso for target variable {c}")
    D = X[c].values
    W = X.drop([c], axis=1)
    res_y_tmp[c] = y - tmp().fit(W, y).predict(W)
    res_D_tmp[c] = D - tmp().fit(W, D).predict(W)

    # Do the lasso penalty here
    hatsigma = np.std(res_y_tmp[c])
    lmbda_theory = 2*const*hatsigma*norm.ppf(1-a/(2*X.shape[1]))/np.sqrt(X.shape[0])
    lasso_model = lambda: make_pipeline(StandardScaler(), Lasso(alpha=lmbda_theory))
    res_y[c] = y - lasso_model().fit(W, y).predict(W)

    # Do the lasso penalty here with Var(D)
    hatsigma = np.std(res_D_tmp[c])
    lmbda_theory = 2*const*hatsigma*norm.ppf(1-a/(2*X.shape[1]))/np.sqrt(X.shape[0])
    lasso_model = lambda: make_pipeline(StandardScaler(), Lasso(alpha=lmbda_theory))
    res_D[c] = D - lasso_model().fit(W, D).predict(W)

    # final stage
    final = LinearRegression(fit_intercept=False).fit(res_D[c].reshape(-1, 1), res_y[c])
    epsilon[c] = res_y[c] - final.predict(res_D[c].reshape(-1, 1))
    alpha[c] = [final.coef_[0]]

In [None]:
# Calculate the covariance matrix of the estimated parameters
V = np.zeros((len(interaction_cols), len(interaction_cols)))
for it, c in enumerate(interaction_cols):
    Jc = np.mean(res_D[c]**2)
    for itp, cp in enumerate(interaction_cols):
        Jcp = np.mean(res_D[cp]**2)
        Sigma = np.mean(res_D[c] * epsilon[c] * epsilon[cp] * res_D[cp])
        V[it, itp] = Sigma / (Jc * Jcp)

# Calculate standard errors for each parameter
n = X.shape[0]
for it, c in enumerate(interaction_cols):
    alpha[c] += [np.sqrt(V[it, it] / n)]

# put all in a dataframe
df = pd.DataFrame.from_dict(alpha, orient='index', columns=['point', 'stderr'])

# Calculate and Pointwise p-value
summary = pd.DataFrame()
summary['Estimate'] = df['point']
summary['Std. Error'] = df['stderr']
summary['p-value'] = norm.sf(np.abs(df['point'] / df['stderr']), loc=0, scale=1) * 2
summary['ci_lower'] = df['point'] - 1.96 * df['stderr']
summary['ci_upper'] = df['point'] + 1.96 * df['stderr']
summary