* Python code replication of:
" https://www.kaggle.com/victorchernozhukov/analyzing-rct-reemployment-experiment "
* Created by: Alexander Quispe and Anzony Quispe 

# Analyzing RCT data with Precision Adjustment

## Data

In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt for further details on data. 
  

In [138]:
import pandas as pd

In [139]:
## loading the data
Penn = pd.read_csv("../data/penn_jae.dat" , sep='\s', engine='python')
n = Penn.shape[0]
p_1 = Penn.shape[1]
Penn = Penn[ (Penn['tg'] == 4) | (Penn['tg'] == 0) ]

In [140]:
#this columns were not dropped out :  Unnamed: 13, recall
Penn.columns
Penn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5099 entries, 0 to 13911
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   abdt         5099 non-null   int64  
 1   tg           5099 non-null   int64  
 2   inuidur1     5099 non-null   int64  
 3   inuidur2     5099 non-null   int64  
 4   female       5099 non-null   int64  
 5   black        5099 non-null   int64  
 6   hispanic     5099 non-null   int64  
 7   othrace      5099 non-null   int64  
 8   dep          5099 non-null   int64  
 9   q1           5099 non-null   int64  
 10  q2           5099 non-null   int64  
 11  q3           5099 non-null   int64  
 12  q4           5099 non-null   int64  
 13  Unnamed: 13  5099 non-null   int64  
 14  q5           5099 non-null   int64  
 15  q6           5099 non-null   int64  
 16  recall       5099 non-null   int64  
 17  agelt35      5099 non-null   int64  
 18  agegt54      5099 non-null   int64  
 19  durab

In [142]:
# Dependent variable
Penn['T4'] = (Penn[['tg']]==4).astype(int)

# Create category variable
Penn['dep'] = Penn['dep'].astype( 'category' )
Penn.head()

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,T4
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,0,1,0,,0
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
11,10607,4,9,9,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,,1
12,10831,0,27,27,0,0,0,0,1,0,...,0,0,1,1,0,1,0,0,,0


### Model 
To evaluate the impact of the treatments on unemployment duration, we consider the linear regression model:

$$
Y =  D \beta_1 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W')' = 0,
$$

where $Y$ is  the  log of duration of unemployment, $D$ is a treatment  indicators,  and $W$ is a set of controls including age group dummies, gender, race, number of dependents, quarter of the experiment, location within the state, existence of recall expectations, and type of occupation.   Here $\beta_1$ is the ATE, if the RCT assumptions hold rigorously.


We also consider interactive regression model:

$$
Y =  D \alpha_1 + D W' \alpha_2 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W', DW')' = 0,
$$
where $W$'s are demeaned (apart from the intercept), so that $\alpha_1$ is the ATE, if the RCT assumptions hold rigorously.

Under RCT, the projection coefficient $\beta_1$ has
the interpretation of the causal effect of the treatment on
the average outcome. We thus refer to $\beta_1$ as the average
treatment effect (ATE). Note that the covariates, here are
independent of the treatment $D$, so we can identify $\beta_1$ by
just linear regression of $Y$ on $D$, without adding covariates.
However we do add covariates in an effort to improve the
precision of our estimates of the average treatment effect.

### Analysis

We consider 

*  classical 2-sample approach, no adjustment (CL)
*  classical linear regression adjustment (CRA)
*  interactive regression adjusment (IRA)

and carry out robust inference using the *estimatr* R packages. 

# Carry out covariate balance check

This is done using "lm_robust" command which unlike "lm" in the base command automatically does the correct Eicher-Huber-White standard errors, instead othe classical non-robus formula based on the homoscdedasticity command.

In [143]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

## Regress treatment on all covariates

In [144]:
model = "T4~(female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)**2"
model_results = smf.ols(model , data=Penn).fit().get_robustcov_results(cov_type = "HC1")

print(model_results.summary())
print( "Number of regressors in the basic model:",len(model_results.params), '\n')

                            OLS Regression Results                            
Dep. Variable:                     T4   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     26.59
Date:                Thu, 18 Mar 2021   Prob (F-statistic):               0.00
Time:                        19:53:39   Log-Likelihood:                -3360.7
No. Observations:                5099   AIC:                             6941.
Df Residuals:                    4989   BIC:                             7660.
Df Model:                         109                                         
Covariance Type:                  HC1                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               0.3890    



## Regress treatment on specific covariates

In [172]:
# 2. But we need to carry out about colinear variables 
# To make similar as R notebook, lets delete the variables lm package also deleted
y, X = patsy.dmatrices(model, Penn, return_type='dataframe')

In [173]:
len( list( X.columns.values ) )

120

In [174]:
# Variables deleted by lm package in R
no_columns = ['lusd:husd','agelt35:agegt54', 'q6:lusd','q6:husd', 'q5:q6','q4:q5','q4:q6', 'q3:q4', 'q3:q5',  'q3:q6', 'q2:q3', 'q2:q4','q2:q5','q2:q6', 'black:othrace' , 'black:q6' , 'othrace:q6']
len(no_columns)

17

In [175]:
# New covariates matrix
X_new = X.drop(no_columns, axis = 1 )
X_new

Unnamed: 0,Intercept,C(dep)[T.1],C(dep)[T.2],female,female:C(dep)[T.1],female:C(dep)[T.2],black,black:C(dep)[T.1],black:C(dep)[T.2],othrace,...,q6:agegt54,q6:durable,agelt35:durable,agelt35:lusd,agelt35:husd,agegt54:durable,agegt54:lusd,agegt54:husd,durable:lusd,durable:husd
0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13904,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13905,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13906,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13910,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [155]:
# Results 
#sm.OLS( y, X_new ).fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(5)
model_results_2 = sm.OLS( y, X_new ).fit().get_robustcov_results(cov_type = "HC1")
print(model_results_2.summary())
print( "Number of regressors in the basic model:",len(results_2.params), '\n')

#with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
#    print(A)

                            OLS Regression Results                            
Dep. Variable:                     T4   R-squared:                       0.026
Model:                            OLS   Adj. R-squared:                  0.006
Method:                 Least Squares   F-statistic:                     28.68
Date:                Thu, 18 Mar 2021   Prob (F-statistic):               0.00
Time:                        19:56:38   Log-Likelihood:                -3365.8
No. Observations:                5099   AIC:                             6936.
Df Residuals:                    4997   BIC:                             7602.
Df Model:                         101                                         
Covariance Type:                  HC1                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               0.3939    



We see that that even though this is a randomized experiment, balance conditions are failed.

# Model Specification

In [158]:
import numpy as np
Penn.head()

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,T4
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,0,1,0,,0
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,,0
11,10607,4,9,9,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,,1
12,10831,0,27,27,0,0,0,0,1,0,...,0,0,1,1,0,1,0,0,,0


In [159]:
# model specifications
# take log of inuidur1
Penn["log_inuidur1"] = np.log( Penn["inuidur1"] ) 

# no adjustment (2-sample approach)
formula_cl = 'log_inuidur1 ~ T4'

# adding controls
formula_cra = 'log_inuidur1 ~ T4 + (female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd)**2'
# Omitted dummies: q1, nondurable, muld

ols_cl = smf.ols( formula = formula_cl, data = Penn ).fit().get_robustcov_results(cov_type = "HC1")
ols_cra = smf.ols( formula = formula_cra, data = Penn ).fit().get_robustcov_results(cov_type = "HC1")

# Results 
print(ols_cl.summary())
print(ols_cra.summary())

                            OLS Regression Results                            
Dep. Variable:           log_inuidur1   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     5.680
Date:                Thu, 18 Mar 2021   Prob (F-statistic):             0.0172
Time:                        19:59:32   Log-Likelihood:                -8223.8
No. Observations:                5099   AIC:                         1.645e+04
Df Residuals:                    5097   BIC:                         1.646e+04
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0568      0.021     98.156      0.0



In [56]:
# ols_cl_model = ols_cl.fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(3)
# ols_cra_model = ols_cra.fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(3)

# print(ols_cl_model)
# print(ols_cra_model)

The interactive specificaiton corresponds to the approach introduced in Lin (2013).

In [160]:
#interactive regression model;
# No intercept
formula3 = "T4~(female+black+othrace+C(dep)+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd) ** 2"

# Create dependent variable and drop the intercept
y, X = patsy.dmatrices(formula3, Penn, return_type='dataframe')
X = X.drop( 'Intercept', axis = 1)

# demean variables 
def demean(X):
    output = X - np.mean(X)
    return output

X = X.apply( demean , axis = 0 )

In [161]:
# create Y variable 
log_inuidur1 = np.log( Penn["inuidur1"] )

In [162]:
# Rename X columns 
columns = X.columns.to_list()
new_columns = []
for column in columns:
    new_string = column.replace(".", "_")
    new_string = new_string.replace("C(dep)", "C_dep")
    new_string = new_string.replace("[", "_")
    new_string = new_string.replace("]", "")
    new_columns.append(new_string)
X.columns = new_columns

In [163]:
# Function to create name of the model 
def listToString(s):  
    i = 1
    # initialize an empty string 
    str1 = ""  
    
    # traverse in the string   
    for ele in s:
        if i ==1:
            str1 = ele
            i += 1
        else:
            str1 += " + " + ele   
    
    # return string   
    return str1

In [164]:
covars = listToString(X.columns.to_list())
len(X.columns.to_list())

119

In [165]:
# Creating the covariable T4*X
X['T4'] = y
X.shape

(5099, 120)

In [166]:
formula4 = f"T4 ~ T4*({covars})"

In [167]:
y, X_T4 = patsy.dmatrices(formula4, X, return_type='dataframe')

In [168]:
# Reset index to estimation
log_inuidur1 = np.log(Penn[ 'inuidur1' ])
ols_ira = sm.OLS( log_inuidur1, X_T4 ).fit().get_robustcov_results(cov_type = "HC1")
# Results 
print(ols_ira.summary())

                            OLS Regression Results                            
Dep. Variable:               inuidur1   R-squared:                       0.093
Model:                            OLS   Adj. R-squared:                  0.053
Method:                 Least Squares   F-statistic:                     12.27
Date:                Thu, 18 Mar 2021   Prob (F-statistic):          3.96e-313
Time:                        20:10:11   Log-Likelihood:                -7977.3
No. Observations:                5099   AIC:                         1.639e+04
Df Residuals:                    4881   BIC:                         1.782e+04
Df Model:                         217                                         
Covariance Type:                  HC1                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                2.0462 



In [26]:
# ols_ira_est = ols_ira.fit().get_robustcov_results(cov_type = "HC1").summary2().tables[1].round(4)
# print( ols_ira_est )

## Lets try the same regression by dropping the possible correlated variables

In [176]:
X_new = X_new.drop( 'Intercept', axis = 1)

# demean variables 
def demean(X):
    output = X - np.mean(X)
    return output

X_new = X_new.apply( demean , axis = 0 )

In [177]:
# Rename X columns 
columns = X_new.columns.to_list()
new_columns = []
for column in columns:
    new_string = column.replace(".", "_")
    new_string = new_string.replace("C(dep)", "C_dep")
    new_string = new_string.replace("[", "_")
    new_string = new_string.replace("]", "")
    new_columns.append(new_string)
X_new.columns = new_columns

# Function to create name of the model 
def listToString(s):  
    i = 1
    # initialize an empty string 
    str1 = ""  
    
    # traverse in the string   
    for ele in s:
        if i ==1:
            str1 = ele
            i += 1
        else:
            str1 += " + " + ele   
    
    # return string   
    return str1

In [178]:
covars = listToString(X_new.columns.to_list())
len(X_new.columns.to_list())

102

In [179]:
X_new['T4'] = y
X_new.shape
X_new

Unnamed: 0,C_dep_T_1,C_dep_T_2,female,female:C_dep_T_1,female:C_dep_T_2,black,black:C_dep_T_1,black:C_dep_T_2,othrace,othrace:C_dep_T_1,...,q6:durable,agelt35:durable,agelt35:lusd,agelt35:husd,agegt54:durable,agegt54:lusd,agegt54:husd,durable:lusd,durable:husd,T4
0,-0.112179,0.836242,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,0.0
3,-0.112179,-0.163758,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,0.0
4,-0.112179,-0.163758,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,0.0
11,-0.112179,-0.163758,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,1.0
12,0.887821,-0.163758,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13904,-0.112179,-0.163758,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,1.0
13905,-0.112179,0.836242,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,1.0
13906,-0.112179,0.836242,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,0.0
13910,-0.112179,-0.163758,-0.404001,-0.05099,-0.051579,-0.121985,-0.01314,-0.015689,-0.007256,-0.000196,...,-0.011179,-0.015297,-0.026476,-0.042165,0.0,-0.028045,-0.053736,-0.032359,-0.039223,1.0


In [180]:
formula4 = f"T4 ~ T4*({covars})"
y, X_new_T4 = patsy.dmatrices(formula4, X_new, return_type='dataframe')

# Reset index to estimation
ols_ira = sm.OLS( log_inuidur1, X_new_T4 ).fit().get_robustcov_results(cov_type = "HC1")
# Results 
print(ols_ira.summary())

                            OLS Regression Results                            
Dep. Variable:               inuidur1   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.053
Method:                 Least Squares   F-statistic:                     13.58
Date:                Thu, 18 Mar 2021   Prob (F-statistic):               0.00
Time:                        20:19:37   Log-Likelihood:                -7985.3
No. Observations:                5099   AIC:                         1.637e+04
Df Residuals:                    4897   BIC:                         1.770e+04
Df Model:                         201                                         
Covariance Type:                  HC1                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                2.0880 



# Next we try out partialling out with lasso

In [181]:
T4 = X_T4[ 'T4' ]
X = X_T4.drop( 'T4', axis = 1)
X = X.drop( 'Intercept', axis = 1)
X

Unnamed: 0,C_dep_T_1,C_dep_T_2,female,female:C_dep_T_1,female:C_dep_T_2,black,black:C_dep_T_1,black:C_dep_T_2,othrace,othrace:C_dep_T_1,...,T4:agelt35:agegt54,T4:agelt35:durable,T4:agelt35:lusd,T4:agelt35:husd,T4:agegt54:durable,T4:agegt54:lusd,T4:agegt54:husd,T4:durable:lusd,T4:durable:husd,T4:lusd:husd
0,-0.112179,0.836242,-0.404001,0.04532,-0.337843,-0.121985,0.013684,-0.102009,-0.007256,0.000814,...,0.000000,0.000000,-0.000000,0.000000,0.000000,-0.000000,0.000000,-0.000000,0.000000,-0.000000
3,-0.112179,-0.163758,-0.404001,0.04532,0.066158,-0.121985,0.013684,0.019976,-0.007256,0.000814,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,-0.112179,-0.163758,-0.404001,0.04532,0.066158,-0.121985,0.013684,0.019976,-0.007256,0.000814,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
11,-0.112179,-0.163758,-0.404001,0.04532,0.066158,-0.121985,0.013684,0.019976,-0.007256,0.000814,...,0.016204,0.011954,0.023930,-0.060822,0.016175,0.032378,-0.082296,0.023887,-0.060713,-0.121536
12,0.887821,-0.163758,-0.404001,-0.35868,0.066158,-0.121985,-0.108301,0.019976,-0.007256,-0.006442,...,0.000000,-0.000000,-0.000000,-0.000000,-0.000000,-0.000000,-0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13904,-0.112179,-0.163758,-0.404001,0.04532,0.066158,-0.121985,0.013684,0.019976,-0.007256,0.000814,...,0.016204,0.011954,0.023930,-0.060822,0.016175,0.032378,-0.082296,0.023887,-0.060713,-0.121536
13905,-0.112179,0.836242,-0.404001,0.04532,-0.337843,-0.121985,0.013684,-0.102009,-0.007256,0.000814,...,0.016204,0.011954,0.023930,-0.060822,0.016175,0.032378,-0.082296,0.023887,-0.060713,-0.121536
13906,-0.112179,0.836242,-0.404001,0.04532,-0.337843,-0.121985,0.013684,-0.102009,-0.007256,0.000814,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
13910,-0.112179,-0.163758,-0.404001,0.04532,0.066158,-0.121985,0.013684,0.019976,-0.007256,0.000814,...,-0.131865,-0.097283,-0.194741,-0.395594,0.016175,0.032378,0.065773,0.023887,0.048524,0.097134


In [182]:
# Import relevant packages for lasso 
from sklearn.linear_model import LassoCV
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

In [183]:
X = X.to_numpy()
T4 = T4.to_numpy()
log_inuidur1_2 = log_inuidur1.to_numpy()

In [184]:
alpha=0.1

# Set penalty value = 0.1
#reg = linear_model.Lasso(alpha=0.1/np.log(len(lwage)))
reg = linear_model.Lasso(alpha = alpha)

# LASSO regression for flexible model
rY = log_inuidur1_2 - reg.fit(X, log_inuidur1_2).predict( X )
rT4 = T4 - reg.fit(X, T4).predict( X )

rT4 = sm.add_constant(rT4)

In [185]:
model = sm.OLS(rY, rT4)
rlasso_ira = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     5.686
Date:                Thu, 18 Mar 2021   Prob (F-statistic):             0.0171
Time:                        20:20:04   Log-Likelihood:                -8223.8
No. Observations:                5099   AIC:                         1.645e+04
Df Residuals:                    5097   BIC:                         1.646e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.372e-16      0.017    1.4e-14      1.0

### Results

In [186]:
table2 = np.zeros((2, 4))
table2[0,0] = ols_cl.summary2().tables[1]['Coef.']['T4']
table2[0,1] = ols_cra.summary2().tables[1]['Coef.']['T4']
table2[0,2] = ols_ira.summary2().tables[1]['Coef.']['T4']
table2[0,3] = rlasso_ira.summary2().tables[1]['Coef.']['x1']

table2[1,0] = ols_cl.summary2().tables[1]['Std.Err.']['T4']
table2[1,1] = ols_cra.summary2().tables[1]['Std.Err.']['T4']
table2[1,2] = ols_ira.summary2().tables[1]['Std.Err.']['T4']
table2[1,3] = rlasso_ira.summary2().tables[1]['Std.Err.']['x1']

table2 = pd.DataFrame(table2, columns = ["$CL$", "$CRA$", "$IRA$", "$IRA Lasso$"], \
                      index = ["estimate","standard error"])
table2
table2.to_latex
print(table2.to_latex())

\begin{tabular}{lrrrr}
\toprule
{} &      \$CL\$ &     \$CRA\$ &     \$IRA\$ &  \$IRA Lasso\$ \\
\midrule
estimate       & -0.085455 & -0.077127 & -0.046288 &    -0.085455 \\
standard error &  0.035856 &  0.035230 &  0.042757 &     0.035839 \\
\bottomrule
\end{tabular}



Treatment group 4 experiences an average decrease of about $7.8\%$ in the length of unemployment spell.


Observe that regression estimators delivers estimates that are slighly more efficient (lower standard errors) than the simple 2 mean estimator, but essentially all methods have very similar standard errors. From IRA results we also see that there is not any statistically detectable heterogeneity.  We also see the regression estimators offer slightly lower estimates -- these difference occur perhaps to due minor imbalance in the treatment allocation, which the regression estimators try to correct.


