## Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question,
but we could begin to investigate from a predictive perspective.

In the following wage example, $Y$ is the hourly wage of a worker and $X$ is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:


* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data


The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015.  We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors;  individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below $3$. 

The variable of interest $Y$ is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single workers (never married),some high school(shs) and high school graduate(hsg). The final sample size is $n= $1376.

## EXPLAINING THE IDEA OF SAMPLE SPLITTING

<p style='text-align: justify;'> When we perform a regression, the estimated coefficients are adjusted so that they are reduced to the minimum mean squared error, i.e., these coefficients perform well within the sample. However, if we want to know the predictive power of the estimated coefficients, it is necessary to measure their out-of-sample performance. To do this, we use the idea of sample splitting, which tells us that we have to divide our sample into two groups: training and test. This division is random. However, the proportion of each group to the total is chosen by the researcher. First, the training sample is used to estimate the coefficients of our model to know the prediction rule. Then, the test sample is used to evaluate the quality of the prediction rule, i.e., we find the predicted values of the endogenous variable using the coefficients obtained from the training sample. Finally, we have to calculate the out-of-sample mean square error ($MSE_{test}$) and $R^{2}_{test}$ in the test sample.

For example, suppose we have the following wage model:  $ lwage \sim sex + exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2$ and data size is $n$. First, we have to divide the data into two groups randomly. This division can be $4/5n$ as the training sample, $M_{train}$, and  $1/5n$ as the test sample, $M_{test}$,. After that, we regress the model on $M_{train}$ and obtain the estimated coefficients $\hat\beta$. With the estimated coefficients, we predict $lwage$ in $M_{test}$, and we calculate the out-of-sample mean square error $MSE_{test}$ and $R^{2}_{test}$ in $M_{test}$.After that, we can analyze whether the $MSE_{test}$ is quite closed to the $MSE_{sample}$ (obtained from regressing the model on the complete sample) or the discrepancy is quite large.

## Data analysis

We start by loading the data set.

In [1]:
library(dplyr)
library(hdm)

"package 'dplyr' was built under R version 3.6.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

"package 'hdm' was built under R version 3.6.3"

In [2]:
load("../data/wage2015_subsample_inference.Rdata")
dim(data)

In [3]:
# filtering the data for values where shs=1 or hsg=1(workers with some high school(shs) or high school graduate(hsg) )
data=filter(data,shs==1|hsg==1)
dim(data)

Let's have a look at the structure of the data.

We are constructing the output variable $Y$ and the matrix $Z$ which includes the characteristics of workers that are given in the data.

In [4]:
Y <- log(data$wage)
n <- length(Y)
Z <- data[-which(colnames(data) %in% c("wage","lwage"))]
p <- dim(Z)[2]

cat("Number of observation:", n, '\n')
cat( "Number of raw regressors:", p)

Number of observation: 1376 
Number of raw regressors: 18

For the outcome variable *wage* and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data.

In [5]:
library(xtable)
options(xtable.floating = FALSE)
options(xtable.timestamp = "")

In [6]:
library(xtable)

Z_subset <- data[which(colnames(data) %in% c("lwage","sex","shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"))]
table <- matrix(0, 12, 1)
table[1:12,1]   <- as.numeric(lapply(Z_subset,mean))
rownames(table) <- c("Log Wage","Sex","Some High School","High School Graduate","Some College","College Graduate", "Advanced Degree","Midwest","South","West","Northeast","Experience")
colnames(table) <- c("Sample mean")
tab<- xtable(table, digits = 2)
tab

Unnamed: 0,Sample mean
Log Wage,2.7185624
Sex,0.3219477
Some High School,0.0872093
High School Graduate,0.9127907
Some College,0.0
College Graduate,0.0
Advanced Degree,0.0
Midwest,0.2863372
South,0.2914244
West,0.1984012


E.g., the share of female workers in our sample is ~32% ($sex=1$ if female).

Alternatively, we can also print the table as latex.

## Prediction Question

Now, we will construct a prediction rule for hourly wage $Y$, which depends linearly on job-relevant characteristics $X$:

\begin{equation}\label{decompose}
Y = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages  using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model**, enables us to approximate the real relationship by a
 more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.
 
 However, this model incorporates a larger number of regressors and requires a larger number of observations to have sufficient degrees of freedom.  
 

Now, let us fit both models to our data by running ordinary least squares (ols):

In [7]:
# 1. basic model
basic <- lwage~ (sex+exp1 + hsg+ mw + so + we +occ2+ind2)
regbasic <- lm(basic, data=data)
regbasic # estimated coefficients
cat( "Number of regressors in the basic model:",length(regbasic$coef), '\n') # number of regressors in the Basic Model



Call:
lm(formula = basic, data = data)

Coefficients:
(Intercept)          sex         exp1          hsg           mw           so  
  2.7518725   -0.0733094    0.0075742    0.0811342   -0.0431882   -0.1091620  
         we        occ22        occ23        occ24        occ25        occ26  
  0.0129620   -0.1961261   -0.0086113    0.0005078    0.2615289   -0.3510072  
      occ27        occ28        occ29       occ210       occ211       occ212  
 -0.1900342   -0.6616521   -0.3013316   -0.0576220   -0.4176903   -0.4663571  
     occ213       occ214       occ215       occ216       occ217       occ218  
 -0.4219896   -0.5527766   -0.4747648   -0.2381724   -0.3529422   -0.3976108  
     occ219       occ220       occ221       occ222        ind23        ind24  
 -0.1181885   -0.1053967   -0.1737437   -0.3479965    0.1742747    0.0504201  
      ind25        ind26        ind27        ind28        ind29       ind210  
  0.0585330    0.0348081    0.2379530    0.0922050    0.0928608    0.2399036

Number of regressors in the basic model: 48 


##### Note that the basic model consists of $48$ regressors.

In [8]:
# 2. flexible model
flex <- lwage ~ (exp1+exp2+exp3+exp4+hsg+occ2+ind2+mw+so+we)**2
regflex <- lm(flex, data=data)
regflex # estimated coefficients
cat( "Number of regressors in the flexible model:",length(regflex$coef)) # number of regressors in the Flexible Model


Call:
lm(formula = flex, data = data)

Coefficients:
  (Intercept)           exp1           exp2           exp3           exp4  
    1.544e+01     -3.684e+00      4.183e+01     -2.251e+01      7.480e+00  
          hsg          occ22          occ23          occ24          occ25  
    5.723e-01     -2.441e+00     -4.590e+01      3.814e+00     -5.081e+01  
        occ26          occ27          occ28          occ29         occ210  
    1.803e+01     -9.997e-01     -1.024e+01     -2.064e+01     -3.857e+00  
       occ211         occ212         occ213         occ214         occ215  
    7.938e-01      2.424e+00     -1.720e+00     -6.556e-01     -1.048e+01  
       occ216         occ217         occ218         occ219         occ220  
    6.067e-01     -4.943e+00     -3.445e-01     -6.796e+00     -3.114e+00  
       occ221         occ222          ind23          ind24          ind25  
   -1.176e+00     -1.987e+00     -1.653e+01     -5.930e+00     -1.193e+01  
        ind26          ind27      

Number of regressors in the flexible model: 826

Note that the flexible model consists of $826$ regressors for only 1376 obs.

LASSO

In [9]:
#install.packages('hdm')
#lambdaCalculation(penalty = list(homoscedastic = FALSE, X.dependent.lambda =
#FALSE, lambda.start = NULL, c = 1.1, gamma = 0.1), y = data$wage, x =data)

In [10]:
#library(hdm)
flex <- lwage ~ (exp1+exp2+exp3+exp4+hsg+occ2+ind2+mw+so+we)**2
lassoreg<- rlasso(flex, data=data,post=FALSE)

sumlasso<- summary(lassoreg)



Call:
rlasso.formula(formula = flex, data = data, post = FALSE)

Post-Lasso Estimation:  FALSE 

Total number of variables: 825
Number of selected variables: 24 

Residuals: 
     Min       1Q   Median       3Q      Max 
-1.37420 -0.29937 -0.01777  0.28156  3.54523 

              Estimate
(Intercept)      2.680
exp1             0.003
exp2             0.000
exp3             0.000
exp4             0.000
hsg              0.000
occ22            0.000
occ23            0.000
occ24            0.000
occ25            0.000
occ26            0.000
occ27            0.000
occ28           -0.085
occ29            0.000
occ210           0.000
occ211           0.000
occ212           0.000
occ213          -0.173
occ214          -0.142
occ215          -0.153
occ216           0.000
occ217          -0.030
occ218           0.000
occ219           0.000
occ220           0.000
occ221           0.000
occ222           0.000
ind23            0.000
ind24            0.000
ind25            0.000
ind26            0

Flex lasso model with 24 regressors

Now, we can evaluate the performance of the three models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$.The aim is to find an $R^2$ close to 1 and low $MSE$ .

In [11]:
# Assess the predictive performance

sumbasic <- summary(regbasic)
sumflex <- summary(regflex)

#  R-squared-basic model
R2.1 <- sumbasic$r.squared
cat("R-squared for the basic model: ", R2.1, "\n")
R2.adj1 <- sumbasic$adj.r.squared
cat("adjusted R-squared for the basic model: ", R2.adj1, "\n")

#  R-squared-flex model
R2.2 <- sumflex$r.squared
cat("R-squared for the flexible model: ", R2.2, "\n")
R2.adj2 <- sumflex$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adj2, "\n")

#  R-squared-flex lasso model
R2.L <- sumlasso$r.squared
cat("R-squared for the lasso with flexible model: ", R2.L, "\n")
R2.adjL <- sumlasso$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adjL, "\n")

R-squared for the basic model:  0.1802381 
adjusted R-squared for the basic model:  0.1512255 
R-squared for the flexible model:  0.507044 
adjusted R-squared for the flexible model:  0.2315028 
R-squared for the lasso with flexible model:  0.1157304 
adjusted R-squared for the flexible model:  0.1000216 


In [12]:
# calculating the MSE-basic model
MSE1 <- mean(sumbasic$res^2)
cat("MSE for the basic model: ", MSE1, "\n")

p1 <- sumbasic$df[1] # number of regressors
MSE.adj1 <- (n/(n-p1))*MSE1
cat("adjusted MSE for the basic model: ", MSE.adj1, "\n")

# calculating the MSE-flex model
MSE2 <-mean(sumflex$res^2)
cat("MSE for the flexible model: ", MSE2, "\n")
p2 <- sumflex$df[1]
MSE.adj2 <- (n/(n-p2))*MSE2
cat("adjusted MSE for the flexible model: ", MSE.adj2, "\n")

# calculating the MSE-flex lasso model
MSEL <-mean(sumlasso$res^2)
cat("MSE for the lasso flexible model: ", MSEL, "\n")
pL <- length(sumlasso$coef)
MSE.adjL <- (n/(n-pL))*MSEL
cat("adjusted MSE for the lasso flexible model: ", MSE.adjL, "\n")

MSE for the basic model:  0.2082191 
adjusted MSE for the basic model:  0.2157451 
MSE for the flexible model:  0.1252106 
adjusted MSE for the flexible model:  0.1953398 
MSE for the lasso flexible model:  0.224604 
adjusted MSE for the lasso flexible model:  0.5619184 


In [13]:
library(xtable)
table <- matrix(0, 3, 5)
table[1,1:5]   <- c(p1,R2.1,MSE1,R2.adj1,MSE.adj1)
table[2,1:5]   <- c(p2,R2.2,MSE2,R2.adj2,MSE.adj2)
table[3,1:5]   <- c(pL,R2.L,MSEL,R2.adjL,MSE.adjL)
colnames(table)<- c("p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$")
rownames(table)<- c("basic reg","flexible reg", "lasso flex")
tab<- xtable(table, digits =c(0,0,2,2,2,2))
print(tab,type="latex") # type="latex" for printing table in LaTeX
tab

% latex table generated in R 3.6.1 by xtable 1.8-4 package
% 
\begin{tabular}{rrrrrr}
  \hline
 & p & \$R\verb|^|2\_\{sample\}\$ & \$MSE\_\{sample\}\$ & \$R\verb|^|2\_\{adjusted\}\$ & \$MSE\_\{adjusted\}\$ \\ 
  \hline
basic reg & 48 & 0.18 & 0.21 & 0.15 & 0.22 \\ 
  flexible reg & 494 & 0.51 & 0.13 & 0.23 & 0.20 \\ 
  lasso flex & 826 & 0.12 & 0.22 & 0.10 & 0.56 \\ 
   \hline
\end{tabular}


Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,48,0.1802381,0.2082191,0.1512255,0.2157451
flexible reg,494,0.507044,0.1252106,0.2315028,0.1953398
lasso flex,826,0.1157304,0.224604,0.1000216,0.5619184


Considering all measures above, the flex model performs  better than the other models.

One procedure to circumvent this issue is to use **data splitting** that is described and applied in the following.

## Data Splitting

Measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophiscticated version of splitting that we can consider).
- Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
- Use the testing sample for evaluation. Predict the $\mathtt{wage}$  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models. 

In [14]:
#splitting the data

set.seed(1) # to make the results replicable (generating random numbers)
random_2 <- sample(1:n, floor(n*4/5))

# draw (4/5)*n random numbers from 1 to n without replacing them
train <- data[random_2,] # training sample
test <- data[-random_2,] # testing sample
dim(train)
dim(test)
#1100 obs for train and 276 for test

In [15]:
# basic model
# estimating the parameters in the training sample
regbasic <- lm(basic, data=train)
regbasic


Call:
lm(formula = basic, data = train)

Coefficients:
(Intercept)          sex         exp1          hsg           mw           so  
   2.790397    -0.113776     0.008171     0.045630    -0.007887    -0.092965  
         we        occ22        occ23        occ24        occ25        occ26  
   0.027930    -0.198916    -0.014664     0.202306     0.242367     0.006158  
      occ27        occ28        occ29       occ210       occ211       occ212  
  -0.180164    -0.796769    -0.241348    -0.015354    -0.399688    -0.506190  
     occ213       occ214       occ215       occ216       occ217       occ218  
  -0.446041    -0.575386    -0.450461    -0.209308    -0.338949    -0.251586  
     occ219       occ220       occ221       occ222        ind23        ind24  
  -0.137541    -0.124913    -0.158960    -0.350055     0.191507     0.036101  
      ind25        ind26        ind27        ind28        ind29       ind210  
   0.034002    -0.004881     0.214846     0.042589     0.076761     0.25978

In [16]:
# calculating the out-of-sample MSE
trainregbasic <- predict(regbasic, newdata=test)
trainregbasic

In [17]:
y.test <- log(test$wage)
MSE.test1 <- sum((y.test-trainregbasic)^2)/length(y.test)
R2.test1<- 1- MSE.test1/var(y.test)
R2.test1_adj<-1- MSE.test1/var(y.test)*(length(y.test)/(length(y.test)-51))
MSE.test1_adj<-sum((y.test-trainregbasic)^2)*(length(y.test)/(length(y.test)-51))
cat("Test MSE for the basic model: ", MSE.test1, " ")
cat("Test MSE_adj for the basic model: ", MSE.test1_adj, " ")
cat("Test R2 for the basic model: ", R2.test1, " ")

cat("Test R2_adj for the basic model: ", R2.test1_adj)

Test MSE for the basic model:  0.2015247  Test MSE_adj for the basic model:  68.22819  Test R2 for the basic model:  0.06784807  Test R2_adj for the basic model:  -0.1434397

In the basic model, the $MSE_{test}$ is quite closed to the $MSE_{sample}$ (0.2082191). however, the $MSEadj_{test}$ is very different from the $MSEadj_{sample}$  which indicates that there is a strong penalty for the number of regressors.

In [18]:
# flexible model
# estimating the parameters

regflex <- lm(flex, data=train)

# calculating the out-of-sample MSE
trainregflex<- predict(regflex, newdata=test)
y.test <- log(test$wage)
MSE.test2 <- sum((y.test-trainregflex)^2)/length(y.test)
R2.test2<- 1- MSE.test2/var(y.test)

cat("Test MSE for the flexible model: ", MSE.test2, " ")

cat("Test R2 for the flexible model: ", R2.test2)

"prediction from a rank-deficient fit may be misleading"

Test MSE for the flexible model:  22576.15  Test R2 for the flexible model:  -104424.9

Moreover, in this model the values of T and F are high and this is because we have a large number of regressors that are close to the number of observations we have.

In the flexible model, the discrepancy between the $MSE_{test}$ and the $MSE_{sample}$ is  large.

It is worth to notice that the $MSE_{test}$ vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample $MSE$, the flex lasso model  performs is about as well (or slightly better) than the basic model and flexible model. 


Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (*least absolute shrinkage and selection operator*) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors $p$ is relatively large in relation to $n$. 

Note that the out-of-sample $MSE$ on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

In [19]:
# flexible model using lasso

# estimating the parameters
#library(hdm)
reglasso <- rlasso(flex, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglasso<- predict(reglasso, newdata=test)
MSE_flex.lasso <- sum((y.test-trainreglasso)^2)/length(y.test)
R2_flex.lasso<- 1- MSE_flex.lasso/var(y.test)


cat("Test MSE for the lasso on flexible model: ", MSE_flex.lasso, " ")

cat("Test R2 for the lasso flexible model: ", R2_flex.lasso)

Test MSE for the lasso on flexible model:  0.2032003  Test R2 for the lasso flexible model:  0.06009756

In [20]:
summary(reglasso)


Call:
rlasso.formula(formula = flex, data = train, post = FALSE)

Post-Lasso Estimation:  FALSE 

Total number of variables: 825
Number of selected variables: 17 

Residuals: 
     Min       1Q   Median       3Q      Max 
-1.37748 -0.29748 -0.02832  0.27519  3.51903 

              Estimate
(Intercept)      2.681
exp1             0.004
exp2             0.000
exp3             0.000
exp4             0.000
hsg              0.000
occ22            0.000
occ23            0.000
occ24            0.000
occ25            0.000
occ26            0.000
occ27            0.000
occ28           -0.259
occ29            0.000
occ210           0.000
occ211           0.000
occ212           0.000
occ213          -0.191
occ214          -0.141
occ215          -0.162
occ216           0.000
occ217           0.000
occ218           0.000
occ219           0.000
occ220           0.000
occ221           0.000
occ222           0.000
ind23            0.000
ind24            0.000
ind25            0.000
ind26            

In [21]:
# basic model using lasso

# estimating the parameters
#library(hdm)
reglasso_basic <- rlasso(basic, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglasso_basic<- predict(reglasso_basic, newdata=test)
MSE_basic.lasso <- sum((y.test-trainreglasso_basic)^2)/length(y.test)
R2_basic.lasso<- 1- MSE_basic.lasso/var(y.test)


cat("Test MSE for the lasso on flexible model: ", MSE_basic.lasso, " ")

cat("Test R2 for the lasso flexible model: ", R2_basic.lasso)

Test MSE for the lasso on flexible model:  0.2051146  Test R2 for the lasso flexible model:  0.05124275

In [22]:
summary(reglasso_basic)


Call:
rlasso.formula(formula = basic, data = train, post = FALSE)

Post-Lasso Estimation:  FALSE 

Total number of variables: 47
Number of selected variables: 14 

Residuals: 
     Min       1Q   Median       3Q      Max 
-1.34220 -0.29244 -0.02897  0.28715  3.47008 

            Estimate
(Intercept)    2.696
sex           -0.072
exp1           0.007
hsg            0.000
mw             0.000
so            -0.043
we             0.000
occ22          0.000
occ23          0.000
occ24          0.000
occ25          0.000
occ26          0.000
occ27          0.000
occ28         -0.315
occ29          0.000
occ210         0.000
occ211        -0.071
occ212        -0.036
occ213        -0.214
occ214        -0.220
occ215        -0.176
occ216         0.000
occ217        -0.037
occ218         0.000
occ219         0.000
occ220         0.000
occ221         0.000
occ222        -0.015
ind23          0.000
ind24          0.000
ind25          0.000
ind26          0.000
ind27          0.000
ind28          0

Finally, let us summarize the results:

In [23]:
table2 <- matrix(0, 4,2)
table2[1,1]   <- MSE.test1
table2[2,1]   <- MSE.test2
table2[3,1]   <- MSE_flex.lasso
table2[1,2]   <- R2.test1
table2[2,2]   <- R2.test2
table2[3,2]   <- R2_flex.lasso
table2[4,1]<-MSE_basic.lasso
table2[4,2]<-R2_basic.lasso
rownames(table2)<- c("basic reg","flexible reg","lasso flex regression","lasso basic regression")
colnames(table2)<- c("$MSE_{test}$", "$R^2_{test}$")
tab2 <- xtable(table2, digits =4)
tab2

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.2015247,0.06784807
flexible reg,22576.15,-104424.9
lasso flex regression,0.2032003,0.06009756
lasso basic regression,0.2051146,0.05124275


It is observed that the best model is the basic flex model, however, both R and MSE are values that we would not want because they indicate that there is not enough information in our data to correctly predict the wage of a worker, therefore, we should increase the number of observations and restrict the number of regressors. 

# Lasso

## Partialling Out using Lasso

## Case 1

We use Lasso when the dimension of $W$ is high, this allows us to regulize the model selecting some variables.

In this case, we will use the following matrix $W$:

Matrix $W$ = 'exp1 + hsg + mw + so + we + occ2+ ind2'

In [24]:
library(sandwich)

In [25]:
flex.y <- lwage ~ exp1+hsg+mw+so+we+occ2+ind2 #model for y

flex.d <- sex ~ exp1+hsg+mw+so+we+occ2+ind2 #model for D

#partialling-out the linear effect of W from Y

t.Y <- rlasso(flex.y, data=data)$res

#partialling-out the linear effect of W from D

t.D <- rlasso(flex.d, data=data)$res

#regression of Y on D after partialling out the effect of W
partial.lasso.fit <- lm(t.Y~t.D)
partial.lasso.est_1 <- summary(partial.lasso.fit)$coef[2,1]

cat('Coefficient for D via partialling-out', partial.lasso.est_1)

#standard error

HCV.coefs <- vcovHC(partial.lasso.fit, type='HC')
partial.se_1 <- sqrt(diag(HCV.coefs))[2]

Coefficient for D via partialling-out -0.09065628

With Lasso the coefficient for $D$ is $-0.09065628$

## Case 2

In this case, we will use the following matrix $W$:

Matrix $W$ =  (exp1+exp2+exp3+exp4+hsg+occ2+ind2+mw+so+we)**2

In [26]:
flex.y <- lwage ~ (exp1+exp2+exp3+exp4+hsg+mw+so+we+occ2+ind2)**2 #model for y

flex.d <- sex ~ (exp1+exp2+exp3+exp4+hsg+mw+so+we+occ2+ind2)**2 #model for D

#partialling-out the linear effect of W from Y

t.Y <- rlasso(flex.y, data=data)$res

#partialling-out the linear effect of W from D

t.D <- rlasso(flex.d, data=data)$res

#regression of Y on D after partialling out the effect of W
partial.lasso.fit <- lm(t.Y~t.D)
partial.lasso.est_2 <- summary(partial.lasso.fit)$coef[2,1]

cat('Coefficient for D via partialling-out', partial.lasso.est_2)


#standard error

HCV.coefs <- vcovHC(partial.lasso.fit, type='HC')
partial.se_2 <- sqrt(diag(HCV.coefs))[2]

Coefficient for D via partialling-out -0.08458483

In [27]:
table <- matrix(0,2,2)
table[1,1]<- partial.lasso.est_1
table[1,2]<-partial.se_1
table[2,1]<-partial.lasso.est_2
table[2,2]<-partial.se_2
colnames(table)<- c("Estimate","Std. Error")
rownames(table)<- c("partial reg via lasso(1)","partial reg via lasso(2)")
tab<- xtable(table, digits=c(3, 3, 4))
tab

Unnamed: 0,Estimate,Std. Error
partial reg via lasso(1),-0.09065628,0.03223961
partial reg via lasso(2),-0.08458483,0.03248088
