# Data analysis

In [1]:
load("../data/wage2015_subsample_inference.Rdata")
dim(data)

In [2]:
str(data)

'data.frame':	5150 obs. of  20 variables:
 $ wage : num  9.62 48.08 11.06 13.94 28.85 ...
 $ lwage: num  2.26 3.87 2.4 2.63 3.36 ...
 $ sex  : num  1 0 0 1 1 1 1 0 1 1 ...
 $ shs  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hsg  : num  0 0 1 0 0 0 1 1 1 0 ...
 $ scl  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ clg  : num  1 1 0 0 1 1 0 0 0 1 ...
 $ ad   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ mw   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ so   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ we   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ne   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ exp1 : num  7 31 18 25 22 1 42 37 31 4 ...
 $ exp2 : num  0.49 9.61 3.24 6.25 4.84 ...
 $ exp3 : num  0.343 29.791 5.832 15.625 10.648 ...
 $ exp4 : num  0.24 92.35 10.5 39.06 23.43 ...
 $ occ  : Factor w/ 369 levels "10","20","40",..: 159 136 269 23 99 86 226 232 184 146 ...
 $ occ2 : Factor w/ 22 levels "1","2","3","4",..: 11 10 19 1 6 5 17 17 13 10 ...
 $ ind  : Factor w/ 236 levels "370","380","390",..: 204 117 12 165 231 176 171 135 210 201 ...
 $ ind2 : Factor w/ 

We are constructing the output variable  𝑌  and the matrix  𝑍  which includes the characteristics of workers that are given in the data.

In [4]:
# selecting a subset of interest (shs and hs)
library(dplyr)
data <-filter(data, shs==1 | hsg==1)
Y <- log(data$wage)
n <- length(Y)
Z <- data[-which(colnames(data) %in% c("wage","lwage"))]
p <- dim(Z)[2]
cat("Number of observation:", n, '\n')
cat( "Number of raw regressors:", p)

Number of observation: 1376 
Number of raw regressors: 18

In [5]:
install.packages("xtable")
library(xtable)
options(xtable.floating = FALSE)
options(xtable.timestamp = "")

package 'xtable' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\gonza\AppData\Local\Temp\Rtmpig0ZWF\downloaded_packages


"package 'xtable' was built under R version 3.6.3"

In [6]:

# keeping variables of interest to do the table
data_z <- data[which(colnames(data) %in% c("lwage","sex","shs","hsg","mw","so","we","ne","exp1"))]




In [7]:
library(xtable)
table <- matrix(0, 9, 1)
table[1:9,1]   <- as.numeric(lapply(data_z,mean))
rownames(table) <- c("Log Wage","Sex","Some High School","High School Graduate","Midwest","South","West","Northeast","Experience" )
colnames(table) <- c("Sample mean")
tab<- xtable(table, digits = 2)
tab

Unnamed: 0,Sample mean
Log Wage,2.7185624
Sex,0.3219477
Some High School,0.0872093
High School Graduate,0.9127907
Midwest,0.2863372
South,0.2914244
West,0.1984012
Northeast,0.2238372
Experience,17.1900436


E.g., the share of female workers in our sample is ~32% (𝑠𝑒𝑥=1 if female).


# Prediction Question

We construct a prediction rule for hourly wage $Y$ for the people who did not go to college (high school graduates or with some high school education), which depends on job-relevant characteristics $X$:
$$\label{decompose}
Y = \beta'X+ \epsilon.
$$

We aim to predict wages using the exogenous characteristics of workers. For this purpose, we use two different specifications for prediction:

Basic Model: $X$ consists of a set of raw regressors: Sex, Experience, Education, Occupation, Region, etc.

Flexible Model: $X$ consists of all raw regressors from the basic model plus their transformations and two-way interactions. We define these by expressing our set of regressors as:$$\label{regressor}
X=(X_1 + X_2 + X_3+...)^2
$$

In the following sections, we will estimate both models using the OLS method and the Lasso method.

## OLS - BASIC MODEL

In [8]:
basic <- lwage~ sex + exp1 + shs + hsg + scl+clg+ mw + so + we +occ2+ind2
regbasic <- lm(basic, data=data)
regbasic # estimated coefficients
cat( "Number of regressors in the basic model:",length(regbasic$coef), '\n') # number of regressors in the Basic Model


Call:
lm(formula = basic, data = data)

Coefficients:
(Intercept)          sex         exp1          shs          hsg          scl  
  2.8330066   -0.0733094    0.0075742   -0.0811342           NA           NA  
        clg           mw           so           we        occ22        occ23  
         NA   -0.0431882   -0.1091620    0.0129620   -0.1961261   -0.0086113  
      occ24        occ25        occ26        occ27        occ28        occ29  
  0.0005078    0.2615289   -0.3510072   -0.1900342   -0.6616521   -0.3013316  
     occ210       occ211       occ212       occ213       occ214       occ215  
 -0.0576220   -0.4176903   -0.4663571   -0.4219896   -0.5527766   -0.4747648  
     occ216       occ217       occ218       occ219       occ220       occ221  
 -0.2381724   -0.3529422   -0.3976108   -0.1181885   -0.1053967   -0.1737437  
     occ222        ind23        ind24        ind25        ind26        ind27  
 -0.3479965    0.1742747    0.0504201    0.0585330    0.0348081    0.2379530

Number of regressors in the basic model: 51 


Note that the flexible model consists of  51 regressors

## OLS - FLEX MODEL

In [9]:
flex <- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2
regflex <- lm(flex, data=data)
regflex # estimated coefficients
cat( "Number of regressors in the flexible model:",length(regflex$coef)) # number of regressors in the Flexible Model



Call:
lm(formula = flex, data = data)

Coefficients:
  (Intercept)           exp1           exp2           exp3           exp4  
    1.601e+01     -3.411e+00      3.874e+01     -2.133e+01      7.337e+00  
          shs            hsg            scl            clg          occ22  
   -5.723e-01             NA             NA             NA     -2.441e+00  
        occ23          occ24          occ25          occ26          occ27  
   -4.590e+01      7.898e+00     -5.081e+01      1.803e+01     -9.997e-01  
        occ28          occ29         occ210         occ211         occ212  
   -1.024e+01     -2.064e+01     -3.857e+00      2.756e-01      1.765e+00  
       occ213         occ214         occ215         occ216         occ217  
   -2.087e+00     -8.440e-01     -1.059e+01     -1.232e-01     -4.549e+00  
       occ218         occ219         occ220         occ221         occ222  
   -3.445e-01     -6.630e+00     -2.868e+00     -1.239e+00     -1.932e+00  
        ind23          ind24      

Number of regressors in the flexible model: 979

Note that the flexible model consists of  979  regressors

# Try Lasso next 

## Lasso Basic Model

In [10]:
install.packages("hdm")
install.packages("sandwich")
library(hdm)
basic <- lwage~ sex + exp1 + shs + hsg + scl+clg+ mw + so + we +occ2+ind2
lassoregbasic<- rlasso(basic, data=data)

sumlassobasic<- summary(lassoregbasic)

package 'hdm' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\gonza\AppData\Local\Temp\Rtmpig0ZWF\downloaded_packages

  There is a binary version available but the source version is later:
         binary source needs_compilation
sandwich  3.0-0  3.0-1             FALSE



installing the source package 'sandwich'

"package 'hdm' was built under R version 3.6.3"


Call:
rlasso.formula(formula = basic, data = data)

Post-Lasso Estimation:  TRUE 

Total number of variables: 50
Number of selected variables: 6 

Residuals: 
     Min       1Q   Median       3Q      Max 
-1.37681 -0.29491 -0.01412  0.27657  3.47488 

            Estimate
(Intercept)    2.667
sex           -0.098
exp1           0.008
shs            0.000
hsg            0.000
scl            0.000
clg            0.000
mw             0.000
so             0.000
we             0.000
occ22          0.000
occ23          0.000
occ24          0.000
occ25          0.000
occ26          0.000
occ27          0.000
occ28          0.000
occ29          0.000
occ210         0.000
occ211         0.000
occ212         0.000
occ213        -0.240
occ214        -0.312
occ215        -0.302
occ216         0.000
occ217         0.000
occ218         0.000
occ219         0.000
occ220         0.000
occ221         0.000
occ222         0.000
ind23          0.000
ind24          0.000
ind25          0.000
ind26       

## Lasso Flex Model

In [11]:
library(hdm)
flex <- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2
lassoreg<- rlasso(flex, data=data )

sumlasso<- summary(lassoreg)


Call:
rlasso.formula(formula = flex, data = data)

Post-Lasso Estimation:  TRUE 

Total number of variables: 978
Number of selected variables: 13 

Residuals: 
      Min        1Q    Median        3Q       Max 
-1.584944 -0.306009 -0.009298  0.273618  3.557535 

              Estimate
(Intercept)      2.630
exp1             0.005
exp2             0.000
exp3             0.000
exp4             0.000
shs              0.000
hsg              0.000
scl              0.000
clg              0.000
occ22            0.000
occ23            0.000
occ24            0.000
occ25            0.000
occ26            0.000
occ27            0.000
occ28            0.000
occ29            0.000
occ210           0.000
occ211           0.000
occ212           0.000
occ213          -0.225
occ214          -0.271
occ215           0.000
occ216           0.000
occ217           0.000
occ218           0.000
occ219           0.000
occ220           0.000
occ221           0.000
occ222           0.000
ind23            0.000


Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$:

In [12]:
# Assess the predictive performance

sumbasic <- summary(regbasic)
sumflex <- summary(regflex)

#  R-squared 
R2.1 <- sumbasic$r.squared
cat("R-squared for the basic model: ", R2.1, "\n")
R2.adj1 <- sumbasic$adj.r.squared
cat("adjusted R-squared for the basic model: ", R2.adj1, "\n")

R2.2 <- sumflex$r.squared
cat("R-squared for the flexible model: ", R2.2, "\n")
R2.adj2 <- sumflex$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adj2, "\n")

R2.LB <- sumlassobasic$r.squared
cat("R-squared for the lasso with basic model: ", R2.LB, "\n")
R2.adjLB <- sumlassobasic$adj.r.squared
cat("adjusted R-squared for the basic model: ", R2.adjLB, "\n")

R2.L <- sumlasso$r.squared
cat("R-squared for the lasso with flexible model: ", R2.L, "\n")
R2.adjL <- sumlasso$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adjL, "\n")

R-squared for the basic model:  0.1802381 
adjusted R-squared for the basic model:  0.1512255 
R-squared for the flexible model:  0.507044 
adjusted R-squared for the flexible model:  0.2315028 
R-squared for the lasso with basic model:  0.0970881 
adjusted R-squared for the basic model:  0.09313085 
R-squared for the lasso with flexible model:  0.09236528 
adjusted R-squared for the flexible model:  0.0837021 


In [13]:
# calculating the MSE
MSE1 <- mean(sumbasic$res^2)
cat("MSE for the basic model: ", MSE1, "\n")
p1 <- sumbasic$df[1] # number of regressors
MSE.adj1 <- (n/(n-p1))*MSE1
cat("adjusted MSE for the basic model: ", MSE.adj1, "\n")

MSE2 <-mean(sumflex$res^2)
cat("MSE for the flexible model: ", MSE2, "\n")
p2 <- sumflex$df[1]
MSE.adj2 <- (n/(n-p2))*MSE2
cat("adjusted MSE for the flexible model: ", MSE.adj2, "\n")

MSELB <-mean(sumlassobasic$res^2)
cat("MSE for the flexible model: ", MSELB, "\n")
p2 <- sumflex$df[1]
MSE.adjLB <- (n/(n-p2))*MSELB
cat("adjusted MSE for the flexible model: ", MSE.adjLB, "\n")

MSEL <-mean(sumlasso$res^2)
cat("MSE for the lasso flexible model: ", MSEL, "\n")
pL <- length(sumlasso$coef)
MSE.adjL <- (n/(n-pL))*MSEL
cat("adjusted MSE for the lasso flexible model: ", MSE.adjL )

MSE for the basic model:  0.2082191 
adjusted MSE for the basic model:  0.2157451 
MSE for the flexible model:  0.1252106 
adjusted MSE for the flexible model:  0.1953398 
MSE for the flexible model:  0.2293392 
adjusted MSE for the flexible model:  0.3577899 
MSE for the lasso flexible model:  0.2305387 
adjusted MSE for the lasso flexible model:  0.7990461

In [14]:
library(xtable)
table <- matrix(0, 4, 5)
table[1,1:5]   <- c(p1,R2.1,MSE1,R2.adj1,MSE.adj1)
table[2,1:5]   <- c(p2,R2.2,MSE2,R2.adj2,MSE.adj2)
table[3,1:5]   <- c(pL,R2.L,MSEL,R2.adjL,MSE.adjLB)
table[4,1:5]   <- c(pL,R2.L,MSEL,R2.adjL,MSE.adjL)
colnames(table)<- c("p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$", "$MSE_{adjusted}$")
rownames(table)<- c("basic reg","flexible reg", "basic lasso", "flex lasso")
tab<- xtable(table, digits =c(0,0,2,2,2,2))
tab

Unnamed: 0,p,$R^2_{sample}$,$MSE_{sample}$,$R^2_{adjusted}$,$MSE_{adjusted}$
basic reg,48,0.18023815,0.2082191,0.1512255,0.2157451
flexible reg,494,0.507044,0.1252106,0.2315028,0.1953398
basic lasso,979,0.09236528,0.2305387,0.0837021,0.3577899
flex lasso,979,0.09236528,0.2305387,0.0837021,0.7990461


# DATA SPLITTING

In [15]:
#splitting the data
set.seed(1) # to make the results replicable (generating random numbers)
random <- sample( 1:n, floor(n*4/5))
# draw (4/5)*n random numbers from 1 to n without replacing them
length(random)

In [16]:
train <- data[random,] # training sample
test <- data[-random,] # testing sample
dim(train)
dim(test)

### Basic Regression

In [17]:

# estimating the parameters in the training sample
regbasic <- lm(basic, data=train)
regbasic


Call:
lm(formula = basic, data = train)

Coefficients:
(Intercept)          sex         exp1          shs          hsg          scl  
   2.836027    -0.113776     0.008171    -0.045630           NA           NA  
        clg           mw           so           we        occ22        occ23  
         NA    -0.007887    -0.092965     0.027930    -0.198916    -0.014664  
      occ24        occ25        occ26        occ27        occ28        occ29  
   0.202306     0.242367     0.006158    -0.180164    -0.796769    -0.241348  
     occ210       occ211       occ212       occ213       occ214       occ215  
  -0.015354    -0.399688    -0.506190    -0.446041    -0.575386    -0.450461  
     occ216       occ217       occ218       occ219       occ220       occ221  
  -0.209308    -0.338949    -0.251586    -0.137541    -0.124913    -0.158960  
     occ222        ind23        ind24        ind25        ind26        ind27  
  -0.350055     0.191507     0.036101     0.034002    -0.004881     0.21484

In [18]:
# calculating the out-of-sample MSE
trainregbasic <- predict(regbasic, newdata=test)
trainregbasic

"prediction from a rank-deficient fit may be misleading"

In [19]:
y.test <- log(test$wage)
MSE.test1 <- sum((y.test-trainregbasic)^2)/length(y.test)
R2.test1<- 1- MSE.test1/var(y.test)

cat("Test MSE for the basic model: ", MSE.test1, "\n")

cat("Test R2 for the basic model: ", R2.test1)

Test MSE for the basic model:  0.2015247 
Test R2 for the basic model:  0.06784807

In the basic model, the $MSE_{test}$ is quite closed to the $MSE_{sample}$.

### Flexible Regression

In [20]:
# estimating the parameters
#options(warn=-1)
regflex <- lm(flex,data=train)

# calculating the out-of-sample MSE
trainregflex<- predict(regflex, newdata=test )
y.test <- log(test$wage)
MSE.test2 <- sum((y.test-trainregflex)^2)/length(y.test)
R2.test2<- 1- MSE.test2/var(y.test)

cat("Test MSE for the flexible model: ", MSE.test2, "\n")

cat("Test R2 for the flexible model: ", R2.test2)

"prediction from a rank-deficient fit may be misleading"

Test MSE for the flexible model:  22576.15 
Test R2 for the flexible model:  -104424.9

## Data splitting 


### Lasso - Basic Regression

In [21]:
# Basic model using lasso

# estimating the parameters
library(hdm)
reglassobasic <- rlasso(basic, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglassobasic<- predict(reglassobasic, newdata=test)
MSE.lasso1 <- sum((y.test-trainreglassobasic)^2)/length(y.test)
R2.lasso1<- 1- MSE.lasso1/var(y.test)


cat("Test MSE for the lasso on flexible model: ", MSE.lasso1, "\n")

cat("Test R2 for the lasso flexible model: ", R2.lasso1, "\n")

Test MSE for the lasso on flexible model:  0.2051704 
Test R2 for the lasso flexible model:  0.05098462 



### Lasso - Flexible Regression

In [22]:
# flexible model using lasso

# estimating the parameters
library(hdm)
reglassoflex <- rlasso(flex, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglassoflex<- predict(reglassoflex, newdata=test)
MSE.lasso2 <- sum((y.test-trainreglassoflex)^2)/length(y.test)
R2.lasso2<- 1- MSE.lasso2/var(y.test)


cat("Test MSE for the lasso on flexible model: ", MSE.lasso2, "\n")

cat("Test R2 for the lasso flexible model: ", R2.lasso2, "\n")

Test MSE for the lasso on flexible model:  0.2023553 
Test R2 for the lasso flexible model:  0.06400605 


In [23]:
table2 <- matrix(0,4,2)
table2[1,1]   <- MSE.test1
table2[2,1]   <- MSE.test2
table2[3,1]   <- MSE.lasso1
table2[4,1]   <- MSE.lasso2
table2[1,2]   <- R2.test1
table2[2,2]   <- R2.test2
table2[3,2]   <- R2.lasso1
table2[4,2]   <- R2.lasso2
rownames(table2)<- c("basic reg","flexible reg"," Basic lasso regression", " Flex lasso regression")
colnames(table2)<- c("$MSE_{test}$", "$R^2_{test}$")
tab2 <- xtable(table2, digits =3)
tab2

Unnamed: 0,$MSE_{test}$,$R^2_{test}$
basic reg,0.2015247,0.06784807
flexible reg,22576.15,-104424.9
Basic lasso regression,0.2051704,0.05098462
Flex lasso regression,0.2023553,0.06400605


# Partialling-Out using lasso

In [24]:
library(sandwich)

# Model 1 via Lasso
basic.y <- lwage ~  exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2 # model for Y
basic.d <- sex ~ exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2 # model for D

# partialling-out the linear effect of W from Y
t.Y <- rlasso(basic.y, data=data)$res
# partialling-out the linear effect of W from D
t.D <- rlasso(basic.d, data=data)$res

# regression of Y on D after partialling-out the effect of W
partialbasic.lasso.fit <- lm(t.Y~t.D)
partialbasic.lasso.est <- summary(partialbasic.lasso.fit)$coef[2,1]

cat("Coefficient for D via partialling-out using lasso", partialbasic.lasso.est)

# standard error
HCV.coefs <- vcovHC(partialbasic.lasso.fit, type = 'HC')
partialbasic.lasso.se <- sqrt(diag(HCV.coefs))[2]


Coefficient for D via partialling-out using lasso -0.09065628

In [25]:
# Partialling-Out using lasso

#Model 2 via Lasso


# models
extraflex.y <- lwage ~  (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2 # model for Y
extraflex.d <- sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)^2 # model for D

# partialling-out the linear effect of W from Y
t.Y <- rlasso(extraflex.y, data=data)$res
# partialling-out the linear effect of W from D
t.D <- rlasso(extraflex.d, data=data)$res

# regression of Y on D after partialling-out the effect of W
partialflex.lasso.fit <- lm(t.Y~t.D)
partialflex.lasso.est <- summary(partialflex.lasso.fit)$coef[2,1]

cat("Coefficient for D via partialling-out using lasso", partialflex.lasso.est)

# standard error
HCV.coefs <- vcovHC(partialflex.lasso.fit, type = 'HC')
partialflex.lasso.se <- sqrt(diag(HCV.coefs))[2]


Coefficient for D via partialling-out using lasso -0.08141371

In [26]:

table<- matrix(0, 2, 2)
table[1,1]<- partialbasic.lasso.est
table[1,2]<- partialbasic.lasso.se    
table[2,1]<-  partialflex.lasso.est
table[2,2]<- partialflex.lasso.se 
colnames(table)<- c("Estimate","Std. Error")
rownames(table)<- c("Basic model via Lasso","Flex model via Lasso")	
tab<- xtable(table, digits=c(3,3,4))
tab

Unnamed: 0,Estimate,Std. Error
Basic model via Lasso,-0.09065628,0.03223961
Flex model via Lasso,-0.08141371,0.03239641



Both estimators are almost identical and reach the same conclusion: there is a negative wage gap between women and men.