REPLICATE THE PM1_NOTEBOOK1_PREDICTION_NEWDATA IN R WITH JN BUT RESTRICTED DATA

# DATA

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below 3.

The variable of interest 𝑌 is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size 𝑛=5150.

In [None]:
load("C:/Users/marci/Documentos/GitHub/ECO224/Labs/data/wage2015_subsample_inference.Rdata")
dim(data)

In [None]:
str(data)

The data cointains 5150 observations with 20 variables. According to the instruction, we only consider a subsample of the data :people who didn't go to college, so that is why we support at the variables shs and hsg.

In [None]:
data <- subset(data, shs==1 | hsg==1)
dim (data_1)


In this case, we can see how many observations had losing. Next, we focus in the construction between the output Y and the matrix Z

In [None]:
Y <- log(data$wage)
n <- length(Y)
Z <- data[-which(colnames(data) %in% c("wage","lwage"))]
p <- dim(Z)[2]

cat("Number of observation:", n, '\n')
cat("Number of raw regressors:", p)


In [None]:
library(xtable)
options(xtble.floating = FALSE)
options(xtable.timestamp = "")

In [None]:
library(xtable)
Z_subset <- data_1[which(colnames(data) %in% c("lwage", "sex", "shs","hsg","scl","clg","ad","mw","so","we","ne","exp1"))]
table <- matrix(0,12,1)
table[1:12,1] <- as.numeric(lapply(Z_subset,mean))
rownames(table) <-c("Log Wage", "Sex", "Some High School","High School Graduate","Some College","College Graduate","Advance Degree","Midwest","South","West","Northeast","Experience")
colnames(table) <-c("Sample mean")
tab<- xtable(table, digits =2)
tab

E.g., the share of female workers in our sample is ~32% ($sex=1$ if female).

Alternatively, we can also print the table as latex.

In [None]:
print(tab, type="latex")

# PREDICTION QUESTION

Now, we will construct a prediction rule for hourly wage  𝑌 , which depends linearly on job-relevant characteristics  𝑋 :

                                𝑌=𝛽′𝑋+𝜖
                                
Our goals are

+Predict wages using various characteristics of workers.

+Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample  𝑅2  and the out-of-sample MSE and  𝑅2 .

We employ two different specifications for prediction:

++Basic Model:  𝑋  consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators, regional indicators).

++Flexible Model:  𝑋  consists of all raw regressors from the basic model (without sex) plus transformations (e.g.,  𝑒𝑥𝑝2  and  𝑒𝑥𝑝3 and  𝑒𝑥𝑝4).

In [None]:
#i. Basic model
basic <-lwage ~ (sex + exp1 + shs + hsg + scl + clg + mw + so + we + occ2 + ind2)
regbasic <- lm(basic, data=data)
regbasic #estimated coefficients
cat ("Number of regressors in the basic model:", length(regbasic$coef),'\n') #number of regresors in the basic Model

#The basic model has 51 regressors

In [None]:
#ii.Flexible model
flex <- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2
regflex <- lm(flex, data=data)
regflex #estimaated coefficients
cat( "Number of regressors in the flexible model:", length(regflex$coef)) # number of regressors in the flexible model
#The flexible model has 979 regressors

In [None]:
#Lasso model
library(hdm)
flex <- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2
lassoreg <- rlasso(flex, data=data)
sumlasso <- summary(lassoreg)

In [None]:
#Evaluationg of R2 adjusted(sample) and MSE adjusted(sample)
#Assess the predictive performance

sumbasic <- summary(regbasic)
sumflex <- summary(regflex)

#R-squared
R2.1 <- sumbasic$r.squared
cat("R.squared for the basic model:", R2.1, "\n")
R2.adj1 <- sumbasic$adj.r.squared
cat("adjusted R-squared for the basic model:", R2.adj1, "\n")

R2.2 <- sumflex$r.squared
cat("R-squared for the flexible model: ", R2.2, "\n")
R2.adj2 <- sumflex$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adj2, "\n")

R2.L <- sumlasso$r.squared
cat("R-squared for the lasso with flexible model: ", R2.L, "\n")
R2.adjL <- sumlasso$adj.r.squared
cat("adjusted R-squared for the flexible model: ", R2.adjL, "\n")

In [None]:
#calculating the MSE
MSE1 <- mean(sumbasic$res^2)
cat("MSE for the basic model:", MSE1, "\n")
p1 <- sumbasic$df[1] #number of regressors
MSE.adj1 <- (n/(n-p1))*MSE1
cat("adjusted MSE for the basic model:", MSE.adj1, "\n")

MSE2 <- mean(sumflex$res^2)
cat("MSE for the flexible model:", MSE2, "\n")
p2 <- sumflex$df[1]
MSE.adj2 <- (n/(n-p2))*MSE2
cat("adjusted MSE for the flexible model:", MSE.adj2, "\n")

MSEL <- mean(sumlasso$res^2)
cat("MSE for the lasso flexible model:", MSEL, "\n")
pL <- length(sumlasso$coef)
MSE.adjL <- (n/(n-pL))*MSEL
cat("adjusted MSE for the lasso flexible model:", MSE.adjL, "\n")
cat("adjusted MSE for the partialing out lasso flexible model:",MSE.adjL2,"\n")

In [None]:
library(xtable)
table <- matrix(0,3,5)
table[1,1:5] <-c(p1,R2.1,MSE1,R2.adj1,MSE.adj1)
table[2,1:5] <-c(p2,R2.2,MSE2,R2.adj2,MSE.adj2)
table[3,1:5] <-c(pL,R2.L,MSEL,R2.adjL,MSE.adjL)
colnames(table)<- c("p","$R^2_{sample}$","$MSE_{sample}$","$R^2_{adjusted}$","$MSE_{Adjusted}$")
rownames(table)<- c("basic reg","flexible reg","lasso flex")
tab<- xtable(table, digits =c(0,0,2,2,2,2))
print(tab,type="latex") #typer="latex" for printing table in latex
tab

Results: The model performs better

# DATA SPLITTING

Measure the prediction quality of the two models via data splitting:

+Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophiscticated version of splitting that we can consider).

+Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.

+Use the testing sample for evaluation. Predict the  𝚠𝚊𝚐𝚎  of every observation in the testing sample based on the estimated parameters in the training sample.

+Calculate the Mean Squared Prediction Error  𝑀𝑆𝐸𝑡𝑒𝑠𝑡  based on the testing sample for both prediction models.

In [None]:
#splitting the data
set.seed(1) #to make the results replicable (generating random numbers)
random <- sample(1:n, floor(n*4/5))
length(random)


In [None]:
#draw (4/5)*n random numbers from 1 to n without replacing them
train <- data[random,] #training sample
test <-data[-random,] #testing sample
dim(test)

In [None]:
#basic model
#estimating the parameters in the training sample
regbasic <- lm(basic, data=train)
#calculating the out-of-sample-MSE
trainregbasic <- predict(regbasic, newdata=test)

y.test <- log(test$wage)
MSE.test1 <- sum((y.test-trainregbasic)^2)/length(y.test)
R2.test1 <- 1-MSE.test1/var(y.test)

cat("Test MSE for the basic model:", MSE.test1, "")
cat("Test R2 for the basic model:", R2.test1)


In [None]:
#flexible model
#estimating the parameters
#options (warn=-1)
regflex <- lm(flex, data=train)

#calculating the out-of-sample MSE
trainregflex <- predict(regflex, newdata=test)

y.test <- log(tes$wage)
MSE.test2 <- sum((y.test-trainregflex)^2)/length(y.test)
R2.test2 <- 1-MSE.test2/var(y.test)

cat("Test MSE for the flexible model:", MSE.test2,"")
cat("Test R2 for the flexible model:", R2,test2)

In the flexible model, the discrepancy between the  𝑀𝑆𝐸𝑡𝑒𝑠𝑡  and the  𝑀𝑆𝐸𝑠𝑎𝑚𝑝𝑙𝑒  is not large.

It is worth to notice that the 𝑀𝑆𝐸𝑡𝑒𝑠𝑡 vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample 𝑀𝑆𝐸, the basic model using ols regression performs is about as well (or slightly better) than the flexible model.

Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (least absolute shrinkage and selection operator) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors 𝑝 is relatively large in relation to 𝑛.

Note that the out-of-sample 𝑀𝑆𝐸 on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

In [None]:
# flexible model using lasso

# estimating the parameters
library(hdm)
reglasso <- rlasso(flex, data=train, post=FALSE)

# calculating the out-of-sample MSE
trainreglasso<- predict(reglasso, newdata=test)
MSE.lasso <- sum((y.test-trainreglasso)^2)/length(y.test)
R2.lasso<- 1- MSE.lasso/var(y.test)


cat("Test MSE for the lasso on flexible model: ", MSE.lasso, " ")

cat("Test R2 for the lasso flexible model: ", R2.lasso)

In [None]:
table2 <- matrix(0, 3,2)
table2[1,1]   <- MSE.test1
table2[2,1]   <- MSE.test2
table2[3,1]   <- MSE.lasso
table2[1,2]   <- R2.test1
table2[2,2]   <- R2.test2
table2[3,2]   <- R2.lasso

rownames(table2)<- c("basic reg","flexible reg","lasso regression")
colnames(table2)<- c("$MSE_{test}$", "$R^2_{test}$")
tab2 <- xtable(table2, digits =3)
tab2

Results: The basic model is better

In [None]:
print(tab2, type="latex") #type="latex" for printing table in latex

TWO CASES OF PARTIALLING-OUT USING LASSO.

CASE 1

In [None]:
library(hdm)
# For the basic model
basic.y <- lwage ~ (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2) # Model for Y
basic.d <- sex ~ (exp1 + shs + hsg+ scl + clg + mw + so + we + occ2+ ind2) # Modelo for D

# Residuals
basic_rY <- rlasso(basic.y, data = data)$res
basic_rD <- rlasso(basic.d, data = data)$res

# regression of Y on D after partialling-out the effect of W
basic_partial.fit <- lm(basic_rY ~ basic_rD)
basic_partial.est <- summary(basic_partial.fit)$coef[2,1]

sum_basiclasso <- summary(basic_partial.fit)

cat("Coefficient for D via partialling-out",basic_partial.est)

Results: We found differences between the gender gap of basic regressions using lasso and the gender gap regressions using OLS 

CASE 2

In [None]:
library(hdm)
# For the flexible model
flex.y <- lwage ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 # Model for Y
flex.d <- sex ~ (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+occ2+ind2+mw+so+we)**2 # Modelo for D

# Residuals
flex_rY <- rlasso(flex.y, data = data)$res
flex_rD <- rlasso(flex.d, data = data)$res

# regression of Y on D after partialling-out the effect of W
flex_partial.fit <- lm(flex_rY ~ flex_rD)
flex_partial.est <- summary(flex_partial.fit)$coef[2,1]

sum_flexlasso <- summary(flex_partial.fit)

cat("Coefficient for D via partialling-out",flex_partial.est)

Results: We found differences between the gender gap of basic regressions using lasso and the gender gap regressions using OLS.