Please first install the required packages:

In [None]:
install.packages(c("xtable"))

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



## Introduction

An important question in labour economics is what determines the wage of workers. This is a causal question, but we can begin to investigate it from a predictive perspective.

In the following wage example, $Y$ is the hourly wage of a worker and $X$ is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:

* How can we use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data


The data set we consider is from the 2015 March Supplement of the U.S. Current Population Survey.  We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week for at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors;  individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below $3$.

The variable of interest $Y$ (or $\log(Y)$) is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers.

## Data analysis

**Exercise 1:** Load the data set. To do this upload the file "wage2015_subsample_inference.Rdata" which you can find on our [webpage](https://maramattes.github.io/CML-HHU/) and run the following code:

In [None]:
load("/content/wage2015_subsample_inference.rdata")
head(data)

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>
10,9.615385,2.263364,1,0,0,0,1,0,0,0,0,1,7,0.49,0.343,0.2401,3600,11,8370,18
12,48.076923,3.872802,0,0,0,0,1,0,0,0,0,1,31,9.61,29.791,92.3521,3050,10,5070,9
15,11.057692,2.403126,0,0,1,0,0,0,0,0,0,1,18,3.24,5.832,10.4976,6260,19,770,4
18,13.942308,2.634928,1,0,0,0,0,1,0,0,0,1,25,6.25,15.625,39.0625,420,1,6990,12
19,28.846154,3.361977,1,0,0,0,1,0,0,0,0,1,22,4.84,10.648,23.4256,2015,6,9470,22
30,11.730769,2.462215,1,0,0,0,1,0,0,0,0,1,1,0.01,0.001,0.0001,1650,5,7460,14


**Exercise 2:** Let's have a look at the structure of the data. How many observations do we have? What is the mean hourly wage? What are the names of the covariates? Also provide the sample mean of the covariates. What is the share of female workers in our sample?

**Exercise 3:** Construct the output variable $\log(Y)$ and the covariate matrix $Z$ which includes the characteristics of workers that are given in the data.

## Prediction Question

Now, we will construct a prediction rule for the logarithm of hourly wage $\log(Y)$, which depends linearly on job-relevant characteristics $X$ (log-linear model):

\begin{equation}
\log(Y) = \beta'X+ \epsilon.
\end{equation}

Our goals are

* Predict wages using various characteristics of workers.

* Assess the predictive performance of a given model using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators and regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus a dictionary of transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of a polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model** enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver higher prediction accuracy but are harder to interpret.

In R, the models are given by:

In [None]:
basic <- lwage ~ (sex + exp1 + shs + hsg + scl + clg + mw + so + we + occ2 + ind2) # basic model
flex <- lwage ~ sex + (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + mw + so + we + occ2 + ind2) # flexible model

**Exercise 4:** Fit both models to our data by running ordinary least squares (ols). *Hint: lm()*. How many covariates do we have in our basic model and flexible model, respectively?

#### Evaluating the predictive performance of the basic and flexible models



**Exercise 5:** Now, evaluate the performance of both models based on the $R^2_{sample}$ and the $MSE_{sample}$. Which model performs better?

## Data Splitting

Next, we would like to apply **data splitting** as a more general procedure to deal with potential overfitting if $p/n$ is not small.

**Exercise 6:** Measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here use a simple method (stratified splitting is a more sophisticated version of splitting that we might consider). *Hint: sample()*
- Use the training sample to estimate the parameters of the Basic Model and the Flexible Model.
- Use the testing sample for evaluation. Predict the (log) $\mathtt{wage}$  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models.