

This notebook contains an example for teaching.


# A Simple Case Study using Wage Data from 2015 - proceeding

So far we considered many machine learning method, e.g Lasso and Random Forests, to build a predictive model. In this lab, we extend our toolbox by predicting wages by a neural network.

## Data preparation

Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.

In [None]:
load("../input/wage2015-inference/wage2015_subsample_inference.Rdata")
Z <- subset(data,select=-c(lwage,wage)) # regressors

Firt, we split the data first and normalize it.

In [None]:
set.seed(1234)
training <- sample(nrow(data), nrow(data)*(3/4), replace=FALSE)

data_train <- data[training,1:16]
data_test <- data[-training,1:16]

# data_train <- data[training,]
# data_test <- data[-training,]
# X_basic <-  "sex + exp1 + exp2+ shs + hsg+ scl + clg + mw + so + we + occ2+ ind2"
# formula_basic <- as.formula(paste("lwage", "~", X_basic))
# model_X_basic_train <- model.matrix(formula_basic,data_train)[,-1]
# model_X_basic_test <- model.matrix(formula_basic,data_test)[,-1]
# data_train <- as.data.frame(cbind(data_train$lwage,model_X_basic_train))
# data_test <- as.data.frame(cbind(data_test$lwage,model_X_basic_test))
# colnames(data_train)[1]<-'lwage'
# colnames(data_test)[1]<-'lwage'


In [None]:
# normalize the data
mean <- apply(data_train, 2, mean)
std <- apply(data_train, 2, sd)
data_train <- scale(data_train, center = mean, scale = std)
data_test <- scale(data_test, center = mean, scale = std)
data_train <- as.data.frame(data_train)
data_test <- as.data.frame(data_test)

Then, we construct the inputs for our network.

In [None]:
X_basic <-  "sex + exp1 + shs + hsg+ scl + clg + mw + so + we"
formula_basic <- as.formula(paste("lwage", "~", X_basic))
model_X_basic_train <- model.matrix(formula_basic,data_train)
model_X_basic_test <- model.matrix(formula_basic,data_test)

Y_train <- data_train$lwage
Y_test <- data_test$lwage

### Neural Networks

First, we need to determine the structure of our network. We are using the R package *keras* to build a simple sequential neural network with three dense layers.

In [None]:
library(keras)

build_model <- function() {
  model <- keras_model_sequential() %>% 
    layer_dense(units = 20, activation = "relu", 
                input_shape = dim(model_X_basic_train)[2])%>% 
    layer_dense(units = 10, activation = "relu") %>% 
    layer_dense(units = 1) 
  
  model %>% compile(
    optimizer = optimizer_adam(lr = 0.005),
    loss = "mse", 
    metrics = c("mae")
  )
}

Let us have a look at the structure of our network in detail.

In [None]:
model <- build_model()
summary(model)

It is worth to notice that we have in total $441$ trainable parameters.

Now, let us train the network. Note that this takes some computation time. Thus, we are using gpu to speed up. The exact speed-up varies based on a number of factors including model architecture, batch-size, input pipeline complexity, etc.

In [None]:
# training the network 
num_epochs <- 1000
model %>% fit(model_X_basic_train, Y_train,
                    epochs = num_epochs, batch_size = 100, verbose = 0)

After training the neural network, we can evaluate the performance of our model on the test sample.

In [None]:
# evaluating the performnace
model %>% evaluate(model_X_basic_test, Y_test, verbose = 0)

In [None]:
# Calculating the performance measures
pred.nn <- model %>% predict(model_X_basic_test)
MSE.nn = summary(lm((Y_test-pred.nn)^2~1))$coef[1:2]
R2.nn <- 1-MSE.nn[1]/var(Y_test)
# printing R^2
cat("R^2 of the neural network:",R2.nn)