# THE Winter School

# **Machine Learning for Prediction**

**Author:**
[Anthony Strittmatter](http://www.anthonystrittmatter.com)


We estimate the hedonic prices of used-cars. For this purpose, we web-scrape data from the online auction platform *MyLemons*. We restrict the sample to BMW 320 series, Opel Astra, Mercedes C-class, VW Golf, and VW Passat. We select used-cars with a mileage between 10,000-200,000 km and an age between 1-20 years. The data is stored in the file *used_cars.csv*.  

We observe the following variables:


|Variable name| Description|
|:----|:----|
|**Outcome variable** ||
|*first_price*| First asking price in 1,000 CHF |
|**Covariates**| |
|*bmw_320, opel_astra, mercedes_c, vw_golf, vw_passat*| Dummies for the car make and model|
|*mileage*| Mileage of the used car (in 1,000 km)|
|*age_car_years*| Age of the used car (in years)|
|*mileage2, mileage3, mileage4, age_car_years2, age_car_years3, age_car_years4*| Squared, cubic, and quadratic *mileage* and *age_car_years* |
|*diesel*| Dummy for diesel engines |
|*private_seller*| Dummy for private seller (as opposed to professional used car sellers) |
|*other_car_owner*| Number of previous caar owners |
|*guarantee*| Dummy indicating that the seller offers a guarantee for the used car|
|*maintenance_cert*| Dummy indicating that the seller has a complete maintenace certificate for the used car|
|*pm_green*| Dummy indicating that the used car has low particular matter emissions|
|*co2_em*| CO2 emssion (in g/km)|
|*page_title* | Text in the title of the used car offer |
|*dur_next_ins_0*| Dummy indicating that the duration until the next general inspection is less than a years |
|*dur_next_ins_1_2*| Dummy indicating that the duration until the next general inspection is between 1 and 2 years |
|*new_inspection*| Dummy indicating that the used car has a new general inspection |
|*euro_1, euro_2, euro_3, euro_4, euro_5, euro_6*| Dummies for EURO emission norms |

## Load Packages

In [None]:
########################  Load Packages  ########################

# List of required packages
pkgs <- c('tidyverse','glmnet','corrplot','plotmo')

# Load packages
for(pkg in pkgs){
    install.packages(pkg)
    library(pkg, character.only = TRUE)
}

print('All packages successfully installed and loaded.')

## Load Data Frame

We load the data frame and label the covariates. We select a subsample of 300 used-cars in order to decrease the computation time while you are testing your code. We can use the entire sample of 104,719 used cars after we are finised with programming.

In [None]:
########################  Load Data Frame  ########################

# Load data frame
data_raw <- read.csv("Data/mylemon.csv",header=TRUE, sep=",")

# Selection of Subsample size, max. 104,721 observations
# Select smaller subsample to decrease computation time
set.seed(1001) # set starting value for random number generator
n_obs <- 300
df <- data_raw %>%
  dplyr::sample_n(n_obs)

print('Data successfully loaded.')

## Take Training and Test Sample 

We want to compare the relative prediction power of different estimation procedures based on the out-of-sample MSE and $R^2$. For this purpose, we create an hold-out-sample. Additionally, we generate 100 random variables which are unrelated to the used-car prices. These variables create additional noise in the estimation. Ideally, the Lasso approach should not select those variables. 

In [None]:
########################  Take Hold-Out-Sample  ########################
set.seed(100219) # set starting value for random number generator

# Partition the sample
df_part <- modelr::resample_partition(df, c(obs = 0.8, hold_out = 0.2))
df_train <- as.data.frame(df_part$obs) # Training data
df_test <- as.data.frame(df_part$hold_out) # Test data

# Outcomes
price_train <- as.matrix(df_train[,2])
price_test <- as.matrix(df_test[,2])

# Covariates/Features
covariates_train <- as.matrix(df_train[,c(3:ncol(df_train))])
covariates_test <- as.matrix(df_test[,c(3:ncol(df_test))])

print('The data is now ready for your first analysis!')

## Correlation Matrix

In [None]:
########################  Correlation Matrix  ########################

corr = cor(covriates)
corrplot(corr, type = "upper", tl.col = "black")


# OLS

We estimate the used-car prices using an OLS model which includes all (relevant and irrelavant) covariates.

**Replace parameters in questionsmarks.**

In [None]:
########################  OLS Model  ######################## 

# Setup the formula of the linear regression model
#sumx <- paste(covariates, collapse = " + ")  
#linear <- paste("first_price_obs",paste(sumx, sep=" + "), sep=" ~ ")
#linear <- as.formula(linear)

# Setup the data for linear regression
#data <- as.data.frame(covariates_obs)

# Estimate OLS model
ols <- lm(first_price_obs ~., as.data.frame(covariates_obs))
summary(ols)
# Some variables might be dropped because of perfect colinearity (121 covariates - 240 observations)


Extrapolate fitted values to the hold-out-sample.

**Replace parameters in questionsmarks.**

In [None]:
# In-sample fitted values
fit1_in <- predict.lm(ols)

# Out-of-sample fitted values
fit1_out <- predict.lm(ols, newdata = data.frame(covariates_hold_out))

# In-sample performance measures
#mse1_in <- round(mean((first_price_obs - fit1_in)^2),digits=3)
rsquared_in <- round(1-mean((first_price_obs - fit1_in)^2)/mean((first_price_obs - mean(first_price_obs))^2),digits=3)
#print(paste0("In-Sample MSE OLS: ", mse1_in))
print(paste0("In-Sample R-squared OLS: ", rsquared_in))

# Out-of-sample performance measures
#mse1_out <- round(mean((first_price_hold_out - fit1_out)^2),digits=3)
rsquared_out <- round(1-mean((first_price_hold_out - fit1_out)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)
#print(paste0("Out-of-Sample MSE OLS: ", mse1_out))
print(paste0("Out-of-Sample R-squared OLS: ", rsquared_out))


Evaluate the in- and out-of-sample performance using MSE and $R^2$.

# LASSO

## Standard Lasso

The LASSO minimises the objective function
\begin{equation*}
\min_{\beta} \left\{ \sum_{i=1}^{N} \left( Y_i-  \beta_0 -\sum_{j=1}^{p}X_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\}.
\end{equation*}
First we have to find the optimal tuning parameter $\lambda$ via cross-validation (CV).

**Replace parameters in questionsmarks.**

In [None]:
########################  CV-LASSO  ######################## 
p = 1 # 1 for LASSO, 0 for Ridge

set.seed(10101)
lasso.linear <- cv.glmnet(covariates_obs, first_price_obs, alpha=p, 
                          nlambda = 100, type.measure = 'mse')
# nlambda specifies the number of different lambda values on the grid (log-scale)
# type.measure spciefies that the optimality criteria is the MSE in CV-samples

# Plot MSE in CV-Samples for different values of lambda
plot(lasso.linear)

# Optimal Lambda
print(paste0("Lambda minimising CV-MSE: ", round(lasso.linear$lambda.min,digits=8)))
# 1 standard error rule reduces the number of included covariates
print(paste0("Lambda 1 standard error rule: ", round(lasso.linear$lambda.1se,digits=8)))

# Number of Non-Zero Coefficients
print(paste0("Number of selected covariates (lambd.min): ",lasso.linear$glmnet.fit$df[lasso.linear$glmnet.fit$lambda==lasso.linear$lambda.min]))
print(paste0("Number of selected covariates (lambd.1se): ",lasso.linear$glmnet.fit$df[lasso.linear$glmnet.fit$lambda==lasso.linear$lambda.1se]))


## Plot Lasso Structure

In [None]:
########################  Visualisation of LASSO  ######################## 

glmcoef<-coef(lasso.linear,lasso.linear$lambda.1se)
coef.increase<-dimnames(glmcoef[glmcoef[,1]>0,0])[[1]]
coef.decrease<-dimnames(glmcoef[glmcoef[,1]<0,0])[[1]]

lambda_min =  lasso.linear$glmnet.fit$lambda[26]/lasso.linear$glmnet.fit$lambda[1]
set.seed(10101)
mod <- glmnet(covariates_obs, first_price_obs, lambda.min.ratio = lambda_min, alpha=p)
maxcoef<-coef(mod,s=lambda_min)
coef<-dimnames(maxcoef[maxcoef[,1]!=0,0])[[1]]
allnames<-dimnames(maxcoef[maxcoef[,1]!=0,0])[[1]][order(maxcoef[maxcoef[,1]!=0,ncol(maxcoef)],decreasing=TRUE)]
allnames<-setdiff(allnames,allnames[grep("Intercept",allnames)])

#assign colors
cols<-rep("gray",length(allnames))
cols[allnames %in% coef.increase]<-"red"   
cols[allnames %in% coef.decrease]<- "green"

plot_glmnet(mod,label=TRUE,s=lasso.linear$lambda.1se,col= cols)


## Plot Lasso Coefficients

**Replace parameters in questionsmarks.**

In [None]:
########################  Plot LASSO Coefficients  ########################

print('LASSO coefficients')

glmcoef<-coef(lasso.linear, lasso.linear$lambda.1se)
print(glmcoef)
# the LASSO coefficients are biased because of the penalty term


## In-Sample Perforamce Measures

**Replace parameters in questionsmarks.**

In [None]:
######################## In-Sample Performance of LASSO  ######################## 

# Estimate LASSO model 
# Use Lambda that minizes CV-MSE
set.seed(10101)
lasso.fit.min <- glmnet(covariates_obs, first_price_obs, lambda = lasso.linear$lambda.min)
yhat.lasso.min <- predict(lasso.fit.min, covariates_obs)

# Use 1 standard error rule
set.seed(10101)
lasso.fit.1se <- glmnet(covariates_obs, first_price_obs, lambda = lasso.linear$lambda.1se)
yhat.lasso.1se <- predict(lasso.fit.1se, covariates_obs)

# In-sample performance measures
print(paste0("In-Sample MSE OLS: ", mse1_in))
print(paste0("In-Sample R-squared OLS: ", rsquared_in))

mse2_in <- round(mean((first_price_obs - yhat.lasso.min)^2),digits=3)
rsquared2_in <- round(1-mean((first_price_obs - yhat.lasso.min)^2)/mean((first_price_obs - mean(first_price_obs))^2),digits=3)
print(paste0("In-Sample MSE Lasso (lambda.min): ", mse2_in))
print(paste0("In-Sample R-squared Lasso (lambda.min): ", rsquared2_in))

mse3_in <- round(mean((first_price_obs - yhat.lasso.1se)^2),digits=3)
rsquared3_in <- round(1-mean((first_price_obs - yhat.lasso.1se)^2)/mean((first_price_obs - mean(first_price_obs))^2),digits=3)
print(paste0("In-Sample MSE Lasso(lambda.1se): ", mse3_in))
print(paste0("In-Sample R-squared Lasso (lambda.1se): ", rsquared3_in))


## Out-of-Sample Perforamce Measures

**Replace parameters in questionsmarks.**

In [None]:
######################## Out-of-Sample Performance of LASSO  ######################## 

# Extrapolate Lasso fitted values to hold-out-sample
yhat.lasso.min <- predict(lasso.fit.min, covariates_hold_out)
yhat.lasso.1se <- predict(lasso.fit.1se, covariates_hold_out)

# Out-of-sample performance measures
print(paste0("Out-of-Sample MSE OLS: ", mse1_out))
print(paste0("Out-of-Sample R-squared OLS: ", rsquared_out))

mse2_out <- round(mean((first_price_hold_out - yhat.lasso.min)^2),digits=3)
rsquared2_out <- round(1-mean((first_price_hold_out - yhat.lasso.min)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)
print(paste0("Out-of-Sample MSE Lasso (lambda.min): ", mse2_out))
print(paste0("Out-of-Sample R-squared Lasso (lambda.min): ", rsquared2_out))

mse3_out <- round(mean((first_price_hold_out - yhat.lasso.1se)^2),digits=3)
rsquared3_out <- round(1-mean((first_price_hold_out - yhat.lasso.1se)^2)/mean((first_price_hold_out - mean(first_price_hold_out))^2),digits=3)
print(paste0("Out-of-Sample MSE Lassso (lambda.1se): ", mse3_out))
print(paste0("Out-of-Sample R-squared Lasso (lambda.1se): ", rsquared3_out))


We could improve the performance of the LASSO prediction by adding more covariates (e.g., interactions). We can check the performance of the Risge estimator by setting *p = 0*.

# Extra Exercises:

1. Estimate the Post-Lasso coefficients. Do they differ from the Lasso coeffieicents? Do the performances of the Lasso and Post-Lasso estimators differ?

2. Predict the used car prices using a Rdge instead of a Lasso model. Which estimator shows the better performance?

3. How do the results change when you increase the sample size to 104,721 observations?

2. Replace the outcome variable 'first_price' with the 'overprice' dummy. Fit a linear and logit Lasso model. How do the models differ from each other?