Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
338 lines (244 sloc) 12.4 KB
title author date output
STAT/MATH 495: Problem Set 10
Andrew and Albert Y. Kim
2017-11-28
html_document
collapsed df_print smooth_scroll toc toc_depth toc_float
false
kable
false
true
3
true
knitr::opts_chunk$set(echo = TRUE, fig.width=8, fig.height=4.5, message=FALSE, warning = FALSE)
set.seed(76)

Collaboration

Please indicate who you collaborated with on this assignment:

Setup

library(tidyverse)
library(broom)
library(glmnet)
library(MLmetrics)

train <- read_csv("data/train.csv")
test <- read_csv("data/test.csv")
sample_submission <- read_csv("data/sample_submission.csv")

# Only use 150 observations to train model!
set.seed(76)
train <- train %>% 
  mutate(log_price_doc = log(price_doc)) %>% 
  sample_n(150)

# Need "dummy" outcome variable to make model.matrix() code below work
test <- test %>% 
  mutate(log_price_doc=1) 

# Model formula
model_formula <- as.formula("log_price_doc ~ full_sq + area_m + raion_popul + green_zone_part + indust_part + children_preschool + preschool_education_centers_raion + children_school + school_education_centers_raion + school_education_centers_top_20_raion + healthcare_centers_raion + university_top_20_raion + sport_objects_raion + additional_education_raion + culture_objects_top_25 + culture_objects_top_25_raion + shopping_centers_raion + office_raion + thermal_power_plant_raion + incineration_raion + oil_chemistry_raion + radiation_raion + railroad_terminal_raion + big_market_raion + nuclear_reactor_raion + detention_facility_raion + full_all + male_f + female_f + young_all + young_male + young_female + work_all + work_male + work_female + ekder_all + ekder_male + ekder_female + ID_metro + metro_min_avto + metro_km_avto + kindergarten_km + school_km + park_km + green_zone_km + industrial_km + water_treatment_km + cemetery_km + incineration_km + railroad_station_avto_km + railroad_station_avto_min + ID_railroad_station_avto + public_transport_station_km + public_transport_station_min_walk + water_km + water_1line + mkad_km + ttk_km + sadovoe_km + bulvar_ring_km + kremlin_km + big_road1_km + ID_big_road1 + big_road1_1line + big_road2_km + ID_big_road2 + railroad_km + railroad_1line + zd_vokzaly_avto_km + ID_railroad_terminal + bus_terminal_avto_km + ID_bus_terminal + oil_chemistry_km + nuclear_reactor_km + radiation_km + power_transmission_line_km + thermal_power_plant_km + ts_km + big_market_km + market_shop_km + fitness_km + swim_pool_km + ice_rink_km + stadium_km + basketball_km + hospice_morgue_km + detention_facility_km + public_healthcare_km + university_km + workplaces_km + shopping_centers_km + office_km + additional_education_km + preschool_km + big_church_km + church_synagogue_km + mosque_km + theater_km + museum_km + exhibition_km + catering_km + green_part_500 + prom_part_500 + office_count_500 + office_sqm_500 + trc_count_500 + trc_sqm_500") 

# Define predictor matrices
predictor_matrix_train <- model.matrix(model_formula, data = train)[, -1]
predictor_matrix_test <- model.matrix(model_formula, data = test)[, -1]

Do work and create submission files

The first thing to note is that the score is NOT the MSE, but rather the Root Mean Squared Logarithmic Error (Go to leaderboard and hover your mouse over the "?" by the Score column). This computes "error" on a multiplicative scale rather than an additive one (since house prices generally vary in orders of magnitude) and adds 1 to deal with 0:

$$ RMSLE = \frac{1}{n}\sum_{i=1}^{n}\left(\log(\widehat{y}_i + 1) - \log(y_i + 1) \right)^2 $$

Instead of programming this, let's use the following package and use the built-in MLmetrics::RMSLE() function!

library(MLmetrics)
?RMSLE

LASSO Fits

We fit LASSO models first. Note

  • glmnet() can work either:
    1. By letting glmnet() chose the $\lambda$ values for you.
    2. For a range of specified $\lambda$ values, which you input as lambda_inputs.
  • When choosing the optimal $\lambda$ to use for our predictions, we can either choose:
    1. The $\lambda^*$ that yields the lowest RMSE
    2. The $\lambda^*_{1SE}$ that is based on the 1 standard error "elbow-rule"
  • Before submitting, we undo the log-transformation to get predicted values $\widehat{y}$, since the models were fit to the outcome variable $\log(y)$.

1. Solution without pre-specified lambda_inputs using $\lambda^*_{1SE}$

Fit/train model and get optimal $\lambda$:

LASSO_fit <- glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1)
LASSO_CV <- cv.glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1)
lambda_star_1SE <- LASSO_CV$lambda.1se

Score model on training data by evaluating RMSLE directly:

y_hat <- predict(LASSO_fit, newx=predictor_matrix_train, s=lambda_star_1SE) %>%
  as.vector() %>% 
  exp()
MLmetrics::RMSLE(y_hat, train$price_doc)

Score model on test data by submitting predictions to Kaggle and getting RMSLE from leaderboard:

y_hat <- predict(LASSO_fit, newx=predictor_matrix_test, s=lambda_star_1SE) %>%
  as.vector() %>% 
  exp()
sample_submission$price_doc <- y_hat
write_csv(x=sample_submission, path="data/LASSO_1_submission.csv")

2. Solution without pre-specified lambda_inputs using $\lambda^*$

Fit/train model and get optimal $\lambda$.

LASSO_fit <- glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1)
LASSO_CV <- cv.glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1)
lambda_star_min <- LASSO_CV$lambda.min

Score model on training data by evaluating RMSLE directly:

y_hat <- predict(LASSO_fit, newx=predictor_matrix_train, s=lambda_star_min) %>%
  as.vector() %>% 
  exp()
MLmetrics::RMSLE(y_hat, train$price_doc)

Score model on test data by submitting predictions to Kaggle and getting RMSLE from leaderboard:

y_hat <- predict(LASSO_fit, newx=predictor_matrix_test, s=lambda_star_min) %>%
  as.vector() %>% 
  exp()
sample_submission$price_doc <- y_hat
write_csv(x=sample_submission, path="data/LASSO_2_submission.csv")

3. Solution with pre-specified lambda_inputs using $\lambda^*_{1SE}$

Fit/train model and get optimal $\lambda$. Note we now specify our own $\lambda$ values in lambda_inputs.

lambda_inputs <- 10^seq(-2, 10, length = 100)

LASSO_fit <- glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1, lambda = lambda_inputs)
LASSO_CV <- cv.glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1, lambda = lambda_inputs)
lambda_star_1SE <- LASSO_CV$lambda.1se

Score model on training data by evaluating RMSLE directly:

y_hat <- predict(LASSO_fit, newx=predictor_matrix_train, s=lambda_star_1SE) %>%
  as.vector() %>% 
  exp()
MLmetrics::RMSLE(y_hat, train$price_doc)

Score model on test data by submitting predictions to Kaggle and getting RMSLE from leaderboard:

y_hat <- predict(LASSO_fit, newx=predictor_matrix_test, s=lambda_star_1SE) %>%
  as.vector() %>% 
  exp()
sample_submission$price_doc <- y_hat
write_csv(x=sample_submission, path="data/LASSO_3_submission.csv")

Linear Model Fits

We fit the unregularized linear models next. Note we can achieve this via

  1. glmnet() by setting the shrinkage penalty $\lambda=0$ by setting s=0 in the predict() function, since using no penalty yields the same results in theory as lm().
  2. lm() directly.

There does seem to be some slight variation in the resulting fits however.

1. Solution using glmnet()

Fit/train model

lm_fit <- glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha=1, lambda=0)

Score model on training data by evaluating RMSLE directly:

y_hat <- predict(lm_fit, newx=predictor_matrix_train, s=0) %>%
  as.vector() %>% 
  exp()
MLmetrics::RMSLE(y_hat, train$price_doc)

Score model on test data by submitting predictions to Kaggle and getting RMSLE from leaderboard:

y_hat <- predict(lm_fit, newx=predictor_matrix_test, s=0) %>%
  as.vector() %>% 
  exp()
sample_submission$price_doc <- y_hat
write_csv(x=sample_submission, path="data/lm_1_submission.csv")

2. Solution using lm()

Fit/train model:

lm_fit <- lm(model_formula, data = train)

Score model on training data by evaluating RMSLE directly:

y_hat <- predict(lm_fit, newdata=train) %>%
  as.vector() %>% 
  exp()
MLmetrics::RMSLE(y_hat, train$price_doc)

Score model on test data by submitting predictions to Kaggle and getting RMSLE from leaderboard:

y_hat <- predict(lm_fit, newdata=test) %>%
  as.vector() %>% 
  exp()
sample_submission$price_doc <- y_hat
write_csv(x=sample_submission, path="data/lm_2_submission.csv")

Scoreboard

Here are the Kaggle scores on the test data for the 5 models, where the "scoring mechanism" for the Russian Housing Kaggle competition was Root Mean Squared Logarithmic Error:

Drawing

Let's compare the

  1. Scores on the training data, which we computed above
  2. Scores on the test data, which are returned from Kaggle in the image above
Method Training data score Test data score (from Kaggle)
LASSO w/o specified $\lambda$'s & using $\lambda^*_{1SE}$ 0.57385 0.49069
LASSO w/o specified $\lambda$'s & using $\lambda^*$ 0.45451 0.41910
LASSO w/ specified $\lambda$'s & using $\lambda^*_{1SE}$ 0.54164 0.45162
Linear model using glmnet() 0.27666 1.00998
Linear model using lm() 0.25347 2.63413

While there is a little variation in the resulting RMSLE, the main takeaways are that:

  1. For the 3 LASSO regularized-models, the training and test scores are of similar orders of magnitude. In other words, they are not overfit.
  2. For the 2 unregularized linear models that use $p=107$ predictors, the test RMSLEs are orders of magnitude higher than the train RMSLEs, indicating that they are overfit to the training data (which only consist of r nrow(train) observations), and hence do poorly on the test data.

Moral: Regularization is particular important to consider when $n << p$, i.e. you have very few observations given the number of preditors.

Extra: Variable/model selection

Using the get_LASSO_coefficients() function from the quickstart guide and using the optimal $\lambda^*_{1SE}$ we computed earlier, we find that the crossvalidated optimal LASSO model is:

get_LASSO_coefficients <- function(LASSO_fit){
  coeff_values <- LASSO_fit %>% 
    broom::tidy() %>% 
    as_tibble() %>% 
    select(-c(step, dev.ratio)) %>% 
    tidyr::complete(lambda, nesting(term), fill = list(estimate = 0)) %>% 
    arrange(desc(lambda)) %>% 
    select(term, estimate, lambda)
  return(coeff_values)
}

lambda_inputs <- 10^seq(-2, 10, length = 100)
LASSO_fit <- glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1, lambda = lambda_inputs)
LASSO_CV <- cv.glmnet(x=predictor_matrix_train, y=train$log_price_doc, alpha = 1, lambda = lambda_inputs)
lambda_star_1SE <- LASSO_CV$lambda.1se

get_LASSO_coefficients(LASSO_fit) %>% 
  filter(lambda == lambda_star_1SE) %>% 
  filter(estimate != 0) %>% 
  knitr::kable(digits=5)

i.e. only 6 out of 107 predictor variables + the intercept. So the optimal crossvalidated LASSO regularized model is

$$ \begin{align} \widehat{y} &= 15.53633 - 0.02116\times\mbox{big_church_km} + 0.00582\times\mbox{full_sq}\ & - 0.00316\times\mbox{power_transmission_line_km} - 0.00341\times\mbox{stadium_km}\ & -0.00357\times\mbox{ts_km} - 0.01183\times\mbox{ttk_km} \end{align} $$

Looking at the data_dictionary.txt file on Kaggle, we observe that all else being equal

  1. big_church_km: As the distance to a big church increases, there is a predicted decrease in price
  2. full_sq: Total area in square meters has a positive relationship with price
  3. power_transmission_line_km: As the distance to power transmission line increases, there is a predicted decrease in price
  4. stadium_km: As the distance to a stadium increases, there is a predicted decrease in price
  5. ts_km: As the distance to a power station increases, there is a predicted decrease in price
  6. ttk_km: As the distance to the Third Transport Ring increases, there is a predicted decrease in price. This says nothing about if the distance is closer to the downtown core or farther