# 2017-09-Amherst-STAT495/PS02

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
197 lines (151 sloc) 5.8 KB
title author date output
STAT/MATH 495: Problem Set 02
Albert Y. Kim
2017-09-19
html_document
toc toc_float toc_depth collapsed smooth_scroll
true
true
2
false
false
knitr::opts_chunk$set(echo = TRUE, fig.width=8, fig.height=4.5) # Load packages library(tidyverse) library(broom) library(gridExtra) library(knitr) # Note the relative file path, and not absolute file path: # http://www.coffeecup.com/help/articles/absolute-vs-relative-pathslinks/ train <- read_csv("data/train.csv") test <- read_csv("data/test.csv") sample_submission <- read_csv("data/sample_submission.csv")  # Exploratory Data Analysis ## Outcome variable Always do an EDA of outcome varible. Since 1 USD$\approx$58.11 Russian Rubles and median price is r median(train$price_doc) %>% prettyNum(big.mark=","), a good guess at units is Russian Roubles.

ggplot(train, aes(x=price_doc)) +
geom_histogram() +
labs(x="Price (Roubles)", title="Distribution of Price")


Things are very compressed on the left-hand side, so let's look at a log-transformation. Things are more even spread out!

ggplot(train, aes(x=price_doc)) +
geom_histogram() +
labs(x="Price (Roubles, log10-scale)", title="Distribution of Price (log10-scale)") +
scale_x_log10()


So let's do the following:

1. Model fitting phase
• $\log()$ transformation the outcome variable $y$. Call these $y'$. I usually only use $\log10()$ for visuals.
• Fit the model to training data
2. Predicting phase
• Take above model and based on test data $x$ and get fitted/predicted $\widehat{y'}$
• Undo $\log()$ transformation. i.e. $\widehat{y} = \exp\left(\widehat{y'}\right)$
train <- train %>%
mutate(log_price_doc = log(price_doc))


## Visual model

A lot of you used x=full_sq, which according to data_dictionary.txt is "total area in square meters, including loggias, balconies and other non-residential areas".

ggplot(train, aes(x=full_sq, y=log_price_doc)) +
geom_point() +
labs(x = "Total area (square meters)", y="log(Price)", title="Price vs area")


A couple of things to improve on:

• The outlier which is r max(train$full_sq) square meters. For perspective, the inbounds area of an American football field is$110 \times 48.76 = 5363.6$square meters! So let's: • Either remove this point, which IMO is reasonable since its only one point of r nrow(train) • At the very least plot the data without this point • There is overplotting on the left portion of the plot, so let's either • Thin the points out using the alpha argument for geom_point • Try a geom_density2d # Drop outlier train <- train %>% filter(full_sq < 2000) # Create baseplot of common elements: baseplot <- ggplot(train, aes(x=full_sq, y=log_price_doc)) + labs(x = "Total area (square meters)", y="log(Price)", title="Price vs area") # Add layers specific to each plot: plot1 <- baseplot + geom_point(alpha=0.2) plot2 <- baseplot + geom_point() + geom_density2d(col="orange", size=1) # Plot both at same time gridExtra::grid.arrange(plot1, plot2, nrow=1)  We note some interesting behavior: • Some houses have 0 square meters in area! Whatttt? • There seems to be some rounding in the Sale Price ## Extra While not absolutely necessary, let's do a comparison of the predictor variable in the training and test set. The distributions do not seem wildly different. bind_rows( train %>% select(full_sq) %>% mutate(set = "Train"), test %>% select(full_sq) %>% mutate(set = "Test") ) %>% ggplot(aes(x=set, y=full_sq)) + geom_boxplot() + coord_flip() + labs(x="Data set", y="Total area", title="Comparison of predictor in train/test data")  # Create Submission File ## Model fitting Let's fit the model using the logged-outcome variable, but also for comparison the untransformed outcome! df <- 5 model_log_price_doc <- smooth.spline(x=train$full_sq, y=train$log_price_doc, df=df) model_price_doc <- smooth.spline(x=train$full_sq, y=train$price_doc, df=df)  plot1 <- model_log_price_doc %>% augment() %>% as_tibble() %>% ggplot(aes(x=x)) + geom_point(aes(y=y), alpha=0.2) + geom_line(aes(y=.fitted), col="blue") + labs(x="Total area", y="log(Price)", title = paste("Spline model on log(Price) with df =", df)) plot2 <- model_price_doc %>% augment() %>% as_tibble() %>% ggplot(aes(x=x)) + geom_point(aes(y=y), alpha=0.2) + geom_line(aes(y=.fitted), col="blue") + labs(x="Total area", y="Price", title = paste("Spline model on Price with df =", df)) grid.arrange(plot1, plot2, nrow=1)  ## Predictions predictions_log <- predict(model_log_price_doc, x=test$full_sq) %>%
as_tibble() %>%
# Note here we need to un-log our predictions:
mutate(price_doc = exp(y)) %>%
mutate(id = sample_submission$id) %>% select(id, price_doc) write_csv(predictions_log, "data/submission_log.csv") predictions_no_log <- predict(model_price_doc, x=test$full_sq) %>%
as_tibble() %>%
# No such requirement here:
mutate(price_doc = y) %>%
mutate(id = sample_submission$id) %>% select(id, price_doc) write_csv(predictions_no_log, "data/submission_no_log.csv") # Sanity checks: predictions_log %>% head() %>% knitr::kable() predictions_no_log %>% head() %>% knitr::kable()  ## Results Final Kaggle Public Scores (RMSLE): • Modeling with$\log\$(Price): 0.42340
• Modeling with Price: 0.40859

Wow! Not what I expected! Keep in mind, we chose df=5 for both in an ad hoc fashion, so take this comparison with a grain of salt!