Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
197 lines (151 sloc) 5.8 KB
title author date output
STAT/MATH 495: Problem Set 02
Albert Y. Kim
toc toc_float toc_depth collapsed smooth_scroll
knitr::opts_chunk$set(echo = TRUE, fig.width=8, fig.height=4.5)

# Load packages

# Note the relative file path, and not absolute file path:
train <- read_csv("data/train.csv")
test <- read_csv("data/test.csv")
sample_submission <- read_csv("data/sample_submission.csv")

Exploratory Data Analysis

Outcome variable

Always do an EDA of outcome varible. Since 1 USD $\approx$ 58.11 Russian Rubles and median price is r median(train$price_doc) %>% prettyNum(big.mark=","), a good guess at units is Russian Roubles.

ggplot(train, aes(x=price_doc)) +
  geom_histogram() +
  labs(x="Price (Roubles)", title="Distribution of Price")

Things are very compressed on the left-hand side, so let's look at a log-transformation. Things are more even spread out!

ggplot(train, aes(x=price_doc)) +
  geom_histogram() +
  labs(x="Price (Roubles, log10-scale)", title="Distribution of Price (log10-scale)") +

So let's do the following:

  1. Model fitting phase
    • $\log()$ transformation the outcome variable $y$. Call these $y'$. I usually only use $\log10()$ for visuals.
    • Fit the model to training data
  2. Predicting phase
    • Take above model and based on test data $x$ and get fitted/predicted $\widehat{y'}$
    • Undo $\log()$ transformation. i.e. $\widehat{y} = \exp\left(\widehat{y'}\right)$
train <- train %>% 
  mutate(log_price_doc = log(price_doc))

Visual model

A lot of you used x=full_sq, which according to data_dictionary.txt is "total area in square meters, including loggias, balconies and other non-residential areas".

ggplot(train, aes(x=full_sq, y=log_price_doc)) +
  geom_point() +
  labs(x = "Total area (square meters)", y="log(Price)", title="Price vs area")

A couple of things to improve on:

  • The outlier which is r max(train$full_sq) square meters. For perspective, the inbounds area of an American football field is $110 \times 48.76 = 5363.6$ square meters! So let's:
    • Either remove this point, which IMO is reasonable since its only one point of r nrow(train)
    • At the very least plot the data without this point
  • There is overplotting on the left portion of the plot, so let's either
    • Thin the points out using the alpha argument for geom_point
    • Try a geom_density2d
# Drop outlier
train <- train %>%
  filter(full_sq < 2000)

# Create baseplot of common elements:
baseplot <- ggplot(train, aes(x=full_sq, y=log_price_doc)) +
  labs(x = "Total area (square meters)", y="log(Price)", title="Price vs area")

# Add layers specific to each plot:
plot1 <- baseplot +
plot2 <- baseplot +
  geom_point() +
  geom_density2d(col="orange", size=1)

# Plot both at same time
gridExtra::grid.arrange(plot1, plot2, nrow=1)

We note some interesting behavior:

  • Some houses have 0 square meters in area! Whatttt?
  • There seems to be some rounding in the Sale Price


While not absolutely necessary, let's do a comparison of the predictor variable in the training and test set. The distributions do not seem wildly different.

  train %>% select(full_sq) %>% mutate(set = "Train"),
  test %>% select(full_sq) %>% mutate(set = "Test")
) %>% 
  ggplot(aes(x=set, y=full_sq)) +
  geom_boxplot() +
  coord_flip() +
  labs(x="Data set", y="Total area", title="Comparison of predictor in train/test data")

Create Submission File

Model fitting

Let's fit the model using the logged-outcome variable, but also for comparison the untransformed outcome!

df <- 5
model_log_price_doc <- smooth.spline(x=train$full_sq, y=train$log_price_doc, df=df)
model_price_doc <- smooth.spline(x=train$full_sq, y=train$price_doc, df=df)
plot1 <- model_log_price_doc %>% 
  augment() %>% 
  as_tibble() %>% 
  ggplot(aes(x=x)) +
  geom_point(aes(y=y), alpha=0.2) +
  geom_line(aes(y=.fitted), col="blue") +
  labs(x="Total area", y="log(Price)", title = paste("Spline model on log(Price) with df =", df))
plot2 <- model_price_doc %>% 
  augment() %>% 
  as_tibble() %>% 
  ggplot(aes(x=x)) +
  geom_point(aes(y=y), alpha=0.2) +
  geom_line(aes(y=.fitted), col="blue") +
  labs(x="Total area", y="Price", title = paste("Spline model on Price with df =", df))
grid.arrange(plot1, plot2, nrow=1)


predictions_log <- predict(model_log_price_doc, x=test$full_sq) %>% 
  as_tibble() %>% 
  # Note here we need to un-log our predictions:
  mutate(price_doc = exp(y)) %>% 
  mutate(id = sample_submission$id) %>% 
  select(id, price_doc)
write_csv(predictions_log, "data/submission_log.csv")

predictions_no_log <- predict(model_price_doc, x=test$full_sq) %>% 
  as_tibble() %>% 
  # No such requirement here:
  mutate(price_doc = y) %>% 
  mutate(id = sample_submission$id) %>% 
  select(id, price_doc)
write_csv(predictions_no_log, "data/submission_no_log.csv")

# Sanity checks:
predictions_log %>% 
  head() %>% 

predictions_no_log %>% 
  head() %>% 


Final Kaggle Public Scores (RMSLE):

  • Modeling with $\log$(Price): 0.42340
  • Modeling with Price: 0.40859

Wow! Not what I expected! Keep in mind, we chose df=5 for both in an ad hoc fashion, so take this comparison with a grain of salt!