Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
468 lines (365 sloc) 12.9 KB
---
title: "STAT/MATH 495: Problem Set 03"
author: "Albert Y. Kim"
date: "2017-09-26"
output:
html_document:
toc: true
toc_depth: 2
toc_float:
collapsed: false
smooth_scroll: false
df_print: kable
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=8, fig.height=4.5)
# Set random number generator seed value
set.seed(76)
# Load packages
library(tidyverse)
library(broom)
library(knitr)
library(gridExtra)
data1 <- read_csv("data/data1.csv")
data2 <- read_csv("data/data2.csv")
```
```{r, echo=FALSE}
# Note: same n and f() used for both data1 and data2! Only thing that differed
# was sigma.
n <- 3000
f <- function(x){
y <- ifelse(-10 < x & x < 5*pi, abs(x-2.5),
ifelse(5*pi < x & x < 25, 0,
ifelse(25 < x & x < 32, -0.01*(exp(x-25)-1),
ifelse(32 < x & x < 60, -x+30+0.02*(exp(7)-1), 10*sin(x/2))
)
)
)
y <- y - 5
return(y)
}
```
# Question
For both `data1` and `data2` tibbles (a tibble is a data frame with some
[metadata](https://blog.rstudio.com/2016/03/24/tibble-1-0-0#tibbles-vs-data-frames) attached):
* Find the splines model with the best out-of-sample predictive ability.
* Create a visualizaztion arguing why you chose this particular model.
* Create a visualizaztion of this model plotted over the given $(x_i, y_i)$ points for $i=1,\ldots,n=3000$.
* Give your estimate $\widehat{\sigma}$ of $\sigma$ where the noise component $\epsilon_i$ is distributed with mean 0 and standard deviation $\sigma$.
## Note on `pull()`
The `pull()` is the `dplyr` way to return a vector from a data frame. If your
data frame has only one row, then `pull()` returns a scalar. Note the following
two commands do the same. IMO I use whatever is more convenient at the moment.
```{r, eval=FALSE}
mtcars %>% pull(mpg)
mtcars$mpg
```
# Data 1
## Simpler approach
The simplest solution is to run $k=2$ fold crossvalidation like done in
ChalkTalk 1.7:
```{r, echo=TRUE, warning=FALSE, message=FALSE}
# Consider a range of degrees of freedom. I chose this through trial and error.
df_vector <- 2:60
# Save results here
results <- data_frame(
df = df_vector,
RMSE = 0
)
for(i in 1:length(df_vector)){
# Note: these two datasets are disjoint
data1_A <- data1 %>%
sample_frac(.5)
data1_B <- data1 %>%
anti_join(data1_A, by="ID")
# Fit model onto A, get predictions for B, compute RMSE for B
model <- smooth.spline(x=data1_A$x, y=data1_A$y, df=df_vector[i])
predictions <- predict(model, data1_B$x) %>%
as_tibble() %>%
pull(y)
RMSE_B <- data1_B %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
# Fit model onto A, get predictions for B, compute RMSE for B
model <- smooth.spline(x=data1_B$x, y=data1_B$y, df=df_vector[i])
predictions <- predict(model, data1_A$x) %>%
as_tibble() %>%
pull(y)
RMSE_A <- data1_A %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
# Take mean
results$RMSE[i] <- (RMSE_A + RMSE_B)/2
}
```
What is $df^*$, the optimal degrees of freedom that yields the lowest estimate
$\widehat{RMSE}$ of the true out-of-sample predictive $RMSE$?
```{r, echo=FALSE, warning=FALSE, message=FALSE}
optimal <- results %>%
arrange(RMSE) %>%
slice(1)
df_star <- optimal$df
RMSE_star <- optimal$RMSE
optimal %>%
kable(digits=3)
ggplot(results, aes(x=df, y=RMSE)) +
geom_point() +
labs(x="degrees of freedom", y="RMSE", title="Crossvalidation estimates of RMSE for spline model") +
geom_vline(xintercept = df_star, col="red")
```
## 5-fold crossvalidation approach
Let's now generalize this idea to 5-fold crossvalidation. Carefully note the
folding scheme; recall the code exercise at the end of [Lecture 2.4](https://rudeboybert.github.io/STAT495/#24_feedback_on_ps02).
```{r, echo=TRUE, warning=FALSE, message=FALSE}
# Set up folding scheme. Note how we first shuffle the rows and then have
# roughly equal numbers of observations in each fold. This scheme ensures folds
# are disjoint
data1 <- data1 %>%
sample_frac(1) %>%
mutate(fold = rep(1:5, length=n()))
data1 %>%
count(fold)
# Consider a range of degrees of freedom. this was chosen by trial and error
df_vector <- 2:60
# Save results here
results <- data_frame(
df = df_vector,
RMSE = 0
)
for(i in 1:length(df_vector)){
for(j in 1:5){
# Set up training & validation (AKA "pretend" test) set based on folds
train_set <- data1 %>%
filter(fold == j)
validation_set <- data1 %>%
filter(fold !=j)
# Fit model
model <- smooth.spline(x=train_set$x, y=train_set$y, df=df_vector[i])
# Get predictions
predictions <- predict(model, validation_set$x) %>%
as_tibble() %>%
pull(y)
# Compute MSE and save
results$RMSE[i] <- validation_set %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
}
}
```
What is $df^*$, the optimal degrees of freedom that yields the lowest estimate
$\widehat{RMSE}$ of the true out-of-sample predictive $RMSE$?
```{r, echo=FALSE, warning=FALSE, message=FALSE}
optimal <- results %>%
arrange(RMSE) %>%
slice(1)
optimal %>%
kable(digits=3)
df_star <- optimal$df
RMSE_star <- optimal$RMSE
ggplot(results, aes(x=df, y=RMSE)) +
geom_point() +
labs(x="degrees of freedom", y="RMSE", title="Crossvalidation estimates of RMSE for spline model") +
geom_vline(xintercept = df_star, col="red")
```
Sanity check: Last, let's fit a model using the optimal $df^*$ on *all* given data and plot it. Doesn't look unreasonable!
```{r, echo=TRUE, warning=FALSE, message=FALSE}
fitted_model_1 <- smooth.spline(x=data1$x, y=data1$y, df=df_star) %>%
augment()
```
```{r, echo=FALSE, warning=FALSE, message=FALSE}
fitted_plot_1 <- ggplot(fitted_model_1, aes(x=x)) +
geom_point(aes(y=y), size=0.5, alpha=0.2) +
geom_line(aes(y=.fitted), col="blue", size=1)
title <- paste("Fitted splines model on all data1 with df =", df_star)
fitted_plot_1 +
labs(title=title)
```
## Truth vs Fitted
Here are the two unknowns revealed!
1. $f(x)$ is the curve in red below
1. $\epsilon$, the error, was distributed Normal$(0, \sigma = 15)$. The black
dashed lines indicate a 95% "error band" about $f(x)$ i.e. $f(x) \pm 1.96 \times
\sigma$. Note this error band holds because the error was set to be Normal,
which is not necessarily always the case!
```{r, echo=FALSE, warning=FALSE, message=FALSE}
sigma <- 15
set.seed(76)
data1 <- data_frame(
x = runif(n, min=-10, max=90),
f_x = f(x),
epsilon = rnorm(n, 0, sd = sigma),
y = f_x + epsilon,
ID = 1:n
) %>%
mutate(
upper = f_x + 1.96*sigma,
lower = f_x - 1.96*sigma
)
true_plot_1 <- ggplot(data1, aes(x=x, y=y)) +
stat_function(fun=f, col="red", size=1) +
geom_point(size=0.5, alpha=0.2) +
geom_line(aes(y=lower), linetype="dashed", size=1) +
geom_line(aes(y=upper), linetype="dashed", size=1)
true_plot_1 +
labs(title="True f(x) and true error band")
```
Let's now compare the above plot with the $\widehat{f}(x)$ the fitted curve in blue below.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
grid.arrange(
true_plot_1 + labs(title="Truth"),
fitted_plot_1 + labs(title="Fitted/Estimated"),
nrow=1)
```
# Data 2
## 5-fold crossvalidation approach
```{r, echo=FALSE, warning=FALSE, message=FALSE}
# Set up folding scheme. Note how we first shuffle the rows and then have
# roughly equal numbers of observations in each fold. This scheme ensures folds
# are disjoint
data2 <- data2 %>%
sample_frac(1) %>%
mutate(fold = rep(1:5, length=n()))
# Consider a range of degrees of freedom. this was chosen by trial and error
df_vector <- 2:60
# Save results here
results <- data_frame(
df = df_vector,
RMSE = 0
)
for(i in 1:length(df_vector)){
for(j in 1:5){
# Set up training & validation (AKA "pretend" test) set based on folds
train_set <- data2 %>%
filter(fold == j)
validation_set <- data2 %>%
filter(fold !=j)
# Fit model
model <- smooth.spline(x=train_set$x, y=train_set$y, df=df_vector[i])
# Get predictions
predictions <- predict(model, validation_set$x) %>%
as_tibble() %>%
pull(y)
# Compute MSE and save
results$RMSE[i] <- validation_set %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
}
}
```
What is $df^*$, the optimal degrees of freedom that yields the lowest estimate
$\widehat{RMSE}$ of the true out-of-sample predictive $RMSE$?
```{r, echo=FALSE, warning=FALSE, message=FALSE}
optimal <- results %>%
arrange(RMSE) %>%
slice(1)
optimal %>%
kable(digits=3)
df_star <- optimal$df
RMSE_star <- optimal$RMSE
ggplot(results, aes(x=df, y=RMSE)) +
geom_point() +
labs(x="degrees of freedom", y="RMSE", title="Crossvalidation estimates of RMSE for spline model") +
geom_vline(xintercept = df_star, col="red")
```
Sanity check: Last, let's fit a model using the optimal $df^*$ on *all* given data and plot it. Doesn't look unreasonable!
```{r, echo=TRUE, warning=FALSE, message=FALSE}
fitted_model_2 <- smooth.spline(x=data2$x, y=data2$y, df=df_star) %>%
augment()
```
```{r, echo=FALSE, warning=FALSE, message=FALSE}
fitted_plot_2 <- ggplot(fitted_model_2, aes(x=x)) +
geom_point(aes(y=y), size=0.5, alpha=0.2) +
geom_line(aes(y=.fitted), col="blue", size=1)
title <- paste("Fitted splines model on all data2 with df =", df_star)
fitted_plot_2 +
labs(title=title)
```
## Truth vs Fitted
Here are the two unknowns revealed!
1. $f(x)$ is the curve in red below. **The same as for `data1`.**
1. $\epsilon$, the error, was distributed Normal$(0, \sigma = 25)$. The black
dashed lines indicate a 95% "error band" about $f(x)$ i.e. $f(x) \pm 1.96 \times
\sigma$. Note this error band holds because the error was set to be Normal,
which is not necessarily always the case!
```{r, echo=FALSE, warning=FALSE, message=FALSE}
sigma <- 25
set.seed(76)
data2 <- data_frame(
x = runif(n, min=-10, max=90),
f_x = f(x),
epsilon = rnorm(n, 0, sd = sigma),
y = f_x + epsilon,
ID = 1:n
) %>%
mutate(
upper = f_x + 1.96*sigma,
lower = f_x - 1.96*sigma
)
true_plot_2 <- ggplot(data2, aes(x=x, y=y)) +
stat_function(fun=f, col="red", size=1) +
geom_point(size=0.5, alpha=0.2) +
geom_line(aes(y=lower), linetype="dashed", size=1) +
geom_line(aes(y=upper), linetype="dashed", size=1)
true_plot_2 +
labs(title="True f(x) and true error band")
```
Let's now compare the above plot with and $\widehat{f}(x)$ the fitted curve in blue below.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
grid.arrange(
true_plot_2 + labs(title="Truth"),
fitted_plot_2 + labs(title="Fitted/Estimated"),
nrow=1)
```
# Comparison
Let's compare everything at once:
* Recall
+ The true curve for both `data1` and `data2` was the same, but there was
more noise for `data2`: $\epsilon$ with standard deviation $\sigma$ 15 vs 25.
+ This larger $\sigma$ led to RMSE's of 15.548 vs 25.375.
* Moral: When there is more noise (in this case when $\epsilon$ has a bigger
standard deviation) for the same signal (the red curve), models will have a
harder time "getting it right."
```{r, echo=FALSE, warning=FALSE, message=FALSE, fig.width=8, fig.height=8}
grid.arrange(
true_plot_1 +
labs(title="Truth data1") +
coord_cartesian(ylim=c(-90, 90)),
fitted_plot_1 +
labs(title="Fitted/Estimated data1") +
coord_cartesian(ylim=c(-90, 90)),
true_plot_2 +
labs(title="Truth data2") +
coord_cartesian(ylim=c(-90, 90)),
fitted_plot_2 +
labs(title="Fitted/Estimated data2") +
coord_cartesian(ylim=c(-90, 90)),
nrow=2)
```
# Additionally...
## Estimate $\widehat{\sigma}$ of $\sigma$
You may think that the appropriate estimate $\widehat{\sigma}$ of $\sigma$ is the RMSE. Very close! However, nottttt quite. We'll see when we cover the *bias-variance tradeoff* that the following holds
$$
\mbox{MSE} = \left(\mbox{Bias}\left[\widehat{f}(x)\right]\right)^2
+ \mbox{Var}\left[\widehat{f}(x)\right] + \sigma^2
$$
where $\sigma^2$ is the variance of the error term $\epsilon$. Stay tuned!
## How to pick optimal degrees of freedom $df^*$
The optimal degrees of freedom $df^*$ here is 19. For a semi-arbitrarily chosen
"tolerance" of 0.05 RMSE units, the simplest model that yields similar
performance is $df=11$, since in the case of splines, "simpler" means fewer
degrees of freedom.
```{r, echo=FALSE, fig.height=6}
ggplot(results, aes(x=df, y=RMSE)) +
geom_point() +
labs(x="degrees of freedom", y="RMSE", title="How to choose degrees of freedom") +
# geom_vline(xintercept = df_star, col="red") +
geom_hline(yintercept=c(RMSE_star-0.05, RMSE_star, RMSE_star+0.05), linetype="dashed", alpha=0.4) +
geom_errorbar(aes(x=df_star, ymin=RMSE_star-0.05, ymax=RMSE_star+0.05), size=0.75, width=2) +
annotate("point", x=11, y=25.42312, col="red", size=3)
```
## How to pick number of folds $k$ for crossvalidation
Not quite equipped to answer this question yet (see ISLR page 183, section 5.1.4).