Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
346 lines (263 sloc) 10.2 KB
title author date output
STAT/MATH 495: Problem Set 06
Albert Y. Kim and Andrew Kim
2017-10-17
html_document
toc toc_float toc_depth df_print
true
collapsed smooth_scroll
2
kable
knitr::opts_chunk$set(
  echo = TRUE, fig.width=8, fig.height=4.5, message=FALSE, warning = FALSE
  )
set.seed(76)

# Load packages
library(tidyverse)
library(broom)
library(knitr)
library(gridExtra)

Collaboration {-}

Please indicate who you collaborated with on this assignment:

Setup

Define truth, which again we know for the purposes of this assignment, but in practice we won't:

  • the true function f(x) i.e. the signal
  • the true epsilon i.e. the noise, which in this case is Normal$(0, \mbox{sd} = \sigma)$. Hence the standard deviation $\sigma$ determines the amount of noise.
f <- function(x) {
  x^2
}
sigma <- 0.3

This is the target point we'll be trying to predict: $(0.95, f(0.95)) = (0.95, 0.95^2) = (0.95, 0.9025)$, Thus, the test set is just x=0.95

x0 <- 0.95
test_set <- data_frame(x=x0)

This function generates a random sample of size $n$; think of this as a "get new data" function. Random in terms of both:

  • (New) the predictor x (uniform on [0,1])
  • the amount of noise $\epsilon$
generate_sample <- function(f, n, sigma) {
  sample <- data_frame(
    x = runif(n = n, min = 0, max = 1),
    f_x = f(x),
    epsilon = rnorm(n = n, mean = 0, sd = sigma),
    y = f_x + epsilon
  )
  # Recall: We don't observe f(x) and epsilon, just (x, y)
  sample <- sample %>% 
    select(x, y)
  
  return(sample)
}

Define

  • The number $n$ of observations $(x_i, y_i)$ in each sample. In the handout, $n=100$ to keep plots uncrowded. Here we boost to $n=500$
  • Number of samples of size $n$ to consider
n <- 500
n_samples <- 10000

Computation

Get fitted/predicted values $\widehat{f}(0.95)$

First, let's

  1. Generate a new training set of observations $(x_i, y_i)$ for $i=1, \ldots, n= 500$
  2. Based on the above training set, fit a) a spline model $\widehat{y} = \widehat{f}{2}(x)$ using degrees of freedom $df=2$ a) a spline model $\widehat{y} = \widehat{f}{99}(x)$ using degrees of freedom $df=99$
  3. Use these models to predict the value of $f(0.95) = 0.95^2$ by computing a) $\widehat{y} = \widehat{f}{2}(0.95)$ a) $\widehat{y} = \widehat{f}{99}(0.95)$

Repeat the above 10000 times

# Store predicted values here
y_hat_df_2 <- rep(0, n_samples)
y_hat_df_99 <- rep(0, n_samples)

for(i in 1:n_samples) {
  # 1. Sample a new instance of training data (x, y)
  train_set <- generate_sample(f, n, sigma)

  # 2.a) Fit df=2 model & predict on test set
  df_2_model <- smooth.spline(x=train_set$x, y=train_set$y, df=2) 
  y_hat_df_2[i] <- predict(df_2_model, x=test_set$x)$y

  # 3.a) Fit df=99 model & predict on test set
  df_99_model <- smooth.spline(x=train_set$x, y=train_set$y, df=99) 
  y_hat_df_99[i] <- predict(df_99_model, x=test_set$x)$y
}

Let's visualize

  • A histogram of the r n_samples $\widehat{f}_{2}(0.95)$
  • A histogram of the r n_samples $\widehat{f}_{99}(0.95)$
  • A red line indicating the true $f(0.95) = 0.95^2 = 0.9025$.
y_hat_data_frame <- data_frame(
  df = c(rep("df = 2", n_samples), rep("df = 99", n_samples)),
  y_hat = c(y_hat_df_2, y_hat_df_99)
)
ggplot(y_hat_data_frame, aes(x=y_hat)) +
  geom_histogram() +
  facet_wrap(~df, nrow=1) +
  geom_vline(xintercept=f(x0), col="red", size=1) +
  labs(x="f_hat(0.95)", title="Figure 1: 10000 fitted/predicted values y_hat")

We observe just as from the handout from Lec 2.7 that for

  • $df=2$ we have lower variance (AKA lower standard error) and higher bias
  • $df=99$ we have higher variance and lower bias (almost none in fact)

Creating $y$'s

We now take our 10000 instances of $\widehat{f}{2}(0.95)$ and $\widehat{f}{99}(0.95)$ and evaluate the MSE. This necessitates 10000 values of $y$. What is $y$? One source of confusion everyone encounters when doing this exercise (even I did) was:

  • $\widehat{y} = \widehat{f}(x)$
  • $y \neq f(0.95)$. Rather $y = f(x) + \epsilon$

Note that $y$'s incorporate the unsystematic error term $\epsilon$. In most real-life cases, we won't know the mechanism that generates these terms. In this exercise however, we do:

$$ \epsilon \sim \mbox{Normal}\left(0, \sigma = 0.3 \right) $$

We can't use the generate_sample() function above as this generates observations $(x,y)$ for many different $x$'s. However, we only want $y$'s for $x=0.95$. So let's manually construct them!

# First the error component...
epsilon <- rnorm(n = n_samples, mean = 0, sd = sigma)
# Then the signal component...
x <- rep(0.95, times = n_samples)
f_x <- f(x)
# Now put them together...
y <- f_x + epsilon

Let's put all these vectors together into a single data frame!

results <- data_frame(
  x = rep(x0, n_samples),
  f_x = f(x),
  eps = rnorm(n = n_samples, mean = 0, sd = sigma),
  y = f_x + eps
)
plot1 <- ggplot(results, aes(x=eps)) +
  geom_histogram() +
  labs(x="epsilon", title="Figure 2: Error component epsilon") +
  coord_cartesian(xlim=c(-1, 2))
plot2 <- ggplot(results, aes(x=y)) +
  geom_histogram() +
  labs(x="epsilon", title="Figure 2: Observed y") +
  geom_vline(xintercept=f(x0), col="red", size=1) +
  coord_cartesian(xlim=c(-1, 2))
grid.arrange(plot1, plot2, nrow=1)

Evaluate MSE and breakdown

Let's now tack on our fitted values:

results <- results %>% 
  mutate(
    y_hat_df_2 = y_hat_df_2,
    y_hat_df_99 = y_hat_df_99
  )

Let's show bias^2/variance/irreducible error breakdown for $df=2$

results %>%
  mutate(y_hat = y_hat_df_2) %>% 
  summarise(
    MSE = mean((y-y_hat)^2),
    bias_squared = mean((f_x-y_hat))^2,
    var = var(y_hat)
  ) %>%
  mutate(
    irreducible = sigma^2,
    sum = bias_squared + var + irreducible
    ) %>% 
  kable(digits = 4)

Let's show bias^2/variance/irreducible error breakdown for $df=99$

results %>%
  mutate(y_hat = y_hat_df_99) %>% 
  summarise(
    MSE = mean((y-y_hat)^2),
    bias_squared = mean((f_x-y_hat))^2,
    var = var(y_hat)
  ) %>%
  mutate(
    irreducible = sigma^2,
    sum = bias_squared + var + irreducible
  ) %>% 
  kable(digits = 4)

Analysis

Questions:

  1. Based on the topics covered in Lec 2.7, name one possible "sanity check" for your results. Name another if you can.
  2. In two sentences or less, give a rough sketch of what the procedure would be to get the breakdown of $$\mbox{MSE}\left[\widehat{f}(x)\right]$$ for all $x$ in this example, and not just for $$\mbox{MSE}\left[\widehat{f}(x_0)\right] = \mbox{MSE}\left[\widehat{f}(0.95)\right]$$.
  3. Which of the two models would you choose for predicting the point of interest and why?

Answers:

  1. Two possible sanity checks:
    1. That the MSE column is equal to the sum of the variance, bias-squared, and irreducible error columns! Which was the point of the whole exercise.
    2. Comparing your results to the handout from Lec 2.7 (points, blue lines, red dots). That for $df=2$ you have higher bias, lower variance and for $df=99$ you have lower bias (almost 0 in fact) and higher variance.
  2. Repeat this whole process for different $x$ values. Either:
    1. Randomly sample $x$'s, repeat the above procedure, and then take the average. OR
    2. Repeat the above procedure for a grid of $x$ values (0, 0.1, 0.2, ..., 0.9, and 1 say), and then take the weighted average where the weights are based on how frequently each of the $x$ occur i.e. by the probability density function $f(x)$ (note $f$ here is different than the "model" $f$) of $x$. In our example, since the $X \sim \mbox{Uniform}(0,1)$, $f(x) = \frac{1}{b-a}\mathbb{x \in (0,1)} = \mathbb{x \in (0,1)}$.
  3. IMO since the MSE's are rather similar (up to sampling error), I'd go with the simpler model: $df=2$. But again, this depends on what your criteria is. If your criteria is unbiasedness, then go with $df=99$. But in machine learning, it might be worth tolerating a little non-zero bias of your model $\widehat{f}$, if it means a big reduction in variance $\widehat{f}$ from training set to training set.

Extra: (This takes a long time to knit!)

Let's repeat the above, but not just for $df=2$ and $df=99$, but for a more refined grid of values: 2, 7, 12, 17, ..., 82, 87, 92, and 97. We then plot the results.

# Note: this takes a while!
df_vector <- seq(from=2, to=99, by=5)
MSE <- rep(0, length(df_vector))
bias_sq <- rep(0, length(df_vector))
variance <- rep(0, length(df_vector))
irreducible <- rep(sigma^2, length(df_vector))
sum_of_components <- rep(0, length(df_vector))

# Create observed y's that incorporate both signal and noise
y <- f(0.95) + rnorm(n_samples, mean=0, sd=sigma)

for(j in 1:length(df_vector)){
  y_hat <- rep(0, n_samples)
  for(i in 1:n_samples) {
    # 1. Sample a new instance of training data (x, y)
    train_set <- generate_sample(f, n, sigma)
    
    # 2.a) Fit df=2 model & predict on test set
    spline_model <- smooth.spline(x=train_set$x, y=train_set$y, df=df_vector[j]) 
    y_hat[i] <- predict(spline_model, x=test_set$x)$y
  }
  
  MSE[j] <- mean((y-y_hat)^2)
  bias_sq[j] <- (mean(f(0.95) - y_hat))^2  
  variance[j] <- var(y_hat)
  sum_of_components[j] <- bias_sq[j] + variance[j] + irreducible[j]
}
data_frame(
  df = df_vector, 
  MSE = MSE,
  bias_sq = bias_sq,
  variance = variance,
  irreducible = irreducible,
  sum = sum_of_components
) %>% 
  # convert to tidy format
  gather(type, value, -df) %>% 
  mutate(type = factor(type, levels=c("bias_sq", "variance", "irreducible", "sum", "MSE"))) %>% 
  ggplot(aes(x=df, y=value, col=type)) +
  geom_line() +
  labs(x="degrees of freedom", y="value", title="MSE bias-variance breakdown")

Note:

  • The blue curve "sum" is the sum of the values of the red (bias-squared), yellow (variance), and green (irreducible error $\sigma^2$ i.e. constant) curves.
  • The purple curve "MSE" tracks the blue curve rather closely.
  • However, I'm at a loss as to why they seem to be off by a constant amount! Could it be sampling variation? I'm not sure!