# 2017-09-Amherst-STAT495/PS06

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
346 lines (263 sloc) 10.2 KB
title author date output
STAT/MATH 495: Problem Set 06
Albert Y. Kim and Andrew Kim
2017-10-17
html_document
toc toc_float toc_depth df_print
true
collapsed smooth_scroll
2
kable
knitr::opts_chunk$set( echo = TRUE, fig.width=8, fig.height=4.5, message=FALSE, warning = FALSE ) set.seed(76) # Load packages library(tidyverse) library(broom) library(knitr) library(gridExtra)  # Collaboration {-} Please indicate who you collaborated with on this assignment: # Setup Define truth, which again we know for the purposes of this assignment, but in practice we won't: • the true function f(x) i.e. the signal • the true epsilon i.e. the noise, which in this case is Normal$(0, \mbox{sd} = \sigma)$. Hence the standard deviation$\sigma$determines the amount of noise. f <- function(x) { x^2 } sigma <- 0.3  This is the target point we'll be trying to predict:$(0.95, f(0.95)) = (0.95, 0.95^2) = (0.95, 0.9025)$, Thus, the test set is just x=0.95 x0 <- 0.95 test_set <- data_frame(x=x0)  This function generates a random sample of size$n$; think of this as a "get new data" function. Random in terms of both: • (New) the predictor x (uniform on [0,1]) • the amount of noise$\epsilon$generate_sample <- function(f, n, sigma) { sample <- data_frame( x = runif(n = n, min = 0, max = 1), f_x = f(x), epsilon = rnorm(n = n, mean = 0, sd = sigma), y = f_x + epsilon ) # Recall: We don't observe f(x) and epsilon, just (x, y) sample <- sample %>% select(x, y) return(sample) }  Define • The number$n$of observations$(x_i, y_i)$in each sample. In the handout,$n=100$to keep plots uncrowded. Here we boost to$n=500$• Number of samples of size$n$to consider n <- 500 n_samples <- 10000  # Computation ## Get fitted/predicted values$\widehat{f}(0.95)$First, let's 1. Generate a new training set of observations$(x_i, y_i)$for$i=1, \ldots, n= 500$2. Based on the above training set, fit a) a spline model$\widehat{y} = \widehat{f}{2}(x)$using degrees of freedom$df=2$a) a spline model$\widehat{y} = \widehat{f}{99}(x)$using degrees of freedom$df=99$3. Use these models to predict the value of$f(0.95) = 0.95^2$by computing a)$\widehat{y} = \widehat{f}{2}(0.95)$a)$\widehat{y} = \widehat{f}{99}(0.95)$Repeat the above 10000 times # Store predicted values here y_hat_df_2 <- rep(0, n_samples) y_hat_df_99 <- rep(0, n_samples) for(i in 1:n_samples) { # 1. Sample a new instance of training data (x, y) train_set <- generate_sample(f, n, sigma) # 2.a) Fit df=2 model & predict on test set df_2_model <- smooth.spline(x=train_set$x, y=train_set$y, df=2) y_hat_df_2[i] <- predict(df_2_model, x=test_set$x)$y # 3.a) Fit df=99 model & predict on test set df_99_model <- smooth.spline(x=train_set$x, y=train_set$y, df=99) y_hat_df_99[i] <- predict(df_99_model, x=test_set$x)$y }  Let's visualize • A histogram of the r n_samples$\widehat{f}_{2}(0.95)$• A histogram of the r n_samples$\widehat{f}_{99}(0.95)$• A red line indicating the true$f(0.95) = 0.95^2 = 0.9025$. y_hat_data_frame <- data_frame( df = c(rep("df = 2", n_samples), rep("df = 99", n_samples)), y_hat = c(y_hat_df_2, y_hat_df_99) ) ggplot(y_hat_data_frame, aes(x=y_hat)) + geom_histogram() + facet_wrap(~df, nrow=1) + geom_vline(xintercept=f(x0), col="red", size=1) + labs(x="f_hat(0.95)", title="Figure 1: 10000 fitted/predicted values y_hat")  We observe just as from the handout from Lec 2.7 that for •$df=2$we have lower variance (AKA lower standard error) and higher bias •$df=99$we have higher variance and lower bias (almost none in fact) ## Creating$y$'s We now take our 10000 instances of$\widehat{f}{2}(0.95)$and$\widehat{f}{99}(0.95)$and evaluate the MSE. This necessitates 10000 values of$y$. What is$y$? One source of confusion everyone encounters when doing this exercise (even I did) was: •$\widehat{y} = \widehat{f}(x)$•$y \neq f(0.95)$. Rather$y = f(x) + \epsilon$Note that$y$'s incorporate the unsystematic error term$\epsilon$. In most real-life cases, we won't know the mechanism that generates these terms. In this exercise however, we do: $$\epsilon \sim \mbox{Normal}\left(0, \sigma = 0.3 \right)$$ We can't use the generate_sample() function above as this generates observations$(x,y)$for many different$x$'s. However, we only want$y$'s for$x=0.95$. So let's manually construct them! # First the error component... epsilon <- rnorm(n = n_samples, mean = 0, sd = sigma) # Then the signal component... x <- rep(0.95, times = n_samples) f_x <- f(x) # Now put them together... y <- f_x + epsilon  Let's put all these vectors together into a single data frame! results <- data_frame( x = rep(x0, n_samples), f_x = f(x), eps = rnorm(n = n_samples, mean = 0, sd = sigma), y = f_x + eps )  plot1 <- ggplot(results, aes(x=eps)) + geom_histogram() + labs(x="epsilon", title="Figure 2: Error component epsilon") + coord_cartesian(xlim=c(-1, 2)) plot2 <- ggplot(results, aes(x=y)) + geom_histogram() + labs(x="epsilon", title="Figure 2: Observed y") + geom_vline(xintercept=f(x0), col="red", size=1) + coord_cartesian(xlim=c(-1, 2)) grid.arrange(plot1, plot2, nrow=1)  ## Evaluate MSE and breakdown Let's now tack on our fitted values: results <- results %>% mutate( y_hat_df_2 = y_hat_df_2, y_hat_df_99 = y_hat_df_99 )  Let's show bias^2/variance/irreducible error breakdown for$df=2$results %>% mutate(y_hat = y_hat_df_2) %>% summarise( MSE = mean((y-y_hat)^2), bias_squared = mean((f_x-y_hat))^2, var = var(y_hat) ) %>% mutate( irreducible = sigma^2, sum = bias_squared + var + irreducible ) %>% kable(digits = 4)  Let's show bias^2/variance/irreducible error breakdown for$df=99$results %>% mutate(y_hat = y_hat_df_99) %>% summarise( MSE = mean((y-y_hat)^2), bias_squared = mean((f_x-y_hat))^2, var = var(y_hat) ) %>% mutate( irreducible = sigma^2, sum = bias_squared + var + irreducible ) %>% kable(digits = 4)  # Analysis Questions: 1. Based on the topics covered in Lec 2.7, name one possible "sanity check" for your results. Name another if you can. 2. In two sentences or less, give a rough sketch of what the procedure would be to get the breakdown of $$\mbox{MSE}\left[\widehat{f}(x)\right]$$ for all$x$in this example, and not just for $$\mbox{MSE}\left[\widehat{f}(x_0)\right] = \mbox{MSE}\left[\widehat{f}(0.95)\right]$$. 3. Which of the two models would you choose for predicting the point of interest and why? Answers: 1. Two possible sanity checks: 1. That the MSE column is equal to the sum of the variance, bias-squared, and irreducible error columns! Which was the point of the whole exercise. 2. Comparing your results to the handout from Lec 2.7 (points, blue lines, red dots). That for$df=2$you have higher bias, lower variance and for$df=99$you have lower bias (almost 0 in fact) and higher variance. 2. Repeat this whole process for different$x$values. Either: 1. Randomly sample$x$'s, repeat the above procedure, and then take the average. OR 2. Repeat the above procedure for a grid of$x$values (0, 0.1, 0.2, ..., 0.9, and 1 say), and then take the weighted average where the weights are based on how frequently each of the$x$occur i.e. by the probability density function$f(x)$(note$f$here is different than the "model"$f$) of$x$. In our example, since the$X \sim \mbox{Uniform}(0,1)$,$f(x) = \frac{1}{b-a}\mathbb{x \in (0,1)} = \mathbb{x \in (0,1)}$. 3. IMO since the MSE's are rather similar (up to sampling error), I'd go with the simpler model:$df=2$. But again, this depends on what your criteria is. If your criteria is unbiasedness, then go with$df=99$. But in machine learning, it might be worth tolerating a little non-zero bias of your model$\widehat{f}$, if it means a big reduction in variance$\widehat{f}$from training set to training set. # Extra: (This takes a long time to knit!) Let's repeat the above, but not just for$df=2$and$df=99$, but for a more refined grid of values: 2, 7, 12, 17, ..., 82, 87, 92, and 97. We then plot the results. # Note: this takes a while! df_vector <- seq(from=2, to=99, by=5) MSE <- rep(0, length(df_vector)) bias_sq <- rep(0, length(df_vector)) variance <- rep(0, length(df_vector)) irreducible <- rep(sigma^2, length(df_vector)) sum_of_components <- rep(0, length(df_vector)) # Create observed y's that incorporate both signal and noise y <- f(0.95) + rnorm(n_samples, mean=0, sd=sigma) for(j in 1:length(df_vector)){ y_hat <- rep(0, n_samples) for(i in 1:n_samples) { # 1. Sample a new instance of training data (x, y) train_set <- generate_sample(f, n, sigma) # 2.a) Fit df=2 model & predict on test set spline_model <- smooth.spline(x=train_set$x, y=train_set$y, df=df_vector[j]) y_hat[i] <- predict(spline_model, x=test_set$x)$y } MSE[j] <- mean((y-y_hat)^2) bias_sq[j] <- (mean(f(0.95) - y_hat))^2 variance[j] <- var(y_hat) sum_of_components[j] <- bias_sq[j] + variance[j] + irreducible[j] }  data_frame( df = df_vector, MSE = MSE, bias_sq = bias_sq, variance = variance, irreducible = irreducible, sum = sum_of_components ) %>% # convert to tidy format gather(type, value, -df) %>% mutate(type = factor(type, levels=c("bias_sq", "variance", "irreducible", "sum", "MSE"))) %>% ggplot(aes(x=df, y=value, col=type)) + geom_line() + labs(x="degrees of freedom", y="value", title="MSE bias-variance breakdown")  Note: • The blue curve "sum" is the sum of the values of the red (bias-squared), yellow (variance), and green (irreducible error$\sigma^2\$ i.e. constant) curves.
• The purple curve "MSE" tracks the blue curve rather closely.
• However, I'm at a loss as to why they seem to be off by a constant amount! Could it be sampling variation? I'm not sure!