Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
354 lines (281 sloc) 10.6 KB
---
title: "STAT/MATH 495: Problem Set 04"
author: "Albert Y. Kim"
date: "2017-10-03"
output:
html_document:
toc: true
toc_float: true
toc_depth: 2
collapsed: false
smooth_scroll: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width=8, fig.height=4.5, message=FALSE, warning = FALSE)
set.seed(76)
```
# Collaboration
Please indicate who you collaborated with on this assignment:
# Load packages, data, model formulas
```{r, echo=FALSE, warning=FALSE}
library(tidyverse)
library(broom)
library(knitr)
credit <- read_csv("http://www-bcf.usc.edu/~gareth/ISL/Credit.csv") %>%
select(-X1) %>%
mutate(ID = 1:n()) %>%
select(ID, Balance, Income, Limit, Rating, Age, Cards, Education)
```
You will train the following 7 models on `credit_train`...
```{r}
model1_formula <- as.formula("Balance ~ 1")
model2_formula <- as.formula("Balance ~ Income")
model3_formula <- as.formula("Balance ~ Income + Limit")
model4_formula <- as.formula("Balance ~ Income + Limit + Rating")
model5_formula <- as.formula("Balance ~ Income + Limit + Rating + Age")
model6_formula <- as.formula("Balance ~ Income + Limit + Rating + Age + Cards")
model7_formula <- as.formula("Balance ~ Income + Limit + Rating + Age + Cards + Education")
```
... where `credit_train` is defined below, along with `credit_test`.
```{r}
set.seed(79)
credit_train <- credit %>%
sample_n(20)
credit_test <- credit %>%
anti_join(credit_train, by="ID")
```
# Simple case: Model 2
Recall we will
* Always fit our model to `credit_train`
* Make predictions $\widehat{credit}$ and compare it to the true $credit$ score on
+ `credit_test` as is typical i.e. fit and evaluate model on different data
+ `credit_train` in violation of the idea of fitting and evaluating model on different data
## Fit model
We fit model 2 (using income as a predictor) and get the estimate/fitted value $\widehat{\beta}_{income}$ of $\beta_{income}$ and $\widehat{\beta}_0$ of the intercept:
```{r}
model2 <- lm(model2_formula, data=credit_train)
model2 %>%
tidy() %>%
kable(digits=2)
```
So for example the first observation in `credit_train` (has `ID == 347`) has
+ Income = 21.551 (in tens of thousands presumably)
+ Observed balance of 907
Thus
$$
\begin{align}
\widehat{y} &= \widehat{balance} = \widehat{\beta}_{0} + \widehat{\beta}_{income} \times income = 314.041 + 3.865 \times 21.551 = 397.35\\
y - \widehat{y} &= balance - \widehat{balance} = 907 - 397.35 = 509.65
\end{align}
$$
## Evaluate model via RMSE on training data
Let's automate the above task for all observations in `credit_train`. Note the
first row matches our calculation above.
```{r}
model2 %>%
augment() %>%
select(Balance, Income, .fitted, .resid) %>%
head(3) %>%
kable(digits=3)
```
and thus the RMSE is
```{r}
model2 %>%
augment() %>%
summarise(MSE = mean((Balance - .fitted)^2)) %>%
mutate(RMSE = sqrt(MSE)) %>%
kable(digits=3)
```
Let's sanity check our RMSE via a graphic. Hard to say if it's valid, but there
isn't any overwhelming evidence to suggest its not!
```{r, echo=FALSE}
model2 %>%
augment() %>%
ggplot(aes(x=Income)) +
geom_point(aes(y=Balance)) +
geom_line(aes(y=.fitted), col="blue") +
labs(title="Credit Training Data + Fitted Model 2")
```
## Evaluate model via RMSE on test data
To get predicted/fitted values for `credit_test` (a new data set than the one we used to fit the model) and evaluate the performance via RMSE, as illustrated in the Quickstart guide, we set `augment(newdata=credit_test)`:
```{r}
model2 %>%
augment(newdata = credit_test) %>%
summarise(MSE = mean((Balance - .fitted)^2)) %>%
mutate(RMSE = sqrt(MSE)) %>%
kable(digits=2)
```
# RMSE vs number of coefficients
```{r, echo=TRUE, warning=FALSE, message=FALSE}
# Placeholder vectors of length 7. For now, I've filled them with arbitrary
# values; you will fill these in
RMSE_train <- runif(n=7)
RMSE_test <- runif(n=7)
# Do your work here:
model1 <- lm(model1_formula, data=credit_train)
RMSE_train[1] <- model1 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[1] <- model1 %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
```
We repeat the above for all 6 remaining models and plot the RMSE over the number
of coefficients used in model.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
model2 <- lm(model2_formula, data=credit_train)
RMSE_train[2] <- model2 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[2] <- model2 %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
model3 <- lm(model3_formula, data=credit_train)
RMSE_train[3] <- model3 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[3] <- model3 %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
model4 <- lm(model4_formula, data=credit_train)
RMSE_train[4] <- model4 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[4] <- model4 %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
model5 <- lm(model5_formula, data=credit_train)
RMSE_train[5] <- model5 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[5] <- model5 %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
model6 <- lm(model6_formula, data=credit_train)
RMSE_train[6] <- model6 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[6] <- model6 %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
model7 <- lm(model7_formula, data=credit_train)
RMSE_train[7] <- model7 %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[7] <- model7 %>%
broom::augment(newdata=credit_test) %>%
mutate(resid = Balance - .fitted) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
# Save results in a data frame. Note this data frame is in wide format.
results <- data_frame(
num_coefficients = 1:7,
RMSE_train,
RMSE_test
)
# Some cleaning of results
results <- results %>%
# More intuitive names:
rename(
`Training data` = RMSE_train,
`Test data` = RMSE_test
) %>%
# Convert results data frame to "tidy" data format i.e. long format, so that we
# can ggplot it
tidyr::gather(type, RMSE, -num_coefficients)
plot1 <- ggplot(results, aes(x=num_coefficients, y=RMSE, col=type)) +
geom_line() +
labs(x="# of coefficients", y="RMSE",
col="Data used to evaluate \nperformance of fitted model",
title = "RMSE over # of coefficients for training data with n=20")
plot1
```
# Interpret the graph
Compare and contrast the two curves and hypothesize as to the root cause of any differences.
What's happening is:
* As we start increasing the number of coefficients (i.e. the number of
predictors), the error on both the training set and test set go down. In other
words, our model is capturing more and more of the "signal". For x = 2
coefficients (intercept and Income), you could say the model is still
"underfitting" to the data.
* Past a certain point however
+ For the training set (blue), the model gets "overfit" to this particular
data AKA the model is overly specific to this dataset AKA is overly
optimized to this dataset AKA the model is no longer fitting to just
"signal", but also to noise.
+ This overfitting is apparent in the increase in error for the test data (red).
Because the model is overfit to the training data, it does not do well on
new out-of-sample data.
# Bonus
Repeat the whole process, but let `credit_train` be a random sample of size 380
from `credit` instead of 20. Now compare and contrast this graph with the
one above and hypothesize as to the root cause of any differences.
```{r, echo=FALSE}
set.seed(79)
credit_train <- credit %>%
sample_n(380)
credit_test <- credit %>%
anti_join(credit_train, by="ID")
# model_formulas is a list. If I recall correctly, R can only handle vectors for
# numerics, characters, logicals
model_formulas <- c(model1_formula, model2_formula, model3_formula,
model4_formula, model5_formula, model6_formula,
model7_formula)
# To access one particular model, you need to use [[ ]] since model_formulas is
# a list. Ex: model_formulas[[5]]
RMSE_train <- runif(n=7)
RMSE_test <- runif(n=7)
for(i in 1:length(model_formulas)){
model <- lm(model_formulas[[i]], data=credit_train)
RMSE_train[i] <- model %>%
broom::augment() %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
RMSE_test[i] <- model %>%
broom::augment(newdata=credit_test) %>%
summarise(RMSE = sqrt(mean((Balance - .fitted)^2))) %>%
pull(RMSE)
}
# Save results in a data frame. Note this data frame is in wide format.
results <- data_frame(
num_coefficients = 1:7,
RMSE_train,
RMSE_test
)
# Some cleaning of results
results <- results %>%
# More intuitive names:
rename(
`Training data` = RMSE_train,
`Test data` = RMSE_test
) %>%
# Convert results data frame to "tidy" data format i.e. long format, so that
# we can ggplot it
tidyr::gather(type, RMSE, -num_coefficients)
plot2 <- ggplot(results, aes(x=num_coefficients, y=RMSE, col=type)) +
geom_line() +
labs(x="# of coefficients", y="RMSE",
col="Data used to evaluate \nperformance of fitted model",
title = "RMSE over # of coefficients for training data with n=380")
plot2
```
Now the difference between the training error and test error isn't as pronounced! This is because all else being equal, overfitting is a bigger issue when you use fewer points to train your model!
i.e. when
* `credit_train` had n=20: more issues with overfitting
* `credit_train` had n=380: less issues with underfitting
Why? You'll see when we eventually cover the *bias-variance tradeoff*!
# Caveat
Please note this is somewhat of an artificial exercise, in particular, how did I determine the order in which to add/include variables when going from model 1 to model 7? We'll see this when we cover the LASSO method: see "Shrinking $\beta$ Coefficients via LASSO" section of this [Shiny app](https://beta.rstudioconnect.com/connect/#/apps/2692/) for a preview.