# Module 6: Autograded Assignment

### Outline:
**Here are the objectives of this assignment:**

1. Apply model selection techniques to various data sets.
2. Learn how to calculate and interpret different model selection criterion.
3. Prove to yourself that you have learned how to apply, interpret and optimize statistical models.
4. Apply variance inflation factors to analyze multicollinearity issues.

**Here are some general tips:**

1. Read the questions carefully to understand what is being asked.
2. When you feel that your work is completed, feel free to hit the ```Validate``` button to see your results on the *visible* unit tests. If you have questions about unit testing, please refer to the "Module 0: Introduction" notebook provided as an optional resource for this course. In this assignment, there are hidden unit tests that check your code. You will not recieve any feedback for failed hidden unit tests until the assignment is submitted. **Do not misinterpret the feedback from visible unit tests as all possible tests for a given question--write your code carefully!**
3. Before submitting, we recommend restarting the kernel and running all the cells in order that they appear to make sure that there are no additional bugs in your code.
4. There are 70 total points in this assignment.

In [None]:
# This cell loads the required packages
library(testthat)
library(tidyverse)
library(ggplot2)
library(leaps)
library(MASS)
library(regclass)
library(faraway)

# Problem 1: Model Selection Criterion

In this lesson, we will perform both the full and partial F-tests in R.

Recall again, the Amazon book data. The data consists of data on $n = 325$ books and includes measurements of:

- `aprice`: The price listed on Amazon (dollars)


- `lprice`: The book's list price (dollars)


- `weight`: The book's weight (ounces)


- `pages`: The number of pages in the book


- `height`: The book's height (inches)


- `width`: The book's width (inches)


- `thick`: The thickness of the book (inches)


- `cover`: Whether the book is a hard cover of paperback.


- And other variables...

Before we do any model selection, we'll repeat the data cleaning methods from the previous lesson on this dataset. For all tests in this lesson, let $\alpha = 0.05$.

In [None]:
amazon = read.csv("amazon.txt", sep="\t")
df = data.frame(aprice = amazon$Amazon.Price, lprice = as.numeric(amazon$List.Price),  
                pages = amazon$NumPages, width = amazon$Width, weight = amazon$Weight..oz,  
                height = amazon$Height, thick = amazon$Thick, cover = amazon$Hard..Paper)

df$lprice[which(is.na(df$lprice))] = mean(df$lprice, na.rm = TRUE)
df$weight[which(is.na(df$weight))] = mean(df$weight, na.rm = TRUE)
df$pages[which(is.na(df$pages))] = mean(df$pages, na.rm = TRUE)
df$height[which(is.na(df$height))] = mean(df$height, na.rm = TRUE)
df$width[which(is.na(df$width))] = mean(df$width, na.rm = TRUE)
df$thick[which(is.na(df$thick))] = mean(df$thick, na.rm = TRUE)
df = df[-205,]
summary(df)

### 1. (a) The Model (15 points)

We want to determine which predictors impact the Amazon list price. Begin by fitting the full model.

Fit a model named `lmod.full` to the data with `aprice` as the response and all other columns as predictors. Then calculate the AIC, BIC and adjusted $R^2$ for this model. Store these values in `AIC.full`, `BIC.full` and `adj.R2.full` respectively. 

In [None]:
AIC.full = NA
BIC.full = NA
adj.R2.full = NA

# your code here


In [None]:
# Test Cell
# Check that the correct number of predictors were used in the model.
if(test_that("Check number of model parameters.", expect_equal(length(lmod.full$coefficients), 8))){
    print("Correct number of parameters in the model.")
}else{
    print("Make sure you're not using the Port column!")
}
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 1. (b) A Partial Model (15 points)

Fit a partial model to the data, with `aprice` as the response and `lprice`, and `pages` as predictors. Calculate the AIC, BIC and adjusted $R^2$ for this partial model. Store their values in `AIC.part`, `BIC.part` and `adj.R2.part` respectively.

In [None]:
AIC.part = NA
BIC.part = NA
adj.R2.part = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 1. (c) Model Selection (9 points)

Which model is better, `lmod.full` or `lmod.part` according to AIC, BIC, and $R^2_a$? Note that the answer may or may not be different across the different criteria. Save your selections as `selected.model.AIC`, `selected.model.BIC`, and `selected.model.adj.R2`.

In [None]:
selected.model.AIC = NA
selected.model.BIC = NA
selected.model.adj.R2 = NA
# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 1. (d) Model Validation (6 points)

Recall that a simpler model may perform statistically worse than a larger model. Test whether there is a statistically significant difference between `lmod.part` and `lmod.full`. Based on the result of this test, what model should you use? Save your answer as `validated.model`.

In [None]:
validated.model = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

## Problem 2

`divorce` is a data frame with 77 observations on the following 7 variables.

1. `year`: the year from 1920-1996

2. `divorce`: divorce per 1000 women aged 15 or more 

3. `unemployed` unemployment rate 

4. `femlab`: percent female participation in labor force aged 16+

5. `marriage`: marriages per 1000 unmarried women aged 16+ 

6. `birth`: births per 1000 women aged 15-44 

7. `military`: military personnel per 1000 population

Here's the data:

In [None]:
# Load in the data
divorce = read.csv("divusa.txt", sep="\t")
summary(divorce)
head(divorce)

### 2 (a) (10 points) 

Using the `divorce` data, with `divorce` as the response and all other variables as predictors, select the "best" regression model, where "best" is defined using AIC. Save your final model as `lm_divorce`.**

In [None]:
lm_divorce = NA

# your code here



In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 2 (b) (10 points) 

Using your model from part (a), compute the variance inflation factors VIFs for each $\widehat\beta_j$, $j = 1,...,p$. Store them in the variable `v`. Also, compute the condition number for the design matrix, stored in `k`. If using the `kappa()` function, you might need to specify `exact = TRUE`. Is there evidence that collinearity causes some predictors not to be significant?

In [None]:
# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 2. (c) (5 points) 

Remove the predictor with the highest VIF. Is multicollinearity still present in the model? If yes, store `TRUE` in `prob.2.c`, and `FALSE` otherwise.

In [None]:
prob.2.c = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.