# Module 3 - Autograded Assignment

### Outline:
**Here are the objectives of this assignment:**

1. Utilize F-tests to distinguish between statistically different models.
2. Calculate Confidence Intervals for feature parameters to understand their variability.
3. Reinforce an understanding of Confidence Intervals by comparing many different CIs from the same underlying population.
4. Improve general familiarity with R, including utilizing data frames and ggplot.

**Here are some general tips:**

1. Read the questions carefully to understand what is being asked.
2. When you feel that your work is completed, feel free to hit the ```Validate``` button to see your results on the *visible* unit tests. If you have questions about unit testing, please refer to the "Module 0: Introduction" notebook provided as an optional resource for this course. In this assignment, there are hidden unit tests that check your code. You will not recieve any feedback for failed hidden unit tests until the assignment is submitted. **Do not misinterpret the feedback from visible unit tests as all possible tests for a given question--write your code carefully!**
3. Before submitting, we recommend restarting the kernel and running all the cells in order that they appear to make sure that there are no additional bugs in your code.
4. There are 50 points total in this assignment.

In [None]:
# This cell loads the necesary libraries for this assignment
library(testthat)
library(tidyverse)
library(RCurl)  # a package that includes the function getURL(), which allows for reading data from github.
library(ggplot2)

# Problem 1: Comparing Models 

In this exercise, we will fit multiple different models to the same data and determine which of those models we should ultimately use.

The data we will be using is the Auto MPG Data Set from the UCI Machine Learning Repository. It contains technical specifications and performance ratings of many different cars. We will focus on the features that impact the overall `mpg` of each car.

In the cell below, code is provided for you to load in the data and rename the columns to be more specific.

In [None]:
mpg.data = read_table("auto-mpg.data")
names(mpg.data) = c("mpg", "cylinders", "displacement", "horsepower", "weight", 
                    "accel", "model_year", "origin", "car_name")
mpg.data$horsepower = as.numeric(mpg.data$horsepower)
mpg.data = na.omit(mpg.data)

summary(mpg.data)
str(mpg.data)
head(mpg.data)

#### 1. (a) Three Different Models (5 points)

We will fit three different models to this data:

1. `mod.1`: Fits `mpg` as the response with `weight` as the predictor.
2. `mod.2`: Fits `mpg` as the response with `weight` and `accel` as predictors.
3. `mod.3`: Fits `mpg` as the response with `weight`, `accel` and `horsepower` as predictors.

Fit these models in the cell below.

In [None]:
mod.1 = NA
mod.2 = NA
mod.3 = NA
# your code here


In [None]:
# Test Cell
# Make sure that each model is a linear model
if(test_that("Testing model types", 
             {(expect_is(mod.1, "lm"))
              (expect_is(mod.2, "lm"))
              (expect_is(mod.3, "lm"))})){
    print("All models are linear models.")
}else{
    print("At least one of the models isn't a linear model!")
    print("Make sure you're using the lm() function.")
}
# This cell has hidden test cases that will run after submission.

#### 1. (b) Partial F-Tests (10 points)

Compare the 3 models using pairwise F-tests to determine which of the three we should use moving forward. It may be helpful to write out the null and alternative hypotheses for these tests.

Copy your selected model into the `final.model` variable.

In [None]:
final.model = NA
# your code here


In [None]:
# Test Cell
if(test_that("Check final.model class", {expect_is(final.model, "lm")})){
    print("You've selected a model! Make sure you're confident in your answer.")
}else{
    print("final.model is not a linear model.")
    print("To copy the selected model use `final.model = mod.#`")
}
# This cell has hidden test cases that will run after submission.

#### 1. (c) Coefficient Confidence Intervals (10 points)

Using your selected best model, calculate a $95\%$ confidence interval for the `weight` parameter. Save the lower and upper values into `weight.CI.lower` and `weight.CI.upper` respectively.

In [None]:
weight.CI.lower = NA
weight.CI.upper = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

#### 1. (d) Model Comparison (5 points)

So far, we've used the F-test as a way to choose a "best" model among the three proposed. Now let's compare the models according to their mean squared errors (MSE). Compute the MSE for each of the three models and save their values into their respective `MSE.#` variables.

Which of these models has the best MSE? Do these conclusions agree with the model you selected in part **1.b**? Think about why or why not.

In [None]:
MSE.1 = NA
MSE.2 = NA
MSE.3 = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

# Problem 2: Large Datasets and Significance

For this exercise, we will see if we can create a "good" regression model for a city's temperature using other weather data. The data is from hourly weather records of Szeged, Hungary from 2006-2016. The data was provided by [Darksky.net](https://darksky.net/forecast/46.2543,20.1484/us12/en) and can be found on Kaggle [here](https://www.kaggle.com/budincsevity/szeged-weather). The data has not been modified in any way.

The data is loaded in the cell below.

In [None]:
# Load in the data
weather.data = read.csv("weatherHistory.csv")
weather.data = na.omit(weather.data)
head(weather.data)


#### 2. (a) Talking about the weather. (5 points)

Before we jump into modeling, let's think about weather. Is temperature correlated with wind speed, visibility or pressure? Certainly somewhat, but probably not to a great extent. Let's find out exactly (at least for these data).

Determine the correlation between `Temperature..C.` and the three predictors: `Wind.Speed..km.h.`, `Visibility..km.` and `Pressure..millibars.`. Store these values in `cor.speed`, `cor.vis` and `cor.pres` respectively.

Also, if our data is hourly records over 10 years, then we're going to have a lot of records. How many rows does our dataset have? Store this value in `data.n`.

In [None]:
cor.speed = NA
cor.vis = NA
cor.pres = NA
data.n = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

#### 2. (b) Data Size Matters (5 points)

Yep, that's a lot of data. But isn't more data better? Well, let's find out. We can create two different models, one with a little data and one with a lot of data, and determine if the one fit to more data is the better model.

Fit two models to the data, with `Temperature..C.` as the response and `Wind.Speed..km.h.`, `Visibility..km.` and `Pressure..millibars.` as predictors. The first model, `weather.lmod.small`, should be fit to the first $30$ rows of the data. The second model, `weather.lmod.all`, should be fit to all the data.

Look at the p-values of the model coefficients. What can you infer?

In [None]:
weather.lmod.small = NA
weather.lmod.all = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

#### 2. (c) Interpreting Our Models (10 points)

Answer the following questions and put your answer with the corresponding answer number.

1. TRUE/FALSE. The coefficient for `Pressure..millibars.` for the model fit to all the data is statistically signficant.
2. TRUE/FALSE. The coefficient for `Pressure..millibars.` for the model fit to a small amount of data is statistically significant.
3. What is the $R^2$ for the model fit to all of the data?
4. What is the $R^2$ for the model fit to a small amount of the data?
5. Which model explained more variablility in its respective dataset? Copy the correct model into this answer variable. Think about why this is the case!
5. TRUE/FALSE. Models fit to large amounts of data run the risk of having statistically significant coefficients, even if the predictor isn't practically significant to the response.

In [None]:
prob.3.c.1 = NA

prob.3.c.2 = NA

prob.3.c.3 = NA

prob.3.c.4 = NA

# Save the selected model into this variable.
prob.3.c.5 = NA

prob.3.c.6 = NA

# your code here


In [None]:
# TEST CELL
if (!test_that("Checking type() of answer", expect_is(prob.3.c.5, "lm"))){
    print("Make sure prob.3.c.5 is your selected linear model. Should be of type 'lm'")
}
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.