# Module 2 - Autograded Assignment

### Outline:
**Here are the objectives of this assignment:**

1. Learn how to construct linear models in R, with both single and multiple predictors.
2. Practice how to identify the intercepts and coefficients from these models, and know what they mean.
3. Understand how to construct hat matrices and what information can be gathered from them.
4. Touch on future concepts like Residuals and MSE.

**Here are some general tips:**

1. Read the questions carefully to understand what is being asked.
2. When you feel that your work is completed, feel free to hit the ```Validate``` button to see your results on the *visible* unit tests. If you have questions about unit testing, please refer to the "Module 0: Introduction" notebook provided as an optional resource for this course. In this assignment, there are hidden unit tests that check your code. You will not recieve any feedback for failed hidden unit tests until the assignment is submitted. **Do not misinterpret the feedback from visible unit tests as all possible tests for a given question--write your code carefully!**
3. Before submitting, we recommend restarting the kernel and running all the cells in order that they appear to make sure that there are no additional bugs in your code.
4. There are 50 points total in this assignment.

In [None]:
# This cell loads the necesary libraries for this assignment
library(testthat)
library(tidyverse)
library(ggplot2) #a package for nice plots!
library(dplyr)

## Problem 1: Introduction to Simple Linear Regression (SLR) Models

For this exercise, we will look at a dataset from *Time* Magazine about college rankings. In this dataset, each row (statistical unit) is a college. There are $n = 706$ rows. After some simplifying, the variables included in the dataset are:

- `school`: the name of the school

- `earn`: yearly earnings

- `sat`: average SAT score

- `act`: average ACT score

- `price`: the cost of attendance for four years

In [None]:
college = read.csv("graduate-earnings.txt", sep="\t")

#prints the names in the dataframe
college = college %>%
    select(school = School, earn = Earn, sat = SAT, act = ACT, price = Price)
summary(college)

#### 1. (a) Create the SLR Model. (5 points)

Let's start simple, and model this relationship between `earn` (the response) and `sat` (the predictor). Save this model into the `slr_earn` variable.

In [None]:
slr_earn = NA

# your code here


summary(slr_earn)

In [None]:
# Test Cell
if(test_that("Does the function return a model?", {expect_is(slr_earn, "lm")})){
    print("Does the function return a model? ... Correct")
    print("Just make sure your predictor and response variables are correct!")
}else{
    print("Test Failed. Tip: Try using the lm() function!")
}


#### 1. (b) Model Interpretation (5 points)

Insert the model's slope and intercept into the `slope` and `intercept` variables, respectively. Do not hard code the answers, instead access the lm object directly.

In [None]:
slope = NA
intercept = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

It can be helpful to visualize our model against the data, to see if it is accurately modeling the data. This code is provided for you.

In [None]:
ggplot(college, aes(x = sat, y = earn)) + 
    geom_point( alpha = 0.5) + 
    geom_smooth(method = "lm", se = F, col = "#CFB87C") + 
    xlab("SAT Score") + ylab("Yearly Earnings")+
    theme_bw()

#### 1. (c) Residuals

A useful plot for model analysis is the *Residuals vs Fitted Values* plot. We will learn how to use this plot to detect things like unequal variances, non-linearity and outliers later in the course. For now, let's just see what this plot looks like. Create a scatterplot with the Residuals on the y-axis and the Fitted Values on the x-axis. 

Tip: Use the `resid()` and `fitted()` functions. 

In [None]:
# your code here


#### 1. (d) Sums of Residuals (5 points)

Now calculate the sum of the residuals. Store your answer in the `sum_of_residuals` variable. As a lead up to future lessons, think about why this value is what it is.

In [None]:
sum_of_residuals = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

#### 1. (e) Prediction (5 points)

At the (sample) mean value of `sat`, compute the predicted value of `earn`. Store your answer in `yhat`.

In [None]:
yhat = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

## Problem 2: SLR Hat Matrix (10 points)

The "hat matrix" is how we map from the response, $y$, to the fitted value $\widehat{y}$. Compute the hat matrix $H$ for the `slr_earn` model from scratch (e.g., using functions like `model.matrix()` to obtain the design matrix $X$, `solve()` to compute an inverse, `%*%` for matrix multiplication, and `t()` for transpose). Store $H$ in the variable `hat_matrix`.

Then compute the sum of the diagonals of $H$. Store this value in `sum_of_diagonals`. Do you understand why this value is what it is?


In [None]:
hat_matrix = NA
sum_of_diagonals = NA

# your code here


In [None]:
# Test Cell
# The hat matrix should be 7x7. Let's check that.
if(test_that("Check matrix dimensions", expect_equal(dim(hat_matrix), c(706,706) ))){
    print("Correct Dimensions!")
}else{
    print("Incorrect dimensions. Make sure your hat matrix equation matches the equation in the videos.")
}
# This cell has hidden test cases that will run after submission.

Note: Above I had you compute a matrix inverse. In practice, [rarely is it a good idea to compute the inverse of a matrix](https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/)  (it's expensive!). There are fancy ways around inverse computation.

## Problem 3: Introduction to Multiple Linear Regression (MLR) Models

In this problem, we will expand our knowledge of linear regression models from only having one predictor to having multiple predictors.

Let's use the Plant Diversity of Northeastern North American Islands dataset from the University of Florida. This data contains the "richness" of native and non-native plant species on 22 different islands.

#### 3. (a) Read in the Data

For practice, try reading in the data yourself. The data file is stored in the same local directory and is named `plant_diverse_island.csv`. You may need to experiment with seperators and headers for the data to load correctly.

In [None]:
# Read in the data
plant = NA
path = "plant_diverse_island.csv"

# your code here


head(plant)

#### 3. (b) Create a MLR Model (10 points)

Using this dataset, construct a linear model named `mlr_plant` with `tot.rich` as the response and `area`, `dist.island` and `human.dens` as predictors.

In [None]:
mlr_plant = NA

# your code here


summary(mlr_plant)

In [None]:
# Test Cell
if(test_that("Test model type", {expect_is(mlr_plant, "lm")})){
    print("Is a linear model? ... Correct")
    print("Make sure you are modeling the correct predictors!")
}else{
    print("Incorrect type. Tip: Try the lm() function!")
}
# This cell has hidden test cases that will run after submission.

#### 3. (c) Mean Squared Error (10 points)

The Means Squared Error (MSE) measures how similar the model's estimated values are to the actual values.

Calculate the MSE for the `mlr_plant` model. Store the answer in the variable `MSE_plant`.

In [None]:
MSE_plant = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.