# C3M2: Autograded Assignment

### Outline:
**Here are the objectives of this assignment:**

1. Understand when to apply different kinds of regression models.
2. Fit a GLM to count data and go through model diagnostics and interpretation.
3. Compare the effectiveness of GLMs to Linear Regression models.

**Here are some general tips:**

1. Read the questions carefully to understand what is being asked.
2. When you feel that your work is completed, feel free to hit the ```Validate``` button to see your results on the *visible* unit tests. If you have questions about unit testing, please refer to the "Module 0: Introduction" notebook provided as an optional resource for this course. In this assignment, there are hidden unit tests that check your code. You will not recieve any feedback for failed hidden unit tests until the assignment is submitted. **Do not misinterpret the feedback from visible unit tests as all possible tests for a given question--write your code carefully!**
3. Before submitting, we recommend restarting the kernel and running all the cells in order that they appear to make sure that there are no additional bugs in your code.

In [None]:
# Load required packages
library(tidyverse)
library(testthat)
library(ggplot2)

# Problem 1: Counts, Rates and Measurements. (15 points)

As we've seen, there are many kinds of models for the many kinds of data out there, and fitting a good model start with understanding the data. For the following questions, determine which kind of model should be used for the specified data and question.

For each question, input the string answer of the specified model in the respective answer variable. Choose your answers from the models: `"linear"`, `"binomial"` and `"poisson"`, case sensitive. Note: Some features may be suitable for different kinds of models. Pick the model that would work the best.

1. You are trying to predict the number of home run scored by baseball players during their next season. Your predictors are the player's age, the number of years spent in professional baseball, and the number of home runs they scored in the previous $5$ years.
2. You are trying to determine whether people in cities buy more cereal than people in suburbs or in rural areas. Your response is the number of cereal boxes sold, rounded to the nearest $1000$. Your predictors are the type of area, the population, the number of grocery stores, and the average cost.
3. You want to predict ratings for hotels based on user reviews. The rating is on a scale of $1$ to $5$ stars. The predictors are different statistics extracted from their review, such as word count and the number of times the review used the word "bathroom."

In [None]:
# Remember, your answers should be strings
prob.1.1 = NA

prob.1.2 = NA

prob.1.3 = NA

# your code here


In [None]:
# Test Cell
if(!test_that("Checking answer types", {expect_is(prob.1.1, "character")
                                        expect_is(prob.1.2, "character")
                                        expect_is(prob.1.3, "character")})){
    print("Make sure your answers are strings!")
}

# Problem 2: MLRs vs. GLMs

For each 30 Galapagos islands, we have a count of the number of plant species found on each island and the number that are endemic to that island. We also have five geographic variables for each island. 

1. Species: the number of plant species found on the island
2. Endemics: the number of endemic species
3. Area: the area of the island (km$^2$)
4. Elevation: the highest elevation of the island (m)
5. Nearest: the distance from the nearest island (km)
6. Scruz: the distance from Santa Cruz island (km)
7. Adjacent: the area of the adjacent island (square km)

In [None]:
# Load the data
data.gala = read.csv("gala.csv")

colnames(data.gala)[1] = "Location"
data.gala$Location = as.character(data.gala$Location)
head(data.gala)

### 2. (a) Trying a Linear Model (15 points)

Fit a linear model called `lmod.gala` with `Species` as the response and all other variables, except `Location` and `Endemics`, as predictors. Run some diagnostics and think about why this model may not be the best fit. For each assumption variable, answer `TRUE` if the assumptoin is being met by the model, and `FALSE` if the assumption is not being met by the model.

In [None]:
lmod.gala = NA
# Code the following as TRUE or FALSE
lmod.gala.linear = NA
lmod.gala.homoskedasticity = NA
lmod.gala.normality = NA

# your code here


In [None]:
# Test Cell
if(!test_that("Checking if model coefficients are correct", {expect_equal(7.068221, as.numeric(lmod.gala$coef[1]), tol=1e-4)
                                                             expect_equal(-0.023938, as.numeric(lmod.gala$coef[2]), tol=1e-4)
                                                             expect_equal(0.319456, as.numeric(lmod.gala$coef[3]), tol=1e-4)})){
    print("At least one of the coefficients was wrong. Make sure your model is correct before doing diagnostics.")
}
# This cell has hidden test cases that will run after submission.

### 2. (b) Linear Transformations (8 points)

Recall that one strategy we used to address models that had nonconstant variance was to transform the response variable. Try the square root transform on the response fit to the same predictors. Store this model as `lmod.gala.sqrt`. Look at the diagnostic plots and consider if this model's assumptions are better than the last. Similar to the previous problem, for each assumption, answer `TRUE` if the model meets the assumption `FALSE` if not. Note that if a plot looks ambiguous, you can interpret it as "no evidence of a violation" and answer `TRUE`.

One thing to keep in mind is that transformations make the model harder to interpret. Think about how a $1$ unit increase in `Nearest` for your transformed model would affect `Species`. Put your answers into `sqrt.gala.linearity`, `sqrt.gala.homoskedasticity` and `sqrt.gala.normality`.

In [None]:
lmod.gala.sqrt = NA
sqrt.gala.linearity = NA
sqrt.gala.homoskedasticity = NA
sqrt.gala.normality = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 2. (c) GLMs to the Rescue (6 points)

There's still some problems with the model. Because our response variable is counts, maybe linear models aren't the best anyways. Fit a GLM of appropriate family to the (untransformed) data, using the same predictors. Store this model as `glm.gala`. Plot the diagnostics plots and think about what assumptions should be met.

How do we interpret this model? In particular, fill in the blank: "A 1-unit increase in `Elevation` is associated with a multiplicative increase of $\text{_____}$ in `Species`, on average." Store this value as `glm.interp`.

In [None]:
glm.gala = NA
glm.interp = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

### 2. (d) GLM Goodness of Fit (6 points)

Our linear models didn't do a great job of fitting the data, how do we know if our GLM fits the data any better? Well, we don't have an easy scale of reference, like the $R^2$ value, for GLMs. What we can do is compare our model to other models, such as the null model, and see if ours performs significantly better.

Calculate the deviance of your model and store it as `glm.deviance`. Then check the goodness of fit of your model using Pearson's $\chi^2$ statistic. Store this value as `glm.chisq.stat`. Calculate the p-value for this statistic and store it as `glm.chisq.pval`. What does this tell you about your model?

In [None]:
glm.deviance = NA
glm.chisq.stat = NA
glm.chisq.stat = NA

# your code here


In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.

In [None]:
# Test Cell
# This cell has hidden test cases that will run after submission.