# 5B: Neither Trump nor Biden

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

# Updated USStates data with 2020 census and election data
USStates <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSEc6kO1zrL_3Jlc_cA7cMgk6E2xcIjuUbTL50y-0ENwWby36EFj1MpWZLVKud8YMTtqb1zsef_a8Ss/pub?gid=2100107333&single=true&output=csv")

## 1.0: Coming up with our ideas

We've updated the data frame `USStates` to include a few new variables:

- `TrumpVote20` Percentage of votes for Donald Trump in 2020 Presidential election
- `BidenVote20` Percentage of votes for Joe Biden in 2020 Presidential election
- `OtherVote20` Percentage of votes for neither Biden nor Trump in 2020 Presidential election
- `TotalVote20` Total votes cast in 2020 Presidential election

(Note: This tabulation of votes is from November 9, 2020.)


**1.1:** Today we will focus on the question: "Some people chose to vote for neither Trump nor
Biden. Why do some states have a disproportionate number of **Other** voters?" Consider some ideas 
here.

In [None]:
head(select(USStates, State, OtherVote20))

**1.2:** Consider the other variables available in `USStates` (note: many of 
these variables were collected circa 2010 unless otherwise noted). What might explain variation in `OtherVote20`?

- `State` Name of state
- `HouseholdIncome` Mean household income (in dollars)
- `IQ` Mean IQ score of residents
- `Region` Area of the country: MW=Midwest, NE=Northeast, S=South, or W=West
- `Population` Number of residents (in millions) in 2020
- `PopPercent` Percent of residents of the US in 2020
- `PopChange10_20` Change in population (in percent) from 2010 to 2020 census
- `EighthGradeMath` Average score on standardized test administered to 8th graders 
- `HighSchool` Percentage of high school graduates
- `GSP` Gross State Product (dollars per capita)
- `FiveVegetables` Percentage of residents who eat at least five servings of fruits/vegetables per day
- `Smokers` Percentage of residents who smoke
- `PhysicalActivity` Percentage of residents who have competed in a physical activity in past month
- `Obese` Percentage of residents classified as obese
- `College` Percentage of residents with college degrees
- `NonWhite` Percentage of residents who are not white
- `HeavyDrinkers` Percentage of residents who drink heavily
- `TotalDosesed` Total COVID-19 vaccine doses delivered as of Nov 2021
- `AtLeast1Dose_per100` Percentage of population that has had at least their first dose of the vaccine as of Nov 2021
- `FullyVacc` Number of people fully vaccinated as of Nov 2021 
- `FullyVacc_per100` Percentage of population fully vaccinated as of Nov 2021 

**1.3:** As a class let's come up with three ideas and write them as word equations.

0. Whole class example: **OtherVote20 = FiveVeg + Other Stuff**
1. 
2.
3.


## 2.0: Explore Variation

**2.1:** Pick one of the ideas (word equations) to work on as a group and write it in 
the space below. 

**2.2:** Take a look at the data and make some visualizations to explore your word equation.

**2.3:** Is it possible to have gotten this pattern of data by chance? 
Write a word equation that represents this possibility.

## 3.0: Model Variation

**3.1:** According to this data, what is your best estimated model of the DGP?
Write it in GLM notation below. (Feel free to add more parameters if needed.)

$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$

**3.2:** Add your model to your visualization.

**3.3:** Interpret your coefficients. What do the numbers in your model mean?

**3.4:** How much error has your model reduced relative to the empty model? 
Does that provide evidence for or against your hypothesis/prediction?

## 4.0: Simulating a Random DGP 

**4.1:** What would the best fitting models typically look like if the DGP was random? What would the $b_1$s usually look like? How much could they vary? Let's check it out.

Modify the code blocks below to examine your particular explanatory variable.

In [None]:
gf_point(shuffle(OtherVote20) ~ FiveVegetables, data = USStates, color = "coral", size = 2) %>%
gf_lm(color = "navyblue") 

In [None]:
do(10) * b1(shuffle(OtherVote20) ~ shuffle(FiveVegetables), data = USStates)

We know how to just a sampling $b_1$ against $b_1$s generated from the empty model ($\beta_1=0$). Generating Fs from the empty model of the DGP will lead us to the same conclusions.  

**4.2:** Let's Fs (like a 1000 of them) from a random DGP. How do those Fs generally vary? Try creating a visualization of your distribution of Fs.

(Bonus: What is this distribution of Fs called?)

In [None]:
SDoF <- do(3) * fVal(shuffle(OtherVote20) ~ FiveVegetables, data = USStates)

head(SDoF)

**4.3:** Where would your sample F exist on the distribution of Fs? Try adding it to your visualization.

**4.4:** Consider the values in the ANOVA table for the main model you have been 
working with so far. Which value do you think corresponds the most to 
this statement:

> The probability of getting a F as large as the sample F, **if** there was no relationship between the variables in the DGP.

**4.5:** Use `tally()` to see if *"the proportion of Fs as large as the sample F, if
there was no relationship between the variables in the DGP"* from your sampling distribution really is similar to the number in the ANOVA table.

**4.6:** So what do you think? Evaluate your model against the empty model. (Were any of the other models from the class not that different from the empty model?) 

**4.7:** Do your conclusions about states hold up for citizens? Can you make a better prediction about  *which* voters within the state will vote for an "other" candidate with your model? Why or why not?

## 5.0: Extending these Ideas to *PRE*

*F* and *PRE* are very closely related. They both try to show how much 
the complex model explains the outcome variable compared to the empty model. We should end up with the same conclusions whether we use F or PRE.

**5.1:** To corroborate our intuitions, try creating a sampling distribution and histogram of the *PRE*  from shuffled data using `PRE()` and `shuffle()`.

Then use `tally()` to get the p-value from the simulated sampling distribution of PREs. 

**5.2:** What is similar and different about the approach using F versus PRE?