# 6B: The Effect of Being at Home

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

We’re going to look at a dataframe called `MiamiHeat`. These are the game log data for the Miami Heat basketball team in 2010-11. 

Here are the variables in this dataframe:

- `MDY` Date the game was played as a date object
- `Date` Date the game was played as a character string
- `Location` Away or Home
- `Opp` Opponent team
- `Win` Game result: L or W
- `FG` Field goals made
- `FGA` Field goals attempted
- `FG3` Three-point field goals made
- `FG3A` Three-point field goals attempted
- `FT` Free throws made
- `FTA` Free throws attempted
- `Rebounds` Total rebounds
- `OffReb` Offensive rebounds
- `Assists` Number of assists
- `Steals` Number of steals
- `Blocks` Number of shots blocked
- `Turnovers` Number of turnovers
- `Fouls` Number of fouls
- `Points` Number of points scored
- `OppFG` Opponent's field goals made
- `OppFGA` Opponent's Field goals attempted
- `OppFG3` Opponent's Three-point field goals made
- `OppFG3A` Opponent's Three-point field goals attempted
- `OppFT` Opponent's Free throws made
- `OppFTA` Opponent's Free throws attempted
- `OppOffReb` Opponent's Total rebounds
- `OppRebounds` Opponent's Offensive rebounds
- `OppAssists` Opponent's assists
- `OppSteals` Opponent's steals
- `OppBlocks` Opponent's shots blocked
- `OppTurnovers` Opponent's turnovers
- `OppFouls` Opponent's fouls
- `OppPoints` Opponent's points scored

## 1.0 - Explore the Home Game Advantage

1.1 - Let's explore the idea that playing at home would help
the Miami Heat score more points. Write this idea as a word equation.

1.2 - Explore the variation with a visualization. Does it seem like they score more
when playing at home?

## 2.0 - Modeling Variation

2.1 - Now that we know a lot more about statistics, we can actually create a model of 
POINTS = LOCATION + OTHER STUFF. Write the best fitting model in 
GLM notation here. 

$$Y_i = ... + e_i$$

2.2 - Interpret the best fitting estimates by connecting the numbers
to the visualization below.

In [None]:
gf_histogram(~ Points, data = MiamiHeat, alpha = 1) %>%
gf_facet_grid(Location ~ .) %>%
gf_model(Points~Location, data = MiamiHeat)

2.3 - When we look at a visualization (any visualization! not just this one!) 
and think, *"Yeah, this one looks like some of the variation is explained,"*
which parameter estimate best represents the “shift”? $b_0$ or $b_1$?


2.4 - Estimate that parameter in the **shuffled** histograms below. 
When you have randomly shuffled data into Home and Away groups, 
what are the $b_1$s typically like? 


<img src="https://i.postimg.cc/pWBFyh9g/10-B-Shuffled-Histograms.png" title="grid of faceted histograms" />

2.5 - If we put all these $b_1$s in their own distribution, what would it be called?

- a sample of `Points`
- a sampling distribution of b1s
- a DGP where there is no home game advantage

2.6 - Why should we be worried about randomness when looking at the sample $b_1$?


## 3.0 - A DGP where there is no home game advantage

3.1 - If there was no effect of `Location` in the real DGP, which model of the DGP 
should we prefer: the `Location` model or the empty model?

3.2 - If there was no effect of `Location`, what would be the value of $\beta_1$ in the DGP?

$$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$$

3.3 - Which of these functions (`shuffle()` or `resample()` as used
below) will generate 1000 samples from a DGP with no home game advantage for `Points`?

```
do(1000) * b1(shuffle(Points) ~ Location, data = MiamiHeat)

do(1000) * b1(Points ~ Location, data = resample(MiamiHeat))
```

3.4 - If there was no home game advantage, would such a DGP *always* generate $b_1$s that are 0? Could such a DGP generate samples where the average `Points` scored at home were *higher* than the away average? What about *lower*? Why or why not?

3.5 - If there was no home game advantage for `Points`, could we ever get sample $b_1$ as extreme as the one in our sample ($b_1 = 2.12$)? Would it be one of the "unlikely" $b_1$s from this DGP?

Make a prediction. Then create a sampling distribution of $b_1$ from a DGP where $\beta_1 = 0$.

In [None]:
# Create a sampling distribution of b1 here.
# Save it as sdob1_no_effect


# Visualize the sampling distribution in a histogram. 

3.6 - The code below will color the middle .95 of samples in `dodgerblue`. Will the sample $b_1$ be an "unlikely" samples? 

Add some code to depict the sample $b_1$ (2.12) in `green4`. 

In [None]:
# this code makes a histogram 
gf_histogram(~ b1, data = sdob1_no_effect, binwidth = .5, fill = ~middle(b1, .95)) %>%
gf_refine(scale_fill_manual(values = c("coral", "dodgerblue")))

3.7 - Check out a different way of depicting the distribution triad (DGP, sampling distribution, and sample) here. Where should this sampling distribution go?

*slides provided by your instructor (found in complete version)*

## 4.0 - A DGP where the home game advantage is similar to our sample 

4.1 - If the effect of `Location` in the real DGP was just like our sample, which model of the DGP should we prefer: the `Location` model or the empty model?

4.2 - If the effect of `Location` was just like our sample, what would be the value of $\beta_1$ 
in the DGP?

$$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$$

4.3 - Which of these functions (`shuffle()` or `resample()` as used
below) will generate 1000 samples from a DGP that is basically just like our sample?

```
do(1000) * b1(shuffle(Points) ~ Location, data = MiamiHeat)

do(1000) * b1(Points ~ Location, data = resample(MiamiHeat))
```

4.4 - If the DGP was just like our sample, could it generate $b_1$s that show no effect of `Location`? Could such a DGP generate samples where the average `Points` in home games were *lower*  than the away average? Why or why not?

4.5 - If the home game advantage for `Points` in the DGP was just like our sample, could we ever get sample $b_1$ as extreme the one in our sample ($b_1 = 2.12$)? Would it be one of the "unlikely" $b_1$s from this DGP?

Make a prediction. Then create a sampling distribution of $b_1$ from a DGP that is just like our sample ($\beta_1 = 2.12$).

In [None]:
# Create a sampling distribution of b1 here.
# Save it as sdob1_same_effect


# Visualize the sampling distribution in a histogram. 

4.6 - The code below will color the middle .95 of samples in a shade of turquoise. 
Will the sample $b_1$ be one of those "unlikely" samples? 

Add some code to depict the sample $b_1$ (2.12) in `green4`. Should it be added as a curve,
a line, something else?

In [None]:
# this code makes a histogram 
gf_histogram(~ b1, data = sdob1_same_effect, binwidth = .5, fill = ~middle(b1,.95))

4.7 - Check out a different way of depicting the distribution triad (DGP, sampling distribution, and sample) here. Where should this "same effect" sampling distribution go?

*Slides in complete version*

## 5.0 - Confidence Intervals

5.1 - From these explorations, could the real $\beta_1$ be 0? Could it be 2.12? Could it be 
some other number? 

5.2 - Use `confint()` to find the 95% confidence interval of $\beta_1$. 

5.3 - Use the sampling distributions in the powerpoint link to show what the 95% confidence
interval around $\beta_1$ means:
What does the lowest number mean? What does the highest number mean? What units are these values in?

5.4 - Even though these sampling distributions were made in different ways (e.g., shuffling 
versus resampling), they seem to show us the same confidence interval. Why?

5.5 - Based on all that we have done here, if you had to say what was the real effect of
playing at home, what would you say?

## 6.0 - Home Game Advantage... in something else?

6.1 - Is it possible there is a really big advantage to playing a home game?

6.2 - Which outcome of a basketball game seems like it would be most affected by playing at home:
points, free throws, or fouls?

6.3 - Which of these is most affected by the home game advantage? 

6.4 - Are you 95% confident that there is at least *some* effect of playing at home 
for that outcome?