# Foundations of inference - simulating randomness

A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D).

In this lab you'll be using the data from the risk of cancer in dogs study to investigate how to simulate from a sample in order to construct p-values.

# Getting started

## Load packages

For this lab we will need the following packages.

```python
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
```

## Creating a reproducible lab report

You will be using Jupyter notebook to create reproducible lab reports. Download the lab report template and load the template into Jupyter notebook. These templates can be used for each of the labs.

## The data

The data we are working with is in the cancer_in_dogs.csv file. Download and load the data frame into **python**. Note that the two variables being used are `order` which represents whether the dog was exposed to (2,4-D) or not, and `response` which represents whether the dog developed cancer or not.

You can compute some observed statistics in this data set. For example, of those dogs exposed to (2,4-D), what proportion developed cancer?

```python
p_24D = cancer_in_dogs.groupby('order')['response'].value_counts(normalize=True)[('2,4-D','cancer')]
print("The proportion of dogs exposed to (2,4-D) who developed cancer is", p_24D)
```

Recall from lab 2 that the `.groupby().value_counts(normalize=True)` function will return a summary of the proportion of each combination. By using the command `[('2,4-D','cancer')]` at the end of the line, you are saying that you are only interested in the proportion corresponding to the combination where the order is `'2,4-D'` and the response is `'cancer'`.

<div class="alert alert-block alert-info">
<b>Exercise 1:</b> Determine the observed difference in proportion of dogs who developed cancer that had been exposed to 2,4-D versus those that had not been exposed.</div>

Is this difference in proportion simply random variability from the sample? How likely would it be to see this big of a difference by random chance? We can answer these questions using simulation.

# Simulating randomness

## Creating the simulation

A deck of cards can be used to simulate random assignment of response variables. You would need a deck of 1,436 cards. Of those, 495 would be marked to indicate 2,4-D. The other 941 would be marked to indicate no 2,4-D. You would then need to somehow shuffle the 1,436 cards and deal them into two piles. One pile would have 491 cards to represent the dogs who develop cancer. The other 945 cards represent the dogs that do not develop cancer. Then you need to count the number of 2,4-D in each pile, compute proportions, and subtract. That is a lot of work to do by hand! Let's instead create a virtual deck and have **python** do the hard work for us.

```python
sim_cancer_in_dogs = pd.DataFrame().assign(order=cancer_in_dogs['order'], response=cancer_in_dogs['response'].sample(frac=1, ignore_index=True))
```

This might seem like a lot of code, but you are familiar with some of it. You are creating a new data frame using the `pd.DataFrame()` function. In this data frame you are `assign`ing two columns. The first column will be named `order` and the data used is the same data from `cancer_in_dogs['order']`. The second column will be named `response` and the data used is the same data from `cancer_in_dogs['response']`. The difference is that the `response` data will be mixed up. This is accomplished using the `.sample()` function. Typically, the `.sample` function is used to randomly sample without replacement from a data set. By setting the parameter `frac=1`, however, we will sample 100% of the data randomly without replacement, effectively shuffling the deck of cards. Each time this command is repeated the deck gets reshuffled.

<div class="alert alert-block alert-info">
<b>Exercise 2:</b> Create a "deck of cards" to represent the data in the cancer_in_dogs dataset. Randomize the data into the two sets of 2,4-D and no 2,4-D. Then compute the difference in proportions for your new simulated dataset. Do this a total of three times, reporting results each time.</div>

<div class="alert alert-block alert-info">
<b>Exercise 3:</b> Describe what shuffling and dealing the deck of 1,436 cards accomplishes. What assumptions are made in this process? Describe how the code equivalently accomplishes the task of shuffling and dealing.</div>

Ideally, you want to simulate this randomization a large number of times. Repeatedly running the same bit of code is tedious. Similar to lab 3, you can use a `for` loop to repeat the code as many times as you would like. Let's simulate 100 shuffles.

```python
simulated_results = []
for x in range(100):
    sim_cancer_in_dogs = pd.DataFrame().assign(order=cancer_in_dogs['order'], response=cancer_in_dogs['response'].sample(frac=1, ignore_index=True))
    p_24D = sim_cancer_in_dogs.groupby('order')['response'].value_counts(normalize=True)[('2,4-D','cancer')]
    p_no_24D = sim_cancer_in_dogs.groupby('order')['response'].value_counts(normalize=True)[('no 2,4-D','cancer')]
    simulated_results.append(p_24D - p_no_24D)
```

This code has more than you might expect. Each iteration of the loop shuffles the deck and computes the difference in proportions. You need a way of storing each of those differences, though. The variable created in the first line, `simulated_results` is an empty list. In each iteration of the loop, the difference in proportions is `append`ed to the list. Thus, at the end of the `for` loop, the `simulated_results` list will have 100 entries. Each entry is the difference in proportions of a single shuffle. You can visualize this using a histogram.

```python
sns.histplot(data=simulated_results, bins=20)
```

Note that the `for` loop is using a different syntax. Unlike in lab 3 where you had a particular list that you wanted to cycle through, here you only want the loop to run a certain number of times. The command `for x in range(100):` does exactly that. The command will loop the indented code 100 times.

<div class="alert alert-block alert-info">
<b>Exercise 4:</b> Repeat the simulation 1,000 times and create a histogram of the results.</div>

<div class="alert alert-block alert-info">
<b>Exercise 5:</b> Describe the histogram above. Why is the histogram shaped the way that it is? What do the data in the histogram represent? Why is the center of the histogram where it is? What assumption is being made regarding 2,4-D and cancer in dogs?</div>

# Observed statistic vs. null statistics

Now that you have a simulation of what could happen with random assignment, you can get a sense of how likely the observed outcome is. One way you could accomplish this is by looking at the histogram you created above, approximate where the observed value is on the graph, and guess at how much of the histogram is more extreme than that value. Alternatively, you could have **python** do it for you. The following code can be used to create a histogram with all of the important pieces, though you will need to input the observed value you got above.

```python
#Type in the observed value you found in exercise 1 in the line below.
obs_value = 

ax = sns.histplot(data=[i for i in simulated_results if i <= obs_value], binrange = (min(simulated_results), obs_value))
sns.histplot(data=[i for i in simulated_results if i > obs_value], binrange = (obs_value, max(simulated_results)), color = "red")
ax.axvline(x = obs_value, ymin = 0, ymax = 1, color = "black", linestyle = "dashed")
```

This code is likely more than you need, but it makes a pretty picture. Let's break down pieces of it.
- The first line begins with `#`. This denotes a comment in the code. A comment is something **python** will not compile. Think of it as a note for someone reading the code. In this case, it is a note to you to input your observed value in the following line.
- The `histplot` function is being constructed differently. Notably, the graphics (histogram) are being given a variable name `ax` and multiple components are being added to the graphics. The first line is creating the blue part of the histogram to the left of the observed value. The second line is creating the red part of the histogram to the right of the observed value. The last line is creating the vertical line on the histogram at the observed value.

<div class="alert alert-block alert-info">
<b>Exercise 6:</b> Estimate the percent of simulations more extreme than the observed value, i.e. what percent of the histogram is red?</div>

Now let's find out the exact percent of simulations more extreme than the observed value. The percent of simulations more extreme is $$\frac{\text{number of simulations with a difference in proportion greater than the observed}}{\text{number of simultions}}$$ You can use the `len`gth function to count these numbers.

```python
len([i for i in simulated_results if i > obs_value]) / len(simulated_results)
```

- The denominator uses the `len` function on `simulated_results` to determine the number of simulations. You could also type in 1000 in the denominator, as that is how many simulations you did. The value in using `len(simulated_results)` is that if you change the number of simulations in your code, you only need to change it in one place.
- The numerator is using the same `len` function, but on a new list that is being created. The list is cycling through the elements of `simulated_results` and only keeping those that are greater than `obs_value`.

<div class="alert alert-block alert-info">
<b>Exercise 7:</b> What is the actual percent of the simulated data more extreme than the observed value from the study? Interpret the results. What does this percent mean in the context of the 2,4-D study?</div>

<div class="alert alert-block alert-info">
<b>Exercise 8:</b> The percent is a $p$-value, which is a probability. What is this a probability of? What assumptions are made related to the $p$-value?</div>

---

# Additional Practice

<div class="alert alert-block alert-info">
<b>Exercise 9:</b> Repeat all of the above for the gender_discrimination data set. The gender_discrimination data set is data from a study in the 1970s about whether gender influences hiring recommendations.</div>