## Resampling/Permutation Tests

In this notebook, we'll see how we can perform a permutation test to determine if two populations have the same distribution.

For more information about permutation testing, see these notes: http://faculty.washington.edu/yenchic/18W_425/Lec3_permutation.pdf

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Example 1:** In this example, we'll look at the number of minutes spent sleeping reported to the American Time Use Survey.

Specifically, we'll be comparing the amount of time spent sleeping reported by males to the amount reported by females.

**Null Hypothesis:** The distribution of the time spent sleeping by females is the same as the distribution for males.

**Alternative Hypothesis:** The distribution of the time spent sleeping by females has a higher mean than the distribution for males.

To assess this, we'll look at the mean minutes by group.

In [None]:
sleeping = pd.read_csv('../data/atus_sleeping.csv')
sleeping.head(2)

In [None]:
group_means = sleeping[['minutes_spent_sleeping', 'sex']].groupby('sex')['minutes_spent_sleeping'].mean()
group_means

In [None]:
observed_difference = group_means['Female'] - group_means['Male']
observed_difference

To perform a permutation test, we need to shuffle the labels, then at the new difference observed by group.

In [None]:
num_group1 = len(sleeping.loc[sleeping['sex'] == 'Female'])     # How many observations were female?
values = sleeping['minutes_spent_sleeping'].tolist()            # Extract the values column as a list
values[:5]

In [None]:
# Then, use the shuffle method from numpy to permute the values
np.random.shuffle(values)
values[:5]

In [None]:
# Finally, look at the permuted differences
# We can allocate the beginning of the shuffled values to females and the remainder to males.
np.mean(values[:num_group1]) - np.mean(values[num_group1:])

Let's automate this process using a for loop.

In [None]:
df = sleeping
column = 'minutes_spent_sleeping'
groups = 'sex'
group1 = 'Female'

permutation_differences = []
values = df[column].tolist()
num_group1 = len(df[df[groups] == group1])

for _ in range(10000):
    np.random.shuffle(values)
    permutation_differences.append(np.mean(values[:num_group1]) - np.mean(values[num_group1:]))
    
permutation_differences = np.array(permutation_differences)

Now, we can compare the distribution of permutation differences to the observed difference.

In [None]:
plt.hist(permutation_differences, bins = 40, edgecolor = 'black')
ymin, ymax = plt.ylim()
plt.vlines(x = observed_difference,
           ymin = ymin,
           ymax = ymax,
           linestyle = '--',
           color = 'red');

Finally, see for what proportion of permutations, we saw at least as extreme a difference in means.

In [None]:
(permutation_differences >= observed_difference).mean()

**Question:** What is our conclusion?



**Example 2:** In this example, we'll look at the number of minutes spent grooming reported to the American Time Use Survey. 

**Null Hypothesis:** The distribution of the time spent grooming by females is the same as the distribution for males.

**Alternative Hypothesis:** The distribution of the time spent grooming by females has a higher mean than the distribution for males.

In [None]:
grooming = pd.read_csv('../data/atus_grooming.csv')

In [None]:
grooming.head()

**Your Turn**

First, calculate the observed difference in the average number of minutes spent grooming by males and the average number of minutes spent grooming by females. Save the result to the `observed_difference` variable.

In [None]:
# Your Code Here

Now, copy and paste and modify the code above to find 10000 permutation differences and save them to a numpy array `permutation_differences`.

In [None]:
# Your Code Here

Next, run the cell below to look at the distribution of permutation differences compared to your observed difference.

In [None]:
plt.hist(permutation_differences, bins = 40, edgecolor = 'black')
ymin, ymax = plt.ylim()
plt.vlines(x = observed_difference,
           ymin = ymin,
           ymax = ymax,
          linestyle = '--',
          color = 'red');

Finally, find the proportion of permutation differences that were as extreme or more extreme than the observed difference.

In [None]:
# Your Code Here

**Question:** What is our conclusion?