## Resampling/Permutation Tests

With **resampling**, you draw repeated samples from observed data with the goal of assessing random variability in a statistic. Similar to the bootstrap, you are not going to try to analytically determine the distribution of the test statistic, but instead build it out of the observed sample.

**The Big Idea:** We are trying to determine if two samples came from the same underlying distribution. If they came from the same distribution, then the label is irrelevant, and if we shuffle them, then it is still a sample from the same distribution.

You start with the null hypothesis - that the two samples came from the same distribution, and then look at the distribution of some test statistic (eg. difference in means) by randomly permuting the samples a large number of times and recalculating the test statistic.

That is, the $p$-value is the proportion of test statistics calculated from permutations that were _at least as extreme_ as the observed test statistic.

This is a non-parametric method, since you don't care how the data was generated (i.e., it doesn't matter if it was from a normal distribution).

See: http://faculty.washington.edu/yenchic/18W_425/Lec3_permutation.pdf

Let's look at the example with the amount of time spent sleeping. First, capture the observed difference in means.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sleeping = pd.read_csv('../data/atus_sleeping.csv')
sleeping.head()

Before bringing in the data, let's state the null and alternative hypotheses. 

**Null Hypothesis:**

$H_0$: The distribution of minutes spent sleeping for males is the same as for females.

**Alternative Hypothesis:**

$H_1:$ The distribution of minutes spent sleeping for males is the different from the distribution for females.

Again, you can use the 0.05 significance level.

Let's find the observed difference in means. We can start by using `groupby`.

In [None]:
means = sleeping.groupby('sex')['minutes_spent_sleeping'].mean()
means

Then calculate the difference in means.

In [None]:
observed_value = means['Female'] - means['Male']
observed_value

To generate our permutations, we can use the `resample` funtion from scikit-learn.

In [None]:
from sklearn.utils import resample

Before scaling up, let's see how it looks to do one permutation.

In [None]:
# First make a copy of the data
sleeping_permutation = sleeping.copy() 


# Then shuffle the sex column. Note that the replace argument must be False
sleeping_permutation['sex'] = resample(sleeping_permutation['sex'], replace = False).tolist()

From there, we can repeat the same steps as above.

In [None]:
means = sleeping_permutation.groupby('sex')['minutes_spent_sleeping'].mean()
means['Female'] - means['Male']

Now that we have seen it for one permutation, let's use a for loop to scale it up.

In [None]:
num_permutations = 10000

permutation_df = sleeping.copy()

permutation_values = []

for _ in range(num_permutations):
    permutation_df['sex'] = resample(permutation_df['sex'].tolist(), replace = False)
    
    means = permutation_df.groupby('sex')['minutes_spent_sleeping'].mean()
    permutation_values.append(means['Female'] - means['Male'])

We can create a histogram to compare our observed value to the permutation differences.

In [None]:
plt.hist(permutation_values, edgecolor = 'black', bins = 25)
ymin, ymax = plt.ylim()
plt.vlines(x = observed_value, ymin = ymin, ymax = ymax, color = 'red', linestyle = '--')
plt.ylim(ymin, ymax);

To get the p-value, we need to find the percentage of permutation differences that are more extreme than the observed difference.

In [None]:
(np.array(permutation_values) > observed_value).mean()

**Question:** What is our conclusion?

Now, repeat this for the grooming dataset.

In [None]:
grooming = pd.read_csv('../data/atus_grooming.csv')
grooming.head()

In [None]:
means = grooming.groupby('sex')['minutes_spent_grooming'].mean()
observed_value = means['Female'] - means['Male']
observed_value

In [None]:
num_permutations = 10000

permutation_df = grooming.copy()

permutation_values = []

for _ in range(num_permutations):
    permutation_df['sex'] = resample(permutation_df['sex'].tolist(), replace = False)
    
    means = permutation_df.groupby('sex')['minutes_spent_grooming'].mean()
    permutation_values.append(means['Female'] - means['Male'])

In [None]:
plt.hist(permutation_values, edgecolor = 'black', bins = 25)
ymin, ymax = plt.ylim()
plt.vlines(x = observed_value, ymin = ymin, ymax = ymax, color = 'red', linestyle = '--')
plt.ylim(ymin, ymax);

In [None]:
(np.array(permutation_values) > observed_value).mean()

**Question:** What is our conclusion?

## Permutation Testing of Correlation

Let's see how to conduct a hypothesis test about correlation. We'll step through the example from the slides. Recall that the null and alternative hypotheses were

$$H_0: \text{There is no relationship between temperature and NOx concentration}$$

$$H_1: \text{There is a relationship between temperature and NOx concentration.}$$

Read in the data.

In [None]:
air_quality = pd.read_csv('../data/air_quality.csv')

In [None]:
air_quality.head()

The scatterplot of the two relevant variables: `Temperature` and `NOx`

In [None]:
air_quality.plot(kind = 'scatter', x = 'Temperature', y = 'NOx');

The observed correlation:

In [None]:
air_quality[['Temperature','NOx']].corr()

In [None]:
observed_value = air_quality[['Temperature','NOx']].corr().iloc[0,1]
observed_value

In [None]:
num_permutations = 10000

permutation_df = air_quality.copy()

permutation_values = []

for _ in range(num_permutations):
    permutation_df['NOx'] = resample(permutation_df['NOx'].tolist(), replace = False)
    
    permutation_values.append(permutation_df[['Temperature','NOx']].corr().iloc[0,1])

In [None]:
plt.hist(permutation_values, edgecolor = 'black', bins = 25)
ymin, ymax = plt.ylim()
plt.vlines(x = observed_value, ymin = ymin, ymax = ymax, color = 'red', linestyle = '--')
plt.ylim(ymin, ymax);

In [None]:
(np.array(permutation_values) < observed_value).mean() + (np.array(permutation_values) > -observed_value).mean()

**Question:** What is our conclusion?