# Hypothesis Testing with Two Variables

**Hypothesis Testing Steps:**

1. State null and alternative hypotheses and significance level.
2. Assume that the null hypothesis is true, and choose  a statistic to calculate based on your observed values.
3. Determine/estimate how your chosen statistic is distributed under the null hypothesis
4. Find the $p$-value: how often would you see a sample statistic as extreme or more extreme than the one you observed?
5. If $p$-value is smaller than the significance level, reject the null hypothesis. Otherwise, do not reject the null hypothesis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

## Hypothesis Testing for Difference in Means

Oftentimes, you will be testing to see if there is a difference between two populations. For example, you might want to compare amount of time spent sleeping by the male population vs the female population from the American Time Use Survey. In this case, you won't specify a direction for the difference. That is, you won't test that males get more sleep than females or vice versa. You'll only test that there is some difference. This means that you are doing a two-tailed test.

Before bringing in the data, let's state the null and alternative hypotheses. Remember that the null hypothesis says that there is no difference between the population means. Let $\mu_M$ represent the average time spent sleeping by males and $\mu_F$ represent the average time spent sleeping by females.

**Null Hypothesis:**

$H_0: \mu_M = \mu_F$

**Alternative Hypothesis:**

$H_1: \mu_M \neq \mu_F$

For this and all tests in this notebook, we'll use a 0.05 significance level.

Now, you can bring in the data, which is a sample of 25 men and 25 women along with the amount of time they reported sleeping.

In [None]:
sleeping = pd.read_csv('../data/atus_sleeping.csv')
sleeping.head()

First, look at some summary statistics.

In [None]:
sleeping.groupby('sex')['minutes_spent_sleeping'].describe()

In [None]:
import seaborn as sns

In [None]:
sns.boxplot(data = sleeping, x = 'sex', y = 'minutes_spent_sleeping');

There does appear to be a difference between males and females in terms of the amount of time spent sleeping. However, there is quite a bit of variability in the dataset, so you need to check to see how likely the difference that you observe is due simply to the randomness inherent in sampling.

Now, we need to compute our test statistic and compare the observed test statistic to the overall distribution of that test statistic.

### Method 1: Welch's Test 
**Fact:** If both populations are approximately normally distributed, then 

$$t = \frac{\bar{x}_1 - \bar{x}_2}{s}$$ approximately follows a $t$ distribution. Here

$$s = \sqrt{\frac{s_1^2}{N_1} + \frac{s_2^2}{N_2}}$$

where $s_1$ and $s_2$ are the sample standard deviations and $N_1$ and $N_2$ are the sample sizes.

This $t$ distribution has degrees of freedom equal to 
$$df = \left(\frac{s_1^2}{n_1}+ \frac{s_2^2}{n_2}\right)^2 / \left(\frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)} \right)$$

See http://www.dcscience.net/welch-1947.pdf if you really want to see details, but I don't recommend it.

Luckily for us, this has been implemented in `scipy.stats` as `ttest_ind`.

In [None]:
from scipy.stats import ttest_ind

To use this function, you need to pass in the observed values for each group and specify that you want `equal_var = False`.

In [None]:
t_results = ttest_ind(
    sleeping.loc[sleeping.sex == 'Male', 'minutes_spent_sleeping'],
    sleeping.loc[sleeping.sex == 'Female', 'minutes_spent_sleeping'],
    equal_var = False,
    alternative="two-sided"
)
t_results

In [None]:
from nssstats.plots import hypot_plot_mean_2sample

In [None]:
hypot_plot_mean_2sample(sleeping.loc[sleeping.sex == 'Male', 'minutes_spent_sleeping'],
                       sleeping.loc[sleeping.sex == 'Female', 'minutes_spent_sleeping'], type='both')

This says that if the null hypothesis is true and there is no difference in average sleeping times, you can expect to see a difference as large as what you observed more than 14% of the time. This is not particularly compelling evidence that the null is not true. You should not reject the null. There is not enough evidence to conclude that there is a difference in average sleeping times between males and females.

### Method 2: Permutation Testing

Rather than having to rely on a complicated derivation for the distribution of our test statistic, we can instead estimate it through simulation.

In this case, the test statistic that we'll use is the difference between means for our two samples.

In [None]:
observed_test_statistic = sleeping.groupby('sex')['minutes_spent_sleeping'].mean().diff().iloc[1]
observed_test_statistic

Now, what we'll do is to randomly shuffle our labels and recalculate to compare how unusual our observed difference is.

In [None]:
sleeping_copy = sleeping.copy()

sex = sleeping_copy['sex'].tolist()
np.random.shuffle(sex)
sleeping_copy['sex'] = sex
sleeping_copy.head()

In [None]:
permutation_difference = sleeping_copy.groupby('sex')['minutes_spent_sleeping'].mean().diff().iloc[1]
permutation_difference

Now, we just need to repeat this a large number of times and record the observed differences.

In [None]:
permutation_differences = []

num_permutations = 5000
for _ in tqdm(range(num_permutations)):
    sex = sleeping_copy['sex'].tolist()
    np.random.shuffle(sex)
    sleeping_copy['sex'] = sex
    permutation_difference = sleeping_copy.groupby('sex')['minutes_spent_sleeping'].mean().diff().iloc[1]
    permutation_differences.append(permutation_difference)

Let's compare the observed difference to the distribution of permutation differences.

In [None]:
plt.hist(
    permutation_differences,
    edgecolor="black"
)

ymin, ymax = plt.ylim()
plt.vlines(
    x=observed_test_statistic,
    ymin=ymin,
    ymax=ymax,
    color="red",
    linestyle="--"
)
plt.ylim(ymin, ymax);

How often did we get a permutation difference at least as large as the observed?

In [None]:
(np.abs(np.array(permutation_differences)) > np.abs(observed_test_statistic)).mean()

With this large a $p$-value, we cannot reject the null hypothesis.

## Hypothesis Testing for Independence of Categorical Variables

What if we have two categorical variables and want to test if one influences the other. That is, are the variables independent or dependent.

Finally, let's look at the squirrel census data again. A lot of people have never seen a black squirrel, so me might hypothesize that black squirrels are more skittish and do not approach humans as frequently.

Let's formally state this as a null and alternative hypothesis.

**Null Hypothesis:**  
$H_0:$ The squirrel's primary fur color and likelihood of approaching are **independent**.

**Alternative Hypothesis:**   
$H_1:$ The squirrel's primary fur color and likelihood of approaching are **dependent**.

Now, let's bring in the data.

In [None]:
squirrels = pd.read_csv('../data/squirrels.csv')
squirrels = squirrels.dropna(subset=['Approaches', 'Primary Fur Color'])

Run a cross-tabulation to see how often black squirrels run compared to other colors.

In [None]:
ct = pd.crosstab(squirrels['Approaches'], squirrels['Primary Fur Color'])
ct

What would it mean if the primary fur color and likeliood of running from were independent. It would mean that 
$$P(\text{Primary Fur Color }= x\text{ and Approaches }= y) = P(\text{Primary Fur Color }= x)\cdot P(\text{Approaches }= y).$$

We can use `value_counts` to see the estimated probabilites for each variable separately.

In [None]:
squirrels['Approaches'].value_counts(normalize=True).sort_index()

In [None]:
squirrels['Primary Fur Color'].value_counts(normalize=True).sort_index()

And to get the estimated probabilities if they were independent, we can use the _outer product_.

In [None]:
np.outer(
    squirrels['Approaches'].value_counts(normalize=True).sort_index(),
    squirrels['Primary Fur Color'].value_counts(normalize=True).sort_index()
)

In [None]:
probs_ind = pd.DataFrame(
    np.outer(
        squirrels['Approaches'].value_counts(normalize=True).sort_index(),
        squirrels['Primary Fur Color'].value_counts(normalize=True).sort_index()
    ),
    index = squirrels['Approaches'].value_counts(normalize=True).sort_index().index,
    columns = squirrels['Primary Fur Color'].value_counts(normalize=True).sort_index().index
)
probs_ind

Let's compare it to the observed proportions.

In [None]:
pd.crosstab(squirrels['Approaches'], squirrels['Primary Fur Color'], normalize=True)

While there isn't a perfect match, is it close enough that the difference could be attributed just to chance?

To determine this, we need a test statistic and a distribution to compare it against. The typical test statistic used in this compares the observed counts in each cell of our contingency table to the expected counts if the variables were independent.

In [None]:
expected_counts = probs_ind * ct.sum().sum()
expected_counts

The test statistic is calculated as 

$$\chi^2 = \sum_{i,j} \frac{(observed_{i,j} - expected_{i,j})^2}{expected_{i,j}}$$

In [None]:
test_stat = ((ct - expected_counts)**2 / expected_counts).sum().sum()
test_stat

How unusual is this test statistic? To determine this, we need to compare it against the overall distribution of test statistics.|

### Method 1: Use the $\chi^2$ Distribution

**Fact:** When the null hypothesis is true, the test statistic follows a $\chi^2$ distribution with degrees of freedom equal to $(r-1)\cdot(c-1)$, where $r$ and $c$ are the number of rows and columns of the contingency table, respectively.

Let's plot this distribution compared to the test statistic.

In [None]:
from scipy.stats import chi2

In [None]:
x = np.linspace(start=0, stop=25, num=1000)
plt.plot(
    x,
    chi2.pdf(x, df=2),
    color="black"
)
plt.plot(
    x,
    np.zeros_like(x),
    color="black"
)
ymin,ymax = plt.ylim()
plt.vlines(
    x=test_stat,
    ymin=ymin,
    ymax=ymax,
    color="red",
    linestyle="--"
)
plt.ylim(ymin, ymax);

To find the $p$-value, we need to know how often we get a value as extreme or more extreme than the one we observed. To do this, we can use the `sf` function. The abbreviation sf stands for "survival function" which gives the probability of a value at least as large as $x$.

In [None]:
p = chi2.sf(x=test_stat, df=2)
p

Rather than doing all of these calculations ourself, we can also rely on the `chi2_contingency` function from scipy.stats.

In [None]:
from scipy.stats import chi2_contingency

In [None]:
chi2_contingency(ct)

This says that if these variables were independent, we would see such an extreme test statistic with probability 0.0000112. This is well below the significance level, meaning that we can reject the null hypothesis and conclude that primary fur color and likelihood of approaching are dependent.

### Method 2: Permutation Test

We can also permute one of our variables to estimate the distribution of test statistics.

In [None]:
permutation_stats = []
squirrels_copy = squirrels.copy()

num_permutations = 1000
for _ in tqdm(range(num_permutations)):
    runs = squirrels_copy['Approaches'].tolist()
    np.random.shuffle(runs)
    squirrels_copy['Approaches'] = runs
    permutation_ct = pd.crosstab(squirrels_copy['Approaches'], squirrels_copy['Primary Fur Color'])
    permutation_test_stat = ((permutation_ct - expected_counts)**2 / expected_counts).sum().sum()
    permutation_stats.append(permutation_test_stat)

In [None]:
plt.hist(
    permutation_stats,
    edgecolor="black"
)

ymin, ymax = plt.ylim()
plt.vlines(
    x=test_stat,
    ymin=ymin,
    ymax=ymax,
    color="red",
    linestyle="--"
)
plt.ylim(ymin, ymax);

In [None]:
(np.array(permutation_stats) >= test_stat).mean()

## Hypothesis Testing for Correlation

Finally, what if we want to perform a test about the relationship between two numeric variables? Specifically, what if we want to test whether the correlation between two variables is nonzero.

In [None]:
nba = pd.read_csv('../data/nba_players.csv')
nba = nba.dropna()
nba.head()

In [None]:
nba[['height_inches', 'pts_per_game']].corr()

Let's formally state this as a null and alternative hypothesis.

**Null Hypothesis:**  
$H_0:$ There is zero correlation between height and points per game.

**Alternative Hypothesis:**   
$H_1:$ There is a non-zero correlation between height and points per game.

Now, we need to calculate our test statistic and compare it against a reference distribution.

### Method 1: Use the $t$-Distribution

We can use $$t = \frac{r\cdot\sqrt{n-2}}{\sqrt{1-r^2}}$$

as our test statistic, where $r$ is the observed correlation and $n$ is the sample size. If the null is true, this will follow a $t$ distribution with $n-2$ degrees of freedom. (Reference: https://online.stat.psu.edu/stat500/lesson/9/9.4/9.4.1)

In [None]:
r = nba[['height_inches', 'pts_per_game']].corr().iloc[0,1]
n = nba.shape[0]
print(r, n)

In [None]:
observed_t = r * np.sqrt(n-2) / np.sqrt(1 - r**2)
observed_t

In [None]:
from scipy.stats import t

In [None]:
x = np.linspace(start=-3, stop=3, num=1000)
plt.plot(
    x,
    t.pdf(x, df=n-2),
    color="black"
)
plt.plot(
    x,
    np.zeros_like(x),
    color="black"
)
ymin,ymax = plt.ylim()
plt.vlines(
    x=observed_t,
    ymin=ymin,
    ymax=ymax,
    color="red",
    linestyle="--"
)
plt.ylim(ymin, ymax);

In [None]:
2*t.cdf(observed_t, df=n-2)

We can also let scipy stats do the hard work for us.

In [None]:
from scipy.stats import pearsonr

In [None]:
pearsonr(
    x=nba['height_inches'],
    y=nba['pts_per_game']
)

Again, since the p-value is greater than the significance level, we cannot reject the null hypothesis.

### Method 2: Permutation

In [None]:
permutation_stats = []
nba_copy = nba.copy()

num_permutations = 5000
for _ in tqdm(range(num_permutations)):
    pts = nba_copy['pts_per_game'].tolist()
    np.random.shuffle(pts)
    nba_copy['pts_per_game'] = pts
    permutation_r = nba_copy[['height_inches', 'pts_per_game']].corr().iloc[0,1]
    permutation_stats.append(permutation_r)

In [None]:
plt.hist(
    permutation_stats,
    edgecolor="black"
)

ymin, ymax = plt.ylim()
plt.vlines(
    x=r,
    ymin=ymin,
    ymax=ymax,
    color="red",
    linestyle="--"
)
plt.ylim(ymin, ymax);

In [None]:
(np.abs(np.array(permutation_stats)) >= np.abs(r)).mean()

**Bonus (if time allows)**: We seem to be doing the same thing over and over to calculate our permutation stats. Perhaps we can write some reusable code so that we don't have to keep copying and pasting.