# Hypothesis Testing

In [2]:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
from math import gamma
%matplotlib inline

### Comprehension Check

Which test would you run for each these scenarios?

1. The average salary per month of an English Premier League player is €240,000. You would like to test whether players who don't have a dominant foot make more than the rest of the league. There are only 25 players who are considered ambidextrous. 

<details>
<summary>
    Check
</summary>
    - Predictor variable is categorical (dominant foot or not) <br/>
    - Outcome variable is quantitative (monthly salary) <br/>
    - Two groups compared <br/>
    - Therefore: $t$-test
        - ONE sample (comparing small ambidextrous sample with population)
        - ONE tail (MORE $ or not)
</details>

2. You would like to test whether there is a difference in arrest rates across neighborhoods with different racial majorities. You have point statistics of mean arrest rates associated with neighborhoods of majority White, Black, Hispanic, and Asian populations.

<details>
    <summary>
        Check
    </summary>
    ANOVA
</details>

3. You are interested in testing whether the superstition that black cats are bad luck affects adoption rate. You would like to test whether black-fur shelter cats get adopted at a different rate from cats of other fur colors.

<details>
    <summary>
        Check
    </summary>
    $t$-test
        - TWO samples (black cats and non-black cats)
        - TWO tails (greater or lesser rate)
</details>

4. You are interested in whether car-accident rates in cities where marijuana is legal differs from the general rate of car accidents. Assume you know the standard deviation of car accident rates across all U.S. cities.

<details>
    <summary>
        Check
    </summary>
    $z$-test
</details>

<br>Our $z$-score tells us how many standard deviations away from the mean our point is.
<br>We assume that the sample population is normally destributed, and we are familiar with the empirical rule: <br>66:95:99.7

![](img/Empirical_Rule_2.png)

### Z-score

Recall the following example: Assume the mean height for women is normally distributed with a mean of 65 inches and a standard deviation of 4 inches. What is the $z$-score of a woman who is 75 inches tall?

* Regular z_score of population data is: <br>


> For a single point in relation to a distribution of points:

> $z = \dfrac{{x} - \mu}{\sigma}$



#### In Python Code
z = (x_bar - mu)/(std)

* z_score for sampling data is:

When we are working with a sampling distribution, the z score is equal to <br><br>  
> $z = \dfrac{{\bar{x}} - \mu_{0}}{\dfrac{\sigma}{\sqrt{n}}}$

#### In Python Code
z = (x_bar - mu)/(std/np.sqrt(n))

### Variable review: 

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>$\sigma$ is the population standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

The denominator $\frac{\sigma}{\sqrt{n}}$, is the standard error

Standard error is the standard deviation of the sampling mean. We will go into that further below.

Once we have a z-stat, we can use a [z-table](http://www.z-table.com/) to find the associated p-value.

In [36]:
# we can use stats to calculate the percentile / probablility of getting given z score OR higher
print("Percentile = ", stats.norm.cdf(z))

# We can also use the survival function to calculate the probability
print("Probability = ", stats.norm.sf(z))

Percentile =  0.9906132944651613
Probability =  0.009386705534838714


## $t$-Tests (1 tail test)

Let's go back to our Gabonese elephants, but let's reduce the sample size to 20, and assume we don't know the standard deviation of the population, but know the sample standard deviation to be ~355 lbs.

Here is the new scenario: suppose we are told that African elephants have weights distributed normally around a mean of 9000 lbs. Pachyderm Adventures has recently measured the weights of 20 African elephants in Gabon and has calculated their average weight at 8637 lbs. They claim that these statistics on the Gabonese elephants are significant. Let's find out!

Because the sample size is smaller, we will use a one sample $t$-test.

Degrees of freedom = n - 1

#### By Hand

In [19]:
# here is the array of our weights

gab = np.random.normal(8637, 355, 20) # Sample test data where: (center, std, n-points)

* This random selection of values won't actually have our desired statistics!

In [10]:
print(gab.mean())
print(gab.std())

8721.457493855585
368.6453210813045


* Let's set this up so that we actually have what we want!

In [11]:
gab = gab + (8637 - gab.mean())
print(gab.mean())

8637.0


In [12]:
gab = gab * (355 / gab.std())
print(gab.std())

354.99999999999994


In [13]:
# Let's continue to assume our alpha is 0.05
x_bar = 8637
mu = 9000         # Mean
sample_std = 355  # "gab"
n = 20

t_stat = (x_bar - mu)/(sample_std/np.sqrt(n))
t_stat

-4.57291648356295

#### Using Python Code

* Now, let's use the t-table to find our critical t-value.
t-critical = -1.729

* z = (sampling mean - population mean)/(population std/np.sqrt(n)) - from formula: <br><br>
$\Large z = \dfrac{{\bar{x}} - \mu_{0}}{\dfrac{\sigma}{\sqrt{n}}}$  <br><br>

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>$\sigma$ is the population standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

In [38]:
# we can use stats to calculate the percentile / probablility of getting given z score OR higher
print("Percentile = ", stats.norm.cdf(z))

# We can also use the survival function to calculate the probability
print("Probability = ", stats.norm.sf(z))

Percentile =  0.9906132944651613
Probability =  0.009386705534838714


In [28]:
# Using Python to get the t-statistic & P-value:
stats.ttest_1samp(gab, 9000)

Ttest_1sampResult(statistic=-4.804800263145146, pvalue=0.00012312972477890294)

## Compare and contrast $z$-tests and $t$-tests. 
In both cases, it is assumed that the samples are normally distributed. 

$t$-distributions have more probability in the tails. As the sample size increases, this decreases and the t distribution more closely resembles the $z$, or standard normal, distribution. By sample size $n = 1000$ they are virtually indistinguishable from each other. 

As the degrees of freedom go up, the $t$-distribution gets closer to the normal curve.

After calculating our $t$-stat, we compare it against our $t$-critical value determined by our preditermined alpha and the degrees of freedom.

Degrees of freedom = n - 1

### Two sample $t$-tests (2 tailed test)

#### By Hand

In [26]:
# By hand

x_1 = np.mean(gab)
x_2 = np.mean(ken)
s_1_2 = np.var(gab, ddof = 1)
s_2_2 = np.var(ken, ddof = 1)
n_1 = len(gab)
n_2 = len(ken)
s_p_2 = ((n_1 - 1)*s_1_2 + (n_2 - 1 )* s_2_2)/(n_1 + n_2 -2)

t = (x_1 - x_2)/np.sqrt(s_p_2*(1/n_1 + 1/n_2))
t

-1.2868709981121276

In [25]:
# By hand 
s_p_2 = ((n_1 - 1)*s_1_2 + (n_2 - 1 )* s_2_2)/(n_1 + n_2 -2)
s_p_2

89361.47957445966

In [27]:
# By hand
print(s_1_2, s_2_2 )

107676.12756997196 71046.83157894739


In [37]:
# we can use stats to calculate the percentile / probablility of getting given t score OR higher
print("Percentile = ", stats.norm.cdf(t))

# We can also use the survival function to calculate the probability
print("Probability = ", stats.norm.sf(t))

Percentile =  0.099069627689014
Probability =  0.900930372310986


#### In Python Code

In [20]:
# Sample data done in Python
ken = [8762, 8880, 8743, 8901,
        8252, 8966, 8369, 9001,
         8857, 8147, 8927, 9005,
         9083, 8477, 8760, 8915,
         8927, 8829, 8579, 9002]


print(np.std(ken))
print(np.std(gab))

259.79701691897856
319.8317076080378


In [21]:
# In Python to get the t-statistic & P-value
stats.ttest_ind(gab, ken, equal_var=False)

Ttest_indResult(statistic=-1.2868709981121276, pvalue=0.20624809368290312)

In [None]:
# conducting a T Test
stats.ttest_ind(experiment['Likes_Given_Exp'], control['Likes_Given_Con']) # equal_var default is True

In [35]:
# Calculate our t-critical value for 2 tailed test (.025 & .975)
print(stats.t.ppf(0.025, 24)) # 24 is the degrees of freedom (n-1)
print(stats.t.ppf(0.975, 24))

-2.063898561628021
2.0638985616280205
