# INFO 98: Data Science Skills, Spring 2019
## Lecture 06: Hypothesis Testing 

---

## Table of Contents
* [Setup](#setup)
* [Demo](#demo)
* [Custom Dice Test](#customization)

<a id='setup'></a>
# Setup
____

In [None]:
import numpy as np
from datascience import *

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

<a id='demo'></a>
# Demo
___

## Simulating Flipping of a Coin

We are now going to simulate 20 flips of a fair coin 10,000 times!

**Null Hypothesis:** The coin is fair 

We are hence gonna simulate under this assumption (equal chance of getting a heads or a tails)

A simple but effective measure of this fairness is the absolute value of the difference between the numbers of heads and tails. High values indicate a biased coin and low values indicate a fair coin. This will be our test statistic.

We are gonna be using the sample_proportions function to simulate our coin tosses. The details of the function are given on the next slide.

The following demonstrates one simulation of 20 flips of a coin:

In [None]:
model_proportions = make_array(.5,.5) #Fairness of our coin
repetitions=20

one_simulation = sample_proportions(repetitions, model_proportions)#vOne simulation of 20 flips of a coin

# We mutiply by repetitions because sample_proportions returns a proportions and not an integer
num_heads= one_simulation.item(0)*repetitions # the first item represents the number of heads                                        
num_tails= one_simulation.item(1)*repetitions # the second item represents the number of heads

print(one_simulation)
print(num_heads)
print(num_tails)


We then define a function that gives us our **test statistic** given a certain number of heads and tails.

In [None]:
def statistic(heads,tails):
    return abs(heads - tails)

statistic(num_heads,num_tails)


We now put both these elements together with a loop to simulate 20 flips of a coin 10,000 times. We then calculate the test statistic for each of these 10,000 times we flipped 20 coins and store this in an array.

This function below simulates number of flips desired for the coin, coin_flips, defined with the fairness of fairness_of_coin. The first element represents heads and the second tails. 

An example of coin biased towards heads would have fairness_of_coin be an array with (0.7,0.3).
Similarly, a coin biased towards tails would have a fairness_of_coin defined as (0.3,0.7)

In [None]:
def simulation_and_statistic(coin_flips, fairness_of_coin):
    # Don't Worry if you don't understand why the code is doing what it is doing
    
    One_simulation = sample_proportions(coin_flips, fairness_of_coin)
    num_heads= One_simulation.item(0)*coin_flips
                                            
    num_tails= One_simulation.item(1)*coin_flips
    simulated_statistic = statistic(num_heads,num_tails)
    return simulated_statistic


We want to repeat this simulation 10000 times. We use a for loop to do this as shown below. We then store our test statistic for every simulation i.e The absolute difference between the number of heads and tails in an array. We do this for a fair coin first - hence our 2 element array is (0.5,0.5). 

In [None]:
repetitions = 20
num_simulations = 10000
fair_coin = make_array(.5,.5)


fair_coin_simulated_statistics = make_array() # Don't Worry if you don't understand why the code is doing what it is doing

for i in np.arange(num_simulations):
    fair_coin_simulated_statistics = np.append(fair_coin_simulated_statistics,
                                               simulation_and_statistic(repetitions,
                                                                        fair_coin))
    

We now have an array of 10000 values with each value representing the test statistic(Absolute difference of number of heads and number of tails) for one repetition of us flipping a coin 20 times

In [None]:
fair_coin_simulated_statistics 

We now create a histogram of all the values in fair_coin_simulated_statistics 

In [None]:
fair = Table().with_column('Absolute Difference between # of heads and tails', fair_coin_simulated_statistics )

fair.hist(bins=make_array(0,2,4,6,8,10,12,14,16,18,20))

Just to see if our simulation works like we think it should we repeat the same process but with a different coin - a coin biased towards heads

In [None]:
repetitions = 20
num_simulations = 10000
unfair_coin = make_array(0.8,0.2)


unfair_coin_simulated_statistics = make_array() # Don't Worry if you don't understand why the code is doing what it is doing

for i in np.arange(num_simulations):
    unfair_coin_simulated_statistics = np.append(unfair_coin_simulated_statistics, simulation_and_statistic(repetitions, unfair_coin))

Make a histogram again! What do you notice?

In [None]:
unfair = Table().with_column('Absolute Difference between # of heads and tails', unfair_coin_simulated_statistics )

unfair.hist(bins=make_array(0,2,4,6,8,10,12,14,16,18,20))

## Computing P-Value

We now want to see how unlikely our observed event of getting 15 heads and 5 tails(or an event that is even more unfair) was assuming our coin was fair. Our observed test statistic is hence 10 since the absolute difference between 15 and 5 is 10.

In [None]:
observed_test_statistic = statistic(15,5)
observed_test_statistic

In [None]:
fair = Table().with_column('Absolute Difference between # of heads and tails', fair_coin_simulated_statistics )

fair.hist(bins=make_array(0,2,4,6,8,10,12,14,16,18,20))
plt.scatter(observed_test_statistic, 0, color='red', s=60);

To see how likely it is for us to get a test statistic of at least 10 we essentialy need to calculate the area to the right of 10 in the first histogram. Why do we choose the first histogram and not the second one? We want to work with the assumption that the coin was fair in the first place. We hence compare our observed test statistic to the distribution we get when we flip a fair coin. **(Always simulate under the null remember!)**

In [None]:
proportion_greater_or_equal = sum(fair_coin_simulated_statistics >=observed_test_statistic)/len(fair_coin_simulated_statistics) # SOLUTION
proportion_greater_or_equal 

**Converting this to a percentage:**

In [None]:
proportion_greater_or_equal_percentage = proportion_greater_or_equal*100
proportion_greater_or_equal_percentage


**This is hence our P-Value. It gives us the probability  assuming that we have a fair coin that we get an absolute difference of 10 or greater purely by chance. Intuitively speaking this is very low but whether we choose to reject or not our null hypothesis depends on our significance level. We assumed our significance level to be 5% and we hence reject our null hypothesis. Our conclusion is that based on our observations, the coin is not fair - there is something other than chance causing the results we have seen.**


Let us now do the exact same calculation but with the second histogram that has results of the coin that was biased towards heads

In [None]:
unfair.hist(bins=make_array(0,2,4,6,8,10,12,14,16,18,20))
plt.scatter(observed_test_statistic, 0, color='red', s=60);

In [None]:
proportion_greater_or_equal_unfair = sum(unfair_coin_simulated_statistics >=observed_test_statistic)/len(unfair_coin_simulated_statistics)

proportion_greater_or_equal_unfair
proportion_greater_or_equal_unfair_percentage = proportion_greater_or_equal_unfair*100 #To percentage
proportion_greater_or_equal_unfair_percentage


This as is evident is a very high p-value. We clearly fail to reject our null hypothesis here as this is much higher than our 5% significance level. This makes intuitive sense though. For a coin that turns to heads 70% of the times- getting 15 heads and 5 tails doesn't seem that out of the ordinary. This is reflected in the P-value

**Congratulations!! You just finished your first two hypothesis tests** We can do a very similar hypothesis test with die as well with some changes that account for the 6 possible outcomes instead of 2

<a id='customization'></a>
# Custom Dice Test
____

In [None]:
# Splitting/aggregating the data to get the number of gun deaths in each year.
deaths_agg_by_year = {}
for i in ca_split_by_year:
    if i in deaths_agg_by_year.keys():
        deaths_agg_by_year[i] += 1
    else:
        deaths_agg_by_year[i] = 1

In [None]:
deaths_agg_by_year

In [None]:
plt.bar(deaths_agg_by_year.keys(), deaths_agg_by_year.values())
plt.ylabel('Number of gun deaths in that year')
plt.xlabel('Year')
plt.title('Number of gun deaths by year in California')