# Welcome to Hypothesis Testing

Hypothesis testing is the process of comparing one hypothesis to another, and using statistics to help evaluate the hypothesis. It is part of the branch of statistics known as Inferential Statistics. 

In this lesson we will introduce some broad concepts related to hypothesis testing, and in future lessons we will dive into a few specific hypothesis tests. 

![Types of Stats](Stats-types.jpg)


![descriptive-and-inferential-statistics](descriptive-and-inferential-statistics.jpeg)



We will use some built-in datasets from the pydataset library to review and explore some concepts. 

1. How distributions help us make inferences

2. Understanding the sample vs. population

3. Asking interesting and relevant questions of your data

4. Ways to answer questions

5. How Hypothesis testing helps us make inferences

6. Key terms in hypothesis testing

**Tangential Lesson & Review** 

Before we get to the statistics lesson, we will practice the following in a little data wrangling: 

1. Identifying when a row is NOT an observation. 

2. Transforming datasets using python to where a row IS an observation. 

3. Writing a loop without getting lost in errors. 

4. Experimenting...be the scientist in data scientist and experiment with other ways to accomplish the same goal. 

So, let's get to it...

__________________________________________________

## Data Wrangling

First, we need data. In this lesson, instead of generating random data, we will use data from the pydataset library which contains 756 datasets available for use. 

As you will soon get used to, every data science project starts with prject planning followed by acquiring and preparing data, a.k.a. wrangling the data. One non-negotiable to having prepared data ready to explore is that each row represents an observation. Let's take an example. 

In [1]:
# Using open datasets from pydataset
from pydataset import data

import pandas as pd
import numpy as np

from scipy import stats

import matplotlib.pyplot as plt

In [2]:
# take a look at available datasets 
data()

Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
1,BJsales,Sales Data with Leading Indicator
2,BOD,Biochemical Oxygen Demand
3,Formaldehyde,Determination of Formaldehyde
4,HairEyeColor,Hair and Eye Color of Statistics Students
...,...,...
752,VerbAgg,Verbal Aggression item responses
753,cake,Breakage Angle of Chocolate Cakes
754,cbpp,Contagious bovine pleuropneumonia
755,grouseticks,Data on red grouse ticks from Elston et al. 2001


We are going to use the **HairEyeColor** dataset. 

In [3]:
# Extract just the row that contains the HairEyeColor dataset 
# using iloc
data().iloc[4]

dataset_id                                 HairEyeColor
title         Hair and Eye Color of Statistics Students
Name: 4, dtype: object

> **iloc vs. loc:** the `i` in `iloc` refers to `index` location. When talking about rows, `iloc[0]` is always the first row. `loc` refers to the index `name`. Sometime this is the same as index location (iloc) as in the dataframe of dataset names above. Sometimes they are different, as we will see later. 

Ok, We are told this dataset contains information about Hair and Eye color of a group of statistics students. 
Let's take a look. 

In [4]:
# store that data in a dataframe named df
df = data('HairEyeColor')

# look at the first 5 rows
df.head()

Unnamed: 0,Hair,Eye,Sex,Freq
1,Black,Brown,Male,32
2,Brown,Brown,Male,53
3,Red,Brown,Male,10
4,Blond,Brown,Male,3
5,Black,Blue,Male,11


In [5]:
# look at the numbers of rows and columns
df.shape

(32, 4)

**Question 1:** 

What do you notice about the structure of this dataset? What does each row represent? 


*your answer:*



_____________________________________________

 
 
 
 
 
 
 
 
 
But, is each row "REALLY" a unique combination of Hair, Eye, & Sex or are there any duplicates? (Always verify your assumptions!)
If we group by Hair, Eye and Sex; count the number of occurrences of each combination; and then sum those counts; we should get a number equal to the number of rows in the original dataframe. 

In [6]:
print("number of unique combinations: ", 
      df.groupby(['Hair', 'Eye', 'Sex']).count().sum())

print("rows in original dataframe: ", len(df))

number of unique combinations:  Freq    32
dtype: int64
rows in original dataframe:  32


____________________________________________


Our data has been aggregated! UGH! I *strongly dislike* starting with aggregated data, as a data scientist :=| It limits what I can find out. 

I want my data in the form where one row represents one observation. 


**Question 2:**

In this scenario, based on the description of the dataset, a single observation should be what?  



*your answer:*



_____________________________________________

 
 
 
 
 
 
 
**Question 3:**
 
So, if each row should be an observartion, and each observation should be a student, how many rows should we have in our dataframe? 

*your answer:*



________________________________

In [None]:
# the code to determine how many students were surveyed/how many 
# observations we should have

df.Freq.sum()

So, we need 32 rows to represent students with Black Hair, Brown Eyes, and are Male; 53 rows of "Brown, Brown, Male"; 10 rows of "Red, Brown, Male", etc. as you can see below...

In [8]:
df.head()

Unnamed: 0,Hair,Eye,Sex,Freq
1,Black,Brown,Male,32
2,Brown,Brown,Male,53
3,Red,Brown,Male,10
4,Blond,Brown,Male,3
5,Black,Blue,Male,11


There are multiple ways to get there, and we will get there the long round about way of writing a loop to practice the technique of writing a loop without getting lost in errors. We will then demonstrate the slick and simple way of doing it using beautiful pandas methods! 

When writing a loop, I recommend the following steps:

1. Make it work for 1. 

2. Make it work for 2. 

3. Make it work for all. 

First, let's create a single column that concatenates Hair, Eye & Sex so we only have to repeat one dimension or column in this case. 

In [None]:
# concatenating using '+' with strings

df['traits'] = df.Hair + ", " + df.Eye + ", " + df.Sex
df.head(2)

**Make it work for 1:**

we will use `np.repeat(*what_to_repeat*, *how_many_times*)` to repeat the first combination and we will return it in a list. 

In [None]:
# what_to_repeat*
df.traits.iloc[0]

In [11]:
#how_many_times
df.Freq.iloc[0]

32

In [None]:
# use those in np.repeat and return a list
row1_observations = list(np.repeat(df.traits.iloc[0], df.Freq.iloc[0]))
row1_observations


**Make it work for 2:**

For the second row, we will need to add the task of appending the list to the first list. We will start with an empty list. Then add the first set of observations. Then the second. 

In [13]:
# start with an empty list to hold all observations
observations = []

In [14]:
# observations = observations + row1_observations 
# is equivalent to :
observations += row1_observations

observations[0:5]

['Black, Brown, Male',
 'Black, Brown, Male',
 'Black, Brown, Male',
 'Black, Brown, Male',
 'Black, Brown, Male']

In [15]:
# how many items are in that list:
len(observations)

32

Use `np.repeat()` to create the 2nd set. The only difference is the index number used in `iloc[]`

In [16]:
row2_observations = list(np.repeat(df.traits.iloc[1], df.Freq.iloc[1]))

print(row2_observations[0:5])
print(len(row2_observations))

['Brown, Brown, Male', 'Brown, Brown, Male', 'Brown, Brown, Male', 'Brown, Brown, Male', 'Brown, Brown, Male']
53


In [17]:
# add the second row's observations to the list
# observations = observations + row2_observations
observations += row2_observations

print(len(observations))

85


In [18]:
print(observations[0:3])
print(observations[-3:])

['Black, Brown, Male', 'Black, Brown, Male', 'Black, Brown, Male']
['Brown, Brown, Male', 'Brown, Brown, Male', 'Brown, Brown, Male']


**Make it work for all:**

In [19]:
# create empty list
obs = []

# loop through each index in original dataframe from 0 to the 
# length of the dataframe
# repeat "Freq" number of times. 

for i in range(len(df)):
    row_obs = list(np.repeat(df.traits.iloc[i], df.Freq.iloc[i]))
    obs += row_obs

Now, what we have is a list where each item should be a row in our dataframe, and each item represents a student, an observation. So let's convert the list to a dataframe. 

In [20]:
print("Number of students: ", len(obs))
print("\nFirst 5 observations: ", obs[0:5])
print("\nLast 5 observations: ", obs[-5:])

Number of students:  592

First 5 observations:  ['Black, Brown, Male', 'Black, Brown, Male', 'Black, Brown, Male', 'Black, Brown, Male', 'Black, Brown, Male']

Last 5 observations:  ['Blond, Green, Female', 'Blond, Green, Female', 'Blond, Green, Female', 'Blond, Green, Female', 'Blond, Green, Female']


In [21]:
obs_df = pd.DataFrame({'traits': obs})
obs_df.head()

Unnamed: 0,traits
0,"Black, Brown, Male"
1,"Black, Brown, Male"
2,"Black, Brown, Male"
3,"Black, Brown, Male"
4,"Black, Brown, Male"


Now we will split these back into the variables we originally had. 

In [22]:
# split the traits at the comma into 3 different columns 
# (n=-1 means ALL, so I could put 3 here since I know there are 3)
# expand = True means return to me a dataframe

obs_df = obs_df['traits'].str.split(", ", n=-1, expand=True)
obs_df.head(3)

Unnamed: 0,0,1,2
0,Black,Brown,Male
1,Black,Brown,Male
2,Black,Brown,Male


In [23]:
#rename columns 
obs_df.columns = ['hair', 'eye', 'sex']

print(obs_df.shape)
obs_df.head()

(592, 3)


Unnamed: 0,hair,eye,sex
0,Black,Brown,Male
1,Black,Brown,Male
2,Black,Brown,Male
3,Black,Brown,Male
4,Black,Brown,Male


And just like that, we have a dataframe with 592 rows, 3 columns, where each row represents an observation or a student. You can think of the row index as the student id. 


Now that we've done it the long way, another way to go about this is the following:
1. Use the pandas `repeat` method on the series `df.index`. This method has an argument for the number of times to repeat, which we will provide as the frequency for each item in that series, `df.Freq`. This returns the row names (starting with 1) `Freq` number of times, as seen below. 

In [24]:
df.index.repeat(df.Freq)

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            31, 31, 32, 32, 32, 32, 32, 32, 32, 32],
           dtype='int64', length=592)

2. Use `df.loc` to select the rows by the row name. 

In [25]:
df.loc[df.index.repeat(df.Freq)]

Unnamed: 0,Hair,Eye,Sex,Freq,traits
1,Black,Brown,Male,32,"Black, Brown, Male"
1,Black,Brown,Male,32,"Black, Brown, Male"
1,Black,Brown,Male,32,"Black, Brown, Male"
1,Black,Brown,Male,32,"Black, Brown, Male"
1,Black,Brown,Male,32,"Black, Brown, Male"
...,...,...,...,...,...
32,Blond,Green,Female,8,"Blond, Green, Female"
32,Blond,Green,Female,8,"Blond, Green, Female"
32,Blond,Green,Female,8,"Blond, Green, Female"
32,Blond,Green,Female,8,"Blond, Green, Female"


3. Finally, reset the index so that each row has a unique value from 0 to 591, and select only the 'Hair', 'Eye', and 'Sex' columns. 

In [None]:
df = df.loc[df.index.repeat(df.Freq)].reset_index(drop=True)
df = df[['Hair', 'Eye', 'Sex']]

print(df.shape)
df.head()

stretch this data out, so that each row represents a single student. 

To do so, we need to ... 

1. repeat each row by the frequency. 

2. remove the frequency column. 

We should end up the same number of rows as the total number of students represented, because...

EACH ROW IS AN OBSERVATION AND EACH OBSERVATION IS A STUDENT. 


**Question:**

So, how can I compute how many rows we will end up with? Or how many students were surveyed?

If you need to take another peek at the data, go for it! `df.head()` is a quick and easy way. 

**Answer:**

In [None]:
# write code to compute how many students are represented in this dataset. 

df.Freq.sum()

Cool, so we are looking to end up with that many rows

___________________________________________
_______________________________________

To get there, we will want to repeat each combination of `Hair`, `Eye`, & `Sex` the number of times represented in the `Freq` field. We can get there in the following way:  

1. Create a single column that concatenates Hair, Eye & Sex

2. For the first unique combination of Hair, Eye & Sex, i.e. the first row which is Black, Brown, and Male, create a list that repeats that combination 32 times, which is the `Freq` value. 

3. We want to do that for each row, so once we made it work for one, we will do row two and make sure we can concatenate them, THEN we will put it in a loop to work for all. 

_______________________________________


___________________________________________________
__________________________________________________


2. Sample vs. Population


According to heffingtons, the estimated population proportion of blue eyes in the U.S. is 27%

https://heffingtons.com/interesting-facts-about-eye-color/

- Brown Eyes: 45%

- Blue Eyes: 27%

- Hazel Eyes: 18% (Note: Hazel eyes consist of shades of brown and green.)

- Green Eyes: 9%

- Other: 1%

In [None]:
sum(obs_df.eye=='Blue')/len(obs_df)

**Question:**

What are some possible reasons why? 

- not large enough sample

- sample probably from 1 location

- self-selected to be in that stats class

- demographic mis-representation

- stats class in norway? 

- age of students? 

- self reported!



This dataset claims to be a sample of statistics students. 

**Questions:**

So is the population all statistics students? 

Across the world? 

All ages? 

Is it a representative sample of the population of all statistics students across the world of all ages? 

Or is it representative sample of the population of all people across the world of all ages? 

Or is it representative sample of the population of all people in the US of all ages? 

Or is it representative sample of the population of all statistics students in the US of all ages? 

Or is it representative sample of the population of all statistics students in the US of all ages? 

Or ...


_________________________________________________________


Hypothesis testing can help us answer these questions as well as many others. 


In [None]:
stats.binom(592, .27).pmf(215)

In [None]:
plt.figure(figsize=(16,6))
x = np.arange(1, 300)
y = stats.binom(592, .27).pmf(x)
plt.bar(x, y, width=1, color='green')

plt.bar(215, stats.binom(592, .27).pmf(215), width=3, color='orange')

plt.title('binomial distribution for n=592 p=.27')
plt.xlabel('x')
plt.ylabel('pmf(x)')

plt.annotate('$P(X=215) = {:.10f}$'.format(stats.binom(592, .27).pmf(215)),
             (215, stats.binom(592, .27).pmf(215)), xytext=(215, .010), 
             arrowprops={'arrowstyle': '->'})

## Ways to answer questions that we will practice throughout your time here:

1. data visualization 

2. hypothesis testing

3. machine learning

4. Saying I don't know, but I will try to find out. 


Ways that we will not cover:

1. pulling it out of your arse

2. using statistics to give you the answer you want. 

3. selecting biased samples to prove your point. 



In [None]:

plt.figure(figsize=(16,10))
plt.suptitle('There are more blue eyes in our sample than in the population.\n Is this by chance? Or does our sample come from a different population?\n', 
            size=18)

plt.subplot(1, 2, 1)
x = stats.binom.rvs(size=592, p=.27, n=1)
sample = stats.binom.rvs(size=592, p=.36, n=1)
plt.hist(x, width=1, bins=range(3), edgecolor='black', color='teal', alpha=1, align='left')
x_ticks = [0, 1]
x_labels = ['Other Colored Eyes', 'Blue Eyes']
plt.title('Population: US Residents')
plt.axhline(y=.27*592, color="lightgray")
y_ticks = [0,160, 215, 377, 432]
y_labels = ['0%', '27%', '37%', '64%', '73%']
plt.yticks(y_ticks, y_labels)
plt.xticks(x_ticks, x_labels)

plt.subplot(1, 2, 2)
plt.hist(sample, width=1, bins=range(3), edgecolor='black', color='orange', alpha=.5, align='left')
plt.title('Sample: 592 Statistics Students')
y_ticks = [0,160, 215, 377, 432]
y_labels = ['0%', '27%', '37%', '64%', '73%']
plt.yticks(y_ticks, y_labels)
plt.xticks(x_ticks, x_labels)
plt.axhline(y=.27*592, color="darkgray")
plt.show()

Cases for hypothesis testing

1. Do those who churn spend more than those who do not churn? 
- Are customers with Fiber more likely to churn that those without?
- Are sr. citizens more likely to churn?
- Are students in stats class more likely to have blue eyes than students not taking stats? 
- Are customers without autopayment more likely to churn? 
- Do customers who churn have lower tenure? 
- Is there a linear relationship between tenure and total charges? 
- Is there a linear relationship between tenure and average monthly charges? 

The variables
1. churn (boolean), average_monthly_charges (continuous/numeric)
2. churn (boolean), has_fiber (boolean)
3. churn (boolean), is_senior (boolean)
4. eye_color (categorical), in_stats (boolean)
5. churn (boolean), has_autopayment (boolean)
6. churn (boolean), tenure (numeric)
7. tenure (numeric), total_charges (numeric)
8. tenure (numeric), avg_monththly_charges (numeric)

The data types
1. boolean x numeric
2. boolean x boolean
3. boolean x boolean
4. categorical x boolean
5. boolean x boolean
6. boolean x numeric
7. numeric x numeric
8. numeric x numeric

The types of tests

1. comparison of means (t-test) across the 2 groups (churned customers vs. not churned customers)
2. comparison of proportions/relationships (chi-square)
3. comparison of proportions/relationships (chi-square)
4. comparison of proportions/relationships (chi-square)
5. comparison of proportions/relationships (chi-square)
6. comparison of means (t-test) across the 2 groups (churned customers vs. not churned customers)
7. linear correlation between two continuous values, does one affect the other. (pearson's correlation)
8. linear correlation between two continuous values, does one affect the other. (pearson's correlation)

**Null Hypothesis:** $H_{0}$

the "default", no difference, no change, no effect. 

**Alternative Hypothesis:** $H_{a}$

Generally what your hypothesis is, that there is a difference, an effect, etc. 


1. Do those who churn spend a different amount than those who do not churn?

> $H_{0}$: avg spend for those who churn == avg spend for those who don't. 

> $H_{a}$: avg spend for thos who churn != avg spend for those who don't. 

t-test, p-value < .05

1a. Do those who churn spend more than those who do not churn?


$H_{0}$: avg spend for those who churn <= avg spend for those who don't. 

$H_{a}$: avg spend for thos who churn > avg spend for those who don't. 

2. Are customers with Fiber more or less likely to churn that those without?

$H_{0}$: Customers with Fiber are equally likely to churn than those without. 

$H_{a}$: Customers with fiber are more or less likely to churn than those without. 

2a. Are customers with Fiber more likely to churn that those without?

$H_{0}$: Customers with Fiber are less or equally likely to churn than those without. 

$H_{a}$: Customers with Fiber are more likely to churn than those without. 

3. Are sr. citizens more likely to churn?

$H_{0}$: Sr. are less or equally likely to churn than non-sr.

$H_{a}$: Sr. are more likely to churn than non-sr.

4. Are students in stats class more likely to have blue eyes than students not taking stats?

5. Are customers without autopayment more likely to churn?

6. Do customers who churn have lower tenure?

$H_{0}$: customers who churn have greater or equal tenure than those who don't. 

$H_{a}$: customers who churn have lower tenure. 

7. Is there a linear relationship between tenure and total charges?

$H_{0}$: There is no linear relationship b/w tenure and total charges. 

$H_{a}$: There is a linear relationship b/w tenure and total charges. 

8. Is there a linear relationship between tenure and average monthly charges?

1. id the question
2. state the hypotheses
3. validate assumptions
4. run our test, set our significance level, $\alpha$
5. get results: test statistic & p-value
6. evaluate results and draw conclusions


### Key terms

**p-value:** probability that we observed this result due to chance. if it's less than our alpha, we reject the null hypothesis...there IS a difference, or a relationship. 

**False Negative Rate:** P(FN) = P(Type II Error)
False Negative: Failing to reject null hypothesis when it is false. 
i.e. there is a difference but test told you otherwise. 

**False Positive Rate:** P(FP) = P(Type I Error)
False Positive: Said there was a difference where there wasn't. 
Alpha is your false positive rate. 


**Answers**

1. What do you notice about the structure of this dataset? What does each row represent? *A unique combination of hair, eye and sex with the number of students in the class who have that combination.*

2. In this scenario, based on the description of the dataset, a single observation should be what?  *A single statistic student!*

3. So, if each row should be an observartion, and each observation should be a student, how many rows should we have in our dataframe? 
