# Hypothesis Testing

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

 - Has the network latency gone up since we switched internet service providers?
 - Is the website redesign any good?
 - Is our television ad driving more sales?

###  - Has the network latency gone up since we switched internet service providers (ISPs)?

#### Null Hypothesis: 
The network latency has NOT changed since we switched internet service providers.
    
#### Alternative Hypothesis: 
The network letency has changed since we switched internet service providers.
    

#### True positive: 
Rejecting the null hypothesis when there is actually a significant difference in latency since we switched ISPs. (we said there was a relationship and there acutally is)


#### True Negative: 
Accepting the null hypothesis when there is no significant difference in latency since we switched ISPs. (we said there is no relationship and we are correct)

    
#### Type I Error (False Positive):
Rejecting the null hypothesis when there is actually no significant difference in latency between ISPs.


#### Type II Error (False Negative):
Accepting the null hypothesis when there is actually a significant difference in latency between ISPs.



### - Is the website redesign any good?

#### Null Hypothesis: 
The website redesign has NOT changed our click through rate.
    
#### Alternative Hypothesis: 
The website redesign has changed our click through rate.
    
    
#### True positive: 
Data analysis reveals a significant change in the clickthrough rate after redesigning our website, which aligns with the alternative hypothesis.


#### True Negative:
Data analysis does not show any change in the clickthrough rate after redesigning our website which aligns with the null hypothesis.
    
    
#### Type I Error (False Positive): 
Data analysis indicates a significant change in the clickthrough rate after redesigning our website leading to the rejection of the null hypothesis. 

However, in reality, there is no actual change in the clickthrough rate, and the decrease is coincidental or due to other factors.

#### Type II Error (False Negative):
Data analysis does not show any change in the clickthrough rate after redesigning our website, we failed to reject the null hypothesis. 

However, in reality, there is a change in the clickthrough rate, which is genuinely caused by the web redesign.


### - Is our television ad driving more sales?

#### Null Hypothesis: 
Our television ad is making NO difference in our sales.
    
#### Alternative Hypothesis: 
Our television ad is making a difference in our sales.
    
#### True positive: 
The number of people buying our product has significantly increased after airing the television ad, supporting the alternative hypothesis.


#### True Negative: 
The number of people buying our product remains unchanged or does not significantly increase after airing the television ad, supporting the null hypothesis.

    
#### Type I Error (False Positive):
The number of people buying our product appears to have significantly increased after airing the television ad, but it is just due to random chance. In reality, the null hypothesis is true.



#### Type II Error (False Negative):
The number of people purchasing our product does not appear to have significantly increased after airing the television ad, but in reality, it has. This means we failed to reject the null hypothesis, even though the alternative hypothesis is true.





# Comparison of Groups

## The Chi-Square Test of Independence

### 1. Use the following contingency table to help answer the question of whether using a Macbook and being a Codeup student are independent of each other.

In [82]:
#standard data imports
import pandas as pd
import numpy as np

#pulling sample dataset
from pydataset import data

#new library!! for stats!!! 
from scipy import stats

In [83]:
textbook_data = {
    'Codeup Student': [49, 1],
    'Not Codeup Student':[20,30],}

In [84]:
df=pd.DataFrame(textbook_data, index=['Uses a Macbook',"Doesn't Use A Mackbook"])

In [85]:
df

Unnamed: 0,Codeup Student,Not Codeup Student
Uses a Macbook,49,20
Doesn't Use A Mackbook,1,30


In [31]:
# First: Form Hypothesis and set Confidence Interval    

- $H_o$: There is NO relationship between using a Macbook and being a Codeup Student
- $H_a$: There IS a relationship between using a Macbook and being a Codeup Student

In [32]:
# Set Alpha
alpha = .05

In [86]:
# we already have a crosstab so use chi2_contingency

stats.chi2_contingency(df)

Chi2ContingencyResult(statistic=36.65264142122487, pvalue=1.4116760526193828e-09, dof=1, expected_freq=array([[34.5, 34.5],
       [15.5, 15.5]]))

In [87]:
# create variables of the output
chi2, p, dof, expected = stats.chi2_contingency(df)

In [88]:
#output values so we can easily read them
print('Observed')
print(df.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[49 20]
 [ 1 30]]

Expected
[[34 34]
 [15 15]]

----
chi^2 = 36.6526
p     = 0.0000


In [89]:
# Conclusion based on output variables
#compare our p-value and alpha
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')

reject the null hypothesis


In [None]:
# we reject the null hypothesis
# there is a relationship between codeup students and Macbook users

### 2 . Choose another 2 categorical variables from the mpg dataset.

 - State your null and alternative hypotheses.
 - State your alpha.
 - Perform a Chi^2 test of independence.
 - State your conclusion

In [90]:
#load mpg dataset
mpg = data('mpg')
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [92]:
# Check out your column names and perform any cleanup you may want on them.

mpg.rename(columns = {'class':'class_type'}, inplace=True)
mpg.rename(columns = {'drv':'drive_wheel'}, inplace=True)
mpg.columns

Index(['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drive_wheel',
       'cty', 'hwy', 'fl', 'class_type'],
      dtype='object')

In [None]:
# State Hypothesis

- $H_o$: There is NO association between drive_wheel and class_type
- $H_a$: There IS a association between drive_wheel and class_type

In [None]:
# State your alpha.
alpha = 0.05

In [93]:
# Perform a Chi^2 test of independence.


#make 'contingency' table using pandas crosstab
#this is our observed values
observed = pd.crosstab(mpg.drive_wheel, mpg.class_type)
observed


class_type,2seater,compact,midsize,minivan,pickup,subcompact,suv
drive_wheel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4,0,12,3,0,33,4,51
f,0,35,38,11,0,22,0
r,5,0,0,0,0,9,11


In [94]:
#use python function to calculate values
#it does all the work for us
stats.chi2_contingency(observed)

Chi2ContingencyResult(statistic=221.6011438535253, pvalue=1.1048811174475079e-40, dof=12, expected_freq=array([[ 2.2008547 , 20.68803419, 18.04700855,  4.84188034, 14.52564103,
        15.40598291, 27.29059829],
       [ 2.26495726, 21.29059829, 18.57264957,  4.98290598, 14.94871795,
        15.85470085, 28.08547009],
       [ 0.53418803,  5.02136752,  4.38034188,  1.17521368,  3.52564103,
         3.73931624,  6.62393162]]))

In [95]:
#chi2_contingency prints out 4 values - chi2, p-value, degrees of freedom, 
# expected values
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [96]:
#output values
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 0 12  3  0 33  4 51]
 [ 0 35 38 11  0 22  0]
 [ 5  0  0  0  0  9 11]]

Expected
[[ 2 20 18  4 14 15 27]
 [ 2 21 18  4 14 15 28]
 [ 0  5  4  1  3  3  6]]

----
chi^2 = 221.6011
p     = 0.0000


In [97]:
# State your conclusion
#compare our p-value and alpha
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')


reject the null hypothesis


In [98]:
# we reject the null hypothesis
# there is an association between drive_wheel and class_type

### 3. Use the data from the employees database to answer these questions:

 - Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)
 - Is an employee's gender independent of whether or not they are or have been a manager?    

In [104]:
import env

In [105]:
db = 'employees'

url = env.get_db_url(db)

In [101]:
query = '''
select gender,dept_name
from employees
	join dept_emp
		using (emp_no)
        join departments
        using (dept_no)
        WHERE to_date > now()
        AND dept_name IN ('Sales', 'Marketing');
'''

In [102]:
emp_df =pd.read_sql(query, url)
emp_df.sample(5)

Unnamed: 0,gender,dept_name
3657,M,Marketing
24562,F,Sales
29122,F,Sales
49837,M,Sales
5984,M,Marketing


###  Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

In [None]:
# State Hypothesis

- $H_o$: An employee's gender is NOT associated with whether they work in sales or marketing
- $H_a$: An employee's gender is assocaited with whether they work in sales or marketing

In [66]:
# State your alpha.
alpha = 0.05

In [74]:
# Perform a Chi^2 test of independence.

#make 'contingency' table using pandas crosstab
#this is our observed values
observed_emp = pd.crosstab(emp_df.gender, emp_df.dept_name)
observed_emp


dept_name,Marketing,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,5864,14999
M,8978,22702


In [75]:
#use python function to calculate values
#it does all the work for us
stats.chi2_contingency(observed_emp)

Chi2ContingencyResult(statistic=0.3240332004060638, pvalue=0.5691938610810126, dof=1, expected_freq=array([[ 5893.2426013, 14969.7573987],
       [ 8948.7573987, 22731.2426013]]))

In [76]:
#chi2_contingency prints out 4 values - chi2, p-value, degrees of freedom, 
# expected values
chi2, p, dof, expected = stats.chi2_contingency(observed_emp)

In [77]:
#output values
print('Observed')
print(observed_emp.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 5864 14999]
 [ 8978 22702]]

Expected
[[ 5893 14969]
 [ 8948 22731]]

----
chi^2 = 0.3240
p     = 0.5692


In [78]:
# State your conclusion
#compare our p-value and alpha
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')


fail to reject the null hypothesis


In [None]:
# we fail to reject the null hypothesis
# An employee's gender is NOT dependent on whether they work in sales or marketing

### Is an employee's gender independent of whether or not they are or have been a manager?

In [None]:
# State Hypothesis

- $H_o$: There is NO association between an employee's gender and whether they are or have been a manager.
- $H_a$: There IS an association between an employee's gender and whether they are or have been a manager.

In [103]:
# set alpha
alpha = 0.05

In [106]:
# pull the data we need from database

query = '''
select e.emp_no, e.gender, dm.to_date
from dept_manager as dm
	RIGHT join employees as e
		using (emp_no)
        ;
'''


In [107]:
manager_df =pd.read_sql(query, url)
manager_df.sample(5)

Unnamed: 0,emp_no,gender,to_date
140117,240093,M,
276733,476709,M,
34206,44207,F,
280723,480699,F,
273515,473491,M,


In [114]:
# add column for employees who are or have been managers
manager_df['ever_manager'] = manager_df.to_date.notnull()
manager_df.sample(5)

Unnamed: 0,emp_no,gender,to_date,ever_manager
209617,409593,F,,False
195708,295684,M,,False
192220,292196,F,,False
140553,240529,M,,False
232395,432371,F,,False


In [120]:
#make 'contingency' table using pandas crosstab
#this is our observed values
observed_manager = pd.crosstab(manager_df.ever_manager, manager_df.gender)
observed_manager

gender,F,M
ever_manager,Unnamed: 1_level_1,Unnamed: 2_level_1
False,120038,179962
True,13,11


In [121]:
#use python function to calculate values
#it does all the work for us
stats.chi2_contingency(observed_manager)

Chi2ContingencyResult(statistic=1.4566857643547197, pvalue=0.22745818732810363, dof=1, expected_freq=array([[1.20041397e+05, 1.79958603e+05],
       [9.60331174e+00, 1.43966883e+01]]))

In [122]:
#chi2_contingency prints out 4 values - chi2, p-value, degrees of freedom, 
# expected values
chi2, p, dof, expected = stats.chi2_contingency(observed_manager)

In [123]:
#output values
print('Observed')
print(observed_emp.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 5864 14999]
 [ 8978 22702]]

Expected
[[120041 179958]
 [     9     14]]

----
chi^2 = 1.4567
p     = 0.2275


In [124]:
# State your conclusion
#compare our p-value and alpha
if p < alpha:
    print('reject the null hypothesis')
else:
    print('fail to reject the null hypothesis')


fail to reject the null hypothesis


In [None]:
# we fail to reject the null hypothesis
# There is NO association between an employee's gender and whether they are or have been a manager.