## Do your work for this exercise in a jupyter notebook named hypothesis_testing.ipynb.

### For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

## - Has the network latency gone up since we switched internet service providers?


### Hypotheses

### $H_0$: The speed of query execution hasn't changed since we switched internet service providers
### $H_a$: The speed of query execution has gone up since we switched internet service providers

### Possible Results:

### True Positive: We find that the speed of query execution has gone up and it has gone up.
### True Negative: We find that the speed of query execution hasn't changed and it hasn't changed.
### False Positive (Type I Error): We find that the speed of query execution has gone up but we had upgraded to fiber optics also.
### False Negative (Type II Error): We find that the speed of query execution hasn't changed but we have more server requests due to a new department.

## -Is the website redesign any good?

### Hypotheses

### $H_0$: The number of visits have not changed.
### $H_a$: The number of visits are higher.

### Possible Results:
### True Positive: We find that the number of visits are higher than the previous website and they are higher.
### True Negative: We find that the number of visits haven't changed and it hasn't changed.
### False Positive (Type I Error): We find that the number of visits are higher but we recently ran an internet add campaign.
### False Negative (Type II Error): We find that the number of visits haven't changed but our ad campaign just ended.

## -Is our television ad driving more sales?

### $H_0$: Our total sales haven't changed.
### $H_a$: Our toal sales have gone up.
### True Positive: We find total sales have gone up and they have gone up.
### True Negative: We find that the total sales haven't changed and they aren't causing more or less sales.
### False Positive (Type I Error): We find that total sales have gone up but we also just opened a new store.
### False Negative (Type II Error): We find that the total sales haven't changed but we also just closed a store.

## 1. Use the following contingency table to help answer the question of whether using a Macbook and being a Codeup student are independent of each other.
![Screenshot 2023-10-10 at 4.25.04 PM.png](attachment:dfc439d9-ea20-439b-bb7e-292a35943193.png)

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [87]:
index = ['Uses a Macbook', "Doesn't Use A Macbook"]
columns = ['Codeup Student', 'Not Codeup Student']

observed = pd.DataFrame([[49, 20], [1, 30]], index=index, columns=columns)
observed

Unnamed: 0,Codeup Student,Not Codeup Student
Uses a Macbook,49,20
Doesn't Use A Macbook,1,30


In [88]:
chi2_stat, p, degf, expected = stats.chi2_contingency(observed)
chi2_stat, p, expected

(36.65264142122487,
 1.4116760526193828e-09,
 array([[34.5, 34.5],
        [15.5, 15.5]]))

$H_0$: Whether you use a macbook and whether you are a codeup student are indpendent of each other

$H_a$: Whether you use a macbook and whether you are a codeup student are dependent on each other

$⍺$ = 0.05

In [89]:
alpha = 0.05

p < alpha

True

## 2.Choose another 2 categorical variables from the mpg dataset.
### - State your null and alternative hypotheses.

In [90]:
from pydataset import data

mpg = data('mpg')
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


In [91]:
mpg['avg_mpg'] = (mpg.cty + mpg.hwy) / 2
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,avg_mpg
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,22.0


In [92]:
mpg.trans = mpg.trans.str[:-4]
mpg.trans

1        auto
2      manual
3      manual
4        auto
5        auto
        ...  
230      auto
231    manual
232      auto
233    manual
234      auto
Name: trans, Length: 234, dtype: object

In [93]:
manu_vs_trans = pd.crosstab(mpg.manufacturer, mpg.trans)
manu_vs_trans

trans,auto,manual
manufacturer,Unnamed: 1_level_1,Unnamed: 2_level_1
audi,11,7
chevrolet,16,3
dodge,30,7
ford,17,8
honda,4,5
hyundai,7,7
jeep,8,0
land rover,4,0
lincoln,3,0
mercury,4,0


In [94]:
null = 'There is no association between Manufacturer and Transmission Type.'

## Hypothoses

#### $H_0$: Manufacturer and Transmission Type are indpendent.

#### $H_a$: Manufacturer and Transmission Type are dependent.

#### $⍺$= 0.05

### - State your alpha.

In [95]:
alpha = 0.05

p < alpha

True

In [96]:
p

1.4116760526193828e-09

### - Perform a $chi^2$ test of independence.

In [97]:
chi2, p, degf, expected = stats.chi2_contingency(manu_vs_trans)
chi2, p, degf, expected

(29.293684393117655,
 0.00953444310358795,
 14,
 array([[12.07692308,  5.92307692],
        [12.74786325,  6.25213675],
        [24.82478632, 12.17521368],
        [16.77350427,  8.22649573],
        [ 6.03846154,  2.96153846],
        [ 9.39316239,  4.60683761],
        [ 5.36752137,  2.63247863],
        [ 2.68376068,  1.31623932],
        [ 2.01282051,  0.98717949],
        [ 2.68376068,  1.31623932],
        [ 8.72222222,  4.27777778],
        [ 3.35470085,  1.64529915],
        [ 9.39316239,  4.60683761],
        [22.81196581, 11.18803419],
        [18.11538462,  8.88461538]]))

### - State your conclusion

In [98]:
print(f'''Observed:
{manu_vs_trans.values}

Expected:
{expected.astype(int)}
________________

ꭓ² = {chi2:.4f}
p  = {p}''')

Observed:
[[11  7]
 [16  3]
 [30  7]
 [17  8]
 [ 4  5]
 [ 7  7]
 [ 8  0]
 [ 4  0]
 [ 3  0]
 [ 4  0]
 [ 8  5]
 [ 5  0]
 [ 7  7]
 [20 14]
 [13 14]]

Expected:
[[12  5]
 [12  6]
 [24 12]
 [16  8]
 [ 6  2]
 [ 9  4]
 [ 5  2]
 [ 2  1]
 [ 2  0]
 [ 2  1]
 [ 8  4]
 [ 3  1]
 [ 9  4]
 [22 11]
 [18  8]]
________________

ꭓ² = 29.2937
p  = 0.00953444310358795


In [99]:
#Conclude:

if p < alpha:
    print(f'We reject H₀:{null}')
else:
    print(f'We FAIL to reject H₀:{null}')

We reject H₀:There is no association between Manufacturer and Transmission Type.


## 3. Use the data from the employees database to answer these questions:
### - Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

In [100]:
import pandas as pd
import env

url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/employees' 
query = '''
SELECT e.gender, d.dept_name
FROM employees e
JOIN dept_emp de USING (emp_no)
JOIN departments d USING (dept_no)
WHERE de.to_date > NOW()
	AND d.dept_name IN ('Marketing', 'Sales')
'''

gender_and_dept = pd.read_sql(query, url)
gender_and_dept

Unnamed: 0,gender,dept_name
0,F,Marketing
1,M,Marketing
2,F,Marketing
3,F,Marketing
4,F,Marketing
...,...,...
52538,F,Sales
52539,M,Sales
52540,M,Sales
52541,F,Sales


In [101]:
ctab = pd.crosstab(gender_and_dept.gender, gender_and_dept.dept_name)
ctab

dept_name,Marketing,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,5864,14999
M,8978,22702


### Hypothoses

#### $H_0$: Gender and whether in Sales and Marketing are indpendent.

#### $H_a$: Gender and whether in Sales and Marketing are dependent.

#### ⍺ = 0.01

In [102]:
null = "There is no association between gender and whether in Sales and Marketing."

In [103]:
chi2_stat, p, degf, expected = stats.chi2_contingency(ctab)
chi2_stat, p, degf, expected

(0.3240332004060638,
 0.5691938610810126,
 1,
 array([[ 5893.2426013, 14969.7573987],
        [ 8948.7573987, 22731.2426013]]))

In [104]:
alpha = 0.01

p < alpha

False

In [105]:
p

0.5691938610810126

In [106]:
if p < alpha:
    print(f'We reject H₀:{null}')
else:
    print(f'We FAIL to reject H₀:{null}')

We FAIL to reject H₀:There is no association between gender and whether in Sales and Marketing.


### - Is an employee's gender independent of whether or not they are or have been a manager?

In [107]:
query = '''
SELECT e.gender, dm.dept_no
FROM employees e
LEFT JOIN dept_manager dm USING (emp_no)
'''

gender_and_was_manager = pd.read_sql(query, url)
gender_and_was_manager

Unnamed: 0,gender,dept_no
0,M,
1,F,
2,M,
3,M,
4,M,
...,...,...
300019,F,
300020,M,
300021,M,
300022,M,


In [108]:
def was_manager(value):
    if value == None:
        return 'Has Been Manager'
    else:
        return 'Has Not Been Manager'

gender_and_was_manager['was_manager'] = gender_and_was_manager.dept_no.apply(was_manager)
gender_and_was_manager

Unnamed: 0,gender,dept_no,was_manager
0,M,,Has Been Manager
1,F,,Has Been Manager
2,M,,Has Been Manager
3,M,,Has Been Manager
4,M,,Has Been Manager
...,...,...,...
300019,F,,Has Been Manager
300020,M,,Has Been Manager
300021,M,,Has Been Manager
300022,M,,Has Been Manager


In [109]:
ctab = pd.crosstab(gender_and_was_manager.gender, gender_and_was_manager.was_manager)
ctab

was_manager,Has Been Manager,Has Not Been Manager
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,120038,13
M,179962,11


### Hypothoses

#### $H_0$: Gender and whether has been manager are indpendent.

#### $H_a$: Gender and whether has been manager are dependent.

#### ⍺ = 0.01

In [110]:
null = "There is no association between gender and whether has been manager."

In [111]:
chi2_stat, p, degf, expected = stats.chi2_contingency(ctab)
chi2_stat, p, degf, expected

(1.4566857643547197,
 0.22745818732810363,
 1,
 array([[1.20041397e+05, 9.60331174e+00],
        [1.79958603e+05, 1.43966883e+01]]))

In [112]:
alpha = 0.01

p < alpha

False

In [113]:
p

0.22745818732810363

In [114]:
if p < alpha:
    print(f'We reject H₀:{null}')
else:
    print(f'We FAIL to reject H₀:{null}')

We FAIL to reject H₀:There is no association between gender and whether has been manager.
