#### Which Statistical test should you use?

1. How is the `altitude` and `temperature` related?
2. Is the average quiz score for Hopper students higher than average quiz score for Germain students?
3. Is there relationship between `churn` and `payment-type`?
4. Is the average `petal-length` different for the three species of Iris?

A pharmaceutical company is conducting a clinical research to detemine if `drug A` is effectiving is preventing severe disease due to COVID compared to current standard treatment. What would be the null and the alternate hypothesis?

- Due to flaws in the clinical trial, the researchers concluded that `drug A` is  effective, when in reality it is not. What kind of error they made?

- What kind of error would the reseachers make if the study concluded that `drug A` is not effective, when in reality it is effective?

1. Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.

In [2]:
# imports
import numpy as np
import pandas as pd
import scipy.stats as stats
from pydataset import data
import env

In [3]:
# Set the signifance level
conf_interval = 0.95
alpha = 1 - conf_interval

In [4]:
# create a dataframe:

contingency = pd.DataFrame({'codeup_student': [49,1], 'not_codeup': [20,30]}, index=['uses_macbook', 'not_macbook'])

In [5]:
contingency

Unnamed: 0,codeup_student,not_codeup
uses_macbook,49,20
not_macbook,1,30


In [None]:
# H0: Macbook Usage is independent of being a Codeup Student
# Ha: Macbook Usage is not independent of being a Codeup Student

In [None]:
# make our computation

In [6]:
chi2, p, degf, expected = stats.chi2_contingency(contingency)

print('Observed\n:')
print(contingency.values)
print('------------------------\nExpected: \n')
print(expected)
print('------------------------\n')
print(f'chi2 = {chi2:.2f}')
print(f'p value: {p:.4f}')
if p < alpha:
      print('We can reject the null hypothesis')

Observed
:
[[49 20]
 [ 1 30]]
------------------------
Expected: 

[[34.5 34.5]
 [15.5 15.5]]
------------------------

chi2 = 36.65
p value: 0.0000
We can reject the null hypothesis


### 2. Choose another 2 categorical variables from the mpg dataset and perform a chi2 contingency table test with them. Be sure to state your null and alternative hypotheses.

In [None]:
# load up our mpg data

In [7]:
df = data('mpg')

In [8]:
df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [9]:
df['transmission_type'] = np.where(df.trans.str.contains('auto'), 'Auto', 'Manual')

In [10]:
df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,transmission_type
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,Auto
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,Manual
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,Manual
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,Auto
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,Auto


In [None]:
# pick our two categoricals

In [11]:
df.transmission_type.value_counts()

Auto      157
Manual     77
Name: transmission_type, dtype: int64

In [12]:
df['drv'].value_counts()

f    106
4    103
r     25
Name: drv, dtype: int64

In [13]:
a = df.transmission_type
b = df.drv

## What is our null Hypothesis?

### H0:   Transmission type independent of drive type on vehicles

In [14]:
# create a contingency table

observed = pd.crosstab(a,b)
observed

drv,4,f,r
transmission_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Auto,75,65,17
Manual,28,41,8


In [15]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n:')
print(observed.values)
print('------------------------\nExpected: \n')
print(expected.astype(int))
print('------------------------\n')
print(f'chi2 = {chi2:.2f}')
print(f'p value: {p:.4f}')
if p < alpha:
      print('We can reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

Observed
:
[[75 65 17]
 [28 41  8]]
------------------------
Expected: 

[[69 71 16]
 [33 34  8]]
------------------------

chi2 = 3.14
p value: 0.2084
We fail to reject the null hypothesis


## 3. Use the data from the employees database to answer these questions:

### Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)


In [None]:
#  gender: department (marketing or sales)
# current employees only
# 
# tables needed:
# employees
# dept_emp
# departments

In [16]:
# set up sql connection
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/employees'

In [17]:
# make our query for the parameters that we want:

query = '''SELECT e.gender, d.dept_name
FROM employees AS e
JOIN dept_emp as dn ON dn.emp_no = e.emp_no
AND to_date > CURDATE()
JOIN departments AS d ON dn.dept_no = d.dept_no'''

In [18]:
# load up our data
gender_dept = pd.read_sql(query, url)

In [19]:
gender_dept.head()

Unnamed: 0,gender,dept_name
0,M,Customer Service
1,F,Customer Service
2,M,Customer Service
3,F,Customer Service
4,F,Customer Service


In [20]:
gender_dept.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240124 entries, 0 to 240123
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   gender     240124 non-null  object
 1   dept_name  240124 non-null  object
dtypes: object(2)
memory usage: 3.7+ MB


In [21]:
# restrict that to just sales and marketing

gender_dept = gender_dept[(gender_dept.dept_name == 'Sales') | (gender_dept.dept_name == 'Marketing')]

In [22]:
# make our crosstab for observed values

observed = pd.crosstab(gender_dept.gender, gender_dept.dept_name)

In [23]:
observed

dept_name,Marketing,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,5864,14999
M,8978,22702


In [None]:
# Set up null and alternate hypothesis:


# HO gender of the employee is independent to department of marketing or sales
# Ha gender of the employee is not independent to department of marketing or sales

In [24]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n:')
print(observed.values)
print('------------------------\nExpected: \n')
print(expected.astype(int))
print('------------------------\n')
print(f'chi2 = {chi2:.2f}')
print(f'p value: {p:.5f}')
if p < alpha:
      print('We can reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

Observed
:
[[ 5864 14999]
 [ 8978 22702]]
------------------------
Expected: 

[[ 5893 14969]
 [ 8948 22731]]
------------------------

chi2 = 0.32
p value: 0.56919
We fail to reject the null hypothesis


#### Is an employee's gender independent of whether or not they are or have been a manager?

In [None]:
# set up a new query for this data. we want:
# 
# employee gender, manager status
# 
# tables needed: employees, dept_manager

In [25]:
query = '''SELECT e.emp_no, e.gender, dm.dept_no
FROM employees as e
LEFT JOIN dept_manager AS dm ON e.emp_no = dm.emp_no'''

In [26]:
# load up our dataframe from our query

In [27]:
gender_manager = pd.read_sql(query, url)

In [28]:
gender_manager

Unnamed: 0,emp_no,gender,dept_no
0,10001,M,
1,10002,F,
2,10003,M,
3,10004,M,
4,10005,M,
...,...,...,...
300019,499995,F,
300020,499996,M,
300021,499997,M,
300022,499998,M,


In [30]:
# rename that column and fill the na's

gender_manager = gender_manager.rename(columns={'dept_no': 'manager'}).fillna(0)

In [31]:
gender_manager.head()

Unnamed: 0,emp_no,gender,manager
0,10001,M,0
1,10002,F,0
2,10003,M,0
3,10004,M,0
4,10005,M,0


In [None]:
# run an apply function to make manager status a binary

In [32]:
gender_manager['manager'] = gender_manager['manager'].apply(lambda x: x if x == 0 else 1)

In [None]:
# another way (using np.where)

# gender_manager['manager'] = np.where(gender_manager.manager == 0, 0, 1)

In [33]:
gender_manager.manager.value_counts()

0    300000
1        24
Name: manager, dtype: int64

In [34]:
observed = pd.crosstab(gender_manager['gender'], gender_manager['manager'])

In [35]:
observed

manager,0,1
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,120038,13
M,179962,11


In [None]:
# H0: Employee Gender is independent of history in management

In [36]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n:')
print(observed.values)
print('------------------------\nExpected: \n')
print(expected.astype(int))
print('------------------------\n')
print(f'chi2 = {chi2:.2f}')
print(f'p value: {p:.4f}')
if p < alpha:
      print('We can reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

Observed
:
[[120038     13]
 [179962     11]]
------------------------
Expected: 

[[120041      9]
 [179958     14]]
------------------------

chi2 = 1.46
p value: 0.2275
We fail to reject the null hypothesis
