## Overview

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

- Has the network latency gone up since we switched internet service providers?

Null hypothesis: There is no change in latency between previous service provider and new service provider.

Alternative hypothesis: There is a change in latency between the service providers.

    -True positive: Network latency has changed due to change in service providers
    -True negative: Network latency has neither increased or decreased.
    -Type I error: A latency change was shown, but the latency actually hadn't changed.
    -Type II error: There was a change in latency, but we failed to identify it.

- Is the website redesign any good?

Null hypothesis: Daily visitors to the website remain the same.

Alternative hypothesis: Daily visitors to the website have changed.

    -True positive: There is an increase in daily website visitors
    -True negative: Website visitors stay approximately the same.
    -Type I error: There was an increase in daily visitors, but it was caused by an external event.
    -Type II error: We noticed no change in visitors, but we were measuring the wrong metric.

- Is our television ad driving more sales?

Null hypothesis: Sales have stayed the same since the television ad has aired.

Alternative hypothesis: Sales have changed since the airing of the ad.

    -True positive: The ad caused sales to either increase or decrease.
    -True negative: The ad had no effect on sales.
    -Type I error: We viewed a change in sales, but in reality the ad had no effect.
    -Type II error: The ad had an effect, but we couldn't observe any direct change.

----

## Comparison of Groups

In [2]:
# standard data imports
import pandas as pd
import numpy as np
import env

#pulling sample dataset
from pydataset import data

#new library for stats
from scipy import stats

1. Use the following contingency table to help answer the question of whether using a Macbook and being a Codeup student are independent of each other.
    
    
|  | Codeup Student | Not Codeup Student |
| --- | --- | --- |
| Uses a Macbook | 49 | 20 |
| Doesn't Use A Macbook | 1 | 30 |

$H_0$: Using a Macbook and being a Codeup student are unrelated to each other (independent)

$H_a$: Using a Macbook and being a Codeup student *are* related to each other (dependent)

In [6]:
# Create the table
observed = pd.DataFrame(
{
    'CodeupStudent':[49,1],
    'NotCodeupStudent':[20,30]
},index=['Uses a Macbook',"Doesn't Use A Macbook"]
)
observed

Unnamed: 0,CodeupStudent,NotCodeupStudent
Uses a Macbook,49,20
Doesn't Use A Macbook,1,30


In [8]:
# Set our alpha
alpha = 0.05

In [15]:
chi2,p,dof,expected = stats.chi2_contingency(observed)

In [17]:
#output values
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[49 20]
 [ 1 30]]

Expected
[[34 34]
 [15 15]]

----
chi^2 = 36.6526
p     = 0.0000


In [19]:
if p < alpha:
    print('We reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

We reject the null hypothesis


2. Choose another 2 categorical variables from the `mpg` dataset.

In [22]:
# Import the dataset
mpg = data('mpg')
mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

In [24]:
mpg.sample(10)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
50,dodge,dakota pickup 4wd,3.7,2008,6,auto(l4),4,14,18,r,pickup
225,volkswagen,new beetle,2.0,1999,4,auto(l4),f,19,26,r,subcompact
28,chevrolet,corvette,7.0,2008,8,manual(m6),r,15,24,p,2seater
216,volkswagen,jetta,2.0,2008,4,auto(s6),f,22,29,p,compact
205,toyota,toyota tacoma 4wd,3.4,1999,6,auto(l4),4,15,19,r,pickup
17,audi,a6 quattro,3.1,2008,6,auto(s6),4,17,25,p,midsize
16,audi,a6 quattro,2.8,1999,6,auto(l5),4,15,24,p,midsize
164,subaru,forester awd,2.5,2008,4,auto(l4),4,20,26,r,suv
25,chevrolet,corvette,5.7,1999,8,auto(l4),r,15,23,p,2seater
200,toyota,land cruiser wagon 4wd,5.7,2008,8,auto(s6),4,13,18,r,suv


In [26]:
# Cycling through variables to explore value counts
# mpg['class'].describe()

- State your null and alternative hypotheses.

$H_0$: The number of cylinders in a vehicle are unrelated to the class of vehicle.

$H_a$: The number of cylinders in a vehicle *are* related to the class of vehicle.

- State your alpha.

In [30]:
alpha = 0.05

- Perform a $chi2$ test of independence.    

In [33]:
observed = pd.crosstab(mpg.cyl,mpg['class'])
observed

class,2seater,compact,midsize,minivan,pickup,subcompact,suv
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4,0,32,16,1,3,21,8
5,0,2,0,0,0,2,0
6,0,13,23,10,10,7,16
8,5,0,2,0,20,5,38


In [35]:
chi2,p,dof,expected = stats.chi2_contingency(observed)

- State your conclusion

In [38]:
if p < alpha:
    print('We reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

We reject the null hypothesis


We conclude that there *is* a relationship between the class of the car and the cylinders.

3. Use the data from the employees database to answer these questions:

In [42]:
# Pull the tables from the database
url = env.get_db_url('employees')

query = '''
SELECT * FROM employees
JOIN dept_emp
    USING(emp_no)
JOIN departments
    USING(dept_no)
'''

employees = pd.read_sql(query,url)

In [43]:
employees.head()

Unnamed: 0,dept_no,emp_no,birth_date,first_name,last_name,gender,hire_date,from_date,to_date,dept_name
0,d009,10011,1953-11-07,Mary,Sluis,F,1990-01-22,1990-01-22,1996-11-09,Customer Service
1,d009,10038,1960-07-20,Huan,Lortz,M,1989-09-20,1989-09-20,9999-01-01,Customer Service
2,d009,10049,1961-04-24,Basil,Tramer,F,1992-05-04,1992-05-04,9999-01-01,Customer Service
3,d009,10060,1961-10-15,Breannda,Billingsley,M,1987-11-02,1992-11-11,9999-01-01,Customer Service
4,d009,10088,1954-02-25,Jungsoon,Syrzycki,F,1988-09-02,1992-03-21,9999-01-01,Customer Service


- Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

$H_0$: An employee's gender is independent of whether they work in sales or marketing

$H_a$: An employee's gender is *not* independent of whether they work in sales or marketing

In [47]:
# Saving as new dataframe since I might need the old one still
dept_bool = (employees.dept_name == 'Sales') | (employees.dept_name == 'Marketing')
current_bool = employees.to_date.astype(str) == '9999-01-01'
curr_emp = employees[current_bool & dept_bool]
curr_emp.sample(5)

Unnamed: 0,dept_no,emp_no,birth_date,first_name,last_name,gender,hire_date,from_date,to_date,dept_name
146844,d001,46727,1953-11-22,LiMin,Dratva,M,1988-10-10,2001-03-02,9999-01-01,Marketing
294849,d007,99091,1960-08-28,Jongsuk,Luga,F,1992-12-20,1996-05-25,9999-01-01,Sales
323048,d007,450702,1962-10-25,Giap,Schneeberger,M,1988-03-23,1998-12-12,9999-01-01,Sales
321335,d007,440786,1957-04-17,Marek,Naumovich,M,1991-04-30,1991-04-30,9999-01-01,Sales
298246,d007,208794,1964-05-22,Garnik,Nergos,F,1985-12-02,1992-11-13,9999-01-01,Sales


In [49]:
# Save the crosstab into observed variable
observed = pd.crosstab(curr_emp.gender,curr_emp.dept_name).T
observed

gender,F,M
dept_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Marketing,5864,8978
Sales,14999,22702


In [51]:
chi2,p,dof,expected = stats.chi2_contingency(observed)

In [70]:
if p < alpha:
    print('We reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

We fail to reject the null hypothesis


We conclude that there is no relationship between gender and whether the employee has worked in Sales or Marketing.

- Is an employee's gender independent of whether or not they are or have been a manager?

$H_0$: An employee's gender is independent of whether they are or have been a manager.

$H_a$: An employee's gender is *not* independent of whether they are or have been a manager.

In [56]:
# Read new query

query = '''
SELECT *
FROM dept_manager
'''

manager = pd.read_sql(query,url)
manager.sample(10)

Unnamed: 0,emp_no,dept_no,from_date,to_date
7,110344,d004,1988-09-09,1992-08-02
22,111877,d009,1992-09-08,1996-01-03
14,110800,d006,1991-09-12,1994-06-28
1,110039,d001,1991-10-01,9999-01-01
4,110183,d003,1985-01-01,1992-03-21
5,110228,d003,1992-03-21,9999-01-01
2,110085,d002,1985-01-01,1989-12-17
0,110022,d001,1985-01-01,1991-10-01
8,110386,d004,1992-08-02,1996-08-30
21,111784,d009,1988-10-17,1992-09-08


In [57]:
# Join employees with manager
emp_manager = employees.merge(manager,how='left',on='emp_no')
emp_manager.sample(10)

Unnamed: 0,dept_no_x,emp_no,birth_date,first_name,last_name,gender,hire_date,from_date_x,to_date_x,dept_name,dept_no_y,from_date_y,to_date_y
228822,d004,462182,1954-02-07,Yannis,Birdsall,F,1991-03-20,1998-02-15,9999-01-01,Production,,,
229993,d004,466851,1953-09-21,Owen,Leijenhorst,M,1987-10-01,1987-10-01,1997-06-28,Production,,,
257228,d006,484203,1961-08-23,Jiong,Boudaillier,F,1988-02-19,1996-08-30,1998-12-07,Quality Management,,,
121158,d002,405677,1953-06-22,Hugo,Delgrange,F,1992-10-30,1992-10-30,9999-01-01,Finance,,,
210603,d004,287789,1963-06-30,Stella,Hardjono,M,1990-10-21,1996-08-03,1996-08-19,Production,,,
288486,d007,61801,1961-07-13,Gor,Rissland,M,1994-12-10,1994-12-10,9999-01-01,Sales,,,
237614,d004,497962,1956-07-07,Uwe,Pell,M,1985-03-05,1988-08-24,9999-01-01,Production,,,
305446,d007,250073,1964-08-27,Jiong,Bahk,M,1988-03-25,1998-04-13,2002-02-15,Sales,,,
148159,d001,66551,1954-03-27,Eirik,Rosiles,F,1993-06-02,1993-06-02,9999-01-01,Marketing,,,
139999,d003,424709,1956-08-01,Moni,Heijenga,M,1986-05-13,1992-01-17,2001-01-12,Human Resources,,,


In [60]:
# Do some cleanup
emp_manager = emp_manager.rename(columns={
    'from_date_x':'dept_from_date',
    'from_date_y':'mgr_from_date',
    'to_date_x':'dept_to_date',
    'to_date_y':'mgr_to_date',
})

emp_manager = emp_manager.drop(columns=['dept_no_y','is_mgr'],errors='ignore')
emp_manager.sample(10)

Unnamed: 0,dept_no_x,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_from_date,dept_to_date,dept_name,mgr_from_date,mgr_to_date
39006,d005,64069,1963-11-01,Nectarios,Borovoy,M,1993-01-02,1993-01-02,9999-01-01,Development,,
16416,d009,409860,1964-10-16,Aris,Desikan,F,1987-08-20,2001-07-08,9999-01-01,Customer Service,,
329010,d007,484882,1960-03-06,Pradeep,Coney,F,1986-08-06,1987-12-11,9999-01-01,Sales,,
68013,d005,255111,1958-09-19,Billie,Honiden,M,1996-10-10,1996-10-10,9999-01-01,Development,,
178033,d004,64310,1953-05-18,Khatoun,Kossowski,M,1985-02-17,1992-08-13,9999-01-01,Production,,
38835,d005,63473,1958-04-19,Shim,Gyorkos,F,1986-07-12,2001-03-06,9999-01-01,Development,,
195278,d004,224950,1961-05-12,Marl,Kohling,F,1994-08-30,1996-03-31,2001-09-28,Production,,
267926,d008,238294,1963-10-06,Tokuyasu,Peral,F,1990-03-21,1996-01-26,2002-05-01,Research,,
179652,d004,71009,1952-07-12,Keung,Zaccaria,M,1987-04-25,1987-04-25,1988-04-27,Production,,
198960,d004,240483,1954-02-08,Geoff,Zultner,M,1996-02-20,1999-03-08,9999-01-01,Production,,


In [62]:
emp_manager['has_been_mgr'] = np.where(emp_manager.mgr_to_date.isna(),False,True)
emp_manager.has_been_mgr.value_counts()

has_been_mgr
False    331579
True         24
Name: count, dtype: int64

In [64]:
observed = pd.crosstab(emp_manager.gender,emp_manager.has_been_mgr)
observed

has_been_mgr,False,True
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,132740,13
M,198839,11


In [66]:
chi2,p,dof,expected = stats.chi2_contingency(observed)

In [68]:
if p < alpha:
    print('We reject the null hypothesis')
else:
    print('We fail to reject the null hypothesis')

We fail to reject the null hypothesis


We conclude that there is no relationship between an employee's gender and whether they have or have not been a manager.

-----

## Correlations

1. Answer with the type of stats test you would use (assume normal distribution):

- Is there a relationship between the length of your arm and the length of your foot?

- Does smoking affect when or not someone has lung cancer?

- Is gender independent of a person’s blood type?

- Does whether or not a person has a cat or dog affect whether they live in an apartment?

- Does the length of time of the lecture correlate with a student's grade?

2. Use the `telco_churn` data.

- Does tenure correlate with monthly charges?

- Total charges?

- What happens if you control for phone and internet service?

3. Use the `employees` database.

- Is there a relationship between how long an employee has been with the company and their salary?

- Is there a relationship between how long an employee has been with the company and the number of titles they have had?

4. Use the `sleepstudy` data.

- Is there a relationship between days and reaction time?