# Do your work for this exercise in a jupyter notebook named hypothesis_testing.ipynb.

## For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

### Has the network latency gone up since we switched internet service providers?

$H0$: Network latency has not been affected since switching internet service providers. <br>
$Ha$: Network latency has gone down since switching internet service providers
<ul> 
    <li> <b>True Positive:</b> Network latency has increased since switching internet service providers.
    <li><b>True Negative: </b>Network latency has decreased since switching internet service providers.
<li><b>Type I: </b>The network latency appears decreased but it has actually increased since switching internet providers.
<li><b>Type II:</b> The network latency has decreased since switching internet providers but we tested latency on an unusually fast day.

### Is the website redesign any good?

$H0$: Users have reported no change in usability since redsigning the website. <br>
$Ha$: Users have reported a decrease in usability since redsigning the website.
<ul> 
    <li> <b>True Positive:</b> Users have reported a increase in usability since redesigning the website.
    <li><b>True Negative: </b>N Users have reported a decrease in usability since redesigning the website.
<li><b>Type I: </b>The users report a decrease in usability but there has been no change is usability.
<li><b>Type II:</b> The users report an increase in usability but the calls for tech support have increased since the website redesign.

### Is our television ad driving more sales?

$H0$: Sales have not been affected since the new ad aired. <br>
$Ha$: Sales have gone down since the new ad aired.
<ul> 
    <li> <b>True Positive:</b> Sales have increased since airing the new ad.
    <li><b>True Negative: </b> Sales have decreased since airing the new ad.
<li><b>Type I: </b>The ad sales went up for one day but the overall weekly sales trend has decreased.
<li><b>Type II:</b> The ad sales went down for one day but the overall weekly sales trend has increased.

In [1]:
import pandas as pd
from scipy import stats
from pydataset import data

# Chi$^2$ Exercises

### 1. Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.<br><br>

| Computer Status        	| Is a Codeup Student 	| Is not a Codeup Student 	|
|------------------------	|---------------------	|-------------------------	|
| Uses a macbook         	| 49                  	| 20                      	|
| Does not use a macbook 	| 1                   	| 30                      	|

In [2]:
index = ['Uses a Macbook', "Doesn't Use A Macbook"]
columns = ['Is a Codeup Student', 'Not Codeup Student']

observed = pd.DataFrame([[49, 20], [1, 30]], index=index, columns=columns)
observed

Unnamed: 0,Is a Codeup Student,Not Codeup Student
Uses a Macbook,49,20
Doesn't Use A Macbook,1,30


#### Null Hypothesis and Alternative Hypothesis
$H_0$: Macbook use and being a Codeup student are independent. <br>

$H_a$: Macbook use and being a Codeup student are dependent

In [3]:
# set the alpha value
alpha = .05
# make the chi2, p value, degree of freedom and expected value
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [4]:
null_hypothesis = "Macbook use and being a Codeup student are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We reject the hypothesis that Macbook use and being a Codeup student are independent.
The p value is 1.4116760526193828e-09 . The alpha value is 0.05 .


### 2. Choose another 2 categorical variables from the mpg dataset and perform a chi$^2$ contingency table test with them. Be sure to state your null and alternative hypotheses.

In [5]:
mpg = data('mpg')
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [6]:
mpg.nunique()

manufacturer    15
model           38
displ           35
year             2
cyl              4
trans           10
drv              3
cty             21
hwy             27
fl               5
class            7
dtype: int64

In [7]:
observed = pd.crosstab(mpg.cyl, mpg.year)
observed

year,1999,2008
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1
4,45,36
5,0,4
6,45,34
8,27,43


#### Null Hypothesis and Alternative Hypothesis
$H_0$: Cylinder and Year are independent. <br>
$H_a$: Cylinder and Year are dependent.

In [8]:
# setting chi2, p, degf, and expectd values
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [9]:
null_hypothesis = "Cylinder and year are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We reject the hypothesis that Cylinder and year are independent.
The p value is 0.01702768537665195 . The alpha value is 0.05 .


### 3. Use the data from the employees database to answer these questions:

In [10]:
# Connect to employees database
#defines function to create a sql url using personal credentials
from env import host, user, password

def get_db_url(database, user=user, host=host, password=password): 
    url = f'mysql+pymysql://{user}:{password}@{host}/{database}'
    return url

url = get_db_url('employees')

In [16]:
employees_sql_query = '''
                        SELECT  e.emp_no, e.gender, d.dept_name
                        FROM employees AS e
                        JOIN dept_emp AS de ON de.emp_no = e.emp_no
                        JOIN departments AS d ON d.dept_no = de.dept_no
                        WHERE de.to_date > now()              
                        '''
employees = pd.read_sql(employees_sql_query, get_db_url('employees'))
employees.head()

Unnamed: 0,emp_no,gender,dept_name
0,10038,M,Customer Service
1,10049,F,Customer Service
2,10060,M,Customer Service
3,10088,F,Customer Service
4,10112,F,Customer Service


#### Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

In [25]:
cross_tab_gender = pd.crosstab(employees.gender,employees.dept_name )
cross_tab_gender

dept_name,Customer Service,Development,Finance,Human Resources,Marketing,Production,Quality Management,Research,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
F,7007,24533,5014,5147,5864,21393,5872,6181,14999
M,10562,36853,7423,7751,8978,31911,8674,9260,22702


In [26]:
#Selecting Sales employees
employee_sales = employees.dept_name == "Sales"
sales = employees[employee_sales]

In [27]:
#Selecting marketing employees
employee_marketing = employees.dept_name == "Marketing"
marketing = employees[employee_marketing]

In [28]:
# Concat the dataframes together
emp_ms = pd.concat([sales, marketing])

In [29]:
#creating observed values from new concat df
observed = pd.crosstab(emp_ms.gender,emp_ms.dept_name )
observed

dept_name,Marketing,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,5864,14999
M,8978,22702


#### Null Hypothesis and Alternative Hypothesis
$H_0$: Gender and department are independent. <br>
$H_a$: Gender and department are dependent.

In [32]:
# setting chi2, p, degf, and expectd values
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [36]:
null_hypothesis = "Gender and department are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We fail to reject the null hypothesis. The null hypothesis is that Gender and department are independent.
The p value is 0.5691938610810126 . The alpha value is 0.05 .


#### Is an employee's gender independent of whether or not they are or have been a manager?

In [45]:
employees_manager_query = '''
SELECT gender, COUNT(gender)
FROM dept_manager
JOIN employees
ON dept_manager.emp_no = employees.emp_no
GROUP BY gender;

'''
manager = pd.read_sql(employees_manager_query, url)

In [70]:
manager
##The query above shows the count of each gender. 
#In the next step, I am going to omit the gender column to run our chi2 test on.

Unnamed: 0,gender,COUNT(gender)
0,M,11
1,F,13


In [63]:
manager_query = '''
SELECT COUNT(gender)
FROM dept_manager
JOIN employees
ON dept_manager.emp_no = employees.emp_no
GROUP BY gender;

'''

In [64]:
manager_observed= pd.read_sql(manager_query, url)

In [66]:
manager_observed

Unnamed: 0,COUNT(gender)
0,11
1,13


In [67]:
chi2, p, degf, expected = stats.chi2_contingency(manager_observed)

#### Null Hypothesis and Alternative Hypothesis
$H_0$: Gender and being a manager are independent. <br>
$H_a$: Gender and being a manager dependent.