# Do your work for this exercise in a jupyter notebook named hypothesis_testing.ipynb.

###  For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

### Has the network latency gone up since we switched internet service providers?

$H0$: Network latency has not been affected since switching internet service providers. <br>
$Ha$: Network latency has gone down since switching internet service providers
<ul> 
    <li> <b>True Positive:</b> Network latency has increased since switching internet service providers.
    <li><b>True Negative: </b>Network latency has decreased since switching internet service providers.
<li><b>Type I: </b>The network latency appears decreased but it has actually increased since switching internet providers.
<li><b>Type II:</b> The network latency has decreased since switching internet providers but we tested latency on an unusually fast day.

### Is the website redesign any good?

$H0$: Users have reported no change in usability since redsigning the website. <br>
$Ha$: Users have reported a decrease in usability since redsigning the website.
<ul> 
    <li> <b>True Positive:</b> Users have reported a increase in usability since redesigning the website.
    <li><b>True Negative: </b>N Users have reported a decrease in usability since redesigning the website.
<li><b>Type I: </b>The users report a decrease in usability but there has been no change is usability.
<li><b>Type II:</b> The users report an increase in usability but the calls for tech support have increased since the website redesign.

### Is our television ad driving more sales?

$H0$: Sales have not been affected since the new ad aired. <br>
$Ha$: Sales have gone down since the new ad aired.
<ul> 
    <li> <b>True Positive:</b> Sales have increased since airing the new ad.
    <li><b>True Negative: </b> Sales have decreased since airing the new ad.
<li><b>Type I: </b>The ad sales went up for one day but the overall weekly sales trend has decreased.
<li><b>Type II:</b> The ad sales went down for one day but the overall weekly sales trend has increased.

In [78]:
from math import sqrt
from scipy import stats

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pydataset import data

# T-test Exercises

#### Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. <br> A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. <br> A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. <br>Use a .05 level of significance.

In [79]:
# x1 = office #1
# x2 = office #2

# Setting up the 2 offices means
xmean1 = 90
xmean2 = 100

#Setting up the 2 offices sales
sales1 = 40
sales2 = 50

#Setting up the 2 offices standard deviations
sdev1 = 15
sdev2 = 20
#Setting up the degf
degf = sales1 + sales2 - 2

s_p = sqrt(((sales1 - 1) * sdev1**2 + (sales2 - 1) * sdev2**2) / (sales1 + sales2 - 2))
s_p

standard_error = se = sqrt(sdev1**2 / sales1 + sdev2**2 / sales2)

t = (xmean1 - xmean2) / (s_p * sqrt(1/sales1 + 1/sales2))
t

p = stats.t(degf).sf(t) * 2

print(f't = {t:.5f}')
print(f'p = {p:.5f}')

t = -2.62523
p = 1.98979


#### Load the mpg dataset and use it to answer the following questions:

In [80]:
mpg = data('mpg')
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


##### Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

In [81]:
mpg_data = pd.DataFrame(mpg)
mpg_data['average_mileage'] = (mpg_data.cty + mpg_data.hwy) / 2

In [83]:
year = mpg.year == 2008
two_thousand_eight = mpg[year]

In [88]:
year = mpg.year == 1999
nineteen_ninety_nine = mpg[year]

#### Null Hypothesis and Alternative Hypothesis
$H_0$: The mean fuel efficiency in 1999 is equal to the mean fuel efficiency in 2008. <br>

$H_a$: The mean fuel efficiency in 1999 is not equal to the mean fuel efficiency in 2008.

In [85]:
alpha = .05

In [89]:
# Find # of observations
print(two_thousand_eight.average_mileage.shape)
print(nineteen_ninety_nine.average_mileage.shape)

(117,)
(117,)


In [90]:
# Variance (2 Sample T-Test)
print(two_thousand_eight.average_mileage.var())
print(nineteen_ninety_nine.average_mileage.var())

24.097480106100797
27.122605363984682


In [100]:
# compute t and p values
t, p = stats.ttest_ind(two_thousand_eight.average_mileage, nineteen_ninety_nine.average_mileage)
t, p

(-0.21960177245940962, 0.8263744040323578)

In [92]:
null_hypothesis = "The mean fuel efficiency in 1999 is equal to the mean fuel efficiency in 2008."

if p > alpha:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)
elif t < 0 :
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)
else:
    print("We reject the null hypothesis that", null_hypothesis)

We fail to reject the null hypothesis. The null hypothesis is that The mean fuel efficiency in 1999 is equal to the mean fuel efficiency in 2008.


##### Are compact cars more fuel-efficient than the average car?

In [94]:
# Filtering for compact and average class types

class_type = mpg['class'] == 'compact'
compact = mpg[class_type]

class_type = mpg['class'] != 'compact'
average = mpg[class_type]

#### Null Hypothesis and Alternative Hypothesis
$H_0$: The mean fuel efficiency in compact cars is equal to the mean fuel efficiency of average cars. <br>

$H_a$: The mean fuel efficiency in compact cars is not equal to the mean fuel efficiency in average cars.

In [95]:
# finding # of observations
print(compact.average_mileage.shape)
print(average.average_mileage.shape)

(47,)
(187,)


In [96]:
# Variance (2 Sample T-Test)
print(compact.average_mileage.var())
print(average.average_mileage.var())

12.442876965772433
23.652794548904602


In [98]:
#compute t and p values
t, p = stats.ttest_ind(compact.average_mileage, average.average_mileage)
t, p

(6.731177612837954, 1.3059121585018135e-10)

In [99]:
null_hypothesis = "The mean fuel efficiency in compact cars is equal to the mean fuel efficiency of average cars."

if p > alpha:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)
elif t < 0 :
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)
else:
    print("We reject the null hypothesis that", null_hypothesis)

We reject the null hypothesis that The mean fuel efficiency in compact cars is equal to the mean fuel efficiency of average cars.


##### Do manual cars get better gas mileage than automatic cars?

In [101]:
# dataframe for manual transmissions
manual = mpg.trans.str.replace('(m5)', '').str.replace('(m6)', '').str.replace('(', '').str.replace(')', '')
manual = manual == 'manual'
man_cars = mpg[manual]

In [103]:
# dataframe for automatic transmissions
auto = mpg.trans.str.replace('(av)', '').str.replace('(l5)', '').str.replace('(i4)', '').str.replace('(s6)', '').str.replace('(', '').str.replace(')', '')
auto = auto == 'auto'
auto_cars = mpg[auto]

In [113]:
#finding average mileage between the two
man_mpg = man_cars.average_mileage
auto_mpg = auto_cars.average_mileage

#### Null Hypothesis and Alternative Hypothesis
$H_0$: The mean fuel efficiency in manual cars is equal to the mean fuel efficiency of automatic cars. <br>

$H_a$: The mean fuel efficiency in manual cars is not equal to the mean fuel efficiency in automatic cars.

In [114]:
# finding # of observations

print(man_mpg.shape)
print(auto_mpg.shape)

(77,)
(60,)


In [115]:
# Variance (2 Sample T-Test)
print(man_mpg.var())
print(auto_mpg.var())

26.635167464114826
22.347175141242943


In [116]:
#compute t and p values
t, p = stats.ttest_ind(man_mpg,auto_mpg)
t, p

(3.55231318025847, 0.000526538569024219)

In [118]:
null_hypothesis = "the mean fuel efficiency in manual cars is equal to the mean fuel efficiency of automatic cars."

if p > alpha:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)
elif t < 0 :
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)
else:
    print("We reject the null hypothesis that", null_hypothesis)

We reject the null hypothesis that the mean fuel efficiency in manual cars is equal to the mean fuel efficiency of automatic cars.


# Correlation Exercises

### Use the telco_churn data. 

In [130]:
telco = pd.read_csv("telco.csv")
telco.head()

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,internet_service,online_security,...,contract_type,paperless_billing,payment_method,monthly_charges,total_charges,churn,phone_type,internet_type,phone_or_internet_or_both,is_churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,Month-to-month,Yes,Electronic check,29.85,29.85,No,0,1,internet only,False
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,One year,No,Mailed check,56.95,1889.5,No,1,1,both phone and internet,False
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,1,both phone and internet,True
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,One year,No,Bank transfer (automatic),42.3,1840.75,No,0,1,internet only,False
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,2,both phone and internet,True


#### Does tenure correlate with monthly charges?

#### Null Hypothesis and Alternative Hypothesis
$H_0$: Tenure and monthly charges are correlated. <br>

$H_a$: Tenure and monthly charges have no correlation.

In [133]:
 n = telco.shape[0]     # number of observations
degf = n - 2        # degrees of freedom: the # of values in the final calculation of a statistic that are free to vary.
conf_interval = .95 # desired confidence interval
α = 1 - conf_interval

In [138]:
x = telco['monthly_charges']
y = telco['tenure']

In [139]:
corr, p = stats.pearsonr(x, y)
corr, p

(0.24789985628615002, 4.0940449915016345e-99)

In [136]:
p < α

True

The p value is greater than the alpha therefore I must reject the null hypothesis.

#### Total charges? 

In [143]:
x = telco.total_charges
y = telco.tenure

In [144]:
corr, p = stats.pearsonr(x, y)
corr, p

ValueError: array must not contain infs or NaNs

#### What happens if you control for phone and internet service?

### Use the employees database.

##### Is there a relationship between how long an employee has been with the company and their salary? <br>

##### Is there a relationship between how long an employee has been with the company and the number of titles they have had?<br>

### Use the sleepstudy data. Is there a relationship between days and reaction time?

# Chi$^2$ Exercises

### 1. Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.<br><br>

| Computer Status        	| Is a Codeup Student 	| Is not a Codeup Student 	|
|------------------------	|---------------------	|-------------------------	|
| Uses a macbook         	| 49                  	| 20                      	|
| Does not use a macbook 	| 1                   	| 30                      	|

In [2]:
index = ['Uses a Macbook', "Doesn't Use A Macbook"]
columns = ['Is a Codeup Student', 'Not Codeup Student']

observed = pd.DataFrame([[49, 20], [1, 30]], index=index, columns=columns)
observed

Unnamed: 0,Is a Codeup Student,Not Codeup Student
Uses a Macbook,49,20
Doesn't Use A Macbook,1,30


#### Null Hypothesis and Alternative Hypothesis
$H_0$: Macbook use and being a Codeup student are independent. <br>

$H_a$: Macbook use and being a Codeup student are dependent

In [3]:
# set the alpha value
alpha = .05
# make the chi2, p value, degree of freedom and expected value
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [4]:
null_hypothesis = "Macbook use and being a Codeup student are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We reject the hypothesis that Macbook use and being a Codeup student are independent.
The p value is 1.4116760526193828e-09 . The alpha value is 0.05 .


### 2. Choose another 2 categorical variables from the mpg dataset and perform a chi$^2$ contingency table test with them. Be sure to state your null and alternative hypotheses.

In [5]:
mpg = data('mpg')
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [6]:
mpg.nunique()

manufacturer    15
model           38
displ           35
year             2
cyl              4
trans           10
drv              3
cty             21
hwy             27
fl               5
class            7
dtype: int64

In [7]:
observed = pd.crosstab(mpg.cyl, mpg.year)
observed

year,1999,2008
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1
4,45,36
5,0,4
6,45,34
8,27,43


#### Null Hypothesis and Alternative Hypothesis
$H_0$: Cylinder and Year are independent. <br>
$H_a$: Cylinder and Year are dependent.

In [8]:
# setting chi2, p, degf, and expectd values
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [9]:
null_hypothesis = "Cylinder and year are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We reject the hypothesis that Cylinder and year are independent.
The p value is 0.01702768537665195 . The alpha value is 0.05 .


### 3. Use the data from the employees database to answer these questions:

In [10]:
# Connect to employees database
#defines function to create a sql url using personal credentials
from env import host, user, password

def get_db_url(database, user=user, host=host, password=password): 
    url = f'mysql+pymysql://{user}:{password}@{host}/{database}'
    return url

url = get_db_url('employees')

In [16]:
employees_sql_query = '''
                        SELECT  e.emp_no, e.gender, d.dept_name
                        FROM employees AS e
                        JOIN dept_emp AS de ON de.emp_no = e.emp_no
                        JOIN departments AS d ON d.dept_no = de.dept_no
                        WHERE de.to_date > now()              
                        '''
employees = pd.read_sql(employees_sql_query, get_db_url('employees'))
employees.head()

Unnamed: 0,emp_no,gender,dept_name
0,10038,M,Customer Service
1,10049,F,Customer Service
2,10060,M,Customer Service
3,10088,F,Customer Service
4,10112,F,Customer Service


#### Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

In [25]:
cross_tab_gender = pd.crosstab(employees.gender,employees.dept_name )
cross_tab_gender

dept_name,Customer Service,Development,Finance,Human Resources,Marketing,Production,Quality Management,Research,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
F,7007,24533,5014,5147,5864,21393,5872,6181,14999
M,10562,36853,7423,7751,8978,31911,8674,9260,22702


In [26]:
#Selecting Sales employees
employee_sales = employees.dept_name == "Sales"
sales = employees[employee_sales]

In [27]:
#Selecting marketing employees
employee_marketing = employees.dept_name == "Marketing"
marketing = employees[employee_marketing]

In [28]:
# Concat the dataframes together
emp_ms = pd.concat([sales, marketing])

In [29]:
#creating observed values from new concat df
observed = pd.crosstab(emp_ms.gender,emp_ms.dept_name )
observed

dept_name,Marketing,Sales
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,5864,14999
M,8978,22702


#### Null Hypothesis and Alternative Hypothesis
$H_0$: Gender and department are independent. <br>
$H_a$: Gender and department are dependent.

In [32]:
# setting chi2, p, degf, and expectd values
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [36]:
null_hypothesis = "Gender and department are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We fail to reject the null hypothesis. The null hypothesis is that Gender and department are independent.
The p value is 0.5691938610810126 . The alpha value is 0.05 .


#### Is an employee's gender independent of whether or not they are or have been a manager?

In [45]:
employees_manager_query = '''
SELECT gender, COUNT(gender)
FROM dept_manager
JOIN employees
ON dept_manager.emp_no = employees.emp_no
GROUP BY gender;

'''
manager = pd.read_sql(employees_manager_query, url)

In [70]:
manager
##The query above shows the count of each gender. 
#In the next step, I am going to omit the gender column to run our chi2 test on.

Unnamed: 0,gender,COUNT(gender)
0,M,11
1,F,13


In [63]:
manager_query = '''
SELECT COUNT(gender)
FROM dept_manager
JOIN employees
ON dept_manager.emp_no = employees.emp_no
GROUP BY gender;

'''

In [64]:
manager_observed= pd.read_sql(manager_query, url)

In [66]:
manager_observed

Unnamed: 0,COUNT(gender)
0,11
1,13


In [67]:
chi2, p, degf, expected = stats.chi2_contingency(manager_observed)

#### Null Hypothesis and Alternative Hypothesis
$H_0$: Gender and being a manager are independent. <br>
$H_a$: Gender and being a manager are dependent.

In [72]:
null_hypothesis = "Gender and being a manager are independent."

if p < alpha:
    print("We reject the hypothesis that", null_hypothesis)
else:
    print("We fail to reject the null hypothesis. The null hypothesis is that", null_hypothesis)

print("The p value is", p,".", "The alpha value is", alpha, ".")

We fail to reject the null hypothesis. The null hypothesis is that Gender and being a manager are independent.
The p value is 1.0 . The alpha value is 0.05 .


Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 4.57 KiB | 4.57 MiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To github.com:GabbyBarajasBroussard/statistics-exercises.git
   3d3ecad..0ed0204  main -> main
