# Exercises: Hypothesis Testing - COMPARISON OF MEANS (T-TEST)

<a href = "https://ds.codeup.com/stats/compare-means/#exercises">![image.png](attachment:e10857fb-23b9-4838-988f-151d3787df91.png)</a>


### Tests
_______________________________________________________________________

|Goal|$H_{0}$|Data Needed|Parametric Test|Assumptions*|Non-parametric Test|  
|---|---|---|---|---|---|  
|Compare observed mean to theoretical one|$\mu_{obs} = \mu_{th}$|array-like of observed values & float of theoretical|One sample t-test: scipy.stats.ttest_1samp|Normally Distributed\*\*|One sample Wilcoxon signed rank test|   
|Compare two observed means (independent samples)|$\mu_{a} = \mu_{b}$|2 array-like samples|Independent t-test (or 2-sample): scipy.stats.ttest_ind|Independent, Normally Distributed\*\*, Equal Variances\*\*\*|Mann-Whitney's test|   
|Compare several observed means (independent samples)|$\mu_{a} = \mu_{b} = \mu_{n}$|n array-like samples|ANOVA: scipy.stats.f_oneway|Independent, Normally Distributed\*\*, Equal Variances|Kruskal-Wallis test|   

\*If assumptions can't be met, the equivalent non-parametric test can be used.   
\*\*Normal Distribution assumption can be be met by having a large enough sample (due to Central Limit Theorem), or the data can be scaled using a Gaussian Scalar.   
\*\*\*The argument in the stats.ttest_ind() method of `equal_var` can be set to `False` to accomodate this assumption. 
<hr style="border:2px solid gray">


In [None]:
import numpy as np
import seaborn as sns
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt

from math import sqrt
from pydataset import data
import statistics

<hr style="border:2px solid gray">

<hr style="border:2px solid black">
<hr style="border:2px solid black">

#### #1. Answer with the type of test you would use (assume normal distribution):

a. Is there a difference in grades of students on the second floor compared to grades of all students?

b. Are adults who drink milk taller than adults who don't drink milk?

c. Is the the price of gas higher in Texas or in New Mexico?

d. Are there differences in stress levels between students who take data science vs students who take web development vs students who take cloud academy?

<hr style="border:2px solid black">
<hr style="border:2px solid black">

<div class="alert alert-block alert-success">

In order to answer questions 2 & 3, we will break down statistical testing.
<br>
<br>
<b>Step-by-Step</b>
1. Plot distribution
2. Set Hypothesis
3. Set Alpha
4. Verify 3 Assumptions
5. Compute Test Statistics
6. Decide
    
</div> 

#### #2. Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices

<div class="alert alert-block alert-info">

<b>Let's break it down:</b> 
    <br>

</div>

<hr style="border:1px solid black">

##### Step 1: Plot Distribution

In [None]:
#stats.norm(mean, std).rvs(# samples)


In [None]:
#let's get the average time to sell homes of office 1


In [None]:
#let's get the average time to sell homes of office 2


##### Step 2: Set Hypothesis

$H_0$: 


$H_a$: 

##### Step 3: Set Alpha

In [None]:
#we can use our typical alpha for this example
α = 

##### Step 4: Verify Assumptions

<div class="alert alert-block alert-info">
<b>We need to ask ourselves:</b> 

1. Are the samples independent? 
    - 
2. Is there normality?      
    - 
    - 
3. Is there equal variance?  
    - 
</div>    

#2. Is there normality?

In [None]:
# to find sample size- must more than 30 to meet assumption


#3 Is there equal variance?

In [None]:
#this shows the variance is not the same. must set variance to false


In [None]:
#we can also do a levene test
stat, p_val = 

In [None]:
if p_val < 0.05:
    print('We can reject H0 ==> inequal variance')

##### Step 5: Compute Test Statistics

In [None]:
# 2 sample. 2 tailed
t, p = 
t, p, α

In [None]:
p < α

##### Step 6: Decide

In [None]:
if p < α:
    print('Our p-value is less than alpha and we can reject the null hypothesis, indicating some difference in the sales time between the offices')

In [None]:
if p < α:
    print('Our p-value is less than alpha and we can reject the null hypothesis, indicating some difference in the sales time between the offices.')

<hr style="border:2px solid black">
<hr style="border:2px solid black">

#### #3. Load the mpg dataset and use it to answer the following questions:
- Is there a difference in fuel-efficiency in cars from 2008 vs 1999?
- Are compact cars more fuel-efficient than the average car?
- Do manual cars get better gas mileage than automatic cars?

In [None]:
#import the data


In [None]:
#take a peak at the data


<div class="alert alert-block alert-info">
<b>Think it through:</b> 
<br>
To answer this set of question, we need to create a new column of average mileage
    <br>
- engineer an average mileage column in order to make the fuel efficiency comparisons
    <br>
- capture transmissions that are automatic or manual for that specific comparison
</div>    

Calculate average fuel economy assuming 50% highway and 50% city driving
- Should I use arithmetic mean or harmonic mean for average mpg?
    - Arithmetic Mean: fe_am = (cty + hwy)/2
    - Harmonic Mean: fe_hm = 2/(1/cty + 1/hwy)

In [None]:
#find the mean of cty and hwy combined and create a new column


In [None]:
#look at our new data


<hr style="border:0.5px solid black">

### A) Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

<div class="alert alert-block alert-info">
<b>Let's break it down:</b> 
<br>
- We are looking to compare values across car manufactured specifically in 2008 versus cars specifically in 1999  
    <br>
</div>    

<b>We are comparing</b>: average mileage (numeric/continious) vs two different years (distinct/categorical)
<br>
<br>
Therefore, we will use:
<br>
<br>
2 sample t-test: comparing two years
    <br>
2 tailed: wants the difference (not if one is less or more)
    <br>
Perform a ttest_ind on these two subsets of our data

In [None]:
#create new panda series for each year we are comparing


<b>Step 1: Plot Distribution</b>

In [None]:
#Let's look at 1999's distribution


In [None]:
#Let's look at 2008's distribution


<b> Step 2: Set Hypothesis</b>
<br>
- $H_0$: 

- $H_a$: 

<b>Step 3: Set Alpha

In [None]:
α = 

<b>Step 4: Verify Assumptions</b>

<div class="alert alert-block alert-info">
<b>We need to ask ourselves:</b> 

1. Are the samples independent? 
    - 
    - 
2. Is there normality?      
    - 
    - 
3. Is there equal variance?  
    - 
</div>    

#2. Normal distribution or Sample size larger than 30?

In [None]:
# to find sample size- must more than 30 to meet assumption


#3. Is there equal variance?

In [None]:
stat, pval = 
if pval < α:
    print('we can reject the null hypothesis and posit that variance is inequal')

<b>Step 5: Compute Test Statistics</b>

In [None]:
t, p = 
t,p, α

In [None]:
p < α

<b>Step 6: Decide</b>

In [None]:
print(f'''
Because p ({pval:.3f}) > alpha (.05), we fail to reject the null\
 hypothesis that there is no difference in fuel-efficency in cars\
 from 2008 and 1999.
''')

<hr style="border:0.5px solid black">

### B) Are compact cars more fuel-efficient than the average car?

<div class="alert alert-block alert-info">
<b>We are comparing</b>: average mileage (numeric/continious) vs two car types (distinct/categorical)
<br>
<br>
Therefore, we will use:
<br>
<br>
one sample- only looking at compact cars average
<br>
one tailed- MORE fuel efficient (as opposed to: is there a difference in fuel)
    </div> 

In [None]:
#Let's create a dataset for only compact car mileage


#Let's create a dataset for overall mileage


<b>Step 1: Plot Distribution</b>

In [None]:
# look at the distribution. N >30


<b>Step 2: Set Hypothesis</b>

- $H_0$: : there is no between compact car fuel-efficiency and the average fuel efficiency
<br>
- $H_a$: there is a between compact car fuel-efficiency and the average fuel efficiency

<b>Step 3: Set Alpha</b>

In [None]:
α = 

<b>Step 4: Verify Assumption</b>

<div class="alert alert-block alert-info">
<b>We need to ask ourselves:</b> 

1. Are the samples independent? 
    - 
2. Is there normality?      
    - 
    - 
3. Is there equal variance?  
    - 
</div> 

#2. Is there normality?

In [None]:
#must be more than 30


<b>Step 5: Compute Test Statistics</b>

<b>Step 6: Decide</b>

In [None]:
if (t > 0) and ((p/2) < α):
    print('we can reject the null hypothesis')

In [None]:
print(f'''
Because p/2 ({p/2:.12f}) < alpha (.05), we reject the null hypothesis that there isno difference in fuel-efficiency between compact cars and the overall average.
''')

<hr style="border:0.5px solid black">

### C) Do manual cars get better gas mileage than automatic cars?

In [None]:
# we will look at average fuel efficiency for auto cars, and manual cars


<b>Step 1: Plot distribution</b>

In [None]:
# look at the distribution. N >30


In [None]:
# look at the distribution. N >30


<b>Step 2: Set Hypothesis</b>

- $H_0$: The average mileage of manual cars =< the average mileage in automatic cars
<br>
- $H_a$: The average mileage of manual cars > the average mileage in automatic cars

<b>Step 3: Set Alpha</b>

In [None]:
α = 

<b>Step 4: Verify 3 Assumptions</b>

<div class="alert alert-block alert-info">
<b>We need to ask ourselves:</b> 

1. Are the samples independent? 
    - 
2. Is there normality?      
    - 
    - 
3. Is there equal variance?  
    - 
</div> 

#2. Is there normality?

In [None]:
#must be more than 30


#3. Is there equal variance?

In [None]:
#check variance


<b>Step 5: Compute Test Statistics</b>

<b>Step 6: Decide</b>

In [None]:
if (t > 0) and ((p/2) < α):
    print('We can reject our null hypothesis')
else:
    print('we cannot reject our null hypothesis')

In [None]:
print(f'''
Because p/2 ({p/2:.6f}) < alpha (.05), we reject the null hypothesis that there is no difference in gas mileage between manual and automatic cars
''')

In [None]:
## just look at the means to visually confirm your decision
