In [18]:
from math import sqrt
from scipy import stats
from pydataset import data

### 1. Are the average grades in web development vs data science classes different?

$H_0$: The average grades in both cohorts are not different

$H_a$: The average grades in both cohorts are different

- TP: DS grades on average are lower because the course is harder
- FP(T1): Comparing DS month 4 to WD month 1, undercoverage sampling error
- TN: The average grades across both cohorts are the same
- FN(T2): The average grades are the same, if data was volunteered then only people with good grades answered

### 2. Is there a relationship between how early a student shows up to class and their grade?


$H_0$: There is no relationship between how early a student shows up to class and their grade

$H_a$: The grade of a student lowers based on how early they show up

- TP: Students with lower grades show up earlier
- FP: Students with lower grades show up earlier, but there is one student with a huge outlier grade affecting that change
- TN: Grades do not change based on how early a student shows up
- FN: Grades do not change, but random sampling happened to miss several students who show up early with poor grades

### 3. Are web dev or data science students more likely to be coffee drinkers?

$H_0$: There is no relationship between cohort and likelihood of drinking coffee

$H_a$: DS students are more likely to drink coffee

- TP: DS students are more likely to drink coffee
- FP(T1): DS students are more likely to drink coffee, but the answers are skewed because the question was asked the day of a test in DS 
- TN: Cohort does not affect likelihood of drinking coffee
- FN(T2): Cohort does not affect likelihood of drinking coffee, but 3/4 of the surveys were not answered

### 4. Has the network latency gone up since we switched internet service providers?

$H_0$: Network latency has not changed since switching service providers

$H_a$: Network latency has gone up since switching service providers

- TP: Network latency has gone up with the new provider
- FP(T1): Network latency has gone up, but it is because more people are using the service
- TN: Network latency has not changed with the new provider
- FN(T2): Network latency has not changed, but we also switched to fiber

### 5. Is the website redesign any good?

$H_0$: The period of time people stay on our website has not changed

$H_a$: The period of time people stay on our website has gone up

- TP: People stay on our website longer after the redesign
- FP(T1): People stay on our website longer after the redesign, but it has only been one week and we do not have enough data
- TN: The period of time people stay on our website has not changed
- FN(T2): The period of time people stay on our website has not changed, but it's only been one week and we do not have enough data

### 6. Is our TV ad driving more sales?

$H_0$: Our sales have not changed after our TV ad

$H_a$: Sales after our TV ad have gone up

- TP: Our sales have gone up after our TV ad
- FP(T1): Our sales have gone up after our TV ad, but we sell tiolet paper during coronavirus
- TN: Our sales have not changed after our TV ad
- FN(T2): Our sales have not changed after our TV ad, but the ad was only aired during the day when many people might be working

### T-test
Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. 
- A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. 
- A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. 

Use a .05 level of significance.

In [3]:
xbar1 = 90
xbar2 = 100

n1 = 40
n2 = 50

s1 = 15
s2 = 20

degf = n1 + n2 - 2

s_p = sqrt(((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2))
s_p

standard_error = se = sqrt(s1**2 / n1 + s2**2 / n2)

t = (xbar1 - xbar2) / (s_p * sqrt(1/n1 + 1/n2))
t

-2.6252287036468456

In [8]:
p = stats.t(degf).sf(abs(t)) * 2
p

0.01020985244923939

In [17]:
print(f'Since p {p:.2f} is less than targeted siginificance of ' +
      '.05, we can reject our null hypothesis that the average ' +
      'time it takes to sell homes is the same for both offices')

Since p 0.01 is less than targeted siginificance of .05, we can reject our null hypothesis that the average time it takes to sell homes is the same for both offices


In [88]:
mpg = data('mpg')
mpg['ave_mpg'] = (mpg.cty + mpg.hwy) / 2
mpg['trans'] = mpg.trans.str.split('(').map(lambda x: x[0])
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,ave_mpg
1,audi,a4,1.8,1999,4,auto,f,18,29,p,compact,23.5
2,audi,a4,1.8,1999,4,manual,f,21,29,p,compact,25.0
3,audi,a4,2.0,2008,4,manual,f,20,31,p,compact,25.5
4,audi,a4,2.0,2008,4,auto,f,21,30,p,compact,25.5
5,audi,a4,2.8,1999,6,auto,f,16,26,p,compact,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto,f,19,28,p,midsize,23.5
231,volkswagen,passat,2.0,2008,4,manual,f,21,29,p,midsize,25.0
232,volkswagen,passat,2.8,1999,6,auto,f,16,26,p,midsize,21.0
233,volkswagen,passat,2.8,1999,6,manual,f,18,26,p,midsize,22.0


In [30]:
x1 = mpg[mpg.year == 1999].ave_mpg
x2 = mpg[mpg.year == 2008].ave_mpg
t_stat, p = stats.ttest_ind(x1, x2)
print(f't_stat = {t_stat:.4f} \np = {p:.4f}')
print(f'Since our p value of {p:.4f} is greater than .05, ' +
     'we must fail to reject our null hypothesis that there is ' +
     'no difference in fuel_efficiency between cars made in ' +
     '1999 vs cars made in 2008')

t_stat = 0.2196 
p = 0.8264
Since our p value of 0.8264 is greater than .05, we must fail to reject our null hypothesis that there is no difference in fuel_efficiency between cars made in 1999 vs cars made in 2008


In [43]:
x = mpg[mpg['class'] == 'compact'].ave_mpg
mu = mpg.ave_mpg.mean()
t_stat, p = stats.ttest_1samp(x, mu)

print(f't_stat = {t_stat:.4f} \np = {p:.11f}')
print(f'Since our p value of {p:.11f} is less than .05, ' +
     'we must reject our null hypothesis that there is ' +
     'no difference in fuel_efficiency between compact cars ' +
     'and average cars')

t_stat = 7.8969 
p = 0.00000000042
Since our p value of 0.00000000042 is less than .05, we must reject our null hypothesis that there is no difference in fuel_efficiency between compact cars and average cars


In [91]:
x1 = mpg[mpg.trans.isin(['manual'])].ave_mpg
x2 = mpg[mpg.trans.isin(['auto'])].ave_mpg
t_stat, p = stats.ttest_ind(x1, x2)
print(f't_stat = {t_stat:.4f} \np = {p:.7f}')
print(f'Since our p value of {p:.7f} is less than .05, ' +
     'we must reject our null hypothesis that there is ' +
     'no difference in fuel_efficiency between manual cars ' +
     'and automatic cars')

t_stat = 4.5934 
p = 0.0000072
Since our p value of 0.0000072 is less than .05, we must reject our null hypothesis that there is no difference in fuel_efficiency between manual cars and automatic cars
