In [1]:
from math import sqrt
from scipy import stats

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


df = pd.read_csv('Exam_scores.csv')
df.study_strategy.fillna('None', inplace=True)

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

#### 1. Has the network latency gone up since we switched internet service providers?

- **alpha**: $\alpha$: 1 - confidence level (95% confidence level -> $\alpha = .05$)
- **null hypothesis**: $H_0$: the "status quo"
- **alterternative hyopthesis**: $H_a$: the opposite; alternative

- Have round trip time and time to first byte increased since switching ISPs?

- $H_0$: RTT and TTFB before and after changing ISPs are no different.
- $H_a$: RTT and TTFB increased after changing ISPs.

or

- $H_0$: Reported FCC latency is the same for both ISPs.
- $H_a$: Reported FFC latency is higher for the new ISP.

- True Positive
    - small p-value -- < alpha (.001)
    - reject $H_0$ (there is a difference)
    - FCC latency of ISP 1 reported at : 100 ms
    - FCC latency of ISP 2 reported at : 1_000 ms
- False Positive
    - small p-value
    - reject $H_0$ (there is no differnce but we conclude there is one)
    - FCC data for ISP 2 was taken in very rural area
- True Negative
    - higher p-value
    - fail to reject $H_0$ (there is no difference, conclusion based on data supports this)
    - FCC latency of ISP 1 reported at : 100 ms
    - FCC latency of ISP 2 reported at : 130 ms
- False Negative
    - higher p-value
    - fail to reject $H_0$ (there is a difference, conclusion based on data does not support this)
    - FCC latency of ISP 1 reported at : 100 ms
    - FCC latency of ISP 2 reported at : 130 ms
    - user does not understnd milisecond units?

#### 2. Is the website redesign any good?


- **alpha**: $\alpha$: 1 - confidence level (95% confidence level -> $\alpha = .05$)
- **null hypothesis**: $H_0$: the "status quo"
- **alterternative hyopthesis**: $H_a$: the opposite; alternative

- $H_0$: number of surveys that mention the website positively are the same.
- $H_a$: number of surveys that mention the website positively are higher.

- True Positive
    - small p-value -- < alpha (.001)
    - reject $H_0$ (there is a difference, we conclude there is one)
    - Previous # of mentions : 10 
    - Current number of mentions : 35 
- False Positive
    - small p-value
    - reject $H_0$ (there is no differnce but we conclude there is one)
    - people feel the same but the survey is presented in a better location and more people       repsond
- True Negative
    - higher p-value
    - fail to reject $H_0$ (there is no difference, conclusion based on data supports this)
    - Number of positive mentions before is 15
    - Number of positive mentions after is 14
- False Negative
    - higher p-value
    - fail to reject $H_0$ (there is a difference, conclusion based on data does not support this)
    - Number of positive mentions before is 15
    - Number of positive mentions after is 23
    - NLP model does not accurately read positive mentions

#### 3. Is our television ad driving more sales?

- **alpha**: $\alpha$: 1 - confidence level (95% confidence level -> $\alpha = .05$)
- **null hypothesis**: $H_0$: the "status quo"
- **alterternative hyopthesis**: $H_a$: the opposite; alternative

- $H_0$: gross sales three months before ad began to air are the same as 3 months after.
- $H_a$: gross sales three months before ad began to air are the same as 3 months after.

- True Positive
    - small p-value -- < alpha (.001)
    - reject $H_0$ (there is a difference, we conclude there is one)
    - sales three months before add are: 1_000
    - sales three months after add are: 3_000
- False Positive
    - small p-value
    - reject $H_0$ (we conclude there is a difference, but there is no actual differnce)
    - sales increase at the same time we offer a huge discount on the product
- True Negative
    - higher p-value
    - fail to reject $H_0$ (we conclude there is no difference, there is no actual difference)
    - sales three months before add are: 2_000
    - sales three months after add are: 2_101
- False Negative
    - higher p-value
    - fail to reject $H_0$ (we conclude there is no differnce, there is an actual difference)
    - sales three months before add are: 1_000
    - sales three months after add are: 1_500
    - sales increased despite product being removed from minor store brands

In [2]:
df.head()

Unnamed: 0,exam_score,hours_studied,study_strategy,handedness,coffee_consumed,hours_slept
0,100.591011,9.126291,flashcards,left,0,11
1,95.637086,9.677438,flashcards,left,1,10
2,53.200296,4.550207,,right,5,6
3,63.934268,6.487848,flashcards,right,4,7
4,51.18637,6.720959,flashcards,right,5,6


- Ace Realty wants to determine whether the average time it takes to sell homes is - different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.

- $H_0$: there is no difference in length of time to sell a home bewtween offices.
- $H_a$: there is an increase in length of time to sell a home bewtween offices.

In [8]:
xbar1 = 90
xbar2 = 100

n1 = 40
n2 = 50

s1 = 15
s2 = 20

degf = (n1 + n2) - 2 # n - number of categories

s_p = sqrt(
    ((n1 - 1) * s1**2 + (n2 - 1) * s2**2)
    /
    (n1 + n2 - 2)
)

t = (xbar1 - xbar2) / (s_p * sqrt(1 / n1 + 1 / n2))
alpha = 0.05
t

-2.6252287036468456

In [9]:
p = stats.t(degf).sf(t) * 2
p

1.9897901475507607

In [11]:
print(f'''
Because p ({p:.6f}) > alpha ({alpha}), we fail to reject the null hypothesis:
that there is no difference in the length of time it takes to sell a house between offices

in plain english: we think that these two groups are significantly different
''')


Because p (1.989790) > alpha (0.05), we fail to reject the null hypothesis:
that there is no difference in the length of time it takes to sell a house between offices

in plain english: we think that these two groups are significantly different

