Materials in this notebook is adapted from [Statistics for Business and Economics 10e Chapter 10](https://www.amazon.com/Statistics-Business-Economics-Education-Printed/dp/1305585313/ref=sr_1_5?keywords=statistics+for+business+and+economics+pagano&qid=1581359479&sr=8-5)

# Inference About the Difference Between Two Population Means: $\sigma_1$ and $\sigma_2$ are Unknown.

The College Board provided comparisons of Scholastic Aptitude Test (SAT) scores based on the highest level of education attained by the test taker’s parents. A research hypothesis was that students whose parents had attained a higher level of education would on average score higher on the SAT. During 2003, the overall mean SAT verbal score was 507 (The World Almanac 2004). SAT verbal scores for independent samples of students follow. The first sample shows the SAT verbal test scores for students whose parents are college graduates with a bachelor’s degree. The second sample shows the SAT verbal test scores for students whose parents are high school graduates but do not have a college degree

__Your Turn__

1. What is the first population?

2. What is the secon population?

3. Do you know the first population's mean? ($\mu_{1}$)

4. Do you know the second population's mean? ($\mu_{2}$)

5. Do you know the first population standard deviation? ($\sigma_{1}$)

6. Do you know the second population standard deviation? ($\sigma_{2}$)

In [5]:
## your answers are here

In [24]:
sample_college = [485, 534, 650, 554, 550,
                  575, 498, 448, 470, 591,
                  480, 540, 520,515, 578]

In [25]:
sample_high_school = [440, 585, 481, 486, 528, 
                     524, 492, 478, 430, 480, 
                     390, 535]

3. Do you know the first sample's mean? ($\bar{x}_{1}$)

4. Do you know the second sample's mean? ($\bar{x}_{2}$)

5. Do you know the first sample's standard deviation? ($s_{1}$)

6. Do you know the second sample's standard deviation? ($s_{2}$)

In [26]:
import numpy as np

In [27]:
## your answer here

x1_bar = np.mean(sample_college)
x2_bar = np.mean(sample_high_school)
s1 = np.std(sample_college)
s2 = np.std(sample_high_school)

print(x1_bar)
print(x2_bar)
print('standard_college', s1)
print('standard_high_school', s2)

532.5333333333333
487.4166666666667
standard_college 51.40929444716738
standard_high_school 50.01242901075248


__Your Turn__

1. Write alternative hypothesis. Do you want to use two-tailed or one-tailed test?

2. Write null-hypothesis. Make sure that the null-hypothesis is logically complementary to the alternative hypothesis. 

3. Set a significance level. ($\alpha$)

In [None]:
## your answer here

#Null_hypothesis = There is no difference between the scores of students with college graduated parents and parents 
#with a highschool diploma only

#Alternative hypothesis = Students who have parents with a college degree will have a higher mean 
#score than those whose parents have a higschool degree


Mathematically we can express this as:

\begin{equation}
    H_{a}:  \mu_{1} - \mu_{2} > D_{0} = 0\\
    H_{0}:  \mu_{1} - \mu_{2} \leq D_{0} = 0 \\
    \text{Significance Level: }\alpha = 0.05
\end{equation}

<img src= "img/welch_test.png" width = 450>

## Using scipy.stats

In [22]:
import scipy.stats as stats

In [23]:
stats.ttest_ind(sample_college, 
                sample_high_school, 
                equal_var=False)

Ttest_indResult(statistic=2.211586919212515, pvalue=0.03680769761049924)

## Using statsmodels

In [26]:
from statsmodels.stats.weightstats import ttest_ind
import numpy as np

In [24]:
np.array(sample_college).shape

(15,)

In [33]:
ttest_ind(np.array(sample_college), np.array(sample_high_school), usevar='unequal', alternative= 'larger')

(2.211586919212515, 0.01840384880524962, 23.902154340470105)