# EGCI 305: Chapter 7 (Hypothesis on 2 Samples)

Outline
> 1. [Packages](#ch7_packages)

> 2. [Independent Z-test function (statmodels)](#ch7_ind_ztest_func)
>    - [Example: math classes (1)](#ch7_ex_math_z)

> 3. [Independent T-test function (statmodels)](#ch7_ind_ttest_func)
>    - [Example: math classes (2)](#ch7_ex_math_t)
>    - [Example: gas mileage](#ch7_ex_gas)

> 4. [Paired T-test function (scipy)](#ch7_paired_ttest_func)
>    - [Example: neck-shoulder disorder](#ch7_ex_neck)
 
> 5. [F distribution](#ch7_f)
>    - [Example: F and inverse](#ch7_ex_f)

<a name="ch7_packages"></a>

## Packages
> - **numpy** -- to work with array manipulation
> - **matplotlib** -- to work with visualization (backend)
> - **seaborn** -- to work with high-level visualization
> - **math** -- to work with calculation such as sqrt (if not using sympy)
> - **scipy.stats** -- to work with stat
> - **statsmodels.stats.weightstats** -- to work with hypothesis testing

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("Numpy version =", np.version.version)
print("Seaborn version =", sns.__version__)

import math
import scipy
print("Scipy version =", scipy.__version__)

from scipy import stats
from scipy.stats import norm            # Normal distribution
from scipy.stats import t               # T distribution
from scipy.stats import f               # F distribution

from statsmodels.stats.weightstats import ztest           # Z-test
from statsmodels.stats.weightstats import ttest_ind       # independent T-test
from scipy.stats import levene                            # Levene test
from scipy.stats import ttest_1samp                       # T-test

<a name="ch7_ind_ztest_func"></a>

## Independent Z-test Function (statsmodels)
- **[Manual: statmodels.stats.weightstats.ztest](https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ztest.html)**
- Can use this function to do the whole testing procedure **if raw data are available**

<a name="ch7_ex_math_z"></a>

### Example : math classes (1)
- Hypothesis
    >- H<sub>0</sub> : $\mu$<sub>1</sub> - $\mu$<sub>2</sub> = 0
    >- H<sub>1</sub> : $\mu$<sub>1</sub> - $\mu$<sub>2</sub> < 0
    >- Group 1 = online class
    >- Group 2 = face-to-face class

In [None]:
A = np.array( [67.6, 41.2, 85.3, 55.9, 82.4, 91.2, 73.5, 94.1, 64.7, 64.7,
               70.6, 38.2, 61.8, 88.2, 70.6, 58.8, 91.2, 73.5, 82.4, 35.5,
               94.1, 88.2, 64.7, 55.9, 88.2, 97.1, 85.3, 61.8, 79.4, 79.4] )

B = np.array( [77.9, 95.3, 81.2, 74.1, 98.8, 88.2, 85.9, 92.9, 87.1, 88.2, 
               69.4, 57.6, 69.4, 67.1, 97.6, 85.9, 88.2, 91.8, 78.8, 71.8, 
               98.8, 61.2, 92.9, 90.6, 97.6, 100,  95.3, 83.5, 92.9, 89.4] )

print("A (online)       >> sample size = %d, mean = %.2f, var = %.2f" % 
      (A.size, A.mean(), A.var(ddof=1))
     )

print("B (face-to-face) >> sample size = %d, mean = %.2f, var = %.2f" % 
      (B.size, B.mean(), B.var(ddof=1)), "\n"
     )

fig = plt.figure( figsize = (7,5) ) 
ax1 = plt.subplot(221)
sns.kdeplot(A).set_title("A (online)")

ax2 = plt.subplot(222)
sns.kdeplot(B).set_title("B (face-to-face)")

ax3 = plt.subplot(223)
stats.probplot(A, dist = 'norm', plot = ax3)
ax3.set_title("A (online)")

ax4 = plt.subplot(224)
stats.probplot(A, dist = 'norm', plot = ax4)
ax4.set_title("B (face-to-face)")

fig.tight_layout()
plt.show() 

In [None]:
### From manual calculation

zcal = -3.23
zthreshold = -norm.ppf(1-0.05)
pvalue = norm.cdf(-3.23)
print("Z threshold = %.2f" % zthreshold)
print("P-value     = %.4f" % pvalue)

In [None]:
### From ztest function

result = ztest(A, B, value = 0, alternative = 'smaller')
print("Calculated Z = %.2f" % result[0])
print("P-value      = %.4f" % result[1])

<a name="ch7_ind_ttest_func"></a>

## Independent T-test Function (statsmodels)
- **[Manual: statsmodels.stats.weightstats.ttest_ind](https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ttest_ind.html)**
- Can use this function to do the whole testing procedure **if raw data are available**
- Do Levene test for equality of variances first --> **[Manual: scipy.stats.levene](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html)**

<a name="ch7_ex_math_t"></a>

### Example : math classes (2)
- What if we run t-test instead of z-test ?

In [None]:
result_vartest = levene(A, B)
print("Levene statistic = %.4f" % result_vartest.statistic)
print("P-value          = %.4f" % result_vartest.pvalue)

In [None]:
result = ttest_ind(A, B, value = 0, alternative = 'smaller', usevar = 'unequal')
print("Calculated t = %.2f" % result[0])
print("P-value      = %.4f" % result[1])
print("df           = %.2f" % result[2])

<a name="ch7_ex_gas"></a>

### Example : gas mileage
- Hypothesis
    >- H<sub>0</sub> : $\mu$<sub>1</sub> - $\mu$<sub>2</sub> = 0
    >- H<sub>1</sub> : $\mu$<sub>1</sub> - $\mu$<sub>2</sub> > 0
    >- Group 1 = premium gasoline
    >- Group 2 = regular gasoline

In [None]:
A = np.array( [35.4, 31.7, 34.5, 35.4, 31.6, 35.3, 32.4, 36.6, 34.8, 36.0] )
B = np.array( [29.7, 34.8, 29.6, 34.6, 32.1, 34.8, 35.4, 32.6, 34.0, 32.2] )

print("A (premium) >> sample size = %d, mean = %.2f, var = %.2f" % 
      (A.size, A.mean(), A.var(ddof=1))
     )

print("B (regular) >> sample size = %d, mean = %.2f, var = %.2f" % 
      (B.size, B.mean(), B.var(ddof=1)), "\n"
     )

fig = plt.figure( figsize = (7,5) ) 
ax1 = plt.subplot(221)
sns.kdeplot(A).set_title("A (premium)")

ax2 = plt.subplot(222)
sns.kdeplot(B).set_title("B (regular)")

ax3 = plt.subplot(223)
stats.probplot(A, dist = 'norm', plot = ax3)
ax3.set_title("A (premium)")

ax4 = plt.subplot(224)
stats.probplot(A, dist = 'norm', plot = ax4)
ax4.set_title("B (regular)")

fig.tight_layout()
plt.show() 

In [None]:
result_vartest = levene(A, B)
print("Levene statistic = %.4f" % result_vartest.statistic)
print("P-value          = %.4f" % result_vartest.pvalue)

In [None]:
result = ttest_ind(A, B, value = 0, alternative = 'larger', usevar = 'pooled')
print("Calculated t = %.3f" % result[0])
print("P-value      = %.4f" % result[1])
print("df           = %.2f" % result[2])

<a name="ch7_paired_ttest_func"></a>

## Paired T-test Function (scipy)

<a name="ch7_ex_neck"></a>

### Example : neck-shoulder disorder
- Hypothesis
    >- H<sub>0</sub> : $\mu$<sub>D</sub> = 0
    >- H<sub>1</sub> : $\mu$<sub>D</sub> $\ne$ 0
    >- D = time_before - time_after

In [None]:
before = np.array( [81, 87, 86, 82, 90, 86, 96, 73, 
                    74, 75, 72, 80, 66, 72, 56, 82] )

after  = np.array( [78, 91, 78, 78, 84, 67, 92, 70, 
                    58, 62, 70, 58, 66, 60, 65, 73] )

D = before - after
print("D >> sample size = %d, mean = %.2f, std = %.3f \n" % 
      (D.size, D.mean(), D.std(ddof=1))
     )

fig = plt.figure( figsize = (3,2) )
stats.probplot(D, dist = 'norm', plot = plt)
plt.show() 

In [None]:
result = ttest_1samp(D, 0, alternative = 'two-sided')
print("Calculated T = %.2f" % result.statistic)
print("Df           = %.0f" % result.df)
print("P-value      = %.3f" % result.pvalue)

<a name="ch7_f"></a>

## F Distribution
- **[Manual: scipy.stats.f](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html)**
    > - For F<sub>(dfn)(dfd)</sub>
    > - Default loc = 0
    > - Default scale = 1

<a name="ch7_ex_f"></a>

### Example : F and inverse
> - Q1 : F<sub>0.05,(6),(10)</sub>
> - Q2 : F<sub>0.95,(6),(10)</sub>
> - Q3 : F<sub>0.05,(10),(6)</sub>
> - Q4 : F<sub>0.95,(10),(6)</sub>

**Recall that subscript means RHS area**

In [None]:
Q1 = f.ppf(1-0.05, 6, 10)
Q2 = f.ppf(1-0.95, 6, 10)
Q3 = f.ppf(1-0.05, 10, 6)
Q4 = f.ppf(1-0.95, 10, 6)

print("F (0.05, 6, 10) = %.2f --> inverse = %.2f" % (Q1, 1/Q1) )
print("F (0.95, 6, 10) = %.2f --> inverse = %.2f" % (Q2, 1/Q2) )
print("F (0.05, 10, 6) = %.2f --> inverse = %.2f" % (Q3, 1/Q3) )
print("F (0.95, 10, 6) = %.2f --> inverse = %.2f" % (Q4, 1/Q4) )