In [1]:
import numpy as np
import scipy as sc
import pandas as pd

Source: [Stat Trek: AP Statistics | Hypothesis Tests](https://stattrek.com/hypothesis-test/difference-in-means?tutorial=AP)

### Hypothesis Test: Difference Between Means

Let there be two populations we would like to compare with mean $\mu_1$ and $\mu_2$. Then we an create null and alternate hypothesis in the following manner:<br><br>

$\displaystyle
H_0: \mu_1 - \mu_2 = d \\[1em]
H_a: \mu_1 - \mu_2 \neq d \qquad \text{or} \qquad \mu_1 - \mu_2 > d \qquad \text{or} \qquad \mu_1 - \mu_2 < d
$ 

To conduct **two-sample t-test** to determine whether the difference between means found in the sample is significanlty different from hypothesized difference between means, we need to calulate following statisitics.
1. *Standard Error*<br><br>
$\displaystyle SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$
<br><br>where $s_1$ and $s_2$ are the sample deviation of two population.

2. *Degrees of freedom*<br><br>
$\displaystyle DF = \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right) / \left( \frac{s_1^4}{n_1^2(n_1-1)} + \frac{s_2^4}{n_2^2(n_2-1)} \right)$<br><br>
rounded to closest whole number. The above formula gives an accurate value, but a simpler way of apporximating DF is to take the smaller quantity out of $n_1 - 1$ and $n_2 - 1$.

3. *Test Statistic*:<br><br>
$\displaystyle t = \frac{(\bar x_1 - \bar x_2) - d}{SE}$

#### Examples

**Problem1**: Test scores of two classes were compared with following data:
- Class A : 30 students | average test score 78 | standard deviation 10.
- Class B : 25 students | average test score 85 | standard deviation 15.

Test the hypothesis that both classes have similar scors with a level of significance 0.10.

**Solution**:
1. $H_0: \mu_1 - \mu_2 = 0 \\
H_a: \mu_1 - \mu_2 \neq 0$

In [2]:
n1, mean1, std1 = 30, 78, 10
n2, mean2, std2 = 25, 85, 15
alpha = 0.1
d = 0

# standard error
se = (std1**2/n1 + std2**2/n2)**0.5
# degrees of freedom
deg_fm = np.round((std1**2/n1 + std2**2/n2)**2 / ((std1**2/n1)**2/(n1-1) + (std2**2/n2)**2/(n2-1)))
# test statistic
test_stats = (mean1 -mean2 - d)/se
# critical values for two tailed test
critical_vals = [sc.stats.t.ppf(alpha/2, deg_fm), sc.stats.t.ppf(1-alpha/2, deg_fm)]
#p-value, cdf of (1 - P(X > abs(test statisitcs)) * 2
pval = 2*(1-sc.stats.t.cdf(np.abs(test_stats), deg_fm))

print(
    f'Standard error: {se:.4f}',
    f'\nDegrees of freedom: {deg_fm}',
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Values: ({critical_vals[0]:.4f},{critical_vals[1]:.4f})',
    f'\nP-value: {pval:.4f}'
    f'\nSignificance level: {alpha}',
)

Standard error: 3.5119 
Degrees of freedom: 40.0 
Test Statisitic: -1.9932 
Critical Values: (-1.6839,1.6839) 
P-value: 0.0531
Significance level: 0.1


We can see test statisitic fall in rejection region ($<z_{-\alpha/2}, >z_{\alpha/2}$) and thus null hypothesis can be rejected and we can conclude that class A and class B ard different.

**Problem2** A battery making company has developed a new battery. THe engineer claims the new battery lasts 7 minutes longer than the old battery.<br>
To test the claim 100 new batteries are tested against 100 old batteries. Consider the following results:
- Battery old: $\mu_2$ = 200 minutes | s = 40 minutes 
- Battery new: $\mu_1$ = 190 minutes | s = 20 minutes.

Test the engineer's claim that the new batteries run at least 7 minutes longer than the old. Use a 0.05 level of significance

**Solution**: <br>
$H_0 : \mu_1 - \mu_2 \leq 7 \\
H_a : \mu_1 - \mu_2 > 7$

In [3]:
n1, mean1, std1 = 100, 200, 40
n2, mean2, std2 = 100, 190, 20
alpha = 0.05
d = 7

# standard error
se = (std1**2/n1 + std2**2/n2)**0.5
# degrees of freedom
deg_fm = np.round((std1**2/n1 + std2**2/n2)**2 / ((std1**2/n1)**2/(n1-1) + (std2**2/n2)**2/(n2-1)))
# test statistic
test_stats = (mean1 -mean2 - d)/se
# critical values
critical_vals = [sc.stats.t.ppf(1-alpha, deg_fm)]
# p-value
pval = 1-sc.stats.t.cdf(np.abs(test_stats), deg_fm)
print(
    f'Standard error: {se:.4f}',
    f'\nDegrees of freedom: {deg_fm}',
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Value: {critical_vals[0]:.4f}',
    f'\nP-value: {pval:.4f}'
    f'\nSignificance level: {alpha}',
)

Standard error: 4.4721 
Degrees of freedom: 146.0 
Test Statisitic: 0.6708 
Critical Value: 1.6554 
P-value: 0.2517
Significance level: 0.05


Since test statisitc does not fall in rejection region, the null hypothesis cannot be rejected and the claim of engineer cannot be verified.

### Hypothesis Test: Difference Between Paired Means

To conduct **matched-pairs-t-test** to determine whether the difference between sample means for paired data, we follow the steps:
1. **Standard deviation**:<br><br>
$s_d = \sqrt{\frac{\sum (d_i-d)^2}{n-1}}$
<br><br>where $d_i$ is the difference for pair i, $\bar d$ is thje sample mean of the difference, and n is the number of paired values.

2. **Standard Error**:<br><br>
$SE = s_d \sqrt{\frac{1}{n} \left[\frac{N-n}{N-1} \right]}$
<br><br>where $s_d$ is the standard deviation of the sample difference, *N* is the number of matched pairs in the population and *n* is the number of matched pairs in the sample.When the population size is much larger (at least 20 times larger) than the sample size, the standard error can be approximated by: <br><br>
$SE = \frac{s_d}{\sqrt{n}}$

3. **Degrees of Freedom**: $DF = n-1$.

4. **Test Statisitic**: <br><br>
$t = \frac{(\bar x_1 - \bar x_2) - D}{SE}= \frac{\bar d - D}{SE}$
<br><br>where $\bar x_1$ is the mean of sample 1, $\bar x_2$ is the mean of sample 2, $\bar d$ id the mean difference between paired values in the sample, D is the hypothesized difference between population means, and SE is the standard error.

#### Examples

**Problem1**: Forty-four sixth graders were randomly selected from a school district. Then, they were divided into 22 matched pairs, each pair having equal IQ's. One member of each pair was randomly selected to receive special training. Then, all of the students were given an IQ test. Test results are summarized below.

In [4]:
df = pd.DataFrame({
    'Training':   [95,89,76,92,91,53,67,88,75,85,90,85,87,85,85,68,81,84,71,46,75,80],
    'No training':[90,85,73,90, 90, 53,68,90,78,89,95,83,83,83,82,65,79,83,60,47,77,83]
})
df['Diff, d'] = df['Training'] - df['No training']
df['(d-dmean)^2'] = (df['Diff, d'] - df['Diff, d'].mean())**2
df

Unnamed: 0,Training,No training,"Diff, d",(d-dmean)^2
0,95,90,5,16.0
1,89,85,4,9.0
2,76,73,3,4.0
3,92,90,2,1.0
4,91,90,1,0.0
5,53,53,0,1.0
6,67,68,-1,4.0
7,88,90,-2,9.0
8,75,78,-3,16.0
9,85,89,-4,25.0


Do these results provide evidence that the special training helped or hurt student performance? Use an 0.05 level of significance. Assume that the mean differences are approximately normally distributed.

**Solution**:<br>
$H_0: \mu_d = 0 \\
H_a = \mu_d \neq 0$

In [5]:
n = len(df)                                         # sample size
d = 0                                               # difference in mean
alpha = 0.05                                        # significance level

s = (df['(d-dmean)^2'].sum()/(n-1))**0.5            # standard deviation
se = s/n**0.5                                       # standard error                      
test_stats = (df['Diff, d'].mean() - d)/se          # test statisitic  

critical_vals = [sc.stats.t.ppf(alpha/2, n-1), 
                 sc.stats.t.ppf(1-alpha/2, n-1)]    # critical values for two tailed test
pval = 2*(1-sc.stats.t.cdf(test_stats, n-1))        # p-value

print(
    f'Standard error: {se:.4f}',
    f'\nDegrees of freedom: {n-1}',
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Values: ({critical_vals[0]:.4f},{critical_vals[1]:.4f})',
    f'\nP-value: {pval:.4f}',
    f'\nSignificance level: {alpha:.4f}',
)

Standard error: 0.7645 
Degrees of freedom: 21 
Test Statisitic: 1.3081 
Critical Values: (-2.0796,2.0796) 
P-value: 0.2050 
Significance level: 0.0500


Since Test statistic does not fall in rejection region $(<z_{-\alpha/2}$ or >$z_{\alpha/2})$, we cannot reject null hypothesis.

### Chi-Square Goodness of Fit Test

**Chi-square goodness of fit test** is applied when you have one categorical variable from a single population. It is used to determine whether sample data are consistent with a hypothesized distribution. Steps invovled:
1. State the Hypotheses: <br>
$H_0:$ The data are consistent with a specified distribution. <br>
$H_a:$ The data are not consistent with a specified distribution.

2. State significance level.

3. **Degrees of freedom**: The degrees of freedom (DF) is equal to the number of levels (k) of the categorical variable minus 1. <br>$DF = k - 1$ 

4. **Expected frequency counts**: The expected frequency counts at each level of the categorical variable are equal to the sample size times the hypothesized proportion from the null hypothesis<br>
$E_i = np_i$

5. **Test statistic**. The test statistic is a chi-square random variable ($\chi^2$) defined by the following equation. <br><br>
$\chi^2 = \sum [\frac{(O_i - E_i)^2}{E_i}]$
<br><br>where $O_i$ is the observed frequency count for the ith level of the categorical variable, and Ei is the expected frequency count for the ith level of the categorical variable.

6. **P-value** The P-value is the probability of observing a sample statistic as extreme as the test statistic (assume to be right sided test).

#### Example

**Problem**: A lottery company claims that there is 
- 60% chance of winning low prize | 30% of medium prize and | 10% the high prize. 
- For a random sample of 100 samples: low prize is won 45 times | medium prize is won 50 times and | high prize is won 5 times.

Is the company's claim consistent. Use a 0.05 level of significance.

**Solution**:<br>
Hypotheses:<br>
- *Null hypothesis*: The proportion of low, mid and high prize is 60%, 30% and 10%.
- *Alternative hypothesis*: At least one of the proportions in the null hypothesis is false.

In [6]:
n = 100                                              # sample size
alpha = 0.05                                         # significance level
k = 3                                                # possible categories
deg_fm = k - 1                                       # degrees of freedom
p = np.array([0.6, 0.3, 0.1])                        # proportions for each category
E = n * p                                            # expected frequency counts
O = np.array([45, 50, 5])                            # observed frequency count
test_stats = np.sum((O-E)**2/E)                      # test statistic
critical_vals = [sc.stats.chi2.ppf(1-alpha, deg_fm)] # critical values
pval = 1-sc.stats.chi2.cdf(test_stats, deg_fm)       # p-value

print(
    f'Standard error: {se:.4f}',
    f'\nDegrees of freedom: {n-1}',
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Value: {critical_vals[0]:.4f}',
    f'\nP-value: {pval:.6f}',
    f'\nSignificance level: {alpha}',
)

Standard error: 0.7645 
Degrees of freedom: 99 
Test Statisitic: 19.5833 
Critical Value: 5.9915 
P-value: 0.000056 
Significance level: 0.05


Since Test statisitic does fall in rejection region 19.58 > 5.99, the null hypothesis is rejected and we can infer company's claim is not consistent.

### Chi-Square Test of Homogeneity

- This test is applied to **single categorical variable from two or more different populations** 
- It is used to determine whether frequency counts are distributed identically across different populations.

Steps:
1. State the Hypotheses: Suppose that data were sampled from r populations, and assume that the categorical variable had c levels. At any specified level of the categorical variable, the null hypothesis states that each population has the same proportion of observations

2. Choose significance level. 

3. **Degrees of freedom**: $DF = (r - 1)\times(c - 1) $ <br>where r is the number of populations, and c is the number of levels for the categorical variable.

4. **Expected frequency counts**: The expected frequency counts are computed separately for each population at each level of the categorical variable.<br><br>
$E_{r,c} = \frac{n_r n_c}{n}$
<br><br>where $E_{r,c}$ is the expected frequency count for population *r* at level *c* of the categorical variable, $n_r$ is the total number of observations from population r, $n_c$ is the total number of observations at treatment level *c* and *n* is the total sample size.

5. **Test Statistic** :<br><br>
$\chi^2 = \sum \left[\frac{(O_{r,c}-E_{r,c})2}{E_{r,c}}\right]$
<br><br>where $O_{r,c}$ is the observed frequency count in population *r* for level *c* of the categorical variable.

#### Example

**Problem**: A survey was conducted on movie choices. Three movie were provided: *Interstellar*, *Titanic*, *Pride and Prejudice*. The sample population consisted of 100 males and 200 females. Do the preference of males differ from females significantly. Use a 0.05 level of significance.

In [7]:
df = pd.DataFrame({
    'Gender':['Male', 'Female', 'Total'],
    'Interstellar':[50, 50, 100],
    'Titanic':[30, 80, 110],
    'Pride and Prejudice': [20, 70, 90]
})
df.index = df.Gender
df = df.drop(['Gender'], axis=1)
df['Total'] = df['Interstellar'] + df['Titanic'] + df['Pride and Prejudice']
df

Unnamed: 0_level_0,Interstellar,Titanic,Pride and Prejudice,Total
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,50,30,20,100
Female,50,80,70,200
Total,100,110,90,300


**Solution** <br>
*Null Hypothesis*: Proportion of males = Proprotion of females who like the movie for all movies considered. <br>
*Alternate Hypothesis*: At least for one movie the proportion of male and female liking it is not same.

In [8]:
r = 2                                                   # male and female
c = 3                                                   # no of movies
nr = np.array([100, 200])                               # total number of observations from population
nc = np.array([100, 110, 90])                           # total number of observations at treatment level c
O = np.array([[50,30,20], [50, 80, 70]])                # observed frequency count
deg_fm = (r-1)*(c-1)                                    # degrees of freedom
E = np.outer(nr, nc) / nr.sum()                         # expected frequency count
test_stats = np.sum((O-E)**2/E)                         # test statisitic

alpha = 0.05                                            # significance level
critical_vals = [sc.stats.chi2.ppf(1-alpha, deg_fm)]    # critical values
pval = 1 - sc.stats.chi2.cdf(test_stats, deg_fm)        # p-value

print(
    f'\nDegrees of freedom: {deg_fm}'
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Value: {critical_vals[0]:.4f}',
    f'\nP-value: {pval:.6f}',
    f'\nSignificance level: {alpha}'
)


Degrees of freedom: 2
Test Statisitic: 19.3182 
Critical Value: 5.9915 
P-value: 0.000064 
Significance level: 0.05


Since the test statisitic lie in rejection region (19.31>5.99), we can reject null hypothesis and claim that for atleast one movie the proportions that like it is different for males and females.

### Chi-Square Test of Independence

**Chi-square test for independence** is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

Steps:
1. *State the hypotheses*: <br><br>
$H_0$: Variable A and Variable B are independent.<br>
$H_a$: Variable A and Variable B are not independent.

2. Set a *Significance level*.

3. Let *r* be the number of levels for one catagorical variable. <br>
*c* be the number of levels for the other categorical variable.

4. *Degrees of freedom*: $DF = (r-1)\times(c-1)$

5. *Expected frequencies*:  Compute $r\times c$ expected frequencies, according to the following formula. <br><br>
$\displaystyle E_{r,c} = \frac{(n_r n_c)}{n}$
<br><br>where $n_r$ is the total number of sample observations at level *r* of Variable A, $n_c$ is the total number of sample observations at level c of Variable B, and *n* is the total sample size.

6. *Test Statisitc*: <br><br>
$\displaystyle \chi^2 = \sum \left[ \frac{(O_{r,c} - E_{r,c})^2}{E_{r,c}}\right]$
<br><br>where $O_{r,c}$ is the observed frequency count.

#### Example

A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified by gender (male or female) and by voting preference (Republican, Democrat, or Independent). Results are shown in the contingency table below.

In [9]:
df = pd.DataFrame({
    'Republican':[200,250,450],
    'Democaratic':[150,300,450],
    'Independent':[50,50,100],
    'Row Total':[400,600,1000]
})
df.index = ['Male','Female','Col Total']
df

Unnamed: 0,Republican,Democaratic,Independent,Row Total
Male,200,150,50,400
Female,250,300,50,600
Col Total,450,450,100,1000


Is there a gender gap? Do the men's voting preferences differ significantly from the women's preferences? Use a 0.05 level of significance.

*Solution*:<br>
$H_0$:  Gender and voting preferences are independent.<br>
$H_a$: Gender and voting preferences are not independent.


In [10]:
r = 2                                                   # number of levels for one catagorical variable
c = 3                                                   # number of categories
deg_fm = (r-1)*(c-1)                                    # degrees of freedom
nr = np.array([400, 600])                               # sample observations of male and female
nc = np.array([450, 450, 100])                          # spread of votes acrros partiles for males and females combined
E = np.outer(nr,nc) / nr.sum()                          # epected frequency
O = np.array([[200,150,50],[250,300,50]])               # observed frequency
test_stats = np.sum((O-E)**2/E)                         # test statistic

alpha = 0.05                                            # significance level
critical_vals = [sc.stats.chi2.ppf(1-alpha, deg_fm)]    # critival value
pval = 1-sc.stats.chi2.cdf(test_stats, deg_fm)          # p-value 

print(
    f'\nDegrees of freedom: {deg_fm}',
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Values: {critical_vals[0]:.4f}',
    f'\nP-value: {pval:.6f}',
    f'\nSignificance level: {alpha}'
)


Degrees of freedom: 2 
Test Statisitic: 16.2037 
Critical Values: 5.9915 
P-value: 0.000303 
Significance level: 0.05


Since Test statistic fall in rejection region (16.2>5.99), the null hypothesis is rejected and we conclude that there is a relationship between gender and voting preference.

### Hypothesis Test for Regression Slope

In this section we learn how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y.

The test focuses on the slope of the regression line

$Y = Β_0 + Β_1X$

where $Β_0$ is a constant, $Β_1$ is the slope (also called the regression coefficient), $X$ is the value of the independent variable, and $Y$ is the value of the dependent variable.

If we find that the slope of the regression line is significantly different from zero, we will conclude that there is a significant relationship between the independent and dependent variables.

Steps:
1. $H_0: B_1 = 0 \\ H_a: B_1 \neq 0$

2. Choose Significance Level.

3. *Standard Error*: <br><br>
$\displaystyle SE =  \sqrt{ \frac{1}{(n-2)} \sum_{i=1}^n \left(  \frac{y_i-\hat y}{x_i - \bar x} \right)^2}$
<br><br>where $y_i$ is the estimate value dependent on $x_i$ and $\bar x$ is the mean of the independent variable and $n$ is the number of observations.

4. *Degrees of freedom*: $DF = n-2$

5. *Test statisitic*: <br><br>
 $t = \frac{b_1}{SE}$
 <br><br> where $b_1$ is the slope of the sample regression line.

6. Test statistic is compared to critical value obtained from *t-distribution*.

#### Example

**Problem**: The local utility company surveys 101 randomly selected customers. For each survey participant, the company collects the following: annual electric bill (in dollars) and home size (in square feet). Output from a regression analysis appears below.

In [11]:
df = pd.DataFrame({
    'Predictor':['constant', 'home size'],
    'Coef':[15, 0.55],
    'SE Coef': [3, 0.24],
    'T':[5.0, 2.29],
    'P':[0, 0.01]
})
df

Unnamed: 0,Predictor,Coef,SE Coef,T,P
0,constant,15.0,3.0,5.0,0.0
1,home size,0.55,0.24,2.29,0.01


Is there a significant linear relationship between annual bill and home size? Use a 0.05 level of significance.

**Solution**: <br>
$H_0$: The slope of the regression line is equal to zero.<br>
$H_a$: The slope of the regression line is not equal to zero. 

Note: We consider a two-tailed test in this example.

In [12]:
b1 = 0.55                                               # slope
se = 0.24                                               # standard error, given
n = 101                                                 # number of samples
deg_fm = n-2                                            # degree of freedom
test_stats = b1/se                                      # test statisitc
alpha = 0.05                                            # significance level
critical_vals = [sc.stats.t.ppf(alpha/2, deg_fm), 
                 sc.stats.t.ppf(1-alpha/2, deg_fm)]     # critical values
pval = 2*(1-sc.stats.t.cdf(np.abs(test_stats), deg_fm)) # p-value

print(
    f'\nDegrees of freedom: {deg_fm}',
    f'\nTest Statisitic: {test_stats:.4f}',
    f'\nCritical Values: ({critical_vals[0]:.4f},{critical_vals[1]:.4f})',
    f'\nP-value: {pval:.6f}',
    f'\nSignificance level: {alpha}',
    
)


Degrees of freedom: 99 
Test Statisitic: 2.2917 
Critical Values: (-1.9842,1.9842) 
P-value: 0.024043 
Significance level: 0.05


Since test statistic fall in rejection region (2.29 > 1.98), we can reject null hypothesis and claim that the slope of regression line is not zero, or there is a relationship between annual bill and size of home.