# Hypothesis Testing
*
*

## One Sample T Test

It is a statistical procedure used to examine or compare the mean of sample data to already known population mean. `stats.ttest_1samp()`

$$t = \dfrac{\bar x - \mu}{\dfrac{s}{\sqrt n}} $$

It is used when the sample size is **less than or equal to 30**.

### Degree of Freedom

It is the number of values in the final calculation of a statistic that are free to vary. It can be calculate: $n -1$


<hr>

**EXAMPLE:**
Calculate the resting systolic blood pressure of 20 first-year resident female doctors and compare it to the general public population mean of 120mmHg.

<u>Solution</u>

**Null Hypothesis:** There is no significant difference between the blood pressures of the resident female doctors and the general population.

**Alternate:** There is a statistically significant difference between the blood pressure of the resident female doctors and the general population.

In [1]:
from scipy import stats

In [2]:
female_doctor_bps = [128, 127, 118, 115, 144, 142, 133, 140, 132, 131, 
                     111, 132, 149, 122, 139, 119, 136, 129, 126, 128]

# one sample t-test
stats.ttest_1samp(female_doctor_bps,120)

Ttest_1sampResult(statistic=4.512403659336718, pvalue=0.00023838063630967753)

The pvalue is less than 0.05. Hence, reject the null hypothesis. *The is a statistically significant difference between the resting systolic blood pressure of the resident female doctors and the general population.*

<hr>

## Two Sample T Test

It is the statistical procedure used to examine or compare the mean of two separate samples. `stats.ttest_ind()`

$$t = \dfrac{\bar x_1 + \bar x_2}{\sqrt{s_p^2(\frac{1}{n_2} + \frac{1}{n_2})}}$$

**DOF:** $n_1 + n_2 - 2$. 

<hr>

**EXAMPLE:**
Compare the blood pressure of male consultant doctors with the junior resident female doctors.

<u>Solution</u>

**Null Hypothesis:** There is no significant difference between the blood pressure of male consultant doctors and junior  resident female doctors.

**Alternate:** There is a statistically significant difference between the blood pressure of the male consultant doctors and junior resident female doctors.

In [3]:
import pandas as pd

# read data
bps = pd.read_csv('../data/bp.csv')

In [4]:
bps.head()

Unnamed: 0,female_bps,male_bps
0,128,118
1,127,115
2,118,112
3,115,120
4,144,124


In [5]:
# two sample t test
stats.ttest_ind(bps.iloc[:,0],bps.iloc[:,1])

Ttest_indResult(statistic=3.5143256412718564, pvalue=0.0011571376404026158)

## Paired Sample T Test

It is a statistical procedure for examining or comparing the means of two samples. It has the situation of before and after. `stats.ttest_rel()`

$$ \large t = \dfrac{\bar d}{\frac{s}{\sqrt{n}}}$$

Degree of Freedom: $n - 1$

<hr>

**EXAMPLE:**
Measure and compare the amount of sleep by patients before and after taking soporific drug to help them sleep.

<u>Solution</u>

**Null Hypothesis:** The drug has no effect on the sleep duration of the patients. 

**Alternate:** The drug has an effect on the sleep duration of the patients.

In [6]:
sleep_duration = pd.read_csv('../data/sleep_duration.csv')
control, treatment = sleep_duration.iloc[:,0],sleep_duration.iloc[:,1]

In [7]:
# paired sample t test
stats.ttest_rel(control,treatment)

Ttest_relResult(statistic=-3.9698390753392734, pvalue=0.003255434487402806)

**N.B:** pvalue is less than 0.05. Therefore, we reject the null hypothesis. There is a statistically significant difference in sleep duration caused by soporofic drug.

## One Sample Z Test

It is a statistical test to determine whether two population means are different when the variances are known.

**T-Test Vs Z-Test**

In t test, the sample size is less than or equal to 30 and the population standard deviation is unknown.

In z test, the sample size is greater than 30 and the population standard deviation is known.

**The One Sample z-test** is used to test whether the mean of a population is greater than, less than, or not equal to a specific value.

$$ z = \dfrac{\bar x - \mu}{\sigma} $$

After finding the z-score, find the corresponding value of z-statistic from the z table. A **z-table** is a mathematical table that allows us to know the percentage of values below a z-score value in a standard normal standard deviation.

**N.B:**
* If z-score is positive, then the pvalue = 1 - zscore
* If negative, then pvalue = zscore.
* If pvalue < significance level, then reject the null hypothesis.

In [8]:
data = pd.read_csv('../data/train.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


**Problem statement:** 
Test whether the mean of house prices is 180000 or not.

**Null Hypothesis:** The mean house price is 180000.

**Alternate:** The mean house price is not 180000.

In [9]:
from statsmodels.stats import weightstats as stests

ztest, pval = stests.ztest(x1=data["SalePrice"], x2=None, value=180000)
print(f"P value is: {float(pval)} \nZ value is: {float(ztest)}")

if pval < 0.05:
    print("We reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

P value is: 0.6577127747949538 
Z value is: 0.44307321990459303
Fail to reject the null hypothesis


<hr>

## Two Sample Z Test

The Two-sample z-test is used to compare the means of two samples to see if it is feasible that they come from the same population.

$$z = \dfrac{(\bar x_1 - \bar x_2)-(\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} $$

**Problem statement:** 
Check if there is an association between the prices of the first floor per square feet and second floor per sqaure feet houses.

**Null Hypothesis:** The mean of the first floor and the second floor per feet square houses are equal.

**Alternate:** The mean of the first floor and the second floor per feet square houses are not equal.

In [10]:
zstats, pval = stests.ztest(data['1stFlrSF'], data['2ndFlrSF'], value=0)
print(f"\nP value: {pval} \nZ score: {zstats}")

if pval < 0.05:
    print("We reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


P value: 0.0 
Z score: 53.4475561911819
We reject the null hypothesis


<hr>

## One Sample ANOVA Test

Two sample T or Z test can validate an hypothesis containing only **two groups** at a time but when there is **three or more groups**, ANOVA - Analysis of Variance is handy. `stats.f_oneway()`

* Anova determines whether the means of three or more groups are different.
* It uses F-Test to statistically test the equality of means.
* It is used when there are three or more sample of less than and greater than 30.

<u>Assumptions</u>
* Anova assumes independence of observations
* Homogeneity of variances
* Normally distibuted observation within group

<u>Degree of Freedom</u>

**For between groups:**
df = no. of groups - 1

**For within groups:**
df = total no. of observation - no. of groups

**Problem Statement:**

Examine whether the means of weights of plants under control and 2 different treatment condition is significantly similiar or different.

**Null Hypothesis:** There is no difference between the means of the weights of dried plants under control and 2 different treatment condition.

**Alternate Hypothesis:** There is a difference between the means of the weights of dried plants under control and 2 different treatment condition.

In [11]:
from scipy import stats

crtl = [4.16, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14]
trt1 = [4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.83, 4.80, 4.32, 4.09]
trt2 = [6.31, 5.12, 5.54, 5.5, 5.37, 5.29, 4.92, 6.15, 5.8, 5.26]

stats.f_oneway(crtl,trt1,trt2)

F_onewayResult(statistic=3.6421479763680336, pvalue=0.039776369202704415)

p_value < 0.05. We reject the null hypothesis. Hence, there is a significant difference between the means of the weight of dried plants under control and 2 different treatment condition.

<hr>

## Two Sample ANOVA Test

In one way ANOVA, there is only one independent variable but in two way ANOVA, there are two independent variables.

**N.B:** All the tests considered are used for **numerical variables** but there exist a `chi test` used for **categorical variable.**

<hr>

## Chi-Squared Test

The `chi-squared` test also known as $x^2$ is used for testing for relationship between categorical variables.

There are two types of chi-squared test:
1. Goodness of fit test
2. Chi-squared test of independence

### Goodness of Fit Test

It is used to test whether the sample data correctly represents the population data or not.

$$ X^2 =  \Sigma\dfrac{(0 - E)^2}{E}$$

**Problem Statement:**
Generate a fake demographic data for Nigeria and any one state in Nigeria (say Oyo). Use the chi-squared Goodness of fit to check whether they are different.

In [12]:
from scipy import stats
import pandas as pd

In [15]:
national = pd.DataFrame(["white"]*100000 + ['hispanic']*60000 +
                        ['black']*50000 + ['others']*35000)

state = pd.DataFrame(["white"]*600 + ['hispanic']*300 + 
                        ['black']*250 + ['others']*150)

national_table = pd.crosstab(index=national[0],columns="count")

state_table = pd.crosstab(index=state[0],columns="count")

In [17]:
observed = state_table

# get population ratios
national_ratio = national_table/len(national)

# get expected count
expected = national_ratio * len(state)

# chi squared
chi_squared_stat = ((observed-expected)**2/expected).sum()
print(chi_squared_stat)

col_0
count    17.884615
dtype: float64


### Test of Independence
