# Hypothesis tests Part 2 Intro

In this Notebook I will continue the **Hypothesis tests** main tests to clarify the difficulties while trying to understand this statistical concept. This is the second part of my series *"The Hypothesis Testing Bible"*. If you want to know further about the dataset or more elemental concepts, please check the first part on my GitHub repository. Link below.

https://github.com/Seniorveiga/Python_Projects/tree/main/Hypothesis%20Testing%20Bible

In this notebook we will have a walk over more complex tests. They will be:

## Index

- Proportion tests
    - Tests for single and two proportions
    - Proportions $\mathit{z}$-test
    - ${Chi^{2}}$ tests

- Non-parametric tests
    - Wilcoxon tests
    - Wilcoxon-Man-Whotney tests
    - Kruskal-Wallis tests

## Packages reminder

To learn more about hypothesis tests, we will be working with two datasets, one that is called *late_shipments* and the other one called *republican_votes*. Right now we will be using only the first one in order to understand more about hypothesis tests.

The *late_shipments* dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. 
- The **"late"** columns denotes whether or not the part was delivered late. A value of "Yes" means that the part was delivered late, and a value of "No" means the part was delivered on time.

Also, for temporal comparing datasets we will use *repub_votes_potus_08_12* that has comparisons of the votes between republicans and democrats betweenn those years.

- Since the counties are the same in both years, these samples are paired (Used in the paired t-test chapter). The columns containing the samples are *"dem_percent_12"* and *"dem_percent_16"*.

## Packages and dataset import

Again, we import the package so that it appears again to do our operations and the hypothesis tests.

In [25]:
import pandas as pd
import pyarrow.feather as feather
import numpy as np 
from scipy.stats import norm, t
import matplotlib.pyplot as plt
import pingouin
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest


In [26]:
late_shipments = feather.read_feather("late_shipments.feather")
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


# Proportions tests

## Test for single proportions
### Reminder of z-Score use

Imagine now that we want to calculate the proportion of a certain population, we already did this in the first part of our Hypothesis testing Bible! The steps we took were the following:

1. We use the bootstrap distribution to calculate them the standard deviation with NumPy Package.
2. We calculate the standardized test statistic, the $\mathit{z}$-Score.
3. With the $\mathit{z}$-Score, we calculated the $\mathit{p}$-Value.
4. We decide the hypothesis that has more sense.

As the bootstrap distribution can be computationally exhausting for the computer, there are other options to do it which can be less tiring for the PC.

What we used for it was: 
$$\mathit{z}-Score = \frac{\widehat{p} - \mathit{p}}{SE(\widehat{p})}$$

If we assume that $H_{0}$ is true, then $\mathit{p}$ = $\mathit{p_{0}}$ so:
$$\mathit{z}-Score = \frac{\widehat{p} - \mathit{p_{0}}}{SE(\widehat{p})}$$

So knowing the value of $\mathit{SE}$, that we can look from the previous notebook, we would only need sample information that are $\widehat{p}$ and $\mathit{n}$, and the parameter $\mathit{p_{0}}$ where we decide the hypothesis. Remember that $\widehat{p}$ is the sample proportion from the population, and n the number of samples.

### Why do we use z instead of t?

Remember that $\mathit{t}$ is calculated, for example in the german and belgian beers case as:

$$\mathit{t} = \frac{(\mu_{German} - \mu_{Belgian})}{\sqrt{\frac{\mathit{s}^{2}_{German}}{n_{German}} + \frac{\mathit{s}^{2}_{Belgian}}{n_{Belgian}}}}$$



This means that the numerator estimates the deviation of the mean, and s, that is for standard deviation, estimates the population standard deviation...**But it is calculated from $\mu$! If we combine then that increases the uncertainty of the model.**

This is more ovious when we remember that the tails in $\mathit{t}$-tests are fatter, so we wold reject wrongly the null hypothesis. 

We are going to do the same test but with $\mathit{p}$-Value with an $\alpha$ = 0.05.

- Our hypothesis is that 6% of shipments are late, and we calculate the value as *"p-hat"*.

In [27]:
# Hypothesize that the proportion of late shipments is 6%
p_0 = 0.06
p_hat = (late_shipments['late'] == "Yes").mean()
n = len(late_shipments)

And then we calculate the numerator and denominator to obtain the $\mathit{z}$-Score and with it, the $\mathit{p}$-Value.

In [28]:
# z-Score
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0)/ n)
z_score = numerator / denominator

# REMINDER:
#- Our null hypothesis is that **the proportion of late shipments is 6%**
#- Our alternative hypothesis is that **the proportion of late shipments is more than 6%**
p_value = 1 - norm.cdf(z_score)
p_value

0.44703503936503364

So we fail to reject $H_{0}$ and we have found an easier path for one sample proportion test.

------------------------

## Two-sample proportions tests
### A nightmare at first-sight

In case we are comparing a two sample proportion tests, we would be comparing two proportions that would have a much more complex mathematical form. 

**We would need the double of arguments than the one-sample proportion tests** as we have now two different populations with different proportions. 

**Notice that in the previous example, we were saying a fact of the proportion, while in this case, we are comparing proportions!**

We would have something similar to the case that we were comparing means. For example we can say that:

- $H_{0}$: Proportion of smokers is the same under 30 as those at least thirty
- $H_{A}$: Proportion of smokers is different under 30 than those at least thirty

Now, our $\mathit{z}$-Score equation would be:

$$\mathit{z}-Score = \frac{(\widehat{p}_{\geq 30} - \widehat{p}_{< 30}) - 0}{SE(\widehat{p}_{\geq 30} - \widehat{p}_{<30})}$$

And the Standard Error now is again a combination of both, which is:

$$SE(\widehat{p}_{\geq 30} - \widehat{p}_{<30}) = \sqrt{\frac{\widehat{p}(1- \widehat{p})}{n_{\geq 30}} + \frac{\widehat{p}(1- \widehat{p})}{n_{< 30}}}$$

But, look at this! Now the value of $\widehat{p}$ is **a ponderated version of both groups**. Things are getting messy:

$$\widehat{p} = \frac{n_{\geq 30} \times \widehat{p}_{\geq{30}} + n_{<30} \times \widehat{p}_{<{30}}}{n_{\geq 30} \times n_{< 30}}$$


### How do we solve this nightmare?

Obviously this is a simple guide. And that is not simple. So we extract these conclusions:

1. You only need 4 variables: $n_{\geq 30}$ , $\widehat{p}_{\geq{30}}$ , $n_{<30}$ and $\widehat{p}_{<{30}}$ to solve the hypothesis test.
2. You only need **pandas** to calculate them!

### Normal solving method

In our example, we are going to do the following two-sample proportion test:

- $H_{0}$: The shipments where the amount paid for freight have the same proportion of lateness as the ones that the amount paid for freight is reasonable.

$$late_{expensive} - late_{reasonable} = 0$$
- $H_{A}$: The shipments where the amount paid for freight have a bigger proportion of late shipments than the ones that the amount paid for freight is reasonable.
$$late_{expensive} - late_{reasonable} > 0$$


In [29]:
p_hats = late_shipments.groupby("freight_cost_groups")["late"].value_counts(normalize = True)
ns = late_shipments.groupby("freight_cost_groups")["late"].count()
p_hats, ns

(freight_cost_groups  late
 expensive            No      0.920904
                      Yes     0.079096
 reasonable           No      0.964835
                      Yes     0.035165
 Name: proportion, dtype: float64,
 freight_cost_groups
 expensive     531
 reasonable    455
 Name: late, dtype: int64)

We calculate now the 4 variables:

In [30]:
# Calculate the pooled estimate of the population proportion
p_hat = (p_hats["reasonable"] * ns["reasonable"] + p_hats["expensive"] * ns["expensive"]) \
                    / (ns["reasonable"] + ns["expensive"]) #denominator

p_hat_times_not_p_hat = p_hat * (1 - p_hat)
p_hat_times_not_p_hat_over_ns = p_hat_times_not_p_hat / ns["expensive"] + p_hat_times_not_p_hat / ns["reasonable"]
std_error = np.sqrt(p_hat_times_not_p_hat_over_ns)["Yes"]

# Calculate the z-score
z_score = ((p_hats["expensive"] - p_hats["reasonable"]) / std_error)["Yes"]

# p-value from the z-score RIGHT TAILED
p_value = (1 - norm.cdf(z_score))
p_value

0.0017353400023595311

### Proportions z-test: The fast-way

Another way to do the exact same operation but without arithetic. You do not have to do much calculus and just knowing what´s going on behind is ok to understand and obtain coherent results:

In [40]:
late_by_freight_cost_group = late_shipments.groupby("freight_cost_groups")["late"].value_counts()

#Are they expensive? Yes, so we pick that rows
success_counts = np.array([late_by_freight_cost_group[("expensive","Yes")], late_by_freight_cost_group[("reasonable","Yes")]])

# Number of elements
n = np.array([late_by_freight_cost_group["expensive"].values.sum(), late_by_freight_cost_group["reasonable"].values.sum()])

# z-test
z_score, p_value = proportions_ztest(count = success_counts, nobs = n, alternative = "larger")
z_score, p_value

(2.922648567784529, 0.001735340002359578)

That´s how we simplify that equation with **statsmodels**!
In case you do not have the package, you can use the link to download it:

https://www.statsmodels.org/stable/index.html

-------------------