**Table of contents**<a id='toc0_'></a>    
- [Import statements](#toc1_1_)    
- [Loading the datasets](#toc1_2_)    
- [Proportion tests](#toc2_)    
  - [One sample proportion test](#toc2_1_)    
  - [Two sample proportion test](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import statements](#toc0_)

In [1]:
import warnings

warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from scipy.stats import norm

In [4]:
from numpy.random import default_rng

rng = default_rng(seed=328)

### <a id='toc1_2_'></a>[Loading the datasets](#toc0_)

- The *"late_shipments"* dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. The "late" column denotes whether or not the part was delivered late. A value of "Yes" means that the part was delivered late, and a value of "No" means the part was delivered on time.

In [5]:
late_shipments = pd.read_feather("./datasets/late_shipments.feather")

In [6]:
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


In [7]:
late_shipments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        1000 non-null   float64
 1   country                   1000 non-null   object 
 2   managed_by                1000 non-null   object 
 3   fulfill_via               1000 non-null   object 
 4   vendor_inco_term          1000 non-null   object 
 5   shipment_mode             1000 non-null   object 
 6   late_delivery             1000 non-null   float64
 7   late                      1000 non-null   object 
 8   product_group             1000 non-null   object 
 9   sub_classification        1000 non-null   object 
 10  vendor                    1000 non-null   object 
 11  item_description          1000 non-null   object 
 12  molecule_test_type        1000 non-null   object 
 13  brand                     1000 non-null   object 
 14  dosage   

## <a id='toc2_'></a>[Proportion tests](#toc0_)

Usually when the sampling distribution is a normal distribution we use z-test. 

The **z-test** can be used to test the null hypothesis that a population parameter is equal to a certain value, *for any population parameter that is normally distributed*. This includes parameters such as the *population mean, the population variance, and the population proportion*.


For finding the *z-score* we use the following formula:

$$ z = \frac{\text{Sample statistic} - \text{Null Hypothesis value}}{\text{Standard error of the sample statistic}} $$ 

$$ z = \frac{\text{Sample statistic} - \text{Null Hypothesis value}}{\sigma / \sqrt{n}} $$

But it is often the case that we don't know the population standard deviation. 

One alternative is to use the sample standard deviation as an estimate for the population standard deviation. In this scenario we can't use the z-test, instead we use the **t-test** (due to increased uncertainty). Also the t-test is used only for the mean (usually of a numerical response variable).

Another alternative is to create a bootstrap distribution from the sample and use that to estimate the standard error. This way we can still use the z-test. But the problem with this is, bootstrapping is computationally expensive and it is not always possible to create a bootstrap distribution.

To deal with these problems we use the **proportion tests**. In proportion tests we use a mathematical formula to calculate the standard error of the sample statistic (SE) instead of creating a bootstrap distribution. Also since we use the sample parameter only in the numerator and not to calculate SE in the denominator we can still use the *z-score* to measure how extreme out sample is.

The idea of proportion is used exclusively for categorical variables.

Proportion tests are statistical tests used to compare two or more proportions. There are two main types of proportion tests:

- One-sample proportion test: This test is used to compare the proportion of a sample to a known population proportion, or to test the null hypothesis that the population proportion is equal to a certain value.
- Two-sample proportion test: This test is used to compare the proportions of two independent samples.

Both of these tests are based on the normal distribution.

### <a id='toc2_1_'></a>[One sample proportion test](#toc0_)

The one-sample proportion test is used to compare the proportion of a sample to a known population proportion, or to test the null hypothesis that the population proportion is equal to a certain value.

For a one-sample proportion test the *z-score* is calculated using the following formula:

$$ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$$

Where,
- $\hat{p}$ is the sample proportion (sample statistic)
- $p_0$ is the hypothesized population proportion (null hypothesis value)
- $n$ is the sample size

> Let's use the late shipments dataset and the proportion of late shipments to illustrate how one-sample proportion test is done.

*The null hypothesis is that the proportion of late shipments is six percent i.e, $H_0: P=0.06$*

*The alternative hypothesis is that the proportion of late shipments is greater than six percent i.e, $H_A: P>0.06$*

In [8]:
# Select a suitable significance level
alpha = 0.05

# Hypothesize that the proportion of late shipments is 6%
p_0 = 0.06

# Calculate the sample proportion of late shipments
p_hat = (late_shipments["late"] == "Yes").mean()

# Calculate the sample size
n = len(late_shipments)

# Calculate the numerator and denominator of the test statistic
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)

# Calculate the test statistic
z_score = numerator / denominator

# Calculate the p-value from the z-score (right tailed test)
p_value = 1 - norm.cdf(z_score)

# Print the p-value
print(p_value)

0.44703503936503364


In [9]:
p_value < alpha

False

Since the p value > alpha, we fail to reject the null hypothesis i.e, we do not have enough statistical evidence to say that the proportion of late shipments is more than 6%.

### <a id='toc2_2_'></a>[Two sample proportion test](#toc0_)

The two sample proportion test is used for comparing the proportions of two independent samples or to compare the proportions of two groups across a categorical variable.

For a two sample proportion test, 

$$ z = \frac{(\hat{p}_1 - \hat{p}_2)}{SE(\hat{p}_1 - \hat{p}_2)}$$

$$ SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n_1} + \frac{\hat{p}(1-\hat{p})}{n_2}}$$

And the pooled proportion, $\hat{p}$ is calculated as:

$$ \hat{p} = \frac{n_1\hat{p}_1 + n_2\hat{p}_2}{n_1 + n_2}$$

> Let's return to our late shipments dataset and see if the amount paid for freight affects whether or not a shipment is late. Recall that in the late_shipments dataset, whether or not the shipment was late is stored in the "late" column. Freight costs are stored in the "freight_cost_groups" column, and the categories are "expensive" and "reasonable".

Since the "freight_cost_group" column is a categorical variable, we can use a two sample proportion test to compare the proportion of late shipments across each group. Here the success is defined as a shipment being late, and the categories are "expensive" and "reasonable".

The Null Hypothesis is, $$H_0: P_{expensive|late} - P_{reasonable|late} = 0$$

And the Alternative Hypothesis is, $$H_A: P_{expensive|late} - P_{reasonable|late} > 0$$

The `statsmodels.stats.proportion` module has a function called `proportions_ztest` that can be used to perform a two sample proportion test without going through the hassle of calculating all these parameters to get the *z-score*. 

*The function is called with a list/array of the number of successes in each group/sample and a list/array of the number of observations in each group/sample. The function returns the *z-score* and the *p-value*.*

This function can be used to perform a one sample proportion test as well.

In [10]:
from statsmodels.stats.proportion import proportions_ztest

In [11]:
# Choose an apropriate significance level
alpha = 0.05

In [12]:
# Count the late column values for each freight_cost_group
late_by_freight_cost_group = late_shipments.groupby("freight_cost_groups")[
    "late"
].value_counts()

late_by_freight_cost_group

freight_cost_groups  late
expensive            No      489
                     Yes      42
reasonable           No      439
                     Yes      16
Name: count, dtype: int64

In [13]:
# Create an array of the late = "Yes" counts for each freight_cost_group
success_counts = np.array(
    [
        late_by_freight_cost_group[("expensive", "Yes")],
        late_by_freight_cost_group[("reasonable", "Yes")],
    ]
)

# Create an array of the total number of rows in each freight_cost_group
n_obs = np.array(
    [
        late_by_freight_cost_group["expensive"].sum(),
        late_by_freight_cost_group["reasonable"].sum(),
    ]
)

In [14]:
# Run a z-test on the two proportions
z_score, p_value = proportions_ztest(success_counts, n, alternative="larger")

In [15]:
# Print the results
print(z_score, p_value)

3.4645731571834224 0.0002655368351732118


In [16]:
p_value < alpha

True

Since the calculated p-value (0.0002) < alpha (0.05), we reject the null hypothesis and conclude that the proportion of late shipments is higher for expensive freight costs group than for reasonable freight costs group.