**Table of contents**<a id='toc0_'></a>    
- [Import statements](#toc1_1_)    
- [Loading the datasets](#toc1_2_)    
- [Non-parametric tests](#toc2_)    
  - [Wilcoxon signed-rank test](#toc2_1_)    
  - [Wilcoxon-Mann-Whitney test](#toc2_2_)    
  - [Kruskal-Wallis test](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_'></a>[Import statements](#toc0_)

In [1]:
import warnings

warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### <a id='toc1_2_'></a>[Loading the datasets](#toc0_)

- The *"late_shipments"* dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. The "late" column denotes whether or not the part was delivered late. A value of "Yes" means that the part was delivered late, and a value of "No" means the part was delivered on time.

In [3]:
late_shipments = pd.read_feather("./datasets/late_shipments.feather")

In [4]:
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


In [5]:
late_shipments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        1000 non-null   float64
 1   country                   1000 non-null   object 
 2   managed_by                1000 non-null   object 
 3   fulfill_via               1000 non-null   object 
 4   vendor_inco_term          1000 non-null   object 
 5   shipment_mode             1000 non-null   object 
 6   late_delivery             1000 non-null   float64
 7   late                      1000 non-null   object 
 8   product_group             1000 non-null   object 
 9   sub_classification        1000 non-null   object 
 10  vendor                    1000 non-null   object 
 11  item_description          1000 non-null   object 
 12  molecule_test_type        1000 non-null   object 
 13  brand                     1000 non-null   object 
 14  dosage   

- The *"dem_votes_potus_12_16"* dataset contains the percentage of votes for the Democratic candidate in the 2012 and 2016 presidential elections for each county in the United States. The "dem_percent_2012" column contains the percentage of votes for the Democratic candidate in the 2012 election, and the "dem_percent_2016" column contains the percentage of votes for the Democratic candidate in the 2016 election.

In [6]:
sample_dem_data = pd.read_feather("./datasets/dem_votes_potus_12_16.feather")

In [7]:
sample_dem_data.head()

Unnamed: 0,state,county,dem_percent_12,dem_percent_16
0,Alabama,Bullock,76.3059,74.946921
1,Alabama,Chilton,19.453671,15.847352
2,Alabama,Clay,26.673672,18.674517
3,Alabama,Cullman,14.661752,10.028252
4,Alabama,Escambia,36.915731,31.020546


In [8]:
sample_dem_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   state           500 non-null    object 
 1   county          500 non-null    object 
 2   dem_percent_12  500 non-null    float64
 3   dem_percent_16  500 non-null    float64
dtypes: float64(2), object(2)
memory usage: 15.8+ KB


## <a id='toc2_'></a>[Non-parametric tests](#toc0_)

The tests that we've seen so far are known as parametric tests. Tests like the z-test, t-test, and ANOVA are all based on the assumption that the population is normally distributed. Parametric tests also require sample sizes that are "big enough" that the Central Limit Theorem applies. 

To check if the assumptions for hypothesis testing holds we can perform a sanity check. We calculate a bootstrap distribution and visualize it with a histogram. If we don't see a bell-shaped normal curve, then one of the assumptions hasn't been met. In that case, we should revisit the data collection process, and see if any of the three assumptions of randomness, independence, and sample size do not hold.

In situations where we aren't sure about these assumptions, or we are certain that the assumptions aren't met, we can use non-parametric tests. They do not make the normal distribution assumptions and does not require the sample size conditions. 

There are various non-parametric tests such as the the Wilcoxon signed-rank test, Mann-Whitney U test, Kruskal-Wallis test, and the Spearman's rank correlation test etc. which act as alternatives to their parametric counterparts.

### <a id='toc2_1_'></a>[Wilcoxon signed-rank test](#toc0_)

The Wilcoxon signed-rank test works well when the assumptions of a paired t-test aren't met.

The Wilcoxon signed-rank test is a non-parametric statistical test used to compare the distribution of a continuous variable before and after an intervention, or between two paired groups. It is a non-parametric test, meaning that it does not make any assumptions about the distribution of the data.

The Wilcoxon signed-rank test is similar to the paired t-test, but it is more robust to violations of the assumptions of normality and homogeneity of variance. It is also more powerful than the paired t-test when the sample size is small.

> We'll explore the difference between the proportion of county-level votes for the Democratic candidate in 2012 and 2016 to identify if the difference is significant. We will use a reduced version of the original dataset which is sampled randomly to include only 10 rows so that the conditions for paired t-test are not met (no of pairs >= 30). First we will use a normal paired t-test and then We'll use the Wilcoxon signed-rank test to check if the difference is significant.

In [9]:
alpha = 0.05

In [10]:
import pingouin

In [11]:
reduced_sample_dem_data = sample_dem_data.sample(n=10, random_state=327)

- Normal paired t-test

In [12]:
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
paired_test_results = pingouin.ttest(
    x=reduced_sample_dem_data["dem_percent_12"],
    y=reduced_sample_dem_data["dem_percent_16"],
    alternative="two-sided",
    paired=True,
)

# Print paired t-test results
paired_test_results

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,5.640869,9,two-sided,0.000317,"[4.6, 10.77]",0.583642,107.095,0.378369


- Wilcoxon signed-rank test

We can use the `scipy.stats.wilcoxon()` method to perform the Wilcoxon signed-rank test. We can also use the `pingouin.wilcoxon()` function from the Pingouin module to perform the Wilcoxon signed-rank test. Both of these functions takes the two paired groups as input and returns the test statistic and the p-value.

In [13]:
# Conduct a Wilcoxon test on dem_percent_12 and dem_percent_16
wilcoxon_test_results = pingouin.wilcoxon(
    x=reduced_sample_dem_data["dem_percent_12"],
    y=reduced_sample_dem_data["dem_percent_16"],
    alternative="two-sided",
)

# Print Wilcoxon test results
wilcoxon_test_results

Unnamed: 0,W-val,alternative,p-val,RBC,CLES
Wilcoxon,0.0,two-sided,0.001953,1.0,0.73


We can see from these results that the p-value in wilcoxon test is greater than the result found in the paired t-test.

### <a id='toc2_2_'></a>[Wilcoxon-Mann-Whitney test](#toc0_)

It is also known as the Mann-Whitney U test. It is the non-parametric alternative to the two-sample independent t-test.

> While trying to determine why some shipments are late, you may wonder if the weight of the shipments that were on time is different than the weight of the shipments that were late.

In [14]:
reduced_late_shipments = late_shipments.groupby("late").sample(
    frac=0.4, random_state=215
)[["weight_kilograms", "late"]]

In [15]:
reduced_late_shipments.head()

Unnamed: 0,weight_kilograms,late
669,2307.0,No
491,228.0,No
275,12.0,No
697,829.0,No
756,6.0,No


In [16]:
# Check that conditions for two sample independent t-test isn't fulfilled
print(reduced_late_shipments.late.value_counts())
print((reduced_late_shipments.late.value_counts() >= 30).all())

late
No     376
Yes     24
Name: count, dtype: int64
False


**`Note:`** For perfoming the independent t-test and the mann-whitney u test we need the data in wide format.

<u>**Wide and long format data**</u>

To understand wide and long format data we first need to understand two terms associated with the data in a dataframe.
- **fact:** A fact is a value that is measured and reported on.
- **dimension:** A dimension is a value that describes the conditions of the fact.

For example, in a sales scenario, typical facts would be the number of sales of an item and the cost. The dimensions might include the store where the item was sold, the date, and the customer.

Based on the idea of fact and dimension, the way data is stored can be categorized as,
- **wide form:** if a single row has multiple facts and,
- **long or, tidy form:** if a single row of data has only one fact (may be along with other variables describing the dimensions).

The "reduced_late_shipments" dataset is in long format. To convert it to wide format we can either use the `df.pivot_table()` or, `df.pivot()` method. The *pivot_table* method will apply an aggregate function while the *pivot* method will not. Here we will use the *pivot* method since we don't want to apply any aggregate function rather we need the values as is.

In [17]:
reduced_late_shipments_wide = pd.pivot(
    columns="late", values="weight_kilograms", data=reduced_late_shipments
)

In [18]:
reduced_late_shipments_wide.head()

late,No,Yes
2,3723.0,
5,5057.0,
10,1290.0,
11,402.0,
13,1727.0,


Here the "reduced_late_shipments_wide" dataframe has multiple facts for each row. The "weight" of the shipment and the "late" status.

In [19]:
alpha = 0.01

- Normal two-sample independent t-test

In [20]:
pingouin.ttest(
    reduced_late_shipments_wide["Yes"],
    reduced_late_shipments_wide["No"],
    alternative="two-sided",
)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,2.198018,25.796663,two-sided,0.037124,"[77.02, 2312.56]",0.478512,1.849,0.620726


- Mann-Whitney U test

We can use the `pingouin.mwu()` function to perform the Mann-Whitney U test. It takes the two independent groups as input and returns the test statistic and the p-value.

In [21]:
# Run a two-sided Wilcoxon-Mann-Whitney test on weight_kilograms vs. late
wmw_test = pingouin.mwu(
    x=reduced_late_shipments_wide["Yes"],
    y=reduced_late_shipments_wide["No"],
    alternative="two-sided",
)

# Print the test results
wmw_test

Unnamed: 0,U-val,alternative,p-val,RBC,CLES
MWU,6094.5,two-sided,0.003966,-0.350731,0.675366


From the results we can see that the p-value in the Mann-Whitney U test less than the result found in the independent t-test.

The decision to reject the null hypothesis or not changes completely based on the test we use. In the independent t-test we fail to reject the null hypothesis and in the Mann-Whitney U test we reject the null hypothesis.

### <a id='toc2_3_'></a>[Kruskal-Wallis test](#toc0_)

The Kruskal-Wallis test is a non-parametric alternative to the ANOVA test. It is used to compare the distribution of a continuous variable between two or more independent groups. 

> Here, we'll return to the late shipments data, see if the variation in the price of each package (pack_price) between the three shipment modes (shipment_mode): "Air", "Air Charter", and "Ocean" is statistically significant. We wil use a normal ANOVA and then we will use the Kruskal-Wallis test to check if the results are different.

In [22]:
# Count the shipment_mode values
counts = late_shipments["shipment_mode"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())

shipment_mode
Air            906
Ocean           88
Air Charter      6
Name: count, dtype: int64
False


Condition for ANOVA is not met since the sample size for all the groups are not >= 30.

In [23]:
alpha = 0.1

- ANOVA

In [24]:
pingouin.anova(data=late_shipments, dv="pack_price", between="shipment_mode")

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,shipment_mode,2,997,21.8646,5.089479e-10,0.042018


In [25]:
pingouin.pairwise_tests(
    data=late_shipments, dv="pack_price", between="shipment_mode", padjust="bonf"
)

Unnamed: 0,Contrast,A,B,Paired,Parametric,T,dof,alternative,p-unc,p-corr,p-adjust,BF10,hedges
0,shipment_mode,Air,Air Charter,False,True,21.179625,600.685682,two-sided,8.748346000000001e-75,2.624504e-74,bonf,5.808999999999999e+76,0.726592
1,shipment_mode,Air,Ocean,False,True,19.33576,986.979785,two-sided,6.934555e-71,2.080367e-70,bonf,1.129e+67,0.711119
2,shipment_mode,Air Charter,Ocean,False,True,-3.170654,35.615026,two-sided,0.003123012,0.009369037,bonf,15.277,-0.423775


- Kruskal-Wallis test

We can use the `pinguion.kruskal()` function to perform the Kruskal-Wallis test. It takes similar input as the ANOVA test and returns the test statistic and the p-value.

In [26]:
# Run a Kruskal-Wallis test on pack_price vs. shipment_mode
kw_test = pingouin.kruskal(
    data=late_shipments, dv="pack_price", between="shipment_mode"
)

# Print the results
print(kw_test)

                Source  ddof1          H         p-unc
Kruskal  shipment_mode      2  94.570935  2.911939e-21


In [27]:
pingouin.pairwise_tests(
    data=late_shipments,
    dv="pack_price",
    between="shipment_mode",
    padjust="bonf",
    parametric=False,
)

Unnamed: 0,Contrast,A,B,Paired,Parametric,U-val,alternative,p-unc,p-corr,p-adjust,hedges
0,shipment_mode,Air,Air Charter,False,False,4661.5,two-sided,0.002496603,0.007489809,bonf,0.726592
1,shipment_mode,Air,Ocean,False,False,63795.0,two-sided,1.2382049999999999e-20,3.7146149999999996e-20,bonf,0.711119
2,shipment_mode,Air Charter,Ocean,False,False,242.0,two-sided,0.7392153,1.0,bonf,-0.423775


From the results we can see that the p-value in the Kruskal-Wallis test is far greater than the result found in the ANOVA test. Also the pairwise comparison of the groups produces very different results for parametric and non-parametric tests.