In this Notebook I will continue the **Hypothesis tests** main tests to clarify the difficulties while trying to understand this statistical concept. This is the second part of my series *"The Hypothesis-Testing Bible"*. If you want to know further about the dataset or more elemental concepts, please check the first part on my GitHub repository. Link below.

https://github.com/Seniorveiga/Python_Projects/tree/main/Hypothesis%20Testing%20Bible

In this notebook we will have a walk over more complex tests. They will be:

## Index

- Non-parametric tests
    - Wilcoxon tests
    - Wilcoxon-Man-Whotney tests
    - Kruskal-Wallis tests

--------------------

## Packages and dataset import

Again, we import the package so that it appears again to do our operations and the hypothesis tests.
Nevertheless, we are going to introduce another dataset based on the first part that is the *"repub_votes_potus_08_12"* which do not follow the conditions to apply a parametric test.

In [2]:
import pandas as pd
import pyarrow.feather as feather
import numpy as np 
from scipy.stats import norm, t, chi2, chisquare
import matplotlib.pyplot as plt
import pingouin
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest

  from pandas.core import (


In [3]:
late_shipments = feather.read_feather("late_shipments.feather")
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


In [9]:
sample_dem_data = feather.read_feather("dem_votes_potus_12_16.feather")
sample_dem_data.head()

Unnamed: 0,state,county,dem_percent_12,dem_percent_16
0,Alabama,Bullock,76.3059,74.946921
1,Alabama,Chilton,19.453671,15.847352
2,Alabama,Clay,26.673672,18.674517
3,Alabama,Cullman,14.661752,10.028252
4,Alabama,Escambia,36.915731,31.020546


## Non-parametric test

There are certain asumptions that we do on our population that makes us use usual parametric tests.

### Parameters
#### Randomness
All hypothesis tests assume that they belong to a random sample. In case we don´t confirm this hypothesis we cannot say that the sample is representative of the population. This should be reconsidered through the collection method of the data.

We would need to ask to the source of the data to know the origin.

#### Independence of observations
We assume that each row is independent. If we don´t consider the dependencies we would have more false positive / false negatives in our tests. Again, it should be checked in the collection method.

#### Large sample size
If the sample is big enough we can apply the central limit theorem, this is that data can be paired with a normal distribution.
If it is not applied what happens is that the tests that we perform on our data cannot conclude anything solid so it would again return lots of false positives and negatives.

#### Solutions

- As a general rule, we need **at least 30 observations** in our samples, unless we do it with two samples, that we wold need 30 in each one.
- In case of proportion tests:
    - If we do it with one sample, there should be **at least 10 failures and 10 successes**.
    - If we do it with two samples, there should be **at least 10 failures and 10 successes in each samples**.

![](https://unicornlifescience.com/wp-content/uploads/2022/03/biological-samples.jpg)

Let´s see an example in our dataset, *"late_shipments"*:

In [6]:
counts = late_shipments.groupby("vendor_inco_term")["freight_cost_groups"].value_counts()
counts

vendor_inco_term  freight_cost_groups
CIP               reasonable              34
                  expensive               16
DDP               expensive               55
                  reasonable              45
DDU               reasonable               1
EXW               expensive              423
                  reasonable             302
FCA               reasonable              73
                  expensive               37
Name: count, dtype: int64

As we see, in this group we did not include DDU in our analysis so we cannot rely on the results obtained from DDU.

Nevertheless, when we see the *"shipment mode"* group:

In [8]:
counts = late_shipments["shipment_mode"].value_counts()
# Inspect whether the counts are big enough
counts

shipment_mode
Air            906
Ocean           88
Air Charter      6
Name: count, dtype: int64

We see that **we cannot rely on the results of an ANOVA test for the comparisons with Air Charter** due to the lack of samples.

-------------------------

### What´s a non-parametric test?

We have already seen ther types of tests that are the $\mathit{z}$-proportions, the $\mathit{t}$-tests or the ANOVA, but what is exactly a non-parametric test?

A **non-parametric test** is a type of test that we do when the sample that we are working with do not follow the rules established above, such as verifying the central limit theorem or do not have enough samples. 

For examples, if we pick 5 different rows from the *"dem_votes_potus_12_16"* and we perform a $\mathit{t}$-test:

In [25]:
#Picking a sample of n=5
small_sample = sample_dem_data.sample(n=5)
small_sample

Unnamed: 0,state,county,dem_percent_12,dem_percent_16
416,Texas,Gonzales,29.337956,24.802652
192,Massachusetts,Essex,57.403571,58.522961
19,Arkansas,Dallas,43.352789,42.042584
287,New Jersey,Sussex,38.426096,32.663303
338,Ohio,Harrison,41.310741,23.845176
160,Kentucky,Boyle,36.135133,33.068129
274,Nebraska,Perkins,17.073171,11.065292
185,Maine,Lincoln,54.511731,47.625913
339,Ohio,Hocking,48.373664,29.404892
126,Iowa,Chickasaw,54.811845,35.213675



- $H_{0}$: The proportion of democratic votes in 2012 and 2016 were the same. 
- $H_{A}$: The proportion of democratic votes in 2012 was bigger than 2016.
Significance level: 0.01

In [27]:
#Perform a t-test
paired_test_results = pingouin.ttest(x=sample_dem_data['dem_percent_12'], 
                                     y=sample_dem_data['dem_percent_16'],
                                     paired=True,
                                     alternative="greater")

paired_test_results

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,30.298384,499,greater,1.800317e-115,"[6.46, inf]",0.454202,4.491e+111,1.0


But, is this reliable? 

-------------

### Non-parametric non-liar

Non-parametric tests helps us to have a reliable source while performing tests on our population.
The first one that was non-parametric was caled the **Wilcoxon tests**, that we can use by:

1. Taking the differences between the columns that we want to compare 
2. Take the absolute value of them.
3. After that, we rank the data with the *scipy.stats* function called *rankdata*.

The last part is to calcualte W, which is a test that we make with 2 variables:
- Adding the values by ranking values which difference is negative in a variable called $\mathit{T}^{-}$
- Adding the values by ranking values which difference is positive in a variable called $\mathit{T}^{+}$

We can do this directly with **pingouin.wilcoxon**.

Let´s see it!

In [28]:
wilcoxon_test_results = pingouin.wilcoxon( x = sample_dem_data["dem_percent_12"],\
                                           y = sample_dem_data["dem_percent_16"], alternative = "greater")

# Print Wilcoxon test results
print(wilcoxon_test_results)

             W-val alternative         p-val       RBC      CLES
Wilcoxon  122849.0     greater  8.901980e-78  0.961661  0.644816


It was reliable! That´s due to the little amount of values that we had that the $\mathit{t}$-test was less restrictive so in smaller populations we should use this one!

------------------------------

### Wilcoxon-Mann-Whitney

We are going to see other non-parametric, but rather than being paired numeric statistics, they are non-paired satistics. 
The Wilcoxon-Mann-Whitney does exactly that but works on unpaired data. 

To do it, we need to put the data in wide-format, which we can do by using the function .pivot(). 

It wll be displayed as two columns where there are values and NaN´s that belong to the other column which has value. Remember that the last test **was not** a Wilcoxon Mann Whitney due to the sample of origin: If they were different people voting they will be diferent.

### Kruskal-Wallis

This is exactly the same as the ANOVA function but it serves as the non-parametric version.


In [29]:
# You need to pick the column for unpaired data
weight_vs_late = late_shipments[["weight_kilograms","late"]]

# Convert weight_vs_late into wide format
weight_vs_late_wide = weight_vs_late.pivot(columns='late', 
                                           values="weight_kilograms")

# two-sided Wilcoxon-Mann-Whitney test
wmw_test = pingouin.mwu(x = weight_vs_late_wide["No"],y = weight_vs_late_wide["Yes"])
wmw_test

Unnamed: 0,U-val,alternative,p-val,RBC,CLES
MWU,19134.0,two-sided,1.4e-05,0.331902,0.334049


The small p-value here leads us to suspect that a difference does exist in the weight of the shipment and whether or not it was late. The Wilcoxon-Mann-Whitney test is useful when you cannot satisfy the assumptions for a parametric test comparing two means, like the t-test.

For the Kruskal-Wallis:

In [None]:
# Kruskal-Wallis test on weight_kilograms vs. shipment_mode
kw_test = pingouin.kruskal(data = late_shipments, dv = "weight_kilograms", between = "shipment_mode")
kw_test

The Kruskal-Wallis test returned a very small p-value, so there is evidence that at least one of the three groups of shipment mode has a different weight distribution than the others. Th Kruskal-Wallis test is comparable to an ANOVA, which tests for a difference in means across multiple groups.

-----