# Nonparametric testing

In [None]:
import numpy as np
import seaborn as sns
import scipy.stats as st
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import pandas as pd
import statsmodels.api as sm
import statistics
import os 
from scipy.stats import norm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from numpy.random import randn
import plotly.express as px

## Comparing parametric and non-parametric testing

To compare parametric and non-parametric testing, lets go back to the example counting white blood cells.

In [None]:
dat = pd.read_csv('https://raw.githubusercontent.com/BiAPoL/Bio-image_Analysis_with_Python/main/biostatistics/data/leukocyte_counts.csv')

print(dat)

In [None]:
sns.swarmplot(data=dat)

Now to compare healthy with COVID19 we are doing a t-test that compares the mean. 

In [None]:
st.ttest_ind(dat['healthy'],dat['COVID19'])

The assumption of equal standard deviation was however violated for this test, so maybe we should have taken a non-parametric test?

In [None]:
st.wilcoxon(dat['healthy'],dat['COVID19'])

How about the difference healthy-CLL?

In [None]:
st.ttest_ind(dat['healthy'],dat['CLL'])

In [None]:
st.wilcoxon(dat['healthy'],dat['CLL'])

Our p-value is one order of magnitude larger. Why is that?

Wilcoxon is comparing the ranks, so some information is lost, which frequently leads to loss of power. What is actually compared? Lets take the ranks.

In [None]:
df4 = st.rankdata(dat[['healthy','CLL']]).reshape(100,2)

sns.swarmplot(data=df4)
plt.ylabel("ranks")

df2 = pd.DataFrame(df4)

print(df2)

## ANOVA

Now we have three samples, so a t-test is actually not appropriate. If we state the 0-Hypothesis that there is no difference between samples, we should apply a one-way ANOVA.

In [None]:
sns.swarmplot(data=dat)

In [None]:
st.f_oneway(dat['healthy'],dat['COVID19'],dat['CLL'])

Now we know that we can reject H0 that there is no difference between the means in this case

How does this look for a non-parametric situation?

## Comparing Kruskal-Wallis with ANOVA

In [None]:
df4 = st.rankdata(dat).reshape(100,3)

sns.swarmplot(data=df4)
plt.ylabel("ranks")

df2 = pd.DataFrame(df4)

print(df2)

In [None]:
st.kruskal(dat['healthy'],dat['COVID19'],dat['CLL'])

As before, we are loosing a bit of power. From the ranking plot, it becomes especially clear that we are looking at all comparisons at the same time.... but we really want to know which one makes the difference! But here we need to include Multiple testing correction!

## Multiple testing correction

For ANOVA there is Tukey. For non-parametric tests, there is Dunn's test. 

In [None]:
# For Tukey the dataframe needs to be melted
melted = pd.melt(dat)
print(melted)

# perform multiple pairwise comparison (Tukey HSD)
m_comp = pairwise_tukeyhsd(endog=melted['value'], groups=melted['variable'], alpha=0.05)
print(m_comp)

Also in these tests the individual comparisons are not independent, which makes them very suitable for moderately multiple testing correction.  
When going "big", Bonferroni and Benjamini-Hochberg are more custom, which for the former just correct by the total numbers of comparisons and for the latter adjust the false discovery rate (FDR).

## Correlation statistics 

In [None]:
Because I am lagging the creativity at the moment to make good relatable data, lets simulate some randomly:

In [None]:
s1 = 20 * randn(1000) + 100
s2 = s1+ (10 * randn(1000) + 50)
s3 = s2+ (10 * randn(1000) + 50)
s4 = s3+ (10 * randn(1000) + 50)

plt.hist(s1,alpha=0.5)
plt.hist(s2,alpha=0.5)
plt.hist(s3,alpha=0.5)
plt.hist(s4,alpha=0.5)

plt.xlabel("value")
plt.ylabel("density")
plt.axvline(statistics.mean(s1), color="blue")
plt.axvline(statistics.mean(s2), color="red")
plt.axvline(statistics.mean(s3), color="green")
plt.axvline(statistics.mean(s4), color="purple")


This plot shows us that they are reasonably normally distributed.  
These data are now matched - information we are loosing with this visualisation!

In [None]:
df = pd.DataFrame({'s1':s1, 's2':s2})
sns.scatterplot(x=df['s1'],y=df['s2'])

And with a regression line + confidence intervals

In [None]:
sns.regplot(x=df['s1'],y=df['s2'])

Because we can assume normal distribution for each of the dimensions separately, Pearson correlation is an appropriate correlation statistic. 

In [None]:
st.pearsonr(x=df['s1'],y=df['s2'])

Although here we do not have to use non-parametric statistics, lets have a look how it perfroms with Spearman.

How do the data look like, if we consider ranks?

In [None]:
df4 = st.rankdata(df[['s1','s2']]).reshape(1000,2)
df4 = pd.DataFrame(df4)
sns.scatterplot(x=df4[0],y=df4[1])

Here you can already see that the ranks do not look random....

In [None]:
st.spearmanr(a=df['s1'],b=df['s2'])

What to do with multiple dimensions?

In [None]:
df = pd.DataFrame({'s1':s1, 's2':s2,'s3':s3, 's4':s4})
px.scatter_matrix(df)

Exercise: How do these samples relate to each other? Can you use the correlation as a measure to group them and visualise their relationship in a heatmap for example?  
How does the correlation relate to whether the distributions are different? How are they connected?