**Author:** C Michell

This notebook evaluates whether the sampling protocol had an affect on the PIC concentration. Our data is not normal, so we need to do nonparametric tests. We will do both a Kruskal-Wallis test and Aligned Rank Transform

In [1]:
import pandas as pd
from scipy.stats import f_oneway
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import numpy as np

In [10]:
df = pd.read_csv('data/03-PIC-blank-corrected.csv')
df = df.assign(label= df.Code.str[:-1])

# Kruskal-Wallis tests

Here we're testing if any of the mean PIC for each of the four protocols are significantly different from each other

In [11]:
labels = ['AX','AY','BX','BY']
group_dict = {}
for ll in labels:
    subdf = df[df.label.str.contains(ll)]
    gr_subdf = subdf.copy()
    gr_subdf = gr_subdf[['Code','Filter','Rinse','PIC mmol/m3']]
    group_dict[ll] = gr_subdf

In [12]:
stats.kruskal(group_dict['AX']['PIC mmol/m3'],
        group_dict['AY']['PIC mmol/m3'],
        group_dict['BX']['PIC mmol/m3'],
        group_dict['BY']['PIC mmol/m3'], nan_policy = 'omit')

KruskalResult(statistic=np.float64(0.7936986697513078), pvalue=np.float64(0.8509736273647703))

So we have an non-significant p-value, therefore there is no difference between the 4 protocols

# Aligned Rank Transform (ART)

The Kruskal-Wallis test doesn't tell you about interactions between factors, so we need a different approach if we want to look at filter pore size and rinse type separately i.e.

- does filter type influence PIC
- does rinse type influence PIC
- is an interaction effect between the filter and rinse type (e.g. the effect of filter type on PIC is dependent on rinse type)

For this we can use Aligned Rank Transform (ART). 

In [13]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import rankdata

Remove the NaN row and make a copy of the dataframe

In [14]:
subdf = df[~(df['Code'] == 'SBY3')].copy()

In [15]:
subdf.rename(columns={'PIC mmol/m3' : 'PIC_mol', 'PIC ug/l' : 'PIC_g'},inplace=True)

In [16]:
# Rank-transform the response variable
subdf["Ranked_Score"] = rankdata(subdf["PIC_mol"])

# Fit an ANOVA model on the ranked data
model = smf.ols("Ranked_Score ~ C(Filter) * C(Rinse)", data=subdf).fit()
anova_table = sm.stats.anova_lm(model, typ=2)  # Type-II ANOVA

anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Filter),75.629926,1.0,0.210426,0.648091
C(Rinse),178.992073,1.0,0.498011,0.483107
C(Filter):C(Rinse),33.591282,1.0,0.093461,0.76088
Residual,21564.851128,60.0,,


All p-values > 0.05, thus filter, rinse nor an interaction effect of both influence PIC