# Tutorial: Permutation Testing using `acore`

In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).

The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).

For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. 


## Data preparation details

### Downloading
The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.

### Preprocessing of abundances
- To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. 
- The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation `acore.microbiome.internal_functions.calc_clr()` to not violate assumptions of any frequentist stats we do

### Preprocessing of the metadata 
- the sample metadata needed for this demo (sampling location) were available in their "sample-desc" 
- the sample-desc for each sample in both INF and EFF were parsed and used for pairing off

### Subset of data for demo
- For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)
- It's expected that antibiotic catabolic processes to be higher in INF vs EFF

### Saving the demo dataset
This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:

| column            | description                                                                                                       | dtype |
|-------------------|-------------------------------------------------------------------------------------------------------------------|-------|
| eff_id            | The run id for the mgnify analysis of the effluent sample.                                                        | str   |
| inf_id            | The run id for the mgnify analysis of the influent sample.                                                        | str   |
| sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str   |
| sampling_read     | Replicates?                                                                                                       | str   |
| eff_abundance     | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR)    | float |
| inf_abundance     | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR)    | float |

-----

We will now proceed with reading in the prepared dataset. 

In [1]:
import pandas as pd 

df_data = pd.read_csv('/Users/anglup/GitHub/acore/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv')
# sanity check 
df_data.head()

Unnamed: 0,eff_id,inf_id,sampling_location,sampling_read,eff_abundance,inf_abundance
0,ERR2985255,ERR2814663,TG,READ2 Taxonomy ID:256318,3.257283,4.226819
1,ERR2985256,ERR2814664,MN,READ2 Taxonomy ID:256318,2.572841,3.847191
2,ERR2985257,ERR2814651,AH,READ1 Taxonomy ID:256318,4.298777,4.086841
3,ERR2985258,ERR2814667,TE,READ1 Taxonomy ID:256318,2.758982,3.436752
4,ERR2985259,ERR2814660,FD,READ1 Taxonomy ID:256318,3.364675,3.486673


## The permutation test

Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. 

The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. 

If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. 

In [2]:
from acore.permutation_test import paired_permutation

# optional choice of random number generatorfor repro
import numpy as np
rng = np.random.default_rng(12345)

In [3]:
# trying diff metrics to demo functionality also
for metric in ['t-statistic', 'mean', np.mean]:
    result = paired_permutation(
        df_data['inf_abundance'].to_numpy(),
        df_data['eff_abundance'].to_numpy(), 
        metric=metric, 
        n_permutations=10000, 
        rng=rng
    )
    # verbosity
    print(result)

{'metric': <function ttest_rel at 0x11014d8a0>, 'observed': TtestResult(statistic=np.float64(6.7389860601792275), pvalue=np.float64(7.122287781830209e-07), df=np.int64(23)), 'p_value': np.float64(0.0)}
{'metric': <function mean at 0x1090f7a70>, 'observed': np.float64(0.5350826397500547), 'p_value': np.float64(0.0)}
{'metric': <function mean at 0x1090f7a70>, 'observed': np.float64(0.5350826397500547), 'p_value': np.float64(0.0)}


## Result

Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001.