# Introduction

We verified the utility of our transfer scheme, including features, model configuration, hyper-parameters, as well as knowledge transfer process, by validating the generalizing performance under two application scenarios: basic quantification of contributions under general application scenario, as well as quantifying human-associated source contributions under context-dependent application scenario. We introduced two datasets in order to systematically assess such performances: a combined dataset consists of 125,827 samples collected from 124 biomes, as well as a human dataset consists of 53,553 samples collected from 27 human-associated biomes. Through random cross-validation, we found that the EXPERT model was  able to quantify source contributions for communities, as well as qualitatively identify the biome sources for communities, confirmed the utility of our transfer scheme in such context-dependent applications. 
We also characterized the impact of knowledge transfer and fine-tune optimization on such context-dependent application in three facets: efficiency, accuracy, as well as the variance of accuracy. The result shown that the knowledge transfer scheme with finetune optimization can contribute to both the robustness and the accuracy of the model, enabled more robust quantification of source contributions as well as more accurate identification the biome sources of these query samples compared to independent optimization. Notably, the fine-tune optimizaiton process comes up with a cost: Three times as much time is spent on performing this optimization. This is largely due to the low learning rate (1 x 10-5) utilized by fine-tuning process. But considering the robustness and  accuracy optimization of fine-tuning process, we defined fine-tuning as a default setting in the following sections. 
We further assessed the performance of a recent source tracking method, FEAST, on the same sets of datasets. We only considered the bottom layer of the human ontology as potential sources, as FEAST cannot handle the association between sources. We also have to trade its accuracy off for the completion of the experiment in an acceptable time (30 days): we considered randomly selected xxx samples for each biome as potential sources for FEAST. Results have shown that FEAST was indeed much slower than EXPERT, while the accuracies of FEAST was also lower than EXPERT, which were largely bottlenecked when faced with large ammount of samples. Considering that SourceTracker was more than ten times slower than FEAST while had similar accuracies [REF], we deemed EXPERT to be both faster and more accurate than both of these two unsupervised learning methods.

# Reproducibility statement

- EXPERT supports completely reproducible optimization & inference.
- Processed data are provided for reproducing the result, the original data can be found under `dataFiles/`.
- Rerunning the entire notebook with the configuration below should yield **completely consistent** results (compared to those reported in our paper).
- Session information
    - EXPERT (version 0.3)
    - Python (version 3.8.2)
    - TensorFlow (version 2.3.1)
    - Pandas (version 1.1.3)
    - NumPy (version 1.18.5)
    - ETE3 (version 3.1.2)
    - NCBI taxonomy database (released [2020-09-01](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/))

In [39]:
%%bash
for i in {1,2,3,4,5}; do
    time expert train -i experiments/exp_$i/SourceCM_General.h5 \
        -l experiments/exp_$i/SourceLabels_General.h5 \
        -t ontology.pkl \
        -o experiments/exp_$i/General;
    time expert train -i experiments/exp_$i/SourceCM.h5 \
        -l experiments/exp_$i/SourceLabels.h5 \
        -t ontology_human.pkl \
        -o experiments/exp_$i/Independent;
    time expert transfer -i experiments/exp_$i/SourceCM.h5 \
        -l experiments/exp_$i/SourceLabels.h5 \
        -t ontology_human.pkl \
        -m experiments/exp_$i/General \
        -o experiments/exp_$i/Transfer_GM0 \
        --update-statistics;
    time expert transfer -i experiments/exp_$i/SourceCM.h5 \
        -l experiments/exp_$i/SourceLabels.h5 \
        -t ontology_human.pkl \
        -m experiments/exp_$i/General \
        -o experiments/exp_$i/Transfer_GM \
        --finetune --update-statistics;
done

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

2021-01-10 10:30:49.456458: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-10 10:30:49.469747: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499960000 Hz
2021-01-10 10:30:49.471956: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564e78396da0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-10 10:30:49.47

## Process
The following sections are used to reproduce the result reported in our paper. For detailed configuration and interpretation of results, please read our original paper first.

### Optimization
- `--finetune`: enable finetune for further optimization.
- `--update-statistics`: update statistics for Z-score standardization.

### Quantifying source contributions

- `--measure-unknown`: measure the contribution from unknown source(s).

In [40]:
%%bash
for i in {1,2,3,4,5}; do
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Independent -o experiments/exp_$i/Search_Independent;
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Transfer_GM0 -o experiments/exp_$i/Search_Transfer_GM0;
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Transfer_GM -o experiments/exp_$i/Search_Transfer_GM;
done


real	0m26.313s
user	0m50.413s
sys	0m8.968s

real	0m17.299s
user	0m47.363s
sys	0m10.732s

real	0m16.502s
user	0m48.272s
sys	0m10.065s

real	0m16.469s
user	0m50.284s
sys	0m9.664s

real	0m15.465s
user	0m51.150s
sys	0m9.668s

real	0m16.468s
user	0m50.797s
sys	0m15.540s

real	0m18.080s
user	0m55.132s
sys	0m10.548s

real	0m15.915s
user	0m55.028s
sys	0m10.266s

real	0m17.380s
user	0m55.007s
sys	0m10.519s

real	0m16.982s
user	0m56.205s
sys	0m10.412s

real	0m16.608s
user	0m56.043s
sys	0m10.467s

real	0m16.558s
user	0m54.941s
sys	0m10.462s

real	0m18.236s
user	0m55.908s
sys	0m11.028s

real	0m14.694s
user	0m54.275s
sys	0m10.144s

real	0m16.220s
user	0m54.083s
sys	0m11.340s


### Evaluating performances
- `-S`: Set threshold for evaluation

In [44]:
%%bash
for i in {1,2,3,4,5}; do
    expert evaluate -i experiments/exp_$i/Search_Independent -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Independent -S 0 -p 10;
    expert evaluate -i experiments/exp_$i/Search_Transfer_GM0 -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Transfer_GM0 -S 0 -p 10;
    expert evaluate -i experiments/exp_$i/Search_Transfer_GM -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Transfer_GM -S 0 -p 10;
done

Reordering labels and prediction result
Reordering labels and prediction result for samples
Running evaluation...
Saving evaluation results...
Evaluating biome source: root:Host-associated
      TN  FP    FN    TP  Acc   Sn  ...  FPR   Rc   Pr   F1  ROC-AUC  F-max
t                                   ...                                    
0.00   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
0.01   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
0.02   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
0.03   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
0.04   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
...   ..  ..   ...   ...  ...  ...  ...  ...  ...  ...  ...      ...    ...
0.97   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
0.98   0   0     0  5253  1.0  1.0  ...  0.0  1.0  1.0  1.0      0.0    1.0
0.99   0   0     0  5253  1.0  1.0  ...  0.0  1.0  

100%|██████████| 5/5 [00:00<00:00, 1068.07it/s]
100%|██████████| 5/5 [00:00<00:00, 535.11it/s]
100%|██████████| 5/5 [00:32<00:00,  6.54s/it]
100%|██████████| 5/5 [00:00<00:00, 13.09it/s]
100%|██████████| 5/5 [00:00<00:00, 1295.42it/s]
100%|██████████| 5/5 [00:00<00:00, 748.29it/s]
100%|██████████| 5/5 [00:31<00:00,  6.22s/it]
100%|██████████| 5/5 [00:00<00:00,  9.04it/s]
100%|██████████| 5/5 [00:00<00:00, 1291.51it/s]
100%|██████████| 5/5 [00:00<00:00, 740.44it/s]
100%|██████████| 5/5 [00:30<00:00,  6.13s/it]
100%|██████████| 5/5 [00:00<00:00,  7.95it/s]
100%|██████████| 5/5 [00:00<00:00, 1258.12it/s]
100%|██████████| 5/5 [00:00<00:00, 662.31it/s]
100%|██████████| 5/5 [00:31<00:00,  6.25s/it]
100%|██████████| 5/5 [00:01<00:00,  4.23it/s]
100%|██████████| 5/5 [00:00<00:00, 1314.33it/s]
100%|██████████| 5/5 [00:00<00:00, 783.66it/s]
100%|██████████| 5/5 [00:31<00:00,  6.36s/it]
100%|██████████| 5/5 [00:00<00:00, 12.74it/s]
100%|██████████| 5/5 [00:00<00:00, 1275.56it/s]
100%|██████████| 

## Support
For support reproducing the result, please email: huichong.me@gmail.com.