# Introduction

In order to assess the capability of EXPERT when source tracking newly analyzed samples, or even more challengingly, samples from newly discovered biomes, we collected 34814 samples, which were analyzed by MGnify in 2020. Among them, there are 3429 samples belonging to 5 newly added biomes. We use the hierarchical biome classification of these samples to construct a novel biome ontology, namely biome ontology (2020) (Fig.3a). According to the identification accuracy comparison between these two models, we noticed that transfer model is more robust than independent model, EXPERT is able to characterize the biome’s composition perfectly (average AUROC of 0.989 for transferred model and 0.985 for independent model, and average F-max of 0.973 for transferred model and 0.965 for independent model, Fig.3b).For the time-usage, although the training time-usage of transfer model is relatively longer, the average searching time-usage is less (Fig.3c). Two new biomes were chosen (root: Host-associated: Fish and root: Host-associated: Birds: Digestive System: Ceca) to determine the identification accuracy of EXPERT at specific layers of new biome ontology. We noticed that on the third layer, the contribution of correct biome (Fish and Birds) has a high value, and the same as the fourth and the fifth layer. EXPERT could accurate classification of samples from new added biomes (Fig.3d).


# Reproducibility statement

- EXPERT supports completely reproducible optimization & inference.
- Processed data are provided for reproducing the result, the original data can be found under `dataFiles/`.
- Rerunning the entire notebook with the configuration below should yield **completely consistent** results (compared to those reported in our paper).
- Session information
    - EXPERT (version 0.3)
    - Python (version 3.8.2)
    - TensorFlow (version 2.3.1)
    - Pandas (version 1.1.3)
    - NumPy (version 1.18.5)
    - ETE3 (version 3.1.2)
    - NCBI taxonomy database (released [2020-09-01](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/))

## Process
The following sections are used to reproduce the result reported in our paper. For detailed configuration and interpretation of results, please read our original paper first.

### Optimization
- `--finetune`: enable finetune for further optimization.
- `--update-statistics`: update statistics for Z-score standardization.

In [1]:
%%bash
for i in {0,1,2,3,4};do
    time expert train -i experiments/exp_$i/SourceCM.h5 -t ontology.pkl -l experiments/exp_$i/SourceLabels.h5 \
        -o experiments/exp_$i/Independent;
    time expert transfer -i experiments/exp_$i/SourceCM.h5 -t ontology.pkl -l experiments/exp_$i/SourceLabels.h5\
        -o experiments/exp_$i/Transfer_GM --finetune;
done

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

2021-01-07 21:38:54.910253: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-07 21:38:54.919880: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499960000 Hz
2021-01-07 21:38:54.921624: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562f5f045d50 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-07 21:38:54.92

### Quantifying source contributions

- `--measure-unknown`: measure the contribution from unknown source(s).

In [2]:
%%bash
for i in {0,1,2,3,4};do
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Independent -o experiments/exp_$i/Search_Independent;
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Transfer_GM -o experiments/exp_$i/Search_Transfer_GM;
done


real	0m21.718s
user	0m38.110s
sys	0m8.265s

real	0m15.586s
user	0m35.269s
sys	0m8.714s

real	0m15.396s
user	0m36.101s
sys	0m8.910s

real	0m14.380s
user	0m35.923s
sys	0m8.493s

real	0m17.099s
user	0m36.158s
sys	0m9.478s

real	0m16.510s
user	0m34.404s
sys	0m9.078s

real	0m14.953s
user	0m33.850s
sys	0m9.078s

real	0m13.318s
user	0m34.704s
sys	0m10.454s

real	0m15.174s
user	0m35.410s
sys	0m9.440s

real	0m12.910s
user	0m35.241s
sys	0m9.243s


### Evaluating performances
- `-S`: Set threshold for evaluation

In [3]:
%%bash
for i in {0,1,2,3,4};do
    expert evaluate -i experiments/exp_$i/Search_Independent -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Independent -S 0 -p 10; 
    expert evaluate -i experiments/exp_$i/Search_Transfer_GM -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Transfer_GM -S 0 -p 10;
done

Reordering labels and prediction result
Reordering labels and prediction result for samples
Running evaluation...
Saving evaluation results...
Evaluating biome source: root:Environmental:Terrestrial
        TN    FP   FN   TP     Acc  ...   Rc      Pr      F1  ROC-AUC   F-max
t                                   ...                                      
0.00     0  3293    0  126  0.0369  ...  1.0  0.0369  0.0711   0.9999  0.9839
0.01  2880   412    0  126  0.8795  ...  1.0  0.2342  0.3795   0.9999  0.9839
0.02  2882   410    0  126  0.8800  ...  1.0  0.2351  0.3807   0.9999  0.9839
0.03  2972   320    0  126  0.9064  ...  1.0  0.2825  0.4406   0.9999  0.9839
0.04  3146   146    0  125  0.9573  ...  1.0  0.4613  0.6313   0.9999  0.9839
...    ...   ...  ...  ...     ...  ...  ...     ...     ...      ...     ...
0.97  3293     0  126    0  0.9631  ...  0.0  0.0000     NaN   0.9999  0.9839
0.98  3293     0  126    0  0.9631  ...  0.0  0.0000     NaN   0.9999  0.9839
0.99  3293     0  126

100%|██████████| 5/5 [00:00<00:00, 1143.61it/s]
100%|██████████| 5/5 [00:00<00:00, 672.34it/s]
100%|██████████| 5/5 [00:20<00:00,  4.02s/it]
100%|██████████| 5/5 [00:00<00:00, 17.69it/s]
100%|██████████| 5/5 [00:00<00:00, 1275.80it/s]
100%|██████████| 5/5 [00:00<00:00, 755.08it/s]
100%|██████████| 5/5 [00:19<00:00,  3.95s/it]
100%|██████████| 5/5 [00:00<00:00, 17.54it/s]
100%|██████████| 5/5 [00:00<00:00, 1142.49it/s]
100%|██████████| 5/5 [00:00<00:00, 642.51it/s]
100%|██████████| 5/5 [00:25<00:00,  5.06s/it]
100%|██████████| 5/5 [00:00<00:00, 15.29it/s]
100%|██████████| 5/5 [00:00<00:00, 1257.81it/s]
100%|██████████| 5/5 [00:00<00:00, 414.51it/s]
100%|██████████| 5/5 [00:26<00:00,  5.32s/it]
100%|██████████| 5/5 [00:00<00:00,  9.02it/s]
100%|██████████| 5/5 [00:00<00:00, 1247.64it/s]
100%|██████████| 5/5 [00:00<00:00, 456.35it/s]
100%|██████████| 5/5 [00:26<00:00,  5.25s/it]
100%|██████████| 5/5 [00:00<00:00, 18.84it/s]
100%|██████████| 5/5 [00:00<00:00, 1244.75it/s]
100%|██████████| 

## Support
For support reproducing the result, please email: huichong.me@gmail.com.