# Introduction

In order to assess the capability of EXPERT when source tracking newly analyzed samples, or even more challengingly, samples from newly discovered biomes, we collected 34814 samples, which were analyzed by MGnify in 2020. Among them, there are 3429 samples belonging to 5 newly added biomes. We use the hierarchical biome classification of these samples to construct a novel biome ontology, namely biome ontology (2020) (Fig.3a). According to the identification accuracy comparison between these two models, we noticed that transfer model is more robust than independent model, EXPERT is able to characterize the biome’s composition perfectly (average AUROC of 0.989 for transferred model and 0.985 for independent model, and average F-max of 0.973 for transferred model and 0.965 for independent model, Fig.3b).For the time-usage, although the training time-usage of transfer model is relatively longer, the average searching time-usage is less (Fig.3c). Two new biomes were chosen (root: Host-associated: Fish and root: Host-associated: Birds: Digestive System: Ceca) to determine the identification accuracy of EXPERT at specific layers of new biome ontology. We noticed that on the third layer, the contribution of correct biome (Fish and Birds) has a high value, and the same as the fourth and the fifth layer. EXPERT could accurate classification of samples from new added biomes (Fig.3d).


# Reproducibility statement

- EXPERT supports completely reproducible optimization & inference.
- Processed data are provided for reproducing the result, the original data can be found under `dataFiles/`.
- Rerunning the entire notebook with the configuration below should yield **completely consistent** results (compared to those reported in our paper).
- Session information
    - EXPERT (version 0.3)
    - Python (version 3.8.2)
    - TensorFlow (version 2.3.1)
    - Pandas (version 1.1.3)
    - NumPy (version 1.18.5)
    - ETE3 (version 3.1.2)
    - NCBI taxonomy database (released [2020-09-01](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/))

## Process
The following sections are used to reproduce the result reported in our paper. For detailed configuration and interpretation of results, please read our original paper first.

### Optimization
- `--finetune`: enable finetune for further optimization.
- `--update-statistics`: update statistics for Z-score standardization.

In [14]:
%%bash
for i in {0,1,2,3,4};do
    time expert train -i experiments/exp_$i/SourceCM.h5 -t ontology.pkl -l experiments/exp_$i/SourceLabels.h5 \
        -o experiments/exp_$i/Independent;
    time expert transfer -i experiments/exp_$i/SourceCM.h5 -t ontology.pkl -l experiments/exp_$i/SourceLabels.h5\
        -o experiments/exp_$i/Transfer_GM --finetune;
done

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

2021-01-06 22:33:10.636092: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-06 22:33:10.678490: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2499960000 Hz
2021-01-06 22:33:10.680983: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b655e69500 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-06 22:33:10.68

### Quantifying source contributions

- `--measure-unknown`: measure the contribution from unknown source(s).

In [15]:
%%bash
for i in {0,1,2,3,4};do
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Independent -o experiments/exp_$i/Search_Independent;
    time expert search -i experiments/exp_$i/QueryCM.h5 -m experiments/exp_$i/Transfer_GM -o experiments/exp_$i/Search_Transfer_GM;
done




real	0m17.152s
user	0m36.286s
sys	0m8.686s

real	0m12.698s
user	0m35.878s
sys	0m8.884s

real	0m12.516s
user	0m35.541s
sys	0m8.321s

real	0m13.140s
user	0m35.809s
sys	0m11.519s

real	0m12.624s
user	0m35.923s
sys	0m8.020s

real	0m14.093s
user	0m35.844s
sys	0m7.984s

real	0m13.003s
user	0m35.955s
sys	0m8.436s

real	0m12.810s
user	0m35.601s
sys	0m9.788s

real	0m12.877s
user	0m35.347s
sys	0m8.383s

real	0m12.644s
user	0m35.502s
sys	0m9.125s


### Evaluating performances
- `-S`: Set threshold for evaluation

In [18]:
%%bash
for i in {0,1,2,3,4};do
    expert evaluate -i experiments/exp_$i/Search_Independent -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Independent -S 0 -p 10; 
    expert evaluate -i experiments/exp_$i/Search_Transfer_GM -l experiments/exp_$i/QueryLabels.h5 -o experiments/exp_$i/Eval_Transfer_GM -S 0 -p 10;
done

Reordering labels and prediction result
Reordering labels and prediction result for samples
Running evaluation...
['root:Engineered', 'root:Environmental', 'root:Host-associated']
['root:Engineered:Food_production', 'root:Environmental:Aquatic', 'root:Environmental:Terrestrial', 'root:Host-associated:Birds', 'root:Host-associated:Fish', 'root:Host-associated:Human', 'root:Host-associated:Mammals', 'root:Host-associated:Plants']
['root:Engineered:Food_production:Dairy_products', 'root:Environmental:Aquatic:Thermal_springs', 'root:Environmental:Terrestrial:Deep_subsurface', 'root:Environmental:Terrestrial:Soil', 'root:Host-associated:Birds:Digestive_system', 'root:Host-associated:Human:Digestive_system', 'root:Host-associated:Human:Reproductive_system', 'root:Host-associated:Human:Respiratory_system', 'root:Host-associated:Human:Skin', 'root:Host-associated:Mammals:Digestive_system', 'root:Host-associated:Mammals:Gastrointestinal_tract', 'root:Host-associated:Plants:Root']
['root:Host-as

100%|██████████| 5/5 [00:00<00:00, 1125.93it/s]
100%|██████████| 5/5 [00:00<00:00, 572.18it/s]
100%|██████████| 5/5 [00:19<00:00,  3.89s/it]
100%|██████████| 5/5 [00:00<00:00, 22.29it/s]
100%|██████████| 5/5 [00:00<00:00, 1061.04it/s]
100%|██████████| 5/5 [00:00<00:00, 633.01it/s]
100%|██████████| 5/5 [00:19<00:00,  3.88s/it]
100%|██████████| 5/5 [00:00<00:00, 29.95it/s]
100%|██████████| 5/5 [00:00<00:00, 1145.55it/s]
100%|██████████| 5/5 [00:00<00:00, 671.95it/s]
100%|██████████| 5/5 [00:19<00:00,  3.95s/it]
100%|██████████| 5/5 [00:00<00:00, 18.32it/s]
100%|██████████| 5/5 [00:00<00:00, 1270.15it/s]
100%|██████████| 5/5 [00:00<00:00, 689.65it/s]
100%|██████████| 5/5 [00:19<00:00,  3.85s/it]
100%|██████████| 5/5 [00:00<00:00, 29.90it/s]
100%|██████████| 5/5 [00:00<00:00, 1224.76it/s]
100%|██████████| 5/5 [00:00<00:00, 693.87it/s]
100%|██████████| 5/5 [00:19<00:00,  3.82s/it]
100%|██████████| 5/5 [00:00<00:00, 39.65it/s]
100%|██████████| 5/5 [00:00<00:00, 1256.31it/s]
100%|██████████| 

## Support
For support reproducing the result, please email: huichong.me@gmail.com.