<center>
<h1>Analyzing Bee Taxonomy:</h1>
<h2>Integrating GBIF, NCBI and iNaturalist Data for Apidae Insights</h2>
</center>

![title](https://live.staticflickr.com/4059/4632384645_a2230b26d5_b.jpg)

This Python notebook is designed for the purpose of integrating taxonomic data from three major biological databases, GBIF (Global Biodiversity Information Facility), NCBI (National Center for Biotechnology Information), and iNaturalist to enhance the accuracy and comprehensiveness of ecological and biological research. GBIF primarily focuses on biodiversity data including species distribution and ecological information, whereas NCBI provides a broader range of data including genomic and taxonomic details. iNaturalist, on other hand, is one of the most important Citizen Science project to collect biological data

Combining these sources enables researchers to cross-validate species identifications and improve the richness of ecological datasets with genetic information. A key biological task performed in this notebook is the construction of a taxonomic tree, which helps in visualizing and understanding the evolutionary relationships and classification hierarchy among different species within a chosen taxon (in this case, Apidae - a family of bees).

## 1. Importing libraries

In [29]:
import pandas as pd
import warnings

pd.set_option('display.max_colwidth', None)
warnings.simplefilter(action='ignore', category=FutureWarning)

In [30]:
import taxonmatch as txm

## 2. Downloading and processing samples

The initial steps involve downloading the most recent taxonomic data from GBIF and NCBI to ensure the analysis is based on the latest available information. 

In [31]:
gbif_dataset = txm.download_gbif_taxonomy()

GBIF backbone taxonomy data already downloaded.
Processing samples...
Done.


In [32]:
ncbi_dataset = txm.download_ncbi_taxonomy()

NCBI taxonomy data already downloaded.
Processing samples...
Done.


## 3.a Training the classifier model

If required, the notebook outlines steps to train a machine learning classifier to distinguish between correct and incorrect taxonomic matches. This involves generating positive and negative examples, preparing the training dataset, and comparing different models.

In [33]:
positive_matches = txm.generate_positive_set(gbif_dataset, ncbi_dataset, 500)

Generating positive set: 100.0%


In [34]:
negative_matches = txm.generate_negative_set(gbif_dataset, ncbi_dataset, 500)

Generating negative set: 100.0%


In [35]:
full_training_set = txm.prepare_data(positive_matches, negative_matches)

In [36]:
X_train, X_test, y_train, y_test = txm.generate_training_test(full_training_set)

In [37]:
txm.compare_models(X_train, X_test, y_train, y_test)

Unnamed: 0,model,accuracy,mae,precision,recall,f1,roc,run_time,tp,fp,tn,fn
0,RandomForestClassifier,0.963333,0.036667,0.954248,0.973333,0.963696,0.963333,0.0,143,7,146,4
1,XGBClassifier,0.96,0.04,0.948052,0.973333,0.960526,0.96,0.0,142,8,146,4
2,GradientBoostingClassifier,0.95,0.05,0.94702,0.953333,0.950166,0.95,0.0,142,8,143,7
3,DecisionTreeClassifier,0.936667,0.063333,0.939597,0.933333,0.936455,0.936667,0.0,141,9,140,10
4,KNeighborsClassifier,0.936667,0.063333,0.933775,0.94,0.936877,0.936667,0.0,140,10,141,9
5,AdaBoostClassifier,0.906667,0.093333,0.886076,0.933333,0.909091,0.906667,0.0,132,18,140,10
6,MLPClassifier,0.89,0.11,0.96063,0.813333,0.880866,0.89,0.0,145,5,122,28
7,Perceptron,0.83,0.17,0.756477,0.973333,0.851312,0.83,0.0,103,47,146,4
8,SVC,0.82,0.18,0.775862,0.9,0.833333,0.82,0.0,111,39,135,15
9,DummyClassifier,0.533333,0.466667,0.534722,0.513333,0.52381,0.533333,0.0,83,67,77,73


In [38]:
model = txm.XGBClassifier(learning_rate=0.1,n_estimators=500, max_depth=9, n_jobs=-1, colsample_bytree = 1, subsample = 0.8)

In [39]:
model.fit(X_train, y_train, verbose=False)

## 3.b Load a pre-trained model

Alternatively, it provides the option to load a pre-trained model, simplifying the process for routine analyses.

In [40]:
model = txm.load_xgb_model()

## 4. Match NCBI with GBIF dataset 

In this section, the focus is on comparing and aligning the taxonomic data from NCBI and GBIF datasets. It specifically targets the taxon "Apidae" to narrow down the analysis to a specific family of bees. Using a pre-trained machine learning model, the notebook matches records from both datasets, categorizing them as exact matches, unmatched, or potentially mislabeled due to typographical errors

In [41]:
gbif_apidae, ncbi_apidae = txm.select_taxonomic_clade("Apidae", gbif_dataset, ncbi_dataset)

In [42]:
matched_df, unmatched_df, possible_typos_df = txm.match_dataset(gbif_apidae, ncbi_apidae, model, tree_generation = True)

## 5. Generate the taxonomical tree 

In this section, the notebook constructs a taxonomic tree from the matched and unmatched data between the GBIF and NCBI datasets, focusing on the Apidae family. This visual representation helps to illustrate the evolutionary relationships and classification hierarchy among the species. The tree is then converted into a dataframe for further analysis and saved in textual format for documentation and review purposes.

In [43]:
apidae_tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

In [44]:
df_apidae = txm.convert_tree_to_dataframe(apidae_tree, gbif_apidae, ncbi_apidae, "apidae_taxonomic_tree_df.txt")

In [45]:
txm.save_tree(apidae_tree, "apidae_tree.txt")

The tree is saved as TXT in the file: apidae_tree.txt.


In [47]:
txm.print_tree(apidae_tree)


└── apidae (NCBI ID: 7458, GBIF ID: 4334)
    ├── aethammobates (GBIF ID: 1345289)
    │   └── aethammobates prionogaster (GBIF ID: 1345290)
    ├── aethemelikertes (GBIF ID: 11218031)
    │   └── aethemelikertes emunctorii (GBIF ID: 11141004)
    ├── afromelecta (GBIF ID: 1339758)
    │   ├── afromelecta bicuspis (GBIF ID: 1339760)
    │   ├── afromelecta fulvohirta (GBIF ID: 1339761)
    │   └── afromelecta lieftincki (GBIF ID: 1339759)
    ├── amelikertotes (GBIF ID: 11153287)
    ├── anthidulum (GBIF ID: 4669755)
    │   └── anthidulum rozeni (GBIF ID: 8550915)
    ├── anthophorites (GBIF ID: 4671634)
    │   ├── anthophorites longaeva (GBIF ID: 8507095)
    │   ├── anthophorites mellona (GBIF ID: 8441636)
    │   ├── anthophorites thoracica (GBIF ID: 8644461)
    │   ├── anthophorites titania (GBIF ID: 8635555)
    │   ├── anthophorites tonsa (GBIF ID: 8554860)
    │   └── anthophorites veterana (GBIF ID: 8563192)
    ├── apinae (NCBI ID: 70987)
    │   ├── ancylini (NCBI ID: 481

## 5.1 Reroot the taxoonomical tree

Alternatively, the previously constructed taxonomic tree will be pruned (rerooted) to focus on a specific lineage within Apidae. By selecting a target taxon—in this case, the genus Bombus—all branches below this level will be retained while discarding unrelated taxa. This approach allows for a more detailed analysis of a specific clade, facilitating a clearer visualization of evolutionary relationships and taxonomic consistency within the selected group.

In [22]:
bombus_tree = txm.reroot_tree(apidae_tree, root_name="bombus")

In [48]:
txm.print_tree(bombus_tree)

bombus (NCBI ID: 28641, GBIF ID: 1340278)
├── alpigenobombus (NCBI ID: 144729)
│   ├── bombus angustus (NCBI ID: 2870605, GBIF ID: 5734074)
│   ├── bombus breviceps (NCBI ID: 395515, GBIF ID: 1340300)
│   ├── bombus grahami (NCBI ID: 421271, GBIF ID: 1340374)
│   ├── bombus kashmirensis (NCBI ID: 395536, GBIF ID: 1340381)
│   ├── bombus nobilis (NCBI ID: 309969, GBIF ID: 1340535)
│   └── bombus wurflenii (NCBI ID: 85670, GBIF ID: 1340358)
│       ├── bombus wurflenii flavicans (GBIF ID: 12196622)
│       ├── bombus wurflenii mastrucatus (GBIF ID: 9163608)
│       ├── bombus wurflenii pyrenaicus (GBIF ID: 8872565)
│       └── bombus wurflenii wurflenii (GBIF ID: 9069794)
├── alpinobombus (NCBI ID: 144707)
│   ├── bombus alpinus (NCBI ID: 309942, GBIF ID: 1340325)
│   ├── bombus balteatus (NCBI ID: 85657, GBIF ID: 1340403)
│   ├── bombus hyperboreus (NCBI ID: 85662, GBIF ID: 1340361)
│   ├── bombus kirbiellus (NCBI ID: 1772339, GBIF ID: 10409744)
│   ├── bombus kluanensis (NCBI ID: 25184

## 6. Add iNaturalist Information

In this final section, the previously curated dataset, which integrates taxonomic information from both GBIF and NCBI, will be further enriched by incorporating data from iNaturalist. This additional dataset will provide valuable community-driven observations, complementing the existing taxonomy with real-world records contributed by citizen scientists and researchers. The ultimate result of this process will be the construction of a comprehensive taxonomic tree that includes unique identifiers from all three datasets—GBIF, NCBI, and iNaturalist—ensuring a more robust and harmonized representation of Apidae taxonomy.

In [57]:
inat_dataset = txm.download_inat_taxonomy()

iNaturalist taxonomy data already downloaded.
Processing samples...
Done.


In [50]:
inat_apidae = txm.select_inat_clade(inat_dataset, "Apidae")

In [51]:
inat_tree = txm.add_inat_taxonomy(apidae_tree, inat_apidae)

In [53]:
txm.print_tree(inat_tree)


└── apidae (NCBI ID: 7458, GBIF ID: 4334, iNaturalist ID: 47221)
    ├── aethammobates (GBIF ID: 1345289, iNaturalist ID: 574511)
    │   └── aethammobates prionogaster (GBIF ID: 1345290)
    ├── aethemelikertes (GBIF ID: 11218031)
    │   └── aethemelikertes emunctorii (GBIF ID: 11141004)
    ├── afromelecta (GBIF ID: 1339758, iNaturalist ID: 574510)
    │   ├── afromelecta bicuspis (GBIF ID: 1339760)
    │   ├── afromelecta fulvohirta (GBIF ID: 1339761, iNaturalist ID: 648015)
    │   ├── afromelecta lieftincki (GBIF ID: 1339759)
    │   └── acanthomelecta (iNaturalist ID: 578272)
    ├── amelikertotes (GBIF ID: 11153287)
    ├── anthidulum (GBIF ID: 4669755)
    │   └── anthidulum rozeni (GBIF ID: 8550915)
    ├── anthophorites (GBIF ID: 4671634)
    │   ├── anthophorites longaeva (GBIF ID: 8507095)
    │   ├── anthophorites mellona (GBIF ID: 8441636)
    │   ├── anthophorites thoracica (GBIF ID: 8644461)
    │   ├── anthophorites titania (GBIF ID: 8635555)
    │   ├── anthophorite

In [54]:
df_apidae_with_inaturalist = txm.convert_tree_to_dataframe(apidae_tree, gbif_apidae, ncbi_apidae, "apidae_taxonomic_tree_df.txt", inat_dataset=inat_apidae)

In [56]:
df_apidae_with_inaturalist.sample(20)

Unnamed: 0,id,ncbi_taxon_id,gbif_taxon_id,inat_taxon_id,ncbi_canonical_name,gbif_canonical_name,inat_canonical_name,gbif_synonyms_ids,gbif_synonyms_names,ncbi_synonyms_names
8673,8674,710044.0,,,Ceratina (Hirashima) sp. Malagasy 1,,,,,
3991,3992,,1340070.0,,,Melipona carrikeri,,9002161,Melipona marginata carrikeri,
6610,6611,,5040067.0,572809.0,,Ceratina cladura,Ceratina cladura,,,
3649,3650,,1340227.0,1358491.0,,Tetralonioidella tricolor,Tetralonioidella tricolor,11015605,Protomelissa tricolor,
5080,5081,,1344487.0,1591582.0,,Doeringiella asignata,Doeringiella asignata,,,
648,649,60899.0,1345383.0,574497.0,Dactylurina,Dactylurina,Dactylurina,,,
1755,1756,,1340471.0,1600463.0,,Bombus richardsiellus,Bombus richardsiellus,10919296,Pyrobombus richardsiellus,
4866,4867,,5039541.0,572314.0,,Ammobates latitarsis,Ammobates latitarsis,,,
3973,3974,,1342750.0,,,Meliplebeia roubiki,,8029272,Meliponula roubiki,
4125,4126,,1345578.0,1403758.0,,Paratrigona melanaspis,Paratrigona melanaspis,,,
