# Analyzing Bee Taxonomy: Integrating GBIF and NCBI Data for Apidae Insights

![title](https://live.staticflickr.com/4059/4632384645_a2230b26d5_b.jpg)

This Python notebook is designed for the purpose of integrating taxonomic data from two major biological databases, GBIF (Global Biodiversity Information Facility) and NCBI (National Center for Biotechnology Information), to enhance the accuracy and comprehensiveness of ecological and biological research. GBIF primarily focuses on biodiversity data including species distribution and ecological information, whereas NCBI provides a broader range of data including genomic and taxonomic details. 

Combining these sources enables researchers to cross-validate species identifications and improve the richness of ecological datasets with genetic information. A key biological task performed in this notebook is the construction of a taxonomic tree, which helps in visualizing and understanding the evolutionary relationships and classification hierarchy among different species within a chosen taxon (in this case, Apidae - a family of bees).

## 1. Importing libraries

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
import taxonmatch as txm

## 2. Downloading and processing samples

The initial steps involve downloading the most recent taxonomic data from GBIF and NCBI to ensure the analysis is based on the latest available information. 

In [3]:
gbif_dataset = txm.download_gbif_taxonomy()

GBIF backbone taxonomy data already downloaded.
Processing samples...
Done.


In [4]:
ncbi_dataset = txm.download_ncbi_taxonomy()

NCBI taxonomy data already downloaded.
Processing samples...
Done.


## 3.a Training the classifier model

If required, the notebook outlines steps to train a machine learning classifier to distinguish between correct and incorrect taxonomic matches. This involves generating positive and negative examples, preparing the training dataset, and comparing different models.

In [19]:
positive_matches = txm.generate_positive_set(gbif_dataset, ncbi_dataset, 500)

ValueError: a must be greater than 0 unless no samples are taken

In [None]:
negative_matches = txm.generate_negative_set(gbif_dataset, ncbi_dataset, 500)

In [None]:
full_training_set = txm.prepare_data(positive_matches, negative_matches)

In [None]:
X_train, X_test, y_train, y_test = txm.generate_training_test(full_training_set)

In [None]:
txm.compare_models(X_train, X_test, y_train, y_test)

In [None]:
model = txm.XGBClassifier(learning_rate=0.1,n_estimators=500, max_depth=9, n_jobs=-1, colsample_bytree = 1, subsample = 0.8)

In [None]:
model.fit(X_train, y_train, verbose=False)

In [None]:
#with open('./files/model/xgb_model.pkl', 'wb') as file:
#    pickle.dump(model, file)

## 3.b Load a pre-trained model

Alternatively, it provides the option to load a pre-trained model, simplifying the process for routine analyses.

In [5]:
from taxonmatch.loader import load_xgb_model
model = load_xgb_model()

## 4. Match NCBI with GBIF dataset 

In this section, the focus is on comparing and aligning the taxonomic data from NCBI and GBIF datasets. It specifically targets the taxon "Apidae" to narrow down the analysis to a specific family of bees. Using a pre-trained machine learning model, the notebook matches records from both datasets, categorizing them as exact matches, unmatched, or potentially mislabeled due to typographical errors

In [6]:
gbif_apidae, ncbi_apidae = txm.select_taxonomic_clade("Apidae", gbif_dataset, ncbi_dataset)

In [9]:
matched_df, unmatched_df, possible_typos_df = txm.match_dataset(gbif_apidae, ncbi_apidae, model, tree_generation = True)

## 5. Generate the taxonomical tree 

In the last section, the notebook constructs a taxonomic tree from the matched and unmatched data between the GBIF and NCBI datasets, focusing on the Apidae family. This visual representation helps to illustrate the evolutionary relationships and classification hierarchy among the species. The tree is then converted into a dataframe for further analysis and saved in textual format for documentation and review purposes.

In [10]:
tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

In [11]:
df_from_tree = txm.convert_tree_to_dataframe(tree, gbif_apidae, ncbi_apidae, "taxonomic_tree_df.txt")

In [12]:
txm.print_tree(tree, root_name="Apidae")

apidae (NCBI ID: 7458, GBIF ID: 4334)
├── xylocopinae (NCBI ID: 78170, GBIF ID: None)
│   ├── allodapini (NCBI ID: 78174, GBIF ID: None)
│   │   ├── nasutapis (NCBI ID: 347686, GBIF ID: 1340237)
│   │   │   ├── unclassified nasutapis (NCBI ID: 2623401, GBIF ID: None)
│   │   │   │   └── nasutapis sp. malawi (NCBI ID: 347687, GBIF ID: None)
│   │   │   └── nasutapis straussorum (NCBI ID: None, GBIF ID: 1340238)
│   │   ├── macrogalea (NCBI ID: 175312, GBIF ID: 1339653)
│   │   │   ├── macrogalea antanosy (NCBI ID: 340383, GBIF ID: 1339656)
│   │   │   ├── macrogalea scaevolae (NCBI ID: 418716, GBIF ID: 1339663)
│   │   │   ├── macrogalea zanzibarica (NCBI ID: 175331, GBIF ID: 1339657)
│   │   │   ├── macrogalea magenge (NCBI ID: 234715, GBIF ID: 1339654)
│   │   │   ├── macrogalea maizina (NCBI ID: 418715, GBIF ID: 1339661)
│   │   │   ├── macrogalea berentyensis (NCBI ID: 411252, GBIF ID: 1339662)
│   │   │   ├── macrogalea candida (NCBI ID: 175313, GBIF ID: 1339660)
│   │   │   ├── ma

In [13]:
txm.save_tree(tree, "taxon_tree.txt")

The tree is saved in the file: taxon_tree.txt.


In [26]:
ncbi_dataset[1].sample(20)

Unnamed: 0,ncbi_id,ncbi_lineage_names,ncbi_lineage_ids,ncbi_canonicalName,ncbi_rank,ncbi_lineage_ranks,ncbi_target_string
1168287,1460902,Viruses;Riboviria;Orthornavirae;Negarnaviricota;Polyploviricotina;Insthoviricetes;Articulavirales;Orthomyxoviridae;Alphainfluenzavirus;Alphainfluenzavirus influenzae;Influenza A virus;H1N1 subtype;Influenza A virus (A/Santa Catarina/9358/2009(H1N1)),10239;2559587;2732396;2497569;2497571;2497577;2499411;11308;197911;2955291;11320;114727;1460902,Influenza A virus (A/Santa Catarina/9358/2009(H1N1)),no rank,superkingdom;clade;kingdom;phylum;subphylum;class;order;family;genus;species;no rank;serotype;no rank,viruses;negarnaviricota;insthoviricetes;articulavirales;orthomyxoviridae;alphainfluenzavirus;alphainfluenzavirus influenzae
1274098,1581661,cellular organisms;Bacteria;Pseudomonadota;Betaproteobacteria;Nitrosomonadales;Methylophilaceae;unclassified Methylophilaceae;Methylophilaceae bacterium MMS-VI-33,131567;2;1224;28216;32003;32011;119067;1581661,Methylophilaceae bacterium MMS-VI-33,species,no rank;superkingdom;phylum;class;order;family;no rank;species,bacteria;pseudomonadota;betaproteobacteria;nitrosomonadales;methylophilaceae;methylophilaceae bacterium mms-vi-33
874778,1128549,cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda;Mandibulata;Pancrustacea;Hexapoda;Insecta;Dicondylia;Pterygota;Neoptera;Endopterygota;Amphiesmenoptera;Lepidoptera;Glossata;Neolepidoptera;Heteroneura;Ditrysia;Obtectomera;Pyraloidea;Crambidae;Spilomelinae;unclassified Spilomelinae;Spilomelinae gen. spiloBioLep01;unclassified Spilomelinae gen. spiloBioLep01;Spilomelinae gen. spiloBioLep01 sp. BioLep605,131567;2759;33154;33208;6072;33213;33317;1206794;88770;6656;197563;197562;6960;50557;85512;7496;33340;33392;85604;7088;41191;41196;41197;37567;104431;37573;268499;581380;1104282;1101700;2637124;1128549,Spilomelinae gen. spiloBioLep01 sp. BioLep605,species,no rank;superkingdom;clade;kingdom;clade;clade;clade;clade;clade;phylum;clade;clade;subphylum;class;clade;subclass;infraclass;cohort;superorder;order;suborder;infraorder;parvorder;clade;clade;superfamily;family;subfamily;no rank;genus;no rank;species,eukaryota;arthropoda;insecta;lepidoptera;crambidae;spilomelinae gen. spilobiolep01;spilomelinae gen. spilobiolep01 sp. biolep605
2034804,2461422,Viruses;Riboviria;Orthornavirae;Kitrinoviricota;Flasuviricetes;Amarillovirales;Flaviviridae;Hepacivirus;unclassified Hepacivirus;Goat hepacivirus,10239;2559587;2732396;2732406;2732462;2732545;11050;11102;1249508;2461422,Goat hepacivirus,species,superkingdom;clade;kingdom;phylum;class;order;family;genus;no rank;species,viruses;kitrinoviricota;flasuviricetes;amarillovirales;flaviviridae;hepacivirus;goat hepacivirus
973380,1242756,cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Spiralia;Lophotrochozoa;Platyhelminthes;Rhabditophora;Seriata;Tricladida;Continenticola;Planarioidea;Dendrocoelidae;Sorocelis,131567;2759;33154;33208;6072;33213;33317;2697495;1206795;6157;147100;166126;6159;1292243;1292248;27893;1242756,Sorocelis,genus,no rank;superkingdom;clade;kingdom;clade;clade;clade;clade;clade;phylum;class;clade;order;suborder;superfamily;family;genus,eukaryota;platyhelminthes;rhabditophora;tricladida;dendrocoelidae;sorocelis
995418,1266528,cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda;Mandibulata;Pancrustacea;Hexapoda;Insecta;Dicondylia;Pterygota;Neoptera;Endopterygota;Diptera;Brachycera;Muscomorpha;Eremoneura;Cyclorrhapha;Schizophora;Calyptratae;Oestroidea;Sarcophagidae;Sarcophaginae;Arachnidomyia;Arachnidomyia clathrata,131567;2759;33154;33208;6072;33213;33317;1206794;88770;6656;197563;197562;6960;50557;85512;7496;33340;33392;7147;7203;43733;480118;480117;43738;43742;43755;7381;43916;1266527;1266528,Arachnidomyia clathrata,species,no rank;superkingdom;clade;kingdom;clade;clade;clade;clade;clade;phylum;clade;clade;subphylum;class;clade;subclass;infraclass;cohort;order;suborder;infraorder;clade;clade;no rank;no rank;superfamily;family;subfamily;genus;species,eukaryota;arthropoda;insecta;diptera;sarcophagidae;arachnidomyia;arachnidomyia clathrata
462119,545261,cellular organisms;Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomonadales;Hyphomonadaceae;Henriciella;Henriciella aquimarina,131567;2;1224;28211;2800060;69657;453849;545261,Henriciella aquimarina,species,no rank;superkingdom;phylum;class;order;family;genus;species,bacteria;pseudomonadota;alphaproteobacteria;hyphomonadales;hyphomonadaceae;henriciella;henriciella aquimarina
1133351,1421722,cellular organisms;Eukaryota;Viridiplantae;Streptophyta;Streptophytina;Embryophyta;Tracheophyta;Euphyllophyta;Spermatophyta;Magnoliopsida;Mesangiospermae;eudicotyledons;Gunneridae;Pentapetalae;asterids;lamiids;Boraginales;Boraginaceae;Cynoglossoideae;Craniospermeae;Craniospermum;Craniospermum subvillosum,131567;2759;33090;35493;131221;3193;58023;78536;58024;3398;1437183;71240;91827;1437201;71274;91888;1538097;21571;1874400;1874403;1421721;1421722,Craniospermum subvillosum,species,no rank;superkingdom;kingdom;phylum;subphylum;clade;clade;clade;clade;class;clade;clade;clade;clade;clade;clade;order;family;subfamily;tribe;genus;species,eukaryota;streptophyta;magnoliopsida;boraginales;boraginaceae;craniospermum;craniospermum subvillosum
1342412,1658616,cellular organisms;Bacteria;PVC group;Lentisphaerota;Lentisphaeria;Lentisphaerales;Lentisphaeraceae;Lentisphaera;Lentisphaera profundi,131567;2;1783257;256845;1313211;278081;566277;256846;1658616,Lentisphaera profundi,species,no rank;superkingdom;clade;phylum;class;order;family;genus;species,bacteria;lentisphaerota;lentisphaeria;lentisphaerales;lentisphaeraceae;lentisphaera;lentisphaera profundi
2391464,2868523,cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda;Mandibulata;Pancrustacea;Hexapoda;Insecta;Dicondylia;Pterygota;Neoptera;Endopterygota;Hymenoptera;Apocrita;Proctotrupomorpha;Chalcidoidea;Eurytomidae;Eurytominae;Axima;Axima zabriskiei,131567;2759;33154;33208;6072;33213;33317;1206794;88770;6656;197563;197562;6960;50557;85512;7496;33340;33392;7399;7400;1955251;7422;75200;246456;2868522;2868523,Axima zabriskiei,species,no rank;superkingdom;clade;kingdom;clade;clade;clade;clade;clade;phylum;clade;clade;subphylum;class;clade;subclass;infraclass;cohort;order;suborder;infraorder;superfamily;family;subfamily;genus;species,eukaryota;arthropoda;insecta;hymenoptera;eurytomidae;axima;axima zabriskiei


In [None]:
#Known Bugs:
#Fix the tree root
#rename dataset differently
#works on python 3.10. Check last version