# Analyzing Arthropoda Taxonomy: Integrating GBIF, NCBI and iNaturalist Data

![title](https://wallpapercave.com/wp/wp1870417.jpg)

This Python notebook is designed for the purpose of integrating taxonomic data from two major biological databases, GBIF (Global Biodiversity Information Facility) and NCBI (National Center for Biotechnology Information) and iNaturalist, to enhance the accuracy and comprehensiveness of ecological and biological research. GBIF primarily focuses on biodiversity data including species distribution and ecological information, whereas NCBI provides a broader range of data including genomic and taxonomic details. 

Combining these sources enables researchers to cross-validate species identifications and improve the richness of ecological datasets with genetic information. A key biological task performed in this notebook is the construction of a taxonomic tree, which helps in visualizing and understanding the evolutionary relationships and classification hierarchy among different species within a chosen taxon (in this case, the Arthropda pyhlum).

## 1. Importing libraries

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [2]:
import taxonmatch as txm

## 2. Downloading and processing samples

The initial steps involve downloading the most recent taxonomic data from GBIF and NCBI to ensure the analysis is based on the latest available information. 

In [3]:
gbif_dataset = txm.download_gbif_taxonomy()

GBIF backbone taxonomy data already downloaded.
Processing samples...
Done.


In [4]:
ncbi_dataset = txm.download_ncbi_taxonomy()

NCBI taxonomy data already downloaded.
Processing samples...
Done.


## 2.1 Checking Inconsistencies in nomenclature

Matching based on canonical names between the GBIF and NCBI datasets is unreliable due to significant taxonomic inconsistencies. In particular, the same canonical name may be assigned to multiple kingdoms, highlighting classification discrepancies. Even when the taxonomic status is accepted, the taxonomic structures in GBIF and NCBI can differ substantially. This necessitates filtering and evaluating differences before considering a match valid, preventing false correspondences between incongruent taxonomies.

In [5]:
df_inconsistencies = txm.get_inconsistencies(gbif_dataset, ncbi_dataset)

In [6]:
df_inconsistencies.sample(5)

Unnamed: 0,canonicalName,gbif_id,ncbi_id,gbif_rank,ncbi_rank,gbif_taxonomy,ncbi_taxonomy
372005,Clavulina rugosa,7690415,149346,species,species,foraminifera;globothalamea;textulariida;valvulinidae;clavulina;clavulina rugosa,basidiomycota;agaricomycetes;cantharellales;hydnaceae;clavulina;clavulina rugosa
469678,Annularia spinulosa,11152599,1981406,species,species,tracheophyta;polypodiopsida;equisetales;calamitaceae;annularia;annularia spinulosa,mollusca;gastropoda;littorinimorpha;annulariidae;annularia;annularia spinulosa
585549,Helicopsis persica,7438932,1766789,species,species,ascomycota;dothideomycetes;tubeufiales;tubeufiaceae;helicopsis;helicopsis persica,mollusca;gastropoda;stylommatophora;geomitridae;helicopsis;helicopsis persica
566695,Salix alba,7882712,75704,species,species,chordata;ascidiacea;aplousobranchia;polycitoridae;salix;salix alba,streptophyta;magnoliopsida;malpighiales;salicaceae;salix;salix alba
515403,Trichospira verticillata,3087526,2067439,species,species,ciliophora;kinetofragminophora;trichostomatida;trichospiridae;trichospira;trichospira verticillata,streptophyta;magnoliopsida;asterales;asteraceae;trichospira;trichospira verticillata


## 3.a Training the classifier model

If required, the notebook outlines steps to train a machine learning classifier to distinguish between correct and incorrect taxonomic matches. This involves generating positive and negative examples, preparing the training dataset, and comparing different models.

In [7]:
positive_matches = txm.generate_positive_set(gbif_dataset, ncbi_dataset, 5000)

Generating positive set: 100.0%


In [8]:
negative_matches = txm.generate_negative_set(gbif_dataset, ncbi_dataset, 5000)

Generating negative set: 100.0%


In [9]:
full_training_set = txm.prepare_data(positive_matches, negative_matches)

In [10]:
#full_training_set.to_csv("training_set.txt", index = False)

In [11]:
X_train, X_test, y_train, y_test = txm.generate_training_test(full_training_set)

In [12]:
txm.compare_models(X_train, X_test, y_train, y_test)

Unnamed: 0,model,accuracy,mae,precision,recall,f1,roc,run_time,tp,fp,tn,fn
0,RandomForestClassifier,0.975758,0.024242,0.981543,0.992005,0.986746,0.902043,0.01,121,28,1489,12
1,GradientBoostingClassifier,0.971515,0.028485,0.97892,0.990007,0.984432,0.887621,0.01,117,32,1486,15
2,XGBClassifier,0.970909,0.029091,0.979538,0.988674,0.984085,0.89031,0.0,118,31,1484,17
3,KNeighborsClassifier,0.969697,0.030303,0.978878,0.988008,0.983422,0.886621,0.0,117,32,1483,18
4,DecisionTreeClassifier,0.966061,0.033939,0.980705,0.982012,0.981358,0.893691,0.0,120,29,1474,27
5,AdaBoostClassifier,0.964848,0.035152,0.97561,0.986009,0.980782,0.868844,0.0,112,37,1480,21
6,SVC,0.96,0.04,0.959641,0.998001,0.978445,0.787591,0.0,86,63,1498,3
7,MLPClassifier,0.955758,0.044242,0.95828,0.99467,0.976136,0.779214,0.01,84,65,1493,8
8,DummyClassifier,0.821212,0.178788,0.909647,0.892072,0.900774,0.499727,0.0,16,133,1339,162
9,Perceptron,0.238788,0.761212,0.991968,0.164557,0.282286,0.575567,0.0,147,2,247,1254


In [13]:
model = txm.XGBClassifier(learning_rate=0.1,n_estimators=500, max_depth=9, n_jobs=-1, colsample_bytree = 1, subsample = 0.8)

In [14]:
model.fit(X_train, y_train, verbose=False)

In [18]:
#txm.save_model(model, "xgb_model")

## 3.b Load a pre-trained model

 Alternatively, it provides the option to load a pre-trained model, simplifying the process for routine analyses.

In [19]:
model = txm.load_xgb_model()

## 4. Match NCBI with GBIF dataset 

In this section, the focus is on comparing and aligning the taxonomic data from NCBI and GBIF datasets. It specifically targets the taxon "Apidae" to narrow down the analysis to a specific family of bees. Using a pre-trained machine learning model, the notebook matches records from both datasets, categorizing them as exact matches, unmatched, or potentially mislabeled due to typographical errors

In [15]:
gbif_arthropoda, ncbi_arthropoda = txm.select_taxonomic_clade("apidae", gbif_dataset, ncbi_dataset) #"formicidae"

In [20]:
matched_df, unmatched_df, possible_typos_df = txm.match_dataset(gbif_arthropoda, ncbi_arthropoda, model2, tree_generation = True)

## 5. Generate the taxonomic tree 

In the last section, the notebook constructs a taxonomic tree from the matched and unmatched data between the GBIF and NCBI datasets, focusing on the Apidae family. This visual representation helps to illustrate the evolutionary relationships and classification hierarchy among the species. The tree is then converted into a dataframe for further analysis and saved in textual format for documentation and review purposes.

In [20]:
tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

  df.update(df_filtered_2)
  df.update(df_filtered_2)


In [21]:
#txm.print_tree(tree)

In [22]:
cicadetta_tree = txm.reroot_tree(tree, root_name="cicadetta")

In [23]:
txm.print_tree(cicadetta_tree)

cicadetta (NCBI ID: 139461, GBIF ID: 4407744)
├── cicadetta macedonica (NCBI ID: 1740319, GBIF ID: 7591128)
├── cicadetta abscondita (NCBI ID: 2593298, GBIF ID: 7844511)
├── cicadetta cantilatrix (NCBI ID: 1740312, GBIF ID: 7903126)
├── cicadetta cerdaniensis (NCBI ID: 1740313, GBIF ID: 7491426)
├── cicadetta hannekeae (NCBI ID: 1740317, GBIF ID: 7938246)
├── cicadetta olympica (NCBI ID: 1740320, GBIF ID: 8176504)
├── cicadetta sibillae (NCBI ID: 1740321, GBIF ID: 8466776)
├── cicadetta fangoana (NCBI ID: 1740316, GBIF ID: 4482637)
├── cicadetta anapaistica (NCBI ID: 1740310, GBIF ID: 8414980)
│   ├── cicadetta anapaistica lucana (NCBI ID: 1889248, GBIF ID: 11198679)
│   └── cicadetta anapaistica anapaistica (GBIF ID: 11192347)
├── cicadetta brevipennis (NCBI ID: 1740311, GBIF ID: 7614613)
│   ├── cicadetta brevipennis brevipennis (GBIF ID: 9428064)
│   ├── cicadetta brevipennis litoralis (GBIF ID: 10243165)
│   └── cicadetta brevipennis hippolaidica (GBIF ID: 11135606)
├── cicadetta k

In [24]:
df_from_tree = txm.convert_tree_to_dataframe(tree, gbif_dataset[1], ncbi_dataset[1], "df_arthropoda.txt", index=False)

In [25]:
txm.save_tree(tree, "./tree_arthropoda.txt", output_format='txt')

The tree is saved as TXT in the file: ./tree_arthropoda.txt.


## 6. Add iNaturalist information

In [29]:
inat_dataset = txm.download_inat_taxonomy()

iNaturalist taxonomy data already downloaded.
Processing samples...
Done.


In [30]:
inat_arthropoda = txm.select_inat_clade(inat_dataset, "Arthropoda")

In [31]:
inat_tree = txm.add_inat_taxonomy(tree, inat_arthropoda)

In [32]:
df_with_inat = txm.convert_tree_to_dataframe(inat_tree, gbif_dataset[1], ncbi_dataset[1], "MoultDB_backbone_v5.1.txt", inat_dataset=inat_arthropoda, index=True)

In [33]:
#txm.print_tree(inat_tree)

In [34]:
txm.save_tree(inat_tree, "final_tree_arthropoda.txt", output_format='txt')

The tree is saved as TXT in the file: final_tree_arthropoda.txt.
