# Discovering Closest Living Relatives with sequenced genomes of Extinct Species in The Paleobiology Database

![title](https://staticgeopop.akamaized.net/wp-content/uploads/sites/32/2023/04/THUMB-VIDEO-DODO-CLEAN.jpg?)

This notebook focuses on identifying the closest living relatives of the extinct Dodo (Raphus cucullatus) using available genomic data from NCBI and fossil records from The Paleobiology Database. By integrating taxonomic and genetic information, we aim to establish phylogenetic relationships between the Dodo and its extant relatives, tracing evolutionary continuities and divergence points.

Using the integrated function in TaxonMatch, we processed taxonomic data from both NCBI and The Paleobiology Database to identify species that share the highest genetic similarity with the Dodo. This approach enhances our understanding of its evolutionary placement and provides insights into the genetic traits preserved in its closest living relatives

## 1. Importing libraries

In [1]:
import taxonmatch as txm

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

## 2. Download the Paleobiology and the NCBI dataset

In [3]:
txm.find_dataset_ids_by_name("Paleobiology Database")

Title: Paleobiology Database, ID: bb5b30b4-827e-4d5e-a86a-825d65cb6583
Title: The Paleobiology Database, ID: c33ce2f2-c3cc-43a5-a380-fe4526d63650
Title: GBIF Backbone Taxonomy, ID: d7dddbf4-2cf0-4f39-9b2a-bb099caae36c
Title: Catalogue of Life, ID: 7ddf754f-d193-4cc9-b351-99906754a03b


In [4]:
paleobiology_dataset = txm.download_gbif_taxonomy(source="c33ce2f2-c3cc-43a5-a380-fe4526d63650")

GBIF backbone taxonomy data already downloaded.
Processing samples...
Done.


In [5]:
ncbi_dataset = txm.download_ncbi_taxonomy()

NCBI taxonomy data already downloaded.
Processing samples...
Done.


## 3. Finding closest living relatives for single extinct species

This section reconstructs the Dodo’s phylogenetic relationships by integrating taxonomic data from The Paleobiology Database, GBIF, and NCBI. The process begins by extracting "Raphus cucullatus" data from The Paleobiology Database and retrieving Columbidae taxonomy from GBIF and NCBI. An XGBoost model is then used to reconcile taxonomic entries, identifying matches and inconsistencies across datasets. The matched taxonomy is used to generate a phylogenetic tree, which is finally visualized to highlight the Dodo’s closest living relatives and its evolutionary placement.

In [6]:
txm.find_species_information("Raphus cucullatus", paleobiology_dataset)

Unnamed: 0,taxonID,datasetID,parentNameUsageID,acceptedNameUsageID,canonicalName,taxonRank,taxonomicStatus,kingdom,phylum,class,order,family,genus,gbif_taxonomy,gbif_taxonomy_ids
106343,2496198,7ddf754f-d193-4cc9-b351-99906754a03b,2496196,,Raphus cucullatus,species,accepted,Animalia,Chordata,Aves,Columbiformes,Columbidae,Raphus,chordata;aves;columbiformes;columbidae;raphus;raphus cucullatus,44;212;1446;5233;2496196;2496198


In [7]:
paleo_columbidae, ncbi_columbidae = txm.select_taxonomic_clade("Columbidae", paleobiology_dataset, ncbi_dataset)

In [8]:
model = txm.load_xgb_model()

In [9]:
matched_df, unmatched_df, typos = txm.match_dataset(paleo_columbidae, ncbi_columbidae, model, tree_generation = True)

In [10]:
tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

In [11]:
txm.print_tree(tree)


└── columbidae (NCBI ID: 8930, GBIF ID: 5233)
    ├── geotrygon (NCBI ID: 115649, GBIF ID: 2496103)
    │   ├── geotrygon chrysia (NCBI ID: 1471290, GBIF ID: 2496120)
    │   └── geotrygon montana (NCBI ID: 115651, GBIF ID: 2496121)
    ├── gallicolumba (NCBI ID: 187119, GBIF ID: 2495214)
    │   ├── gallicolumba keayi (NCBI ID: 977961)
    │   └── gallicolumba leonpascoi (GBIF ID: 11221070)
    ├── alectroenas (NCBI ID: 187103, GBIF ID: 2495383)
    ├── pampusana (NCBI ID: 2953413)
    │   ├── pampusana beccarii (NCBI ID: 2953425, GBIF ID: 10674903)
    │   ├── pampusana jobiensis (NCBI ID: 2953430, GBIF ID: 10747886)
    │   ├── pampusana canifrons (NCBI ID: 2953426, GBIF ID: 10773621)
    │   └── pampusana xanthonura (NCBI ID: 2953437, GBIF ID: 10783350)
    ├── nesoenas (NCBI ID: 187125, GBIF ID: 7758956)
    │   ├── nesoenas mayeri (NCBI ID: 187126, GBIF ID: 5788471)
    │   ├── nesoenas picturatus (NCBI ID: 2953424, GBIF ID: 5788470)
    │   └── nesoenas rodericanus (GBIF ID: 77

In [12]:
df_columbidae = txm.convert_tree_to_dataframe(tree, paleo_columbidae, ncbi_columbidae, "columbidae_taxonomic_tree_df.txt")

In [13]:
ncbi_samples = df_columbidae[df_columbidae.ncbi_taxon_id.notna()]