<center>
<h1>Discovering Closest Living Relatives with sequenced genomes of Extinct Species in The Paleobiology Database</h1>
</center>

![title](https://www.science.org/do/10.1126/science.aad1693/full/sn-pentecopterus-1644949894120.jpg)

This notebook focuses on identifying the closest living relatives of the extinct Ristoria pliocaenica, a Pliocene leucosiid crab, by leveraging available taxonomic and genetic data from NCBI and fossil occurrences from the Paleobiology Database. By integrating paleontological records and molecular phylogenies, we aim to clarify the evolutionary position of Ristoria within the family Leucosiidae, tracing lineage continuity and divergence from extant species.

Using the integrated taxonomic reconciliation functions in TaxonMatch, we aligned fossil taxa with extant clades in NCBI and GBIF, identifying candidate species that share the closest phylogenetic affinity with Ristoria pliocaenica. This comparative approach provides a framework for understanding morphological stasis and evolutionary innovation within the Leucosiidae lineage, offering insights into the persistence of carapace traits and ecological niches from the Pliocene to the present.

## 1. Importing libraries

In [1]:
import taxonmatch as txm

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

## 2. Download the Paleobiology and the NCBI dataset

Although the Paleobiology Database is the primary source for fossil occurrences, its taxonomic backbone in GBIF lacks many fossil taxa in structured form. As a result, extinct species like Ristoria pliocaenica may not be represented as taxonomic entries within the two Paleobiology datasets currently available in GBIF.

In [3]:
txm.find_dataset_ids_by_name("Paleobiology Database")

Title: Paleobiology Database, ID: bb5b30b4-827e-4d5e-a86a-825d65cb6583
Title: The Paleobiology Database, ID: c33ce2f2-c3cc-43a5-a380-fe4526d63650
Title: GBIF Backbone Taxonomy, ID: d7dddbf4-2cf0-4f39-9b2a-bb099caae36c
Title: Catalogue of Life, ID: 7ddf754f-d193-4cc9-b351-99906754a03b


In [4]:
dataset_id = txm.get_dataset_from_species("Ristoria pliocaenica")
print(dataset_id)

File not found: /Users/mleone1/Desktop/UNIL/GitHub/TaxonMatch/notebooks/fossils/GBIF_output/Taxon.tsv


Instead, we rely on the Catalogue of Life (CoL) to provide a standardized taxonomic placement for this extinct species, since Ristoria is included as a valid genus in the CoL taxonomy backbone integrated into GBIF (source: 7ddf754f-d193-4cc9-b351-99906754a03b).
This approach ensures consistent lineage tracing and phylogenetic mapping across datasets, even when fossil taxa are missing from occurrence-based datasets such as Paleobiology DB.

In [5]:
catalogue_of_life_dataset = txm.download_gbif_taxonomy(source="7ddf754f-d193-4cc9-b351-99906754a03b")

Downloading GBIF Taxonomic Data: 926MB [00:28, 34.2MB/s] 


GBIF backbone taxonomy has been downloaded successfully.
Processing samples...
Done.


In [6]:
ncbi_dataset = txm.download_ncbi_taxonomy()

Downloading NCBI Taxonomic Data: 67.0MB [00:03, 19.0MB/s]


NCBI taxonomy has been downloaded successfully.
Processing samples...
Done.


## 3. Finding closest living relatives for single extinct species

The process begins by extracting Ristoria pliocaenica and its parent lineage from both the Catalogue of Life and NCBI taxonomies. Using the select_closest_common_clade function, the two lineages are compared to identify the nearest shared ancestral clade. Discrepancies in rank depth and naming conventions are detected and resolved, enabling taxonomic reconciliation between the two sources.

In [7]:
catalogue_of_life_parents, ncbi_parents = txm.select_closest_common_clade("Ristoria pliocaenica", catalogue_of_life_dataset, ncbi_dataset)

Last common node: leucosiidae


In [8]:
model = txm.load_xgb_model()

In [9]:
matched_df, unmatched_df, typos = txm.match_dataset(catalogue_of_life_parents, ncbi_parents, model, tree_generation = True)

The resulting matched taxonomy is then used to position Ristoria pliocaenica within the broader evolutionary framework of decapod crustaceans. While genetic data is unavailable for this extinct species, the integration of curated taxonomic hierarchies allows for a meaningful approximation of its phylogenetic placement and its relation to extant leucosiid genera.

In [10]:
tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

In [11]:
txm.print_tree(tree)


└── leucosiidae (NCBI ID: 6800, GBIF ID: 3928)
    ├── ryphila (NCBI ID: 1816520, GBIF ID: 4644382)
    │   ├── ryphila cancellus (NCBI ID: 1816521, GBIF ID: 5969169)
    │   ├── ryphila bertrandi (GBIF ID: 8682323)
    │   └── ryphila verrucosa (GBIF ID: 5969167)
    ├── ebalia (NCBI ID: 580079, GBIF ID: 2221790)
    │   ├── ebalia cranchii (NCBI ID: 1582881, GBIF ID: 4382641)
    │   ├── ebalia edwardsii (NCBI ID: 2951280, GBIF ID: 4382634)
    │   ├── ebalia nux (NCBI ID: 1131617, GBIF ID: 2221797)
    │   ├── ebalia tuberculosa (NCBI ID: 580080, GBIF ID: 2221799)
    │   ├── ebalia granulosa (NCBI ID: 2268640, GBIF ID: 4382651)
    │   ├── ebalia tuberosa (NCBI ID: 1732101, GBIF ID: 2221793)
    │   ├── ebalia tumefacta (NCBI ID: 1732102)
    │   ├── ebalia cariosa (NCBI ID: 1676467)
    │   ├── unclassified ebalia (NCBI ID: 2644593)
    │   │   └── ebalia sp. bold:aay0490 (NCBI ID: 1818044)
    │   ├── ebalia dimorphoides (GBIF ID: 5969108)
    │   ├── ebalia longispinosa (GBIF I

In [12]:
df_leucosiidae = txm.convert_tree_to_dataframe(tree, catalogue_of_life_parents, ncbi_parents, "leucosiidae_taxonomic_tree_df.txt")

In [13]:
ncbi_binomials = df_leucosiidae.loc[
    df_leucosiidae["ncbi_taxon_id"].notna() &
    (df_leucosiidae["ncbi_canonical_name"].fillna("").str.split().str.len() == 2)
]

In [14]:
ncbi_binomials

Unnamed: 0,id,ncbi_taxon_id,gbif_taxon_id,ncbi_canonical_name,gbif_canonical_name,gbif_synonyms_ids,gbif_synonyms_names,ncbi_synonyms_names
40,41,2858094,,Unclassified leucosiidae,,,,
93,94,1816521,5969169,Ryphila cancellus,Ryphila cancellus,5969021; 5969020; 5968911,Cancer cancellus; Philyra cancella; Pseudophilyra burmensis,Cancer cancellus
96,97,1582881,4382641,Ebalia cranchii,Ebalia cranchii,4382642,Ebalia chiragra,
97,98,2951280,4382634,Ebalia edwardsii,Ebalia edwardsii,11946291; 5969087; 5969088; 4382636,Ebalia brayerii; Ebalia algirica; Ebalia ambigua; Ebalia bryerii,Ebalia ambigua; Ebalia algirica; Ebalia bryerii
98,99,1131617,2221797,Ebalia nux,Ebalia nux,10931836,Ebalia nux,
...,...,...,...,...,...,...,...,...
525,526,1565490,4382673,Coleusia biannulata,Coleusia biannulata,5968930; 5731458,Leucosia longifrons neocaledonia; Leucosia biannulata,
531,532,1349633,2221767,Nucia speciosa,Nucia speciosa,10993447; 5969177,Ebalia spec; Ebalia pfefferi,
532,533,2790769,,Unclassified nucia,,,,
691,692,3069640,,Pyrhila sp.,,,,
