<center>
<h1>Discovering Closest Living Relatives with sequenced genomes of Extinct Species in The Paleobiology Database</h1>
</center>

![title](https://www.science.org/do/10.1126/science.aad1693/full/sn-pentecopterus-1644949894120.jpg)

This notebook focuses on identifying the closest living relatives of the extinct Ristoria pliocaenica, a Pliocene leucosiid crab, by leveraging available taxonomic and genetic data from NCBI and fossil occurrences from the Paleobiology Database. By integrating paleontological records and molecular phylogenies, we aim to clarify the evolutionary position of Ristoria within the family Leucosiidae, tracing lineage continuity and divergence from extant species.

Using the integrated taxonomic reconciliation functions in TaxonMatch, we aligned fossil taxa with extant clades in NCBI and GBIF, identifying candidate species that share the closest phylogenetic affinity with Ristoria pliocaenica. This comparative approach provides a framework for understanding morphological stasis and evolutionary innovation within the Leucosiidae lineage, offering insights into the persistence of carapace traits and ecological niches from the Pliocene to the present.

## 1. Importing libraries

In [None]:
import taxonmatch as txm

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

## 2. Download the Paleobiology and the NCBI dataset

Although the Paleobiology Database is the primary source for fossil occurrences, its taxonomic backbone in GBIF lacks many fossil taxa in structured form. As a result, extinct species like Ristoria pliocaenica may not be represented as taxonomic entries within the two Paleobiology datasets currently available in GBIF.

In [None]:
txm.find_dataset_ids_by_name("Paleobiology Database")

In [None]:
dataset_id = txm.get_dataset_from_species("Ristoria pliocaenica")
print(dataset_id)

Instead, we rely on the Catalogue of Life (CoL) to provide a standardized taxonomic placement for this extinct species, since Ristoria is included as a valid genus in the CoL taxonomy backbone integrated into GBIF (source: 7ddf754f-d193-4cc9-b351-99906754a03b).
This approach ensures consistent lineage tracing and phylogenetic mapping across datasets, even when fossil taxa are missing from occurrence-based datasets such as Paleobiology DB.

In [None]:
catalogue_of_life_dataset = txm.download_gbif_taxonomy(source="7ddf754f-d193-4cc9-b351-99906754a03b")

In [None]:
ncbi_dataset = txm.download_ncbi_taxonomy()

## 3. Finding closest living relatives for single extinct species

The process begins by extracting Ristoria pliocaenica and its parent lineage from both the Catalogue of Life and NCBI taxonomies. Using the select_closest_common_clade function, the two lineages are compared to identify the nearest shared ancestral clade. Discrepancies in rank depth and naming conventions are detected and resolved, enabling taxonomic reconciliation between the two sources.

In [None]:
catalogue_of_life_parents, ncbi_parents = txm.select_closest_common_clade("Ristoria pliocaenica", catalogue_of_life_dataset, ncbi_dataset)

In [None]:
model = txm.load_xgb_model()

In [None]:
matched_df, unmatched_df, typos = txm.match_dataset(catalogue_of_life_parents, ncbi_parents, model, tree_generation = True)

The resulting matched taxonomy is then used to position Ristoria pliocaenica within the broader evolutionary framework of decapod crustaceans. While genetic data is unavailable for this extinct species, the integration of curated taxonomic hierarchies allows for a meaningful approximation of its phylogenetic placement and its relation to extant leucosiid genera.

In [None]:
tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

In [None]:
txm.print_tree(tree)

In [None]:
df_leucosiidae = txm.convert_tree_to_dataframe(tree, catalogue_of_life_parents, ncbi_parents, "leucosiidae_taxonomic_tree_df.txt")

In [None]:
ncbi_binomials = df_leucosiidae.loc[
    df_leucosiidae["ncbi_taxon_id"].notna() &
    (df_leucosiidae["ncbi_canonical_name"].fillna("").str.split().str.len() == 2)
]

In [None]:
ncbi_binomials