## Discovering Closest Living Relatives with sequenced genomes of Extinct Species in PaleoDB

![title](https://imagedelivery.net/wKQ19LTSBT0ARz08tkssqQ/www.courthousenews.com/2020/11/Kylinxia_zhangi.jpg/w=1880)

This notebook is about identifying the closest living relatives of extinct species cataloged in the PaleoDB using genomic data from the Arthropoda Assembly Assessment Catalogue (A3Cat). A3Cat provides a comprehensive overview of genomic data for Arthropoda, facilitating detailed analyses of taxonomic coverage and genome assembly quality. 

By linking extinct species with their extant counterparts, we aim to trace genetic continuities and divergence points, enhancing our understanding of evolutionary trends and ecological adaptations. Using the integrated function in TaxonMatch, we processed taxonomic data from both datasets to identify the closest living relatives of extinct species. 

## 1. Importing libraries

In [1]:
import taxonmatch as txm

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

## 2. Download the Paleobiology and the a3cat dataset

In [3]:
paleodb_dataset = txm.download_gbif_taxonomy(source="c33ce2f2-c3cc-43a5-a380-fe4526d63650")

GBIF backbone taxonomy data already downloaded.
Processing samples...
Done.


In [4]:
a3cat = txm.download_ncbi_taxonomy(source = "a3cat")

NCBI taxonomy data already downloaded.
Processing samples...
Done.
a3cat v.2024-08-01 downloaded


## 3. Fininding closest living relatives for single extinct species

In this section, we focus on identifying the closest living relatives for a specific extinct species using taxonomic data. By querying the dataset with detailed taxonomic information for an extinct species, such as Formica seuberti, we can find its closest extant counterparts. This process involves leveraging the TaxonMatch function, which integrates taxonomic data from both the PaleoDB and A3Cat databases to identify genetic similarities. 

In [5]:
query = "Arthropoda;Insecta;Hymenoptera;Formicidae;Formica;Formica seuberti"

In [6]:
txm.find_top_n_similar(query, a3cat, n_neighbors=4)

Unnamed: 0,Query,ncbi_id,Matched Target,Distance
0,Arthropoda;Insecta;Hymenoptera;Formicidae;Formica;Formica seuberti,72781,Formica exsecta,0.56
1,Arthropoda;Insecta;Hymenoptera;Formicidae;Formica;Formica seuberti,208979,Formica selysi,0.56
2,Arthropoda;Insecta;Hymenoptera;Formicidae;Formica;Formica seuberti,2796348,Formica aquilonia x Formica polyctena,0.59
3,Arthropoda;Insecta;Hymenoptera;Formicidae;Formica;Formica seuberti,1830378,Formica aserva,0.64


In [7]:
query_2 = "Arthropoda;Insecta;Lepidoptera;Zygaenidae;Zygaenites;Zygaenites controversus"

In [8]:
txm.find_top_n_similar(query_2, a3cat, 3)

Unnamed: 0,Query,ncbi_id,Matched Target,Distance
0,Arthropoda;Insecta;Lepidoptera;Zygaenidae;Zygaenites;Zygaenites controversus,287375,Zygaena filipendulae,0.7
1,Arthropoda;Insecta;Lepidoptera;Zygaenidae;Zygaenites;Zygaenites controversus,124411,Limenitis arthemis,0.87
2,Arthropoda;Insecta;Lepidoptera;Zygaenidae;Zygaenites;Zygaenites controversus,327109,Uloborus diversus,0.9


In [9]:
query_3 = 'Arthropoda;Malacostraca;Decapoda;Portunidae;Portunus;Portunus yaucoensis'

In [10]:
txm.find_top_n_similar(query_3, a3cat, 3)

Unnamed: 0,Query,ncbi_id,Matched Target,Distance
0,Arthropoda;Malacostraca;Decapoda;Portunidae;Portunus;Portunus yaucoensis,80836,Portunus pelagicus,0.53
1,Arthropoda;Malacostraca;Decapoda;Portunidae;Portunus;Portunus yaucoensis,210409,Portunus trituberculatus,0.59
2,Arthropoda;Malacostraca;Decapoda;Portunidae;Portunus;Portunus yaucoensis,7098,Malacosoma neustria,0.86


## 4. Fininding closest living relatives for all extinct species

This final section of the analysis involves a comprehensive search to find the closest living relatives for all extinct species in the dataset. By applying the TaxonMatch function to the entire dataset, we can generate a complete mapping of extinct species to their extant counterparts. This mapping process involves analyzing large volumes of genomic and taxonomic data to identify the nearest genetic matches between extinct and living species. 

In [11]:
txm.find_closest_sample(a3cat, paleodb_dataset[0])

Unnamed: 0,Query,Matched_id,Matched Target,Distance
0,bacca,226178,Baccha elongata,0.68
1,celtitis,56958,Ceratitis rosa,0.66
2,aroides,95590,Troides helena,0.62
3,halleia,2795559,Helleia helle,0.67
4,amyelon,680683,Amyelois transitella,0.57
...,...,...,...,...
1433,echinodermata;crinoidea;dendrocrinidae;dendrocrinus;dendrocrinus tener,77173,Dendroctonus valens,0.67
1434,echinodermata;crinoidea;dendrocrinidae;dendrocrinus;dendrocrinus minutus,77173,Dendroctonus valens,0.67
1435,echinodermata;crinoidea;dendrocrinidae;dendrocrinus;dendrocrinus longidactylus,77173,Dendroctonus valens,0.68
1436,echinodermata;crinoidea;dendrocrinidae;dendrocrinus;dendrocrinus leptos,77173,Dendroctonus valens,0.67
