## Discovering Closest Living Relatives with sequenced genomes of Extinct Species in PaleoDB

![title](https://imagedelivery.net/wKQ19LTSBT0ARz08tkssqQ/www.courthousenews.com/2020/11/Kylinxia_zhangi.jpg/w=1880)

This notebook is about identifying the closest living relatives of extinct species cataloged in the PaleoDB using genomic data from the Arthropoda Assembly Assessment Catalogue (A3Cat). A3Cat provides a comprehensive overview of genomic data for Arthropoda, facilitating detailed analyses of taxonomic coverage and genome assembly quality. 

By linking extinct species with their extant counterparts, we aim to trace genetic continuities and divergence points, enhancing our understanding of evolutionary trends and ecological adaptations. Using the integrated function in TaxonMatch, we processed taxonomic data from both datasets to identify the closest living relatives of extinct species. 

## 1. Importing libraries

In [1]:
import taxonmatch as txm

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

## 2. Download GBIF and NCBI dataset

In [3]:
gbif_dataset = txm.download_gbif_taxonomy()

GBIF backbone taxonomy data already downloaded.
Processing samples...
Done.


In [4]:
ncbi_dataset = txm.download_ncbi_taxonomy()

NCBI taxonomy data already downloaded.
Processing samples...
Done.


## 3. Import A3CAT dataset

In [5]:
#Importing the A3cat dataset
a3cat = pd.read_csv("./a3cat_v2.tsv", sep="\t")

In [6]:
a3cat_filtered = ncbi_dataset[0][ncbi_dataset[0]["ncbi_id"].isin(list(a3cat.TaxId.astype(str)))]

## 4. Import and process PaleoDB Dataset

In [7]:
#Importing the pbdb dataset updated to 30/08/2023
pbdb = pd.read_csv("./pbdb_data.tsv", sep= "\t", skiprows=17, low_memory=False)

In [8]:
filtered_pbdb = pbdb[['orig_no', 'taxon_no', 'taxon_rank',
       'taxon_name', 'common_name', 'parent_no', 'parent_name', 'immpar_no',
       'immpar_name', 'reference_no', 'is_extant', 'n_occs', 'phylum',
       'phylum_no', 'class', 'class_no', 'order', 'order_no', 'family',
       'family_no', "genus"]]

In [9]:
pbdb_arthropoda = filtered_pbdb[filtered_pbdb.phylum == "Arthropoda"]

In [10]:
pbdb_arthropoda_ = pbdb_arthropoda.copy()
#Creating a new column with the full taxononmical information
pbdb_arthropoda_["taxonomy"] = pbdb_arthropoda['phylum'] + ";" + pbdb_arthropoda['class'] + ";" + pbdb_arthropoda['order'] + ";" + pbdb_arthropoda['family'] + ";" + pbdb_arthropoda["taxon_name"]

## 5. Fininding closest living relatives for extinct species

In [11]:
query = "Arthropoda;Insecta;Hymenoptera;Formicidae;Formica:Formica seuberti"

In [12]:
txm.find_similar(a3cat_filtered, query, 3)

Unnamed: 0,Matched Target,Distance
0,arthropoda;insecta;hymenoptera;formicidae;formica;formica exsecta,0.55
1,arthropoda;insecta;hymenoptera;formicidae;formica;formica selysi,0.55
2,arthropoda;insecta;hymenoptera;formicidae;formica;formica aquilonia x formica polyctena,0.58


In [13]:
query = "Arthropoda;Insecta;Lepidoptera;Zygaenidae;Zygaenites;Zygaenites controversus"

In [14]:
txm.find_similar(a3cat_filtered, query, 3)

Unnamed: 0,Matched Target,Distance
0,arthropoda;insecta;lepidoptera;zygaenidae;zygaena;zygaena filipendulae,0.71
1,arthropoda;insecta;lepidoptera;nymphalidae;limenitis;limenitis arthemis,0.87
2,arthropoda;arachnida;araneae;uloboridae;uloborus;uloborus diversus,0.9


In [15]:
query = 'Arthropoda;Malacostraca;Decapoda;Portunidae;Portunus;Portunus yaucoensis'

In [16]:
query = list(pbdb_arthropoda_.sample(1).taxonomy)[0]

In [17]:
query

'Arthropoda;Arachnida;Araneae;Theridiidae;Cretotheridion'

In [18]:
txm.find_similar(a3cat_filtered, query, 3)

Unnamed: 0,Matched Target,Distance
0,arthropoda;insecta;diptera;athericidae;atherix;atherix ibis,0.86
1,arthropoda;insecta;hymenoptera;apidae;habropoda;habropoda laboriosa,0.86
2,arthropoda;insecta;lepidoptera;noctuidae;apamea;apamea epomidion,0.87
