<center>
<h1>Matching genomes of endangered species in A3cat with conservation data from IUCN Red List</h1>
</center>

![title](https://static.vecteezy.com/system/resources/previews/026/746/427/non_2x/illustration-image-nature-and-sustainability-eco-friendly-living-and-conservation-concept-art-of-earth-and-animal-life-in-different-environments-generative-ai-illustration-free-photo.jpg)

This notebook analyzes species conservation statuses by integrating IUCN Red List data with information on genome availability from A3CAT (the Arthropoda Assembly Assessment Catalogue). The goal is to assess the overlap between species with genomic assemblies and those under threat, and to visualize how genome sequencing efforts are distributed across IUCN categories.

Datasets used:

### **Datasets Used**
- **IUCN Red List Data**: Provides extinction risk categories for thousands of species based on population trends, habitat loss, and other ecological indicators.
- **A3CAT**: Arthropoda Assembly Assessment Catalogue, offers a curated list of arthropod species for which genome assemblies are available, covering multiple phyla and clades within Arthropoda.

## 1. Import Libraries

In [None]:
import taxonmatch as txm

In [None]:
import logging
# Suppress warnings from the requests library
logging.getLogger("requests").setLevel(logging.ERROR)

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

## 2. Download GBIF and A3cat dataset

We download two core datasets:
The GBIF taxonomic backbone, used to normalize species names and resolve hierarchical taxonomic structure.
The A3cat genomic dataset, which lists arthropod taxa with public genome assemblies, including metadata on genome quality and completeness.
These resources allow downstream matching with IUCN data to identify conservation-relevant species.

In [None]:
gbif_dataset = txm.download_gbif_taxonomy()

In [None]:
a3cat_dataset = txm.download_ncbi_taxonomy(source = "a3cat")

## 3. Filtering samples for Arthropoda Phylum

In [None]:
gbif_arthropda, a3cat_arthropoda = txm.select_taxonomic_clade("arthropoda", gbif_dataset, a3cat_dataset)

## 4. Import xgboost model  

In [None]:
model = txm.load_xgb_model()

## 5. Matching samples

This section aligns species across three different data sources: A3cat, GBIF, and the IUCN Red List. Taxonomic reconciliation is performed using TaxonMatch, ensuring that synonymous or ambiguous species names are correctly mapped. This matching step is crucial to identify species for which genomic and conservation data coexist.

In [None]:
matched_df, unmatched_df, possible_typos_df = txm.match_dataset(gbif_arthropda, a3cat_arthropoda, model, tree_generation = False)

## 6. Add conservation status to the results

Once species from A3cat are matched to IUCN entries, this step appends their official conservation status (e.g., Least Concern, Vulnerable, Endangered, Critically Endangered). These categories reflect IUCN’s systematic assessment of extinction risk, based on population trends, habitat threats, and range size.

In [None]:
df_with_iucn_status = txm.add_iucn_status_column(matched_df)

In [None]:
endangered = df_with_iucn_status[df_with_iucn_status.iucnRedListCategory.isin(['ENDANGERED', 'CRITICALLY_ENDANGERED', 'VULNERABLE'])]

## 7. Filtering and Ordering Conservation Categories

To visualize the data meaningfully, this section orders the IUCN categories along a gradient of increasing extinction risk. This facilitates more intuitive and informative plots, e.g., from Least Concern → Near Threatened → Vulnerable → Endangered → Critically Endangered → Extinct. Filtering may also remove categories with too few representatives.

In [None]:
txm.plot_conservation_statuses(df_with_iucn_status)

## 8.Extracting Genomically Sequenced Endangered Arthropods

In this last section, we extract arthropod species that are both listed as threatened in the IUCN Red List and have genome assemblies available in GenBank, as reported by A3cat. This is done by matching NCBI taxonomy IDs and retrieving the associated GenBank accession numbers.

The resulting table includes species name, NCBI ID, GenBank assembly ID, and IUCN conservation category. These species represent high-priority cases where genomic data is already available and can directly support conservation efforts.

In [None]:
filtered = a3cat_dataset[0][a3cat_dataset[0].ncbi_id.isin(endangered.ncbi_id)][["ncbi_id", "ncbi_canonicalName", "Genbank Accession"]]

In [None]:
merged = pd.merge(filtered, endangered [['ncbi_id', "taxonID", 'iucnRedListCategory']], left_on='ncbi_id', right_on='ncbi_id', how='left')

In [None]:
final = merged.drop_duplicates(subset="ncbi_id").sort_values(by="iucnRedListCategory", ascending=True)

In [None]:
final