<table style="width: 100%;">
    <tr>
        <td><a href="https://ieb-chile.cl/en/" target="_blank"><img src="https://raw.githubusercontent.com/IEB-BIODATA/pydwca-examples/main/images/logo/IEB.png" style="height: 100px;"></a</td>
        <td></td>
        <td><img src="https://raw.githubusercontent.com/IEB-BIODATA/pydwca-examples/main/images/logo/Biodata.png" style="height: 100px;"></td>
    </tr>
</table>

# Catalogue Of Life

A usage example of the `PyDwCA` library, reading the DwC-A of Catalogue of Life.

Reading the [Catalogue of Life](https://www.catalogueoflife.org/data/download):

In [1]:
import os
import pandas as pd
os.makedirs("data", exist_ok=True)

In [2]:
"!wget https://download.catalogueoflife.org/col/annual/2023_dwca.zip -O data/col-dwca.zip"

In [3]:
from dwca import DarwinCoreArchive

In [4]:
col = DarwinCoreArchive.from_file("data/col-dwca.zip")

In [5]:
print(col)

In [6]:
taxon = col.core
taxon.pandas

## Exploring this `DataFrame`

In [7]:
taxon.pandas.groupby("taxonRank").size()

In [8]:
taxon.pandas.groupby(["taxonRank", "taxonomicStatus"]).size().unstack(fill_value=0)

Because we use the `"acceptedNameUsageID"` to find synonyms, we are going to fill this columns for all the `"accepted"` taxa with its own `"taxonID"`:

In [9]:
accepted_mask = taxon.pandas["taxonomicStatus"].str.contains("accepted")
taxon.pandas.loc[
    taxon.pandas[accepted_mask].index, "acceptedNameUsageID"
] = taxon.pandas[accepted_mask]["taxonID"]
taxon.pandas

We also need the `"parentNameUsageID"` of synonyms, which is necessary the same as the `"acceptedNameUsageID"`. We complete this field as well:

In [10]:
missing_parent = taxon.pandas[(taxon.pandas["parentNameUsageID"] == "") | pd.isna(taxon.pandas["parentNameUsageID"])]
parent2fill = missing_parent[["taxonID", "acceptedNameUsageID"]].merge(
    taxon.pandas[["taxonID", "parentNameUsageID"]],
    how="left",
    left_on="acceptedNameUsageID",
    right_on="taxonID",
)
parent2fill.index = missing_parent.index
taxon.pandas.loc[missing_parent.index, "parentNameUsageID"] = parent2fill.loc[parent2fill.index, "parentNameUsageID"]
taxon.pandas

Checking one extension

In [11]:
col.extensions[0].pandas

## Filter chilean species

We are going to filter the species reported in the DwC-A of the COL that are in Chile. To do that, we are going to use a list of the chilean species obtained in [this notebook](<GBIF Chile Species.ipynb>).

In [12]:
chilean_species = pd.read_csv("data/chilean_species.tsv", sep="\t", header=0)
chilean_species

In [13]:
chilean_species.groupby("taxonRank").size()

In [14]:
chilean_species.groupby(["taxonRank", "taxonomicStatus"]).size().unstack(fill_value=0)

We take advantage of the [`filter_by_species`](https://pydwca.readthedocs.io/en/latest/dwca.classes.html#dwca.classes.taxon.Taxon.filter_by_species) method of the `Taxon` class of PyDwCA library. This will left all species in a list, its synonyms and taxa parent.

In [15]:
taxon.filter_by_species(chilean_species["scientificName"])
taxon.pandas

And check one extension to see is the filter propagates:

In [16]:
col.extensions[0].pandas

### Adding needed fields

We need more fields in this dataset, so we are going to added using the [`add_field`]() method.

In [17]:
from tqdm import tqdm

In [18]:
taxon.fields

In [19]:
from dwca.terms import Kingdom, Phylum, DWCClass, Order, Family, Genus

taxon.add_field(Kingdom(index=22))
taxon.add_field(Phylum(index=23))
taxon.add_field(DWCClass(index=24))
taxon.add_field(Order(index=25))
taxon.add_field(Family(index=26))
taxon.add_field(Genus(index=27))

In [20]:
taxon.fields

In [21]:
taxon_df = taxon.pandas
taxon_df

And, we are going to take advantage of the pandas `DataFrame` to implement an algorithm that are going to populate those field. The idea is:
1. Find taxonRank corresponding to the field, e.g. `"kindgom"` for `Kingdom`.
2. All child taxa of this has the `"scientificName"` found as that field. Child taxa are the taxa in with `"parentNameUsageID"` is the taxa found.
3. Child of the above has also  that `"scientificName"` as the field.
4. Repeat until there are no more childs.

In [34]:
def fill_taxa(df: pd.DataFrame, rank_name: str, to_print: bool = True) -> None:
    ranks = taxon_df[taxon_df["taxonRank"] == rank_name]["scientificName"]
    if to_print:
        print(ranks)
    else:
        bar_progress = tqdm(f"Working with {rank_name}: {ranks.iloc[0]}", total=len(ranks))
    for rank in ranks:
        total = len(taxon_df)
        if to_print:
            bar_progress = tqdm(desc=f"Working with {rank_name}: {rank}", total=total)
        else:
            bar_progress.set_description(f"Working with {rank_name}: {rank}", refresh=False)
        posfix = dict()
        childs = taxon_df[
            (taxon_df["taxonRank"] == rank_name) &
            (taxon_df["scientificName"] == rank)
        ]
        while(len(childs) > 0):
            taxon_df.loc[childs.index, rank_name] = rank
            total -= len(childs)
            if to_print:
                bar_progress.update(n=len(childs))
            taxa = childs["taxonID"]
            posfix["Working with ranks"] = ", ".join([str(rank) for rank in pd.unique(childs["taxonRank"])])
            bar_progress.set_postfix(ordered_dict=posfix, refresh=to_print)
            childs = taxon_df[taxon_df["parentNameUsageID"].isin(taxa)]
            synonyms = taxon.all_synonyms(childs["taxonID"])
            childs = taxon_df[
                (taxon_df["parentNameUsageID"].isin(taxa)) |
                (taxon_df["taxonID"].isin(synonyms))
            ]
        if to_print:
            bar_progress.update(n=total)
            bar_progress.close()
        else:
            bar_progress.update(n=1)
    if not to_print:
        bar_progress.close()
    return

In [23]:
fill_taxa(taxon_df, "kingdom")
taxon_df

In [24]:
fill_taxa(taxon_df, "phylum")
taxon_df

In [35]:
fill_taxa(taxon_df, "class", to_print=False)
taxon_df

In [36]:
fill_taxa(taxon_df, "order", to_print=False)
taxon_df

In [37]:
fill_taxa(taxon_df, "family", to_print=False)
taxon_df

In [38]:
fill_taxa(taxon_df, "genus", to_print=False)
taxon_df

In [39]:
taxon.pandas = taxon_df
print(col)
print(col.extensions)

We save this modified archive as [`chilean-col.zip`](data/chilean-col.zip)

In [40]:
col.to_file("data/chilean-col.zip")

In [41]:
del col

## Extracting species

For our pipeline, we need to extract the tracheophytas. First, we check if this is directly on the dataset. We start reading the Darwin Core Archive generated on the previous step:

In [4]:
chilean_col = DarwinCoreArchive.from_file("data/chilean-col.zip")

In [5]:
taxon = chilean_col.core
taxon.pandas

In [6]:
taxon.pandas[taxon.pandas["scientificName"].str.lower() == "tracheophyta"]

In [8]:
taxon.pandas.groupby("phylum").size()

In [9]:
taxon.pandas[taxon.pandas["phylum"] == "Tracheophyta"]

In [10]:
taxon.pandas[taxon.pandas["phylum"] == "Tracheophyta"].groupby("class").size()

Due to Tracheophyta is a phylum, we can extract all phylum except this one:

In [45]:
taxon.fields

In [48]:
found_phyla = pd.unique(taxon.pandas["phylum"])
found_phyla

In [49]:
found_phyla = list(found_phyla)
found_phyla.remove("")
found_phyla.remove("Tracheophyta")
found_phyla

In [50]:
taxon.filter_by_phylum(found_phyla)
taxon.pandas

In [54]:
print(chilean_col)
for extension in chilean_col.extensions:
    print(extension)

In [52]:
chilean_col.to_file("data/chilean_col_var.zip")

<table>
    <tr>
        <td colspan="3" style="text-align: center;"><p>BIODATA - <a href="https://ieb-chile.cl/en/" target="_blank">Institute of Ecology and Biodiversity</a> © 2024</p></td>
    </tr>
</table>