# Analyzing Bee Taxonomy: Integrating GBIF and NCBI Data for Apidae Insights

![title](https://live.staticflickr.com/4059/4632384645_a2230b26d5_b.jpg)

This Python notebook is designed for the purpose of integrating taxonomic data from two major biological databases, GBIF (Global Biodiversity Information Facility) and NCBI (National Center for Biotechnology Information), to enhance the accuracy and comprehensiveness of ecological and biological research. GBIF primarily focuses on biodiversity data including species distribution and ecological information, whereas NCBI provides a broader range of data including genomic and taxonomic details. 

Combining these sources enables researchers to cross-validate species identifications and improve the richness of ecological datasets with genetic information. A key biological task performed in this notebook is the construction of a taxonomic tree, which helps in visualizing and understanding the evolutionary relationships and classification hierarchy among different species within a chosen taxon (in this case, Apidae - a family of bees).

## 1. Importing libraries and downloading data

The initial steps involve downloading the most recent taxonomic data from GBIF and NCBI to ensure the analysis is based on the latest available information. 

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [20]:
import taxonmatch as txm

In [21]:
txm.download_gbif_taxonomy()

Downloading GBIF Taxonomic Data: 926MB [01:54, 8.48MB/s] 


GBIF backbone taxonomy has been downloaded successfully.


In [22]:
txm.download_ncbi_taxonomy(taxonkitpath="./taxonkit") #Specify taxonkit path

Downloading NCBI Taxonomic Data: 389kB [00:06, 59.5kB/s] 


NCBI taxonomic data has been downloaded successfully.


## 2. Loading and processing samples

In [None]:
gbif_dataset = txm.load_gbif_samples("./GBIF_output/Taxon.tsv")

In [23]:
ncbi_dataset = txm.load_ncbi_samples("./NCBI_output/ncbi_data.tsv")

In [68]:
ncbi_dataset[1].query("ncbi_id == 3148892")

Unnamed: 0,ncbi_id,ncbi_lineage_names,ncbi_lineage_ids,ncbi_canonicalName,ncbi_rank,ncbi_lineage_ranks,ncbi_target_string
2472448,3148892,cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Sauropsida;Sauria;Lepidosauria;Squamata;Bifurcata;Unidentata;Episquamata;Toxicofera;Serpentes;Henophidia;Pythonidae;Nyctophilopython,131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;8457;32561;8504;8509;1329961;1329950;1329912;1329911;8570;34979;34984;3148892,Nyctophilopython,genus,no rank;superkingdom;clade;kingdom;clade;clade;clade;phylum;subphylum;clade;clade;clade;clade;superclass;clade;clade;clade;clade;clade;class;order;clade;clade;clade;clade;infraorder;superfamily;family;genus,Eukaryota;Chordata;Lepidosauria;Squamata;Pythonidae;Nyctophilopython;


In [None]:
2	6	[root, cellular organisms, prokaryotes, purple photosynthetic bacteria and relatives, Purple bacteria, alpha subdivision, Rhizobiales, Xanthobacteraceae, Azorhizobium]	[1, 131567, 2, 1224, 28211, 356, 335928, 6]	[no rank, no rank, superkingdom, phylum, class, order, family, genus]	prokaryotes;purple photosynthetic bacteria and relatives;purple bacteria, alpha subdivision;rhizobiales;xanthobacteraceae;azorhizobium

## 3.a Training the classifier model

If required, the notebook outlines steps to train a machine learning classifier to distinguish between correct and incorrect taxonomic matches. This involves generating positive and negative examples, preparing the training dataset, and comparing different models.

In [None]:
positive_matches = txm.generate_positive_set(gbif_dataset, ncbi_dataset, 5000)

In [None]:
negative_matches = txm.generate_negative_set(gbif_dataset, ncbi_dataset, 5000)

In [None]:
full_training_set = txm.prepare_data(positive_matches, negative_matches)

In [None]:
X_train, X_test, y_train, y_test = txm.generate_training_test(full_training_set)

In [None]:
txm.compare_models(X_train, X_test, y_train, y_test)

In [None]:
model = txm.XGBClassifier(learning_rate=0.1,n_estimators=500, max_depth=9, n_jobs=-1, colsample_bytree = 1, subsample = 0.8)

In [None]:
model.fit(X_train, y_train, verbose=False)

In [None]:
#with open('./files/model/xgb_model.pkl', 'wb') as file:
#    pickle.dump(model, file)

## 3.b Load a pre-trained model

Alternatively, it provides the option to load a pre-trained model, simplifying the process for routine analyses.

In [None]:
from taxonmatch.loader import load_xgb_model
model = load_xgb_model()

## 4. Match NCBI with GBIF dataset 

In this section, the focus is on comparing and aligning the taxonomic data from NCBI and GBIF datasets. It specifically targets the taxon "Apidae" to narrow down the analysis to a specific family of bees. Using a pre-trained machine learning model, the notebook matches records from both datasets, categorizing them as exact matches, unmatched, or potentially mislabeled due to typographical errors

In [None]:
gbif_apidae, ncbi_apidae = txm.select_taxonomic_clade("Apidae", gbif_dataset, ncbi_dataset)

In [None]:
matched_df, unmatched_df, possible_typos_df = txm.match_dataset(gbif_apidae, ncbi_apidae, model, tree_generation = True)

## 5. Generate the taxonomical tree 

In the last section, the notebook constructs a taxonomic tree from the matched and unmatched data between the GBIF and NCBI datasets, focusing on the Apidae family. This visual representation helps to illustrate the evolutionary relationships and classification hierarchy among the species. The tree is then converted into a dataframe for further analysis and saved in textual format for documentation and review purposes.

In [None]:
tree = txm.generate_taxonomic_tree(matched_df, unmatched_df)

In [None]:
df_from_tree = txm.convert_tree_to_dataframe(tree, ncbi_apidae, gbif_apidae, "taxonomic_tree_df.txt")

In [None]:
txm.print_tree(tree, root_name="Apidae")

In [None]:
txm.save_tree(tree, "taxon_tree.txt")

In [1]:
import os
import csv
import gzip
import zipfile
import subprocess
import pandas as pd
from tqdm import tqdm
import urllib.request

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
def download_ncbi_taxonomy(output_folder=None):

    if output_folder is None:
        output_folder = os.getcwd()
    
    # Create a new folder for NCBI output
    NCBI_output_folder = os.path.join(output_folder, "NCBI_output")
    os.makedirs(NCBI_output_folder, exist_ok=True)
    
    # Download the NCBI taxondump
    url = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip"
    tf = os.path.join(NCBI_output_folder, "taxdmp.zip")
    with tqdm(unit='B', unit_scale=True, unit_divisor=1024, miniters=1, desc="Downloading NCBI Taxonomic Data") as pbar:
        def report_hook(blocknum, blocksize, totalsize):
            pbar.update(blocknum * blocksize / totalsize * 100)
        urllib.request.urlretrieve(url, tf, reporthook=report_hook)
    
    # Decompress the downloaded zip file
    with zipfile.ZipFile(tf, "r") as zip_ref:
        zip_ref.extractall(NCBI_output_folder)

    # Lettura dei file .dmp
    nodes_path = os.path.join(NCBI_output_folder, 'nodes.dmp')
    names_path = os.path.join(NCBI_output_folder, 'names.dmp')

    # Leggere nodes.dmp usando una stringa raw per il separatore
    nodes_df = pd.read_csv(
        nodes_path, 
        sep=r'\t\|\t',  # Separatore di campo usando una stringa raw
        header=None,  # Nessuna riga di intestazione nel file
        usecols=range(13),  # Leggi le prime 13 colonne basate sull'estratto
        names=[
            'ncbi_id', 'parent_tax_id', 'rank', 'embl_code', 'division_id', 
            'inherited_div_flag', 'genetic_code_id', 'inherited_GC_flag', 
            'mitochondrial_genetic_code_id', 'inherited_MGC_flag', 
            'GenBank_hidden_flag', 'hidden_subtree_root_flag', 'comments'
        ],
        dtype=str,  # Imposta tutti i dati come stringhe per evitare problemi di conversione
        engine='python'
    ).replace(r'\t\|$', '', regex=True)  # Rimuovi l'ultimo separatore di campo usando una stringa raw   

    # Leggere names.dmp
    names_df = pd.read_csv(
        names_path, 
        sep=r'\t\|\t',  # Uso di stringa raw per il separatore
        header=None,  # Nessuna riga di intestazione nel file
        usecols=range(4),  # Ci sono quattro colonne basate sull'estratto
        names=['ncbi_id', 'name_txt', 'unique_name', 'name_class'],
        dtype=str,  # Imposta tutti i dati come stringhe per evitare problemi di conversione
        engine='python'
    ).replace(r'\t\|$', '', regex=True)  # Rimuovi l'ultimo separatore di campo da ogni riga

    # Filtrare 'names_df' per includere solo i nomi scientifici
    scientific_names_df = names_df[names_df['name_class'] == 'scientific name']
    
    # Creare il mappaggio da ncbi_id a name_txt, limitato ai nomi scientifici
    name_map = pd.Series(scientific_names_df['name_txt'].values, index=scientific_names_df['ncbi_id']).to_dict()
    
    # Mappatura da ncbi_id a parent_tax_id e rank
    parent_map = pd.Series(nodes_df['parent_tax_id'].values, index=nodes_df['ncbi_id']).to_dict()
    rank_map = pd.Series(nodes_df['rank'].values, index=nodes_df['ncbi_id']).to_dict()
    
    # Aggiungere le colonne per il nome scientifico e il rango diretti
    nodes_df['ncbi_canonicalName'] = nodes_df['ncbi_id'].apply(lambda x: name_map.get(x, ''))
    nodes_df['ncbi_rank'] = nodes_df['ncbi_id'].apply(lambda x: rank_map.get(x, ''))
    
    # Funzione per ottenere l'intera linea di discendenza, escludendo la radice
    def get_lineage(ncbi_id, map_dict):
        lineage = []
        while ncbi_id in map_dict and ncbi_id != '1':  # Escludi la radice
            lineage.append(ncbi_id)
            ncbi_id = map_dict[ncbi_id]
        return lineage[::-1]
    
    # Costruzione delle linee di discendenza
    nodes_df['ncbi_lineage_ids'] = nodes_df['ncbi_id'].apply(lambda x: get_lineage(x, parent_map))
    nodes_df['ncbi_lineage_names'] = nodes_df['ncbi_lineage_ids'].apply(lambda ids: [name_map.get(id, '') for id in ids])
    nodes_df['ncbi_lineage_ranks'] = nodes_df['ncbi_lineage_ids'].apply(lambda ids: [rank_map.get(id, '') for id in ids])
    
    # Converti le liste in stringhe
    nodes_df['ncbi_lineage_names'] = nodes_df['ncbi_lineage_names'].apply(lambda names: ';'.join(names))
    nodes_df['ncbi_lineage_ids'] = nodes_df['ncbi_lineage_ids'].apply(lambda ids: ';'.join(ids))
    nodes_df['ncbi_lineage_ranks'] = nodes_df['ncbi_lineage_ranks'].apply(lambda ranks: ';'.join(ranks))
    
    # Definire i ranghi target
    target_ranks = {'superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'}
    
    # Costruire la stringa target basata sui ranghi
    def build_target_string(names, ranks):
        names = names.split(';')
        ranks = ranks.split(';')
        return ';'.join(name for name, rank in zip(names, ranks) if rank in target_ranks).lower()
    
    nodes_df['ncbi_target_string'] = nodes_df.apply(
        lambda row: build_target_string(row['ncbi_lineage_names'], row['ncbi_lineage_ranks']), axis=1)
    
    # Visualizzazione dei risultati
    ncbi_full = nodes_df[['ncbi_id', 'ncbi_lineage_names', 'ncbi_lineage_ids', 'ncbi_canonicalName', 'ncbi_rank', 'ncbi_lineage_ranks', 'ncbi_target_string']]
    ncbi_subset = ncbi_full.copy()
    ncbi_subset['ncbi_target_string'] = ncbi_subset.apply(prepare_ncbi_strings, axis=1)
    ncbi_subset["ncbi_target_string"] = ncbi_subset["ncbi_target_string"].apply(remove_extra_separators).str.strip(';')
    ncbi_subset = ncbi_subset.drop_duplicates(subset="ncbi_target_string")
    
    return ncbi_subset, ncbi_full

In [None]:
ncbi_dataset = download_ncbi_taxonomy()

Downloading NCBI Taxonomic Data: 389kB [00:07, 51.2kB/s] 
Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x107091df0>>
Traceback (most recent call last):
  File "/Users/mleone1/miniconda3/envs/test/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 770, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 


In [91]:
ncbi_dataset

Unnamed: 0,ncbi_id,ncbi_lineage_names,ncbi_lineage_ids,ncbi_canonicalName,ncbi_rank,ncbi_lineage_ranks,ncbi_target_string
0,1,,,root,no rank,,
1,2,cellular organisms;Bacteria,131567;2,Bacteria,superkingdom,no rank;superkingdom,bacteria
2,6,cellular organisms;Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Xanthobacteraceae;Azorhizobium,131567;2;1224;28211;356;335928;6,Azorhizobium,genus,no rank;superkingdom;phylum;class;order;family;genus,bacteria;pseudomonadota;alphaproteobacteria;hyphomicrobiales;xanthobacteraceae;azorhizobium
3,7,cellular organisms;Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Xanthobacteraceae;Azorhizobium;Azorhizobium caulinodans,131567;2;1224;28211;356;335928;6;7,Azorhizobium caulinodans,species,no rank;superkingdom;phylum;class;order;family;genus;species,bacteria;pseudomonadota;alphaproteobacteria;hyphomicrobiales;xanthobacteraceae;azorhizobium;azorhizobium caulinodans
4,9,cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Buchnera;Buchnera aphidicola,131567;2;1224;1236;91347;1903409;32199;9,Buchnera aphidicola,species,no rank;superkingdom;phylum;class;order;family;genus;species,bacteria;pseudomonadota;gammaproteobacteria;enterobacterales;erwiniaceae;buchnera;buchnera aphidicola
...,...,...,...,...,...,...,...
2583642,3149255,cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Corticiales;Corticiaceae;Lyomyces;Lyomyces punctatomarginatus,131567;2759;33154;4751;451864;5204;5302;155619;355688;452338;5304;1234780;3149255,Lyomyces punctatomarginatus,species,no rank;superkingdom;clade;kingdom;subkingdom;phylum;subphylum;class;no rank;order;family;genus;species,eukaryota;basidiomycota;agaricomycetes;corticiales;corticiaceae;lyomyces;lyomyces punctatomarginatus
2583643,3149256,cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia jaagii,131567;2759;33154;4751;451864;5204;29000;162484;5258;5262;5296;3149256,Puccinia jaagii,species,no rank;superkingdom;clade;kingdom;subkingdom;phylum;subphylum;class;order;family;genus;species,eukaryota;basidiomycota;pucciniomycetes;pucciniales;pucciniaceae;puccinia;puccinia jaagii
2583644,3149307,cellular organisms;Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Salaquimonadaceae,131567;2;1224;28211;356;3149307,Salaquimonadaceae,family,no rank;superkingdom;phylum;class;order;family,bacteria;pseudomonadota;alphaproteobacteria;hyphomicrobiales;salaquimonadaceae
2583645,3149308,cellular organisms;Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Rhodoligotrophaceae,131567;2;1224;28211;356;3149308,Rhodoligotrophaceae,family,no rank;superkingdom;phylum;class;order;family,bacteria;pseudomonadota;alphaproteobacteria;hyphomicrobiales;rhodoligotrophaceae


In [3]:
def prepare_ncbi_strings(row):
    """
    Prepare NCBI taxonomy strings.

    Args:
    row (pd.Series): A row from the NCBI DataFrame.

    Returns:
    str: A cleaned taxonomy string.
    """
    parts = row['ncbi_target_string'].split(';')
    
    if row['ncbi_rank'] in ['species', 'subspecies', 'strain']:
        new_string = ';'.join(parts[1:-1]) + ';' + row['ncbi_canonicalName']
    else:
        new_string = ';'.join(parts[1:-1])
    
    return new_string.lower()

def remove_extra_separators(s):
    """
    Remove extra semicolons from a string.

    Args:
    s (str): The string to process.

    Returns:
    str: The string with extra semicolons removed.
    """
    return re.sub(r';+', ';', s)





In [4]:
import os
import re
import time
import zipfile
import threading
import pandas as pd
import urllib.request
from tqdm import tqdm


def download_ncbi_taxonomy(output_folder=None):

    """
    Download, extract, and process NCBI taxonomy data.

    This function sets up the necessary directories, downloads the taxonomy dump from NCBI,
    extracts the contents, and processes the data to create structured pandas DataFrames 
    with lineage information.

    Args:
    output_folder (str, optional): The directory to save NCBI output. Defaults to the current working directory.

    Returns:
    tuple: A tuple containing two pandas DataFrames:
        - ncbi_subset: DataFrame with unique target strings.
        - ncbi_full: Full DataFrame with all taxonomy information.
    """
    
    # Set the default output folder to the current working directory if none is provided
    if output_folder is None:
        output_folder = os.getcwd()

    # Create a new folder for storing the NCBI taxonomic data
    NCBI_output_folder = os.path.join(output_folder, "NCBI_output")
    os.makedirs(NCBI_output_folder, exist_ok=True)

    # Define the URL for the NCBI taxonomic database dump
    url = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip"
    tf = os.path.join(NCBI_output_folder, "taxdmp.zip")

    # Check if the taxonomy data file already exists to avoid re-downloading
    if os.path.exists(tf):
        print("NCBI taxonomy data already downloaded.")
    
    else:
        # Download the zip file with a progress bar
        with tqdm(unit='B', unit_scale=True, unit_divisor=1024, miniters=1, desc="Downloading NCBI Taxonomic Data") as pbar:
            def report_hook(blocknum, blocksize, totalsize):
                pbar.update(blocknum * blocksize - pbar.n)
            urllib.request.urlretrieve(url, tf, reporthook=report_hook)
    
        # Decompress the downloaded zip file
        with zipfile.ZipFile(tf, "r") as zip_ref:
            zip_ref.extractall(NCBI_output_folder)
    
        # Check if the file was downloaded successfully
        if os.path.exists(tf):
            print("NCBI taxonomy has been downloaded successfully.")

    def animate_dots():
    """
    Animate a sequence of dots on the console to indicate processing activity.
        
    Args:
    None. The function assumes access to a globally defined threading.Event() named 'done_event'.

    Returns:
    None. This function returns nothing and is intended solely for side effects (console output).
    """
    dots = ["   ", ".  ", ".. ", "..."]  # List of dot states for animation.
    idx = 0  # Initialize index to cycle through dot states.
    print("Processing samples", end="")  # Initial print statement for processing message.
    while not done_event.is_set():  # Loop until the event is set signaling processing is complete.
        print(f"\rProcessing samples{dots[idx % len(dots)]}", end="", flush=True)  # Overwrite the previous line with new dot state.
        time.sleep(0.5)  # Pause for half a second before updating dot state.
        idx += 1  # Increment index to cycle to the next dot state.


    # Create and start the dot animation thread
    done_event = threading.Event()
    thread = threading.Thread(target=animate_dots)
    thread.start()

    try:
        # Read the nodes dump file
        nodes_path = os.path.join(NCBI_output_folder, 'nodes.dmp')
        names_path = os.path.join(NCBI_output_folder, 'names.dmp')

        # Read the nodes file using a specific field separator and specifying no header
        nodes_df = pd.read_csv(
            nodes_path, 
            sep=r'\t\|\t',
            header=None,
            usecols=range(13),
            names=[
                'ncbi_id', 'parent_tax_id', 'rank', 'embl_code', 'division_id', 
                'inherited_div_flag', 'genetic_code_id', 'inherited_GC_flag', 
                'mitochondrial_genetic_code_id', 'inherited_MGC_flag', 
                'GenBank_hidden_flag', 'hidden_subtree_root_flag', 'comments'
            ],
            dtype=str,
            engine='python'
        ).replace(r'\t\|$', '', regex=True)

        # Read the names dump file
        names_df = pd.read_csv(
            names_path, 
            sep=r'\t\|\t',
            header=None,
            usecols=range(4),
            names=['ncbi_id', 'name_txt', 'unique_name', 'name_class'],
            dtype=str,
            engine='python'
        ).replace(r'\t\|$', '', regex=True)

        # Filter to include only scientific names
        scientific_names_df = names_df[names_df['name_class'] == 'scientific name']
        
        # Map NCBI IDs to parent tax IDs and ranks
        name_map = pd.Series(scientific_names_df['name_txt'].values, index=scientific_names_df['ncbi_id']).to_dict()
        
        # Add columns for canonical name and rank using the maps
        parent_map = pd.Series(nodes_df['parent_tax_id'].values, index=nodes_df['ncbi_id']).to_dict()
        rank_map = pd.Series(nodes_df['rank'].values, index=nodes_df['ncbi_id']).to_dict()
        
        # Aggiungere le colonne per il nome scientifico e il rango diretti
        nodes_df['ncbi_canonicalName'] = nodes_df['ncbi_id'].apply(lambda x: name_map.get(x, ''))
        nodes_df['ncbi_rank'] = nodes_df['ncbi_id'].apply(lambda x: rank_map.get(x, ''))
        
        # Function to compute the entire lineage, excluding the root
        def get_lineage(ncbi_id, map_dict):
            lineage = []
            while ncbi_id in map_dict and ncbi_id != '1': 
                lineage.append(ncbi_id)
                ncbi_id = map_dict[ncbi_id]
            return lineage[::-1]
        
        # Build lineages
        nodes_df['ncbi_lineage_ids'] = nodes_df['ncbi_id'].apply(lambda x: get_lineage(x, parent_map))
        nodes_df['ncbi_lineage_names'] = nodes_df['ncbi_lineage_ids'].apply(lambda ids: [name_map.get(id, '') for id in ids])
        nodes_df['ncbi_lineage_ranks'] = nodes_df['ncbi_lineage_ids'].apply(lambda ids: [rank_map.get(id, '') for id in ids])
        
        # Convert list of lineage information to semicolon-separated strings
        nodes_df['ncbi_lineage_names'] = nodes_df['ncbi_lineage_names'].apply(lambda names: ';'.join(names))
        nodes_df['ncbi_lineage_ids'] = nodes_df['ncbi_lineage_ids'].apply(lambda ids: ';'.join(ids))
        nodes_df['ncbi_lineage_ranks'] = nodes_df['ncbi_lineage_ranks'].apply(lambda ranks: ';'.join(ranks))
        
        # Define target ranks and build target strings based on ranks
        target_ranks = {'superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'}
        
        def build_target_string(names, ranks):
            names = names.split(';')
            ranks = ranks.split(';')
            return ';'.join(name for name, rank in zip(names, ranks) if rank in target_ranks).lower()
        
        nodes_df['ncbi_target_string'] = nodes_df.apply(
            lambda row: build_target_string(row['ncbi_lineage_names'], row['ncbi_lineage_ranks']), axis=1)
        
        # Prepare final subsets for return
        ncbi_full = nodes_df[['ncbi_id', 'ncbi_lineage_names', 'ncbi_lineage_ids', 'ncbi_canonicalName', 'ncbi_rank', 'ncbi_lineage_ranks', 'ncbi_target_string']]
        ncbi_subset = ncbi_full.copy()
        ncbi_subset['ncbi_target_string'] = ncbi_subset.apply(prepare_ncbi_strings, axis=1)
        ncbi_subset["ncbi_target_string"] = ncbi_subset["ncbi_target_string"].apply(remove_extra_separators).str.strip(';')
        ncbi_subset = ncbi_subset.drop_duplicates(subset="ncbi_target_string")
    
    finally:
        done_event.set()
        thread.join()
        print("\rProcessing samples...")
        print("Done.")

    return ncbi_subset, ncbi_full

In [None]:
ncbi_dataset = download_ncbi_taxonomy()

Exception in thread Thread-5 (animate_dots):
Traceback (most recent call last):
  File "/Users/mleone1/miniconda3/envs/test/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/Users/mleone1/miniconda3/envs/test/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 761, in run_closure
    _threading_Thread_run(self)
  File "/Users/mleone1/miniconda3/envs/test/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/var/folders/hh/_8qm31cx4yd3t43862l203c80000gp/T/ipykernel_25081/3478035857.py", line 45, in animate_dots
NameError: name 'done_event' is not defined


NCBI taxonomy data already downloaded.
Processing samples