# COI Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the COI gene, the primary barcode for animal identification, using the BOLD database.

**Methodology:**
1.  Work from a small, manageable sample of the full BOLD database.
2.  Develop a new taxonomy parser specifically designed for the pipe-separated (`|`) format of BOLD FASTA headers.
3.  Apply the full data cleaning and feature engineering workflow.
4.  Save the final, uniquely named artifacts for the COI pipeline.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent

# --- Verification Step ---
print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# --- Define Core Directories ---
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# --- Define COI Specific File Paths ---

# --- FIX: Use the exact filename of the downloaded BOLD database ---
# Path to the full, original BOLD database file
FULL_BOLD_PATH = RAW_DATA_DIR / "BOLD_Public.29-Aug-2025.fasta"

# Path to the small sample file we will create for development
SAMPLE_BOLD_PATH = RAW_DATA_DIR / "BOLD_sample_10k.fasta"

# --- Verification Step ---
if FULL_BOLD_PATH.exists():
    print("Source BOLD database found.")
    print(f"  - Location: {FULL_BOLD_PATH}")
else:
    print(f"ERROR: The source BOLD database was not found at the expected location.")
    print(f"  - Expected: {FULL_BOLD_PATH}")
    print("Please ensure the file is downloaded and correctly named in the 'data/raw' directory.")

Source BOLD database found.
  - Location: C:\Users\jampa\Music\atlas\data\raw\BOLD_Public.29-Aug-2025.fasta


In [3]:
# =============================================================================
# STEP 1: CREATE A DEVELOPMENT SAMPLE FROM BOLD DATA
# =============================================================================
#
# OBJECTIVE:
#   To create a smaller 10,000-sequence sample from the full BOLD database.
#
# RATIONALE:
#   Working with a smaller sample allows for rapid, interactive development
#   and debugging of the subsequent parsing and cleaning steps without the
#   long processing times required for the full dataset. This script is
#   designed to be run only once; if the sample file is found, this
#   step will be skipped.
#
# =============================================================================

# --- Configuration ---
SAMPLE_SIZE = 10000

# --- Main Logic ---
# This check prevents us from re-running this process unnecessarily.
if not SAMPLE_BOLD_PATH.exists():
    print(f"Creating a sample of {SAMPLE_SIZE} sequences from the BOLD database...")
    
    sample_records = []
    # Open the full BOLD database file for reading
    with open(FULL_BOLD_PATH, "r") as handle_in:
        # Create an efficient iterator over the FASTA records
        records_iterator = (record for record in SeqIO.parse(handle_in, "fasta"))
        
        # Use tqdm to show a progress bar while iterating
        for i, record in tqdm(enumerate(records_iterator), total=SAMPLE_SIZE, desc="Sampling BOLD DB"):
            # Stop once we have collected the desired number of samples
            if i >= SAMPLE_SIZE:
                break
            sample_records.append(record)
            
    # Write the collected records to our new sample file
    with open(SAMPLE_BOLD_PATH, "w") as handle_out:
        SeqIO.write(sample_records, handle_out, "fasta")
        
    print(f"\nSuccessfully created sample file with {len(sample_records)} sequences.")
    print(f"   Location: {SAMPLE_BOLD_PATH}")
else:
    print(f"BOLD sample file already exists. No action needed.")
    print(f"   Location: {SAMPLE_BOLD_PATH}")

BOLD sample file already exists. No action needed.
   Location: C:\Users\jampa\Music\atlas\data\raw\BOLD_sample_10k.fasta


In [4]:
# =============================================================================
# STEP 2 (REVISED): PARSE TAXONOMY FROM BOLD SAMPLE DATA
# =============================================================================
#
# OBJECTIVE:
#   To use a new, more robust parser to correctly handle the complex and
#   inconsistent BOLD database FASTA header format.
#
# RATIONALE:
#   Initial analysis revealed the taxonomy is often a comma-separated string
#   within a larger pipe-separated header. This new v2 parser is designed
#   to specifically find and process this nested format, ensuring accurate
#   extraction of taxonomic ranks.
#
# =============================================================================

# --- A list to hold our structured data ---
parsed_data = []

# --- Define the new, more robust BOLD-specific parsing function ---
def parse_bold_taxonomy_v2(description):
    """
    Parses the complex, pipe-and-comma-separated BOLD FASTA header.
    """
    # Initialize a dictionary with all ranks set to None
    parsed_ranks = {
        'kingdom': None, 'phylum': None, 'class': None, 'order': None,
        'family': None, 'genus': None, 'species': None
    }
    
    try:
        # First, split the entire description by the pipe character
        parts = description.split('|')
        
        # Now, find the part that actually contains the taxonomy.
        # We'll assume it's the part with commas and "Animalia".
        taxonomy_str = ""
        for part in parts:
            if ',' in part and 'Animalia' in part:
                taxonomy_str = part
                break
        
        # If we found a valid taxonomy string, process it
        if taxonomy_str:
            ranks = taxonomy_str.split(',')
            
            # Assign ranks based on their position in the comma-separated list
            if len(ranks) > 0: parsed_ranks['kingdom'] = ranks[0]
            if len(ranks) > 1: parsed_ranks['phylum'] = ranks[1]
            if len(ranks) > 2: parsed_ranks['class'] = ranks[2]
            if len(ranks) > 3: parsed_ranks['order'] = ranks[3]
            if len(ranks) > 4: parsed_ranks['family'] = ranks[4]
            if len(ranks) > 5: parsed_ranks['genus'] = ranks[5]
            if len(ranks) > 6: parsed_ranks['species'] = ranks[6]

    except Exception:
        # Pass silently if any unexpected error occurs
        pass
        
    return parsed_ranks

# --- Apply the new parser to our sample data ---
print("Applying revised BOLD taxonomy parser (v2) to sample data...")

# Loop through the records in our sample file
with open(SAMPLE_BOLD_PATH, "r") as handle:
    for record in tqdm(SeqIO.parse(handle, "fasta"), total=SAMPLE_SIZE, desc="Parsing BOLD headers"):
        # The entire description line contains the taxonomy
        taxonomy_dict = parse_bold_taxonomy_v2(record.description)
        
        # Store the essential sequence information
        taxonomy_dict['id'] = record.id
        taxonomy_dict['sequence'] = str(record.seq)
        
        parsed_data.append(taxonomy_dict)

# --- Create and Verify the DataFrame ---
# Convert the list of dictionaries into a pandas DataFrame
df = pd.DataFrame(parsed_data)

print(f"\nParsing complete. Created a DataFrame with {len(df)} rows.")

# Display the first 5 rows for a preliminary check
print("\nPreview of the structured data (first 5 rows):")
display(df.head())

# Display 5 random rows to check for consistency across the dataset
print("\nPreview of the structured data (5 random rows):")
display(df.sample(5))

Applying revised BOLD taxonomy parser (v2) to sample data...


Parsing BOLD headers: 100%|██████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 65802.29it/s]


Parsing complete. Created a DataFrame with 10000 rows.

Preview of the structured data (first 5 rows):





Unnamed: 0,kingdom,phylum,class,order,family,genus,species,id,sequence
0,Animalia,Arthropoda,Insecta,Lepidoptera,Geometridae,Oenochrominae,Arhodia,"AANIC003-10|COI-5P|Australia|Animalia,Arthropo...",AACATTATATTTTATTTTTGGTATTTGAGCTGGTATAATTGGAACT...
1,Animalia,Arthropoda,Insecta,Lepidoptera,Geometridae,Oenochrominae,Arhodia,"AANIC101-10|COI-5P|Australia|Animalia,Arthropo...",AACATTATATTTTATTTTTGGTATTTGAGCTGGTATAATTGGAACT...
2,Animalia,Arthropoda,Insecta,Lepidoptera,Geometridae,Oenochrominae,Arhodia,"AANIC102-10|COI-5P|Australia|Animalia,Arthropo...",AACATTATATTTTATTTTTGGTATTTGAGCTGGTATAATTGGAACT...
3,Animalia,Arthropoda,Insecta,Lepidoptera,Geometridae,Oenochrominae,Arhodia,"AANIC104-10|COI-5P|Australia|Animalia,Arthropo...",AACATTATATTTTATTTTTGGTATTTGAGCAGGTATAATTGGAACT...
4,Animalia,Arthropoda,Insecta,Lepidoptera,Geometridae,Oenochrominae,Arhodia,"AANIC105-10|COI-5P|Australia|Animalia,Arthropo...",AACATTATATTTTATTTTTGGTATTTGAGCAGGTATAATTGGAACT...



Preview of the structured data (5 random rows):


Unnamed: 0,kingdom,phylum,class,order,family,genus,species,id,sequence
4819,Animalia,Arthropoda,Insecta,Hemiptera,Aphididae,Aphidinae,Brachycaudus,"AACTA2592-20|COI-5P|Australia|Animalia,Arthrop...",TTTATATTTTTTATTTGGTATTTGATCAGGTATAATTGGATCATCA...
5172,Animalia,Arthropoda,Insecta,Lepidoptera,Gelechiidae,Gelechiinae,Scrobipalpa,"AACTA3277-20|COI-5P|Australia|Animalia,Arthrop...",TTATATTTTATTTTTGGTATTTGAGCAGGAATAGTTGGTACATCTT...
23,Animalia,Arthropoda,Insecta,Lepidoptera,Geometridae,Oenochrominae,Arhodia,"AANIC135-10|COI-5P|Australia|Animalia,Arthropo...",AACATTATATTTTATCTTTGGTATTTGAGCTGGTATAATTGGAACT...
3034,Animalia,Arthropoda,Insecta,Hymenoptera,Eulophidae,Tetrastichinae,Aprostocetus,"AACTA1176-20|COI-5P|Australia|Animalia,Arthrop...",AAATTTATAATTCAATTGTCACTACTCATGCATTTATTATAATTTT...
1128,Animalia,Arthropoda,Insecta,Hymenoptera,Braconidae,Microgastrinae,Pholetesor,"AAHYM134-16|COI-5P|Canada|Animalia,Arthropoda,...",TATAATATATTTTATATTTGGATTGTGAGCAGGAATAGTTGGATTT...


In [5]:
# =============================================================================
# STEPS 3-7: DATA CLEANING, FEATURE ENGINEERING, AND SAVING
# =============================================================================
#
# OBJECTIVE:
#   To apply the full, standardized data preparation workflow to the parsed
#   BOLD (COI) data and save the final, model-ready artifacts to disk.
#
# WORKFLOW:
#   1. Clean Data: Remove rows with missing target labels ('genus') and
#      filter out rare genera with fewer than 3 members to ensure data quality.
#   2. Engineer Features: Calculate k-mer counts for each sequence. A k-mer
#      size of 8 is chosen, as a larger k-mer can be more effective for
#      distinguishing protein-coding genes like COI.
#   3. Vectorize: Convert the k-mer counts and text labels into numerical
#      matrices suitable for machine learning.
#   4. Split Data: Partition the dataset into training (80%) and testing (20%) sets.
#   5. Save Artifacts: Save all processed data and encoders to disk with
#      unique 'coi' filenames.
#
# =============================================================================

# --- Imports for this combined cell ---
from collections import Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from scipy.sparse import save_npz
import pickle

# --- Configuration for this phase ---
TARGET_RANK = 'genus'
MIN_CLASS_MEMBERS = 3
KMER_SIZE = 8 # Using a larger k-mer for the protein-coding COI gene
TEST_SPLIT_SIZE = 0.2
RANDOM_STATE = 42
MODELS_DIR = project_root / "models"
MODELS_DIR.mkdir(exist_ok=True)


# --- Step 3: Clean and Filter the DataFrame ---
print("--- Step 3: Cleaning and filtering data ---")
initial_rows = len(df)
df_cleaned = df.dropna(subset=[TARGET_RANK]).copy()
print(f"Removed {initial_rows - len(df_cleaned)} rows with missing '{TARGET_RANK}' labels.")

class_counts = df_cleaned[TARGET_RANK].value_counts()
classes_to_keep = class_counts[class_counts >= MIN_CLASS_MEMBERS].index
df_filtered = df_cleaned[df_cleaned[TARGET_RANK].isin(classes_to_keep)].copy()
print(f"Removed {len(df_cleaned) - len(df_filtered)} rows for rare genera (less than {MIN_CLASS_MEMBERS} members).")
print(f"Final DataFrame has {len(df_filtered)} sequences.")
print("-" * 45)


# --- Step 4: Feature Engineering (K-mer Counting) ---
print(f"--- Step 4: Engineering {KMER_SIZE}-mer features ---")
def get_kmer_counts(sequence, k):
    counts = Counter()
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i+k]
        if "N" not in kmer.upper():
            counts[kmer] += 1
    return dict(counts)

df_filtered['kmer_counts'] = list(tqdm((get_kmer_counts(seq, KMER_SIZE) for seq in df_filtered['sequence']), total=len(df_filtered), desc="Calculating k-mers"))
print("-" * 45)


# --- Step 5: Vectorize Features and Labels ---
print("--- Step 5: Vectorizing data ---")
vectorizer = DictVectorizer(sparse=True)
X = vectorizer.fit_transform(df_filtered['kmer_counts'])
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_filtered[TARGET_RANK])
print(f"Feature matrix shape: {X.shape}")
print(f"Label vector shape:   {y.shape}")
print("-" * 45)


# --- Step 6: Split Data ---
print("--- Step 6: Splitting data into training/testing sets ---")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SPLIT_SIZE, random_state=RANDOM_STATE, stratify=y)
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape:  {X_test.shape}")
print("-" * 45)


# --- Step 7: Save All Processed Artifacts ---
print("--- Step 7: Saving all COI artifacts to disk ---")
save_npz(PROCESSED_DATA_DIR / "X_train_coi.npz", X_train)
save_npz(PROCESSED_DATA_DIR / "X_test_coi.npz", X_test)
np.save(PROCESSED_DATA_DIR / "y_train_coi.npy", y_train)
np.save(PROCESSED_DATA_DIR / "y_test_coi.npy", y_test)
with open(MODELS_DIR / "coi_genus_vectorizer.pkl", 'wb') as f:
    pickle.dump(vectorizer, f)
with open(MODELS_DIR / "coi_genus_label_encoder.pkl", 'wb') as f:
    pickle.dump(label_encoder, f)
print("All artifacts saved successfully.")
print("\n--- COI DATA PREPARATION COMPLETE ---")

--- Step 3: Cleaning and filtering data ---
Removed 0 rows with missing 'genus' labels.
Removed 104 rows for rare genera (less than 3 members).
Final DataFrame has 9896 sequences.
---------------------------------------------
--- Step 4: Engineering 8-mer features ---


Calculating k-mers: 100%|███████████████████████████████████████████████████████████████████████████| 9896/9896 [00:03<00:00, 2667.22it/s]


---------------------------------------------
--- Step 5: Vectorizing data ---
Feature matrix shape: (9896, 41040)
Label vector shape:   (9896,)
---------------------------------------------
--- Step 6: Splitting data into training/testing sets ---
Training set shape: (7916, 41040)
Testing set shape:  (1980, 41040)
---------------------------------------------
--- Step 7: Saving all COI artifacts to disk ---
All artifacts saved successfully.

--- COI DATA PREPARATION COMPLETE ---
