# COI Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the COI gene, the primary barcode for animal identification, using the BOLD database.

**Methodology:**
1.  Work from a small, manageable sample of the full BOLD database.
2.  Develop a new taxonomy parser specifically designed for the pipe-separated (`|`) format of BOLD FASTA headers.
3.  Apply the full data cleaning and feature engineering workflow.
4.  Save the final, uniquely named artifacts for the COI pipeline.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent

# --- Verification Step ---
print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [3]:
# --- Define Core Directories ---
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# --- Define COI Specific File Paths ---

# --- FIX: Use the exact filename of the downloaded BOLD database ---
# Path to the full, original BOLD database file
FULL_BOLD_PATH = RAW_DATA_DIR / "BOLD_Public.29-Aug-2025.fasta"

# Path to the small sample file we will create for development
SAMPLE_BOLD_PATH = RAW_DATA_DIR / "BOLD_sample_10k.fasta"

# --- Verification Step ---
if FULL_BOLD_PATH.exists():
    print("Source BOLD database found.")
    print(f"  - Location: {FULL_BOLD_PATH}")
else:
    print(f"ERROR: The source BOLD database was not found at the expected location.")
    print(f"  - Expected: {FULL_BOLD_PATH}")
    print("Please ensure the file is downloaded and correctly named in the 'data/raw' directory.")

Source BOLD database found.
  - Location: C:\Users\jampa\Music\atlas-v3\data\raw\BOLD_Public.29-Aug-2025.fasta


In [4]:
# =============================================================================
# STEP 1: CREATE A DEVELOPMENT SAMPLE FROM BOLD DATA
# =============================================================================
#
# OBJECTIVE:
#   To create a smaller 10,000-sequence sample from the full BOLD database.
#
# RATIONALE:
#   Working with a smaller sample allows for rapid, interactive development
#   and debugging of the subsequent parsing and cleaning steps without the
#   long processing times required for the full dataset. This script is
#   designed to be run only once; if the sample file is found, this
#   step will be skipped.
#
# =============================================================================

# --- Configuration ---
SAMPLE_SIZE = 10000

# --- Main Logic ---
# This check prevents us from re-running this process unnecessarily.
if not SAMPLE_BOLD_PATH.exists():
    print(f"Creating a sample of {SAMPLE_SIZE} sequences from the BOLD database...")
    
    sample_records = []
    # Open the full BOLD database file for reading
    with open(FULL_BOLD_PATH, "r") as handle_in:
        # Create an efficient iterator over the FASTA records
        records_iterator = (record for record in SeqIO.parse(handle_in, "fasta"))
        
        # Use tqdm to show a progress bar while iterating
        for i, record in tqdm(enumerate(records_iterator), total=SAMPLE_SIZE, desc="Sampling BOLD DB"):
            # Stop once we have collected the desired number of samples
            if i >= SAMPLE_SIZE:
                break
            sample_records.append(record)
            
    # Write the collected records to our new sample file
    with open(SAMPLE_BOLD_PATH, "w") as handle_out:
        SeqIO.write(sample_records, handle_out, "fasta")
        
    print(f"\nSuccessfully created sample file with {len(sample_records)} sequences.")
    print(f"   Location: {SAMPLE_BOLD_PATH}")
else:
    print(f"BOLD sample file already exists. No action needed.")
    print(f"   Location: {SAMPLE_BOLD_PATH}")

Creating a sample of 10000 sequences from the BOLD database...


Sampling BOLD DB:   0%|          | 0/10000 [00:00<?, ?it/s]


Successfully created sample file with 10000 sequences.
   Location: C:\Users\jampa\Music\atlas-v3\data\raw\BOLD_sample_10k.fasta
