# COI Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the COI gene, the primary barcode for animal identification, using the BOLD database.

**Methodology:**
1.  Work from a small, manageable sample of the full BOLD database.
2.  Develop a new taxonomy parser specifically designed for the pipe-separated (`|`) format of BOLD FASTA headers.
3.  Apply the full data cleaning and feature engineering workflow.
4.  Save the final, uniquely named artifacts for the COI pipeline.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent

# --- Verification Step ---
print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [3]:
# --- Define Core Directories ---
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# --- Define COI Specific File Paths ---

# --- FIX: Use the exact filename of the downloaded BOLD database ---
# Path to the full, original BOLD database file
FULL_BOLD_PATH = RAW_DATA_DIR / "BOLD_Public.29-Aug-2025.fasta"

# Path to the small sample file we will create for development
SAMPLE_BOLD_PATH = RAW_DATA_DIR / "BOLD_sample_10k.fasta"

# --- Verification Step ---
if FULL_BOLD_PATH.exists():
    print("Source BOLD database found.")
    print(f"  - Location: {FULL_BOLD_PATH}")
else:
    print(f"ERROR: The source BOLD database was not found at the expected location.")
    print(f"  - Expected: {FULL_BOLD_PATH}")
    print("Please ensure the file is downloaded and correctly named in the 'data/raw' directory.")

Source BOLD database found.
  - Location: C:\Users\jampa\Music\atlas-v3\data\raw\BOLD_Public.29-Aug-2025.fasta
