# 18S Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the 18S rRNA gene (Eukaryotes) using the SILVA database.

**Methodology:**
1.  Filter the full SILVA database to create a dedicated `Eukaryotes_only.fasta` file. This is a one-time, computationally intensive step.
2.  Create a small, manageable sample from this Eukaryote-only file for rapid development.
3.  Develop and test a robust taxonomy parser specifically designed for the complexities of eukaryotic lineages.
4.  Apply the full data cleaning and feature engineering workflow, and save the final artifacts.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent

# --- Verification Step ---
print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [2]:
# --- Define Core Directories ---
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"

# Create the processed data directory if it doesn't exist
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# --- Define 18S Specific File Paths ---

# Path to the full, original SILVA database file (our main source)
FULL_SILVA_PATH = RAW_DATA_DIR / "SILVA_138.1_SSURef_NR99_tax_silva.fasta"

# Path to the intermediate file we will create, containing ONLY eukaryotes
EUKARYOTE_ONLY_PATH = RAW_DATA_DIR / "SILVA_eukaryotes_only.fasta"

# Path to the small sample file we will create from the eukaryote-only file
SAMPLE_EUKARYOTE_PATH = RAW_DATA_DIR / "SILVA_eukaryotes_sample_10k.fasta"

# --- Verification Step ---
if FULL_SILVA_PATH.exists():
    print("Source SILVA database found.")
    print(f"  - Location: {FULL_SILVA_PATH}")
else:
    print(f"ERROR: The source SILVA database was not found at the expected location.")
    print(f"  - Expected: {FULL_SILVA_PATH}")
    print("Please download the file and place it in the 'data/raw' directory before proceeding.")

Source SILVA database found.
  - Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_138.1_SSURef_NR99_tax_silva.fasta


### Step 1: Filter Full SILVA Database for Eukaryotes

This is a one-time, computationally intensive step. We will read the entire source SILVA database and write a new FASTA file containing only the sequences belonging to the "Eukaryota" kingdom.

This script is designed to be run only once. If the `SILVA_eukaryotes_only.fasta` file is found, this step will be skipped.

In [3]:
# This check prevents us from re-running this long process unnecessarily.
if not EUKARYOTE_ONLY_PATH.exists():
    print(f"Eukaryote-only file not found. Starting filtering process...")
    print("This will take a significant amount of time. Please be patient.")
    
    eukaryote_count = 0
    
    # Open both the input and output files
    with open(FULL_SILVA_PATH, "r") as handle_in, open(EUKARYOTE_ONLY_PATH, "w") as handle_out:
        # Use tqdm to monitor the progress of this long-running task
        for record in tqdm(SeqIO.parse(handle_in, "fasta"), desc="Filtering full SILVA DB"):
            # Check for the keyword "Eukaryota" in the description
            if "Eukaryota" in record.description:
                # Write the record to our new file
                SeqIO.write(record, handle_out, "fasta")
                eukaryote_count += 1
                
    print(f"\n✅ Filtering complete.")
    print(f"   Found and wrote {eukaryote_count:,} Eukaryote sequences.")
    print(f"   New file created at: {EUKARYOTE_ONLY_PATH}")

else:
    print(f"✅ Eukaryote-only file already exists. Skipping filtering step.")
    print(f"   Location: {EUKARYOTE_ONLY_PATH}")

Eukaryote-only file not found. Starting filtering process...
This will take a significant amount of time. Please be patient.


Filtering full SILVA DB: 0it [00:00, ?it/s]


✅ Filtering complete.
   Found and wrote 58,545 Eukaryote sequences.
   New file created at: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_eukaryotes_only.fasta


### Step 2: Create a Development Sample from Eukaryote Data

Now that we have a clean file containing only Eukaryote sequences, we will create a smaller 10,000-sequence sample. This will serve as our development dataset for the rest of this notebook, ensuring all subsequent steps are fast and interactive.

In [4]:
SAMPLE_SIZE = 10000

# We only create the sample file if it doesn't already exist.
if not SAMPLE_EUKARYOTE_PATH.exists():
    print(f"Creating a sample of {SAMPLE_SIZE} Eukaryote sequences...")
    
    sample_records = []
    with open(EUKARYOTE_ONLY_PATH, "r") as handle_in:
        # Use a generator expression for memory efficiency
        records_iterator = (record for record in SeqIO.parse(handle_in, "fasta"))
        
        # Use tqdm to show progress
        for i, record in tqdm(enumerate(records_iterator), total=SAMPLE_SIZE):
            if i >= SAMPLE_SIZE:
                break
            sample_records.append(record)
            
    # Write the collected sample records to the new file
    with open(SAMPLE_EUKARYOTE_PATH, "w") as handle_out:
        SeqIO.write(sample_records, handle_out, "fasta")
        
    print(f"\n✅ Successfully created sample file with {len(sample_records)} sequences.")
    print(f"   Location: {SAMPLE_EUKARYOTE_PATH}")
else:
    print(f"✅ Eukaryote sample file already exists. No action needed.")
    print(f"   Location: {SAMPLE_EUKARYOTE_PATH}")

Creating a sample of 10000 Eukaryote sequences...


  0%|          | 0/10000 [00:00<?, ?it/s]


✅ Successfully created sample file with 10000 sequences.
   Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_eukaryotes_sample_10k.fasta
