# 18S Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the 18S rRNA gene (Eukaryotes) using the SILVA database.

**Methodology:**
1.  Filter the full SILVA database to create a dedicated `Eukaryotes_only.fasta` file. This is a one-time, computationally intensive step.
2.  Create a small, manageable sample from this Eukaryote-only file for rapid development.
3.  Develop and test a robust taxonomy parser specifically designed for the complexities of eukaryotic lineages.
4.  Apply the full data cleaning and feature engineering workflow, and save the final artifacts.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent

# --- Verification Step ---
print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [2]:
# --- Define Core Directories ---
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"

# Create the processed data directory if it doesn't exist
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# --- Define 18S Specific File Paths ---

# Path to the full, original SILVA database file (our main source)
FULL_SILVA_PATH = RAW_DATA_DIR / "SILVA_138.1_SSURef_NR99_tax_silva.fasta"

# Path to the intermediate file we will create, containing ONLY eukaryotes
EUKARYOTE_ONLY_PATH = RAW_DATA_DIR / "SILVA_eukaryotes_only.fasta"

# Path to the small sample file we will create from the eukaryote-only file
SAMPLE_EUKARYOTE_PATH = RAW_DATA_DIR / "SILVA_eukaryotes_sample_10k.fasta"

# --- Verification Step ---
if FULL_SILVA_PATH.exists():
    print("Source SILVA database found.")
    print(f"  - Location: {FULL_SILVA_PATH}")
else:
    print(f"ERROR: The source SILVA database was not found at the expected location.")
    print(f"  - Expected: {FULL_SILVA_PATH}")
    print("Please download the file and place it in the 'data/raw' directory before proceeding.")

Source SILVA database found.
  - Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_138.1_SSURef_NR99_tax_silva.fasta


### Step 1: Filter Full SILVA Database for Eukaryotes

This is a one-time, computationally intensive step. We will read the entire source SILVA database and write a new FASTA file containing only the sequences belonging to the "Eukaryota" kingdom.

This script is designed to be run only once. If the `SILVA_eukaryotes_only.fasta` file is found, this step will be skipped.

In [3]:
# This check prevents us from re-running this long process unnecessarily.
if not EUKARYOTE_ONLY_PATH.exists():
    print(f"Eukaryote-only file not found. Starting filtering process...")
    print("This will take a significant amount of time. Please be patient.")
    
    eukaryote_count = 0
    
    # Open both the input and output files
    with open(FULL_SILVA_PATH, "r") as handle_in, open(EUKARYOTE_ONLY_PATH, "w") as handle_out:
        # Use tqdm to monitor the progress of this long-running task
        for record in tqdm(SeqIO.parse(handle_in, "fasta"), desc="Filtering full SILVA DB"):
            # Check for the keyword "Eukaryota" in the description
            if "Eukaryota" in record.description:
                # Write the record to our new file
                SeqIO.write(record, handle_out, "fasta")
                eukaryote_count += 1
                
    print(f"\n✅ Filtering complete.")
    print(f"   Found and wrote {eukaryote_count:,} Eukaryote sequences.")
    print(f"   New file created at: {EUKARYOTE_ONLY_PATH}")

else:
    print(f"✅ Eukaryote-only file already exists. Skipping filtering step.")
    print(f"   Location: {EUKARYOTE_ONLY_PATH}")

Eukaryote-only file not found. Starting filtering process...
This will take a significant amount of time. Please be patient.


Filtering full SILVA DB: 0it [00:00, ?it/s]


✅ Filtering complete.
   Found and wrote 58,545 Eukaryote sequences.
   New file created at: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_eukaryotes_only.fasta


### Step 2: Create a Development Sample from Eukaryote Data

Now that we have a clean file containing only Eukaryote sequences, we will create a smaller 10,000-sequence sample. This will serve as our development dataset for the rest of this notebook, ensuring all subsequent steps are fast and interactive.

In [4]:
SAMPLE_SIZE = 10000

# We only create the sample file if it doesn't already exist.
if not SAMPLE_EUKARYOTE_PATH.exists():
    print(f"Creating a sample of {SAMPLE_SIZE} Eukaryote sequences...")
    
    sample_records = []
    with open(EUKARYOTE_ONLY_PATH, "r") as handle_in:
        # Use a generator expression for memory efficiency
        records_iterator = (record for record in SeqIO.parse(handle_in, "fasta"))
        
        # Use tqdm to show progress
        for i, record in tqdm(enumerate(records_iterator), total=SAMPLE_SIZE):
            if i >= SAMPLE_SIZE:
                break
            sample_records.append(record)
            
    # Write the collected sample records to the new file
    with open(SAMPLE_EUKARYOTE_PATH, "w") as handle_out:
        SeqIO.write(sample_records, handle_out, "fasta")
        
    print(f"\n✅ Successfully created sample file with {len(sample_records)} sequences.")
    print(f"   Location: {SAMPLE_EUKARYOTE_PATH}")
else:
    print(f"✅ Eukaryote sample file already exists. No action needed.")
    print(f"   Location: {SAMPLE_EUKARYOTE_PATH}")

Creating a sample of 10000 Eukaryote sequences...


  0%|          | 0/10000 [00:00<?, ?it/s]


✅ Successfully created sample file with 10000 sequences.
   Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_eukaryotes_sample_10k.fasta


### Step 3: Develop a Robust Parser for Eukaryotic Taxonomy

This is the most critical step for the 18S pipeline. Eukaryotic taxonomy strings in SILVA are inconsistent and contain many non-standard ranks that must be filtered out.

We will:
1.  Define a set of known non-standard ranks to discard.
2.  Create a new parsing function that intelligently identifies and assigns the correct major ranks (Phylum, Class, Order, Family, Genus, Species).
3.  Apply this function to our 10,000-sequence sample and inspect the resulting DataFrame for correctness.

In [5]:
# A list to hold our structured data
parsed_data = []

# Define a set of non-standard, intermediate, or otherwise unhelpful ranks to discard.
# This list is based on common patterns found in the SILVA eukaryote taxonomy.
DISCARD_RANKS = {
    'cellular organisms', 'Opisthokonta', 'Holozoa', 'Metazoa (Animalia)', 
    'Eumetazoa', 'Bilateria', 'Protostomia', 'Deuterostomia', 'Sar', 
    'Stramenopila', 'Alveolata', 'Rhizaria', 'Archaeplastida', 'Glaucophyta',
    'Chloroplastida', 'Rhodophyceae', 'Streptophyta', 'Embryophyta', 
    'Tracheophyta', 'Phragmoplastophyta', 'Excavata', 'Discoba', 'Metamonada'
}

def parse_eukaryote_taxonomy(taxonomy_str):
    """
    A more robust parser specifically for the complex SILVA Eukaryote taxonomy.
    """
    # Start with a dictionary of empty ranks
    parsed_ranks = {
        'kingdom': None, 'phylum': None, 'class': None, 'order': None,
        'family': None, 'genus': None, 'species': None
    }
    
    # Split the string by ';' and convert the discard list to lowercase for case-insensitive matching
    discard_lower = {r.lower() for r in DISCARD_RANKS}
    
    # Filter out the discardable ranks
    ranks = [
        rank.strip() for rank in taxonomy_str.split(';')
        if rank.strip() and rank.strip().lower() not in discard_lower
    ]
    
    if not ranks:
        return parsed_ranks
        
    # The first item is always the kingdom
    parsed_ranks['kingdom'] = ranks[0]
    
    # If there are more ranks, the last one is either the genus or species
    if len(ranks) > 1:
        last_item = ranks[-1]
        # A simple heuristic: if it contains a space, it's likely a species
        if ' ' in last_item:
            parsed_ranks['species'] = last_item
            if len(ranks) > 2: # Genus is the second to last item
                parsed_ranks['genus'] = ranks[-2]
            remaining_ranks = ranks[1:-2]
        else: # The last item is the genus
            parsed_ranks['genus'] = last_item
            remaining_ranks = ranks[1:-1]
            
        # Assign the remaining ranks from right to left (most specific to least specific)
        if len(remaining_ranks) > 0: parsed_ranks['family'] = remaining_ranks[-1]
        if len(remaining_ranks) > 1: parsed_ranks['order'] = remaining_ranks[-2]
        if len(remaining_ranks) > 2: parsed_ranks['class'] = remaining_ranks[-3]
        if len(remaining_ranks) > 3: parsed_ranks['phylum'] = remaining_ranks[-4]
        # A fallback for phylum if the hierarchy is short
        elif not parsed_ranks['phylum'] and len(ranks) > 1:
             parsed_ranks['phylum'] = ranks[1]

    return parsed_ranks

# --- Apply the parser to our sample data ---

# Loop through our sample file records
with open(SAMPLE_EUKARYOTE_PATH, "r") as handle:
    for record in tqdm(SeqIO.parse(handle, "fasta"), total=SAMPLE_SIZE, desc="Parsing sample taxonomy"):
        # The taxonomy string is the part of the description after the first space
        accession, taxonomy_str = record.description.split(' ', 1)
        
        # Parse the taxonomy string using our new, robust function
        taxonomy_dict = parse_eukaryote_taxonomy(taxonomy_str)
        
        # Also store the sequence and its ID
        taxonomy_dict['id'] = record.id
        taxonomy_dict['sequence'] = str(record.seq)
        
        parsed_data.append(taxonomy_dict)

# Convert the list of dictionaries into a pandas DataFrame
df = pd.DataFrame(parsed_data)

# --- Verification Step ---
print(f"\n✅ Parsing complete. Created a DataFrame with {len(df)} rows.")
print("Here's a preview of the structured data:")
display(df.head()) # Use display() for better formatting in Jupyter

print("\nLet's also check a few random samples to see how the parser handled them:")
display(df.sample(5))

Parsing sample taxonomy:   0%|          | 0/10000 [00:00<?, ?it/s]


✅ Parsing complete. Created a DataFrame with 10000 rows.
Here's a preview of the structured data:


Unnamed: 0,kingdom,phylum,class,order,family,genus,species,id,sequence
0,Eukaryota,Chlorophyta,Chlorophyta,Chlorophyceae,Sphaeropleales,Monoraphidium,Monoraphidium sp. Itas 9/21 14-6w,AY846379.1.1791,AACCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUA...
1,Eukaryota,Charophyta,Spermatophyta,Magnoliophyta,Oxalidales,Connarus,Connarus championii,AY929368.1.1768,AAGAUUAAGCCAUGCAUGUGUAAGUAUGAACUAAUUCAGACUGUGA...
2,Eukaryota,Chlorophyta,Chlorophyta,Mamiellophyceae,Mamiellales,Micromonas,Micromonas pusilla,AY955002.1.1727,UGUCUAAGUAUAAGCGUUAUACUGUGAAACUGCGAAUGGCUCAUUA...
3,Eukaryota,Chlorophyta,Chlorophyta,Chlorophyceae,Chlamydomonadales,Chlamydomonas,Chlamydomonas orbicularis,LF644976.16.1783,AGCGUUGCAUGCCUGCAGGUCGACUCUAGAGGGGAUCCAGAUCUCC...
4,Eukaryota,Amorphea,Amoebozoa,Centrohelida,Heterophryidae,Chlamydaster,Chlamydaster sterni,KY857824.1.1808,AACCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUA...



Let's also check a few random samples to see how the parser handled them:


Unnamed: 0,kingdom,phylum,class,order,family,genus,species,id,sequence
5303,Eukaryota,Amorphea,Amorphea,Amoebozoa,Dictyostelia,Dictyostelium,Dictyostelium dimigraformum,AM168038.1.1867,AACCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUA...
8829,Eukaryota,Apicomplexa,Apicomplexa,Conoidasida,Gregarinasina,Eugregarinorida,uncultured eukaryote,AB275104.1.1590,AUUAAGUAAGUUUUAAUUUAUUUGGUUAUAAAAAAUUAACGGAUAA...
6450,Eukaryota,Euglenozoa,Kinetoplastea,Metakinetoplastina,Trypanosomatida,Trypanosoma,Trypanosoma wauwau,KR653211.1.2208,UGAUCUGGUUGAUUCUGCCAGUAGUCAUAUGCUUGUUUCAAGGACU...
2007,Eukaryota,Dinoflagellata,Dinophyceae,Gymnodiniphycidae,Kareniaceae,Karlodinium,Karenia mikimotoi,JF791035.1.1777,ACCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUAA...
2272,Eukaryota,Dinoflagellata,Dinophyceae,Gymnodiniphycidae,Gymnodinium clade,Gymnodinium,Lepidodinium viride,JF791033.1.1777,ACCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUAA...


In [6]:
from collections import Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from scipy.sparse import save_npz
import pickle

# --- Configuration for this phase ---
TARGET_RANK = 'genus'
MIN_CLASS_MEMBERS = 3
KMER_SIZE = 6 # A standard k-mer size for the 18S gene
TEST_SPLIT_SIZE = 0.2
RANDOM_STATE = 42
MODELS_DIR = project_root / "models"
MODELS_DIR.mkdir(exist_ok=True)


# --- Step 4: Clean and Filter the DataFrame ---
print("--- Step 4: Cleaning and filtering data ---")
initial_rows = len(df)
df_cleaned = df.dropna(subset=[TARGET_RANK]).copy()
print(f"Removed {initial_rows - len(df_cleaned)} rows with missing '{TARGET_RANK}' labels.")

class_counts = df_cleaned[TARGET_RANK].value_counts()
classes_to_keep = class_counts[class_counts >= MIN_CLASS_MEMBERS].index
df_filtered = df_cleaned[df_cleaned[TARGET_RANK].isin(classes_to_keep)].copy()
print(f"Removed {len(df_cleaned) - len(df_filtered)} rows for rare genera (less than {MIN_CLASS_MEMBERS} members).")
print(f"Final DataFrame has {len(df_filtered)} sequences.")
print("-" * 45)


# --- Step 5: Feature Engineering (K-mer Counting) ---
print(f"--- Step 5: Engineering {KMER_SIZE}-mer features ---")
def get_kmer_counts(sequence, k):
    counts = Counter()
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i+k]
        if "N" not in kmer.upper():
            counts[kmer] += 1
    return dict(counts)

df_filtered['kmer_counts'] = list(tqdm((get_kmer_counts(seq, KMER_SIZE) for seq in df_filtered['sequence']), total=len(df_filtered), desc="Calculating k-mers"))
print("-" * 45)


# --- Step 6: Vectorize Features and Labels ---
print("--- Step 6: Vectorizing data ---")
vectorizer = DictVectorizer(sparse=True)
X = vectorizer.fit_transform(df_filtered['kmer_counts'])
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_filtered[TARGET_RANK])
print(f"Feature matrix shape: {X.shape}")
print(f"Label vector shape:   {y.shape}")
print("-" * 45)


# --- Step 7: Split Data ---
print("--- Step 7: Splitting data into training/testing sets ---")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SPLIT_SIZE, random_state=RANDOM_STATE, stratify=y)
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape:  {X_test.shape}")
print("-" * 45)


# --- Step 8: Save All Processed Artifacts ---
print("--- Step 8: Saving all 18S artifacts to disk ---")
save_npz(PROCESSED_DATA_DIR / "X_train_18s.npz", X_train)
save_npz(PROCESSED_DATA_DIR / "X_test_18s.npz", X_test)
np.save(PROCESSED_DATA_DIR / "y_train_18s.npy", y_train)
np.save(PROCESSED_DATA_DIR / "y_test_18s.npy", y_test)
with open(MODELS_DIR / "18s_genus_vectorizer.pkl", 'wb') as f:
    pickle.dump(vectorizer, f)
with open(MODELS_DIR / "18s_genus_label_encoder.pkl", 'wb') as f:
    pickle.dump(label_encoder, f)
print("✅ All artifacts saved successfully.")
print("\n--- 18S DATA PREPARATION COMPLETE ---")

--- Step 4: Cleaning and filtering data ---
Removed 2 rows with missing 'genus' labels.
Removed 1964 rows for rare genera (less than 3 members).
Final DataFrame has 8034 sequences.
---------------------------------------------
--- Step 5: Engineering 6-mer features ---


Calculating k-mers:   0%|          | 0/8034 [00:00<?, ?it/s]

---------------------------------------------
--- Step 6: Vectorizing data ---
Feature matrix shape: (8034, 14058)
Label vector shape:   (8034,)
---------------------------------------------
--- Step 7: Splitting data into training/testing sets ---
Training set shape: (6427, 14058)
Testing set shape:  (1607, 14058)
---------------------------------------------
--- Step 8: Saving all 18S artifacts to disk ---
✅ All artifacts saved successfully.

--- 18S DATA PREPARATION COMPLETE ---
