# 16S Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the 16S rRNA gene (Bacteria & Archaea). 

**Methodology:**
1. Start with a small, manageable sample of the full SILVA database.
2. Interactively develop and test each step of the pipeline (Filtering, Parsing, K-mer Counting, Vectorizing).
3. Ensure the logic is robust before converting it to a final `.py` script.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Add the project root to the Python path to allow for module imports if needed later
project_root = Path.cwd().parent
sys.path.append(str(project_root))

print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [2]:
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"

# Create the processed data directory if it doesn't exist yet
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# Path to the full, original SILVA database file
FULL_SILVA_PATH = RAW_DATA_DIR / "SILVA_138.1_SSURef_NR99_tax_silva.fasta"

# Path to the small sample file we will create for development
SAMPLE_SILVA_PATH = RAW_DATA_DIR / "SILVA_sample_10k.fasta"

### Step 1: Create a Small Sample for Development

We will read the first 10,000 sequences from the full SILVA database and save them to a new file. This allows us to develop the rest of the pipeline quickly without waiting for the full dataset to process.

In [4]:
SAMPLE_SIZE = 10000

# We only create the file if it doesn't already exist.
# This saves time if we have to re-run the notebook.
if not SAMPLE_SILVA_PATH.exists():
    print(f"Full SILVA file found. Creating a sample of {SAMPLE_SIZE} sequences...")
    print(f"This may take a moment...")
    
    # Use a progress bar to see what's happening
    with open(FULL_SILVA_PATH, "r") as handle_in:
        # Use a generator expression for memory efficiency
        records_iterator = (record for record in SeqIO.parse(handle_in, "fasta"))
        
        sample_records = []
        for i, record in tqdm(enumerate(records_iterator), total=SAMPLE_SIZE):
            if i >= SAMPLE_SIZE:
                break
            sample_records.append(record)
            
    # Write the collected sample records to the new file
    with open(SAMPLE_SILVA_PATH, "w") as handle_out:
        SeqIO.write(sample_records, handle_out, "fasta")
        
    print(f"✅ Successfully created sample file with {len(sample_records)} sequences.")
    print(f"   Location: {SAMPLE_SILVA_PATH}")
else:
    print(f"✅ Sample file already exists. No action needed.")
    print(f"   Location: {SAMPLE_SILVA_PATH}")

Full SILVA file found. Creating a sample of 10000 sequences...
This may take a moment...


  0%|          | 0/10000 [00:00<?, ?it/s]

✅ Successfully created sample file with 10000 sequences.
   Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_sample_10k.fasta


### Step 2: Filter the Sample for Prokaryotes (Bacteria & Archaea)

Now we will read our `SILVA_sample_10k.fasta` file and create a list in memory containing only the records that are explicitly labeled as "Bacteria" or "Archaea".

In [5]:
prokaryote_records = []

# We use tqdm again to see the progress
print(f"Reading from sample file: {SAMPLE_SILVA_PATH}")

with open(SAMPLE_SILVA_PATH, "r") as handle:
    for record in tqdm(SeqIO.parse(handle, "fasta"), total=SAMPLE_SIZE):
        # The description line contains the full taxonomy string
        description = record.description.lower() # Use .lower() for a case-insensitive match
        
        if "bacteria" in description or "archaea" in description:
            prokaryote_records.append(record)

# Print a summary to verify the result
print("\n--- Filtering Summary ---")
print(f"Total sequences in sample: {SAMPLE_SIZE}")
print(f"Found {len(prokaryote_records)} prokaryote sequences.")
print(f"✅ Filtering complete.")

Reading from sample file: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_sample_10k.fasta


  0%|          | 0/10000 [00:00<?, ?it/s]


--- Filtering Summary ---
Total sequences in sample: 10000
Found 6883 prokaryote sequences.
✅ Filtering complete.
