# 16S Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the 16S rRNA gene (Bacteria & Archaea). 

**Methodology:**
1. Start with a small, manageable sample of the full SILVA database.
2. Interactively develop and test each step of the pipeline (Filtering, Parsing, K-mer Counting, Vectorizing).
3. Ensure the logic is robust before converting it to a final `.py` script.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Add the project root to the Python path to allow for module imports if needed later
project_root = Path.cwd().parent
sys.path.append(str(project_root))

print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [2]:
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"

# Create the processed data directory if it doesn't exist yet
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# Path to the full, original SILVA database file
FULL_SILVA_PATH = RAW_DATA_DIR / "SILVA_138.1_SSURef_NR99_tax_silva.fasta"

# Path to the small sample file we will create for development
SAMPLE_SILVA_PATH = RAW_DATA_DIR / "SILVA_sample_10k.fasta"

### Step 1: Create a Small Sample for Development

We will read the first 10,000 sequences from the full SILVA database and save them to a new file. This allows us to develop the rest of the pipeline quickly without waiting for the full dataset to process.

In [4]:
SAMPLE_SIZE = 10000

# We only create the file if it doesn't already exist.
# This saves time if we have to re-run the notebook.
if not SAMPLE_SILVA_PATH.exists():
    print(f"Full SILVA file found. Creating a sample of {SAMPLE_SIZE} sequences...")
    print(f"This may take a moment...")
    
    # Use a progress bar to see what's happening
    with open(FULL_SILVA_PATH, "r") as handle_in:
        # Use a generator expression for memory efficiency
        records_iterator = (record for record in SeqIO.parse(handle_in, "fasta"))
        
        sample_records = []
        for i, record in tqdm(enumerate(records_iterator), total=SAMPLE_SIZE):
            if i >= SAMPLE_SIZE:
                break
            sample_records.append(record)
            
    # Write the collected sample records to the new file
    with open(SAMPLE_SILVA_PATH, "w") as handle_out:
        SeqIO.write(sample_records, handle_out, "fasta")
        
    print(f"✅ Successfully created sample file with {len(sample_records)} sequences.")
    print(f"   Location: {SAMPLE_SILVA_PATH}")
else:
    print(f"✅ Sample file already exists. No action needed.")
    print(f"   Location: {SAMPLE_SILVA_PATH}")

Full SILVA file found. Creating a sample of 10000 sequences...
This may take a moment...


  0%|          | 0/10000 [00:00<?, ?it/s]

✅ Successfully created sample file with 10000 sequences.
   Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_sample_10k.fasta


### Step 2: Filter the Sample for Prokaryotes (Bacteria & Archaea)

Now we will read our `SILVA_sample_10k.fasta` file and create a list in memory containing only the records that are explicitly labeled as "Bacteria" or "Archaea".

In [6]:
prokaryote_records = []

# We use tqdm again to see the progress
print(f"Reading from sample file: {SAMPLE_SILVA_PATH}")

with open(SAMPLE_SILVA_PATH, "r") as handle:
    for record in tqdm(SeqIO.parse(handle, "fasta"), total=SAMPLE_SIZE):
        # The description line contains the full taxonomy string
        description = record.description.lower() # Use .lower() for a case-insensitive match
        
        if "bacteria" in description or "archaea" in description:
            prokaryote_records.append(record)

# Print a summary to verify the result
print("\n--- Filtering Summary ---")
print(f"Total sequences in sample: {SAMPLE_SIZE}")
print(f"Found {len(prokaryote_records)} prokaryote sequences.")
print(f"✅ Filtering complete.")

Reading from sample file: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_sample_10k.fasta


  0%|          | 0/10000 [00:00<?, ?it/s]


--- Filtering Summary ---
Total sequences in sample: 10000
Found 6883 prokaryote sequences.
✅ Filtering complete.


### Step 3: Parse Taxonomy from Filtered Records

The description for each sequence in the FASTA file contains the full taxonomic lineage, separated by semicolons (e.g., `Bacteria;Proteobacteria;Gammaproteobacteria...`). 

We will now:
1. Define a function to parse this string into distinct taxonomic ranks.
2. Apply this function to our list of 6,883 prokaryote records.
3. Store the structured data in a pandas DataFrame for easy analysis.

In [7]:
# A list to hold our structured data
parsed_data = []

# Define a set of useless terms we want to ignore in the taxonomy
DISCARD_RANKS = {'uncultured', 'unidentified', 'metagenome'}

def parse_silva_taxonomy(taxonomy_str):
    """
    Parses a SILVA taxonomy string (e.g., "Bacteria;Firmicutes;...")
    into a dictionary of ranks.
    """
    # Start with a dictionary of empty ranks
    parsed_ranks = {
        'kingdom': None, 'phylum': None, 'class': None, 
        'order': None, 'family': None, 'genus': None, 'species': None
    }
    
    # Split the string by ';' and remove any useless terms
    ranks = [
        rank.strip() for rank in taxonomy_str.split(';') 
        if rank.strip() and rank.strip().lower() not in DISCARD_RANKS
    ]
    
    if not ranks:
        return parsed_ranks # Return empty if nothing is left
        
    # Assign ranks based on their position
    # This is a safe way that avoids errors if some ranks are missing
    if len(ranks) > 0: parsed_ranks['kingdom'] = ranks[0]
    if len(ranks) > 1: parsed_ranks['phylum'] = ranks[1]
    if len(ranks) > 2: parsed_ranks['class'] = ranks[2]
    if len(ranks) > 3: parsed_ranks['order'] = ranks[3]
    if len(ranks) > 4: parsed_ranks['family'] = ranks[4]
    if len(ranks) > 5: parsed_ranks['genus'] = ranks[5]
    # Species often contains two words, but we'll just take the whole string for now
    if len(ranks) > 6: parsed_ranks['species'] = ranks[6]
        
    return parsed_ranks

# Loop through our filtered list of records
for record in tqdm(prokaryote_records):
    # The taxonomy string is the part of the description after the first space
    accession, taxonomy_str = record.description.split(' ', 1)
    
    # Parse the taxonomy string using our function
    taxonomy_dict = parse_silva_taxonomy(taxonomy_str)
    
    # Also store the sequence and its ID
    taxonomy_dict['id'] = record.id
    taxonomy_dict['sequence'] = str(record.seq)
    
    parsed_data.append(taxonomy_dict)

# Convert the list of dictionaries into a pandas DataFrame
df = pd.DataFrame(parsed_data)

# --- Verification Step ---
print(f"✅ Parsing complete. Created a DataFrame with {len(df)} rows.")
print("Here's a preview of the structured data:")
df.head() # Display the first 5 rows of our new table

  0%|          | 0/6883 [00:00<?, ?it/s]

✅ Parsing complete. Created a DataFrame with 6883 rows.
Here's a preview of the structured data:


Unnamed: 0,kingdom,phylum,class,order,family,genus,species,id,sequence
0,Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,Pseudomonas amygdali pv. morsprunorum,AB001445.1.1538,AACUGAAGAGUUUGAUCAUGGCUCAGAUUGAACGCUGGCGGCAGGC...
1,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Pectobacteriaceae,Dickeya,Dickeya phage phiDP10.3,KM209255.204.1909,AGAGUUUGAUCAUGGCUCAGAUUGAACGCUGGCGGCAGGCCUAACA...
2,Bacteria,Actinobacteriota,Actinobacteria,Actinomycetales,Actinomycetaceae,F0332,,HL281554.1.1313,GACGAACGCUGGCGGCGUGCUUAACACAUGCAAGUCGAACGAGUGG...
3,Bacteria,Firmicutes,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus equi,AB002515.1.1332,GCCUAAUACAUGCAAGUUGACGACAGAUGAUACGUAGCUUGCUACA...
4,Bacteria,Firmicutes,Bacilli,Lactobacillales,Streptococcaceae,Streptococcus,Streptococcus porcinus,AB002523.1.1496,UCCUGGCUCAGGACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGU...
