# 18S Data Preparation: Refinement and Development

**Objective:** Refine the data preparation pipeline for the 18S rRNA gene (Eukaryotes) using the SILVA database.

**Methodology:**
1.  Filter the full SILVA database to create a dedicated `Eukaryotes_only.fasta` file. This is a one-time, computationally intensive step.
2.  Create a small, manageable sample from this Eukaryote-only file for rapid development.
3.  Develop and test a robust taxonomy parser specifically designed for the complexities of eukaryotic lineages.
4.  Apply the full data cleaning and feature engineering workflow, and save the final artifacts.

In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from tqdm.auto import tqdm
from pathlib import Path
import sys

# Set up project path
project_root = Path.cwd().parent

# --- Verification Step ---
print(f"Project Root: {project_root}")

Project Root: C:\Users\jampa\Music\atlas-v3


In [2]:
# --- Define Core Directories ---
RAW_DATA_DIR = project_root / "data" / "raw"
PROCESSED_DATA_DIR = project_root / "data" / "processed"

# Create the processed data directory if it doesn't exist
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

# --- Define 18S Specific File Paths ---

# Path to the full, original SILVA database file (our main source)
FULL_SILVA_PATH = RAW_DATA_DIR / "SILVA_138.1_SSURef_NR99_tax_silva.fasta"

# Path to the intermediate file we will create, containing ONLY eukaryotes
EUKARYOTE_ONLY_PATH = RAW_DATA_DIR / "SILVA_eukaryotes_only.fasta"

# Path to the small sample file we will create from the eukaryote-only file
SAMPLE_EUKARYOTE_PATH = RAW_DATA_DIR / "SILVA_eukaryotes_sample_10k.fasta"

# --- Verification Step ---
if FULL_SILVA_PATH.exists():
    print("Source SILVA database found.")
    print(f"  - Location: {FULL_SILVA_PATH}")
else:
    print(f"ERROR: The source SILVA database was not found at the expected location.")
    print(f"  - Expected: {FULL_SILVA_PATH}")
    print("Please download the file and place it in the 'data/raw' directory before proceeding.")

Source SILVA database found.
  - Location: C:\Users\jampa\Music\atlas-v3\data\raw\SILVA_138.1_SSURef_NR99_tax_silva.fasta
