<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/taxonomic_codes_ipynb_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Developing a taxonomic table for taxonomic codes for a diverse set of sequences**


BIOINFO4/5203 — Colab Exercise Template

Use this template for every weekly exercise. It standardizes setup, data paths, and the final summary so grading in Canvas is quick.

Workflow

    Click the "Open in Colab" link in Canvas (points to this notebook in GitHub).
    Run Setup cells (installs and mounts Google Drive).
    Run the Exercise cells (edit as instructed for each lecture).
    Verify the Results Summary prints the values requested by Canvas.
    File → Print → Save as PDF and upload .ipynb + PDF to Canvas.

    Instructor note (delete in student copy if desired):

        Place datasets for this lecture at: Drive → BIOINFO4-5203-F25 → Data → Lxx_topic
        Update the constants in Config below: COURSE_DIR, LECTURE_CODE (e.g., L05), and TOPIC.
        For heavy jobs (trees, assemblies), provide the PETE output files in the same Data folder so students can analyze them here if the queue is busy.



**Auto‑setup + course folder (uses your Teaching path)**

##A. Mount Google Drive, Import Coding Libraries Necessary for Running Subsequent Code

In [1]:

# Install FIRST, then import
%pip install -q biopython       # Install the Biopython package quietly (-q suppresses most output) so we can work with biological sequence files

from google.colab import drive  # Import the module that lets Colab interact with Google Drive
drive.mount('/content/drive')   # Mount your Google Drive so it appears in Colab's file system under /content/drive

import os, pandas as pd          # Import 'os' for file/directory operations, and pandas for working with data tables
from Bio import SeqIO            # Import SeqIO from Biopython for reading/writing biological sequence files (FASTA, GenBank, etc.)
import matplotlib.pyplot as plt  # Import Matplotlib's plotting library to create figures and graphs

print("✅ Dependencies installed & Drive mounted.")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/3.3 MB[0m [31m32.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.3/3.3 MB[0m [31m61.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/drive
✅ Dependencies installed & Drive mounted.



## B. Course folders: Define the course folders for places to load data to be processed and output to be saved

Edit only `LECTURE_CODE` and `TOPIC` if needed. All inputs will live in `Data/LECTURE_TOPIC` and outputs in `Outputs/LECTURE_TOPIC`.

**2) Define file sturcture

In [5]:

# --- Course folder config (customize LECTURE_CODE/TOPIC only) ---
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"
LECTURE_CODE = "Taxonomy"            # change per week (e.g., L02, L03, ...)
TOPIC        = "Template"    # short slug for the exercise

# Derived paths (do not change)
DATA_DIR   = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"

# Create folder structure if missing
for p in [f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", f"{COURSE_DIR}/Notebooks", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


📁 COURSE_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25
📁 DATA_DIR   : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template
📁 OUTPUT_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template


##Take file name "taxonomic_cmmon_names" and find taxonomic ids

**3) Version 3 Aug 25

In [7]:
# --- Resolve names -> NCBI TaxIDs, add Domain + Phylum, and write mapping + helper files ---
from Bio import Entrez
from pathlib import Path
import pandas as pd, io, re, os, sys, time

# REQUIRED: set your email; an API key is optional but helpful for rate limits
Entrez.email = "you@university.edu"
# Entrez.api_key = "YOUR_NCBI_API_KEY"   # <- optional

# Expect these from your course scaffold cell
assert 'DATA_DIR' in globals() and 'OUTPUT_DIR' in globals(), "Run your course folder setup cell first."
DATA_DIR = Path(DATA_DIR); OUT = Path(OUTPUT_DIR)
OUT.mkdir(parents=True, exist_ok=True)

# ---- Find your names file (one per line; scientific or common names are fine) ----
CANDIDATES = [
    DATA_DIR / "taxonomic_common_names",
    DATA_DIR / "taxonomic_common_names.txt",
    DATA_DIR / "taxonomic_common_names.csv",
    DATA_DIR / "taxonomic_common_names.tsv",
]
NAMES_PATH = next((p for p in CANDIDATES if p.exists()), None)
if not NAMES_PATH:
    raise FileNotFoundError(
        f"Could not find 'taxonomic_common_names' (txt/csv/tsv) in {DATA_DIR}.\n"
        "Create it with one name per line (scientific or common)."
    )
print("📄 Names file:", NAMES_PATH)

# ---- Load names (plain text or first column of CSV/TSV). Lines starting with # are ignored. ----
def load_names(path: Path):
    text = path.read_text(errors="ignore")
    head = "\n".join(text.splitlines()[:5])
    is_table = ("," in head) or ("\t" in head)
    names = []
    if is_table:
        sep = "\t" if "\t" in head else ","
        df = pd.read_csv(io.StringIO(text), sep=sep, comment="#", header=None)
        col0 = df.columns[0]
        names = [str(x).strip() for x in df[col0].tolist() if str(x).strip()]
    else:
        for line in text.splitlines():
            line = line.strip()
            if line and not line.startswith("#"):
                names.append(line)
    return names

raw_names = load_names(NAMES_PATH)
print(f"📝 Loaded {len(raw_names)} name(s)")

# ---- Helpers to fetch taxonomy and extract lineage ranks ----
def fetch_taxnode_by_id(tid: str):
    h2 = Entrez.efetch(db="taxonomy", id=tid, retmode="xml")
    rec = Entrez.read(h2); h2.close()
    return rec[0] if rec else {}

def lineage_map(node):
    """Return dict rank->scientificName from LineageEx (e.g., {'superkingdom':'Bacteria','phylum':'Proteobacteria',...})."""
    lm = {}
    for x in node.get("LineageEx", []):
        rk = x.get("Rank"); nm = x.get("ScientificName")
        if rk and nm:
            lm[rk] = nm
    return lm

def tax_lookup(name: str):
    """
    Resolve 'name' (scientific, common, or numeric TaxID) →
      (taxid, canonical_name, rank, domain, phylum, match_type, status)
    match_type: SCIN | COMMON | LOOSE | TAXID
    status: 'ok' or reason
    """
    name = name.strip()

    # direct TaxID?
    if re.fullmatch(r"\d+", name):
        try:
            node = fetch_taxnode_by_id(name)
            lm = lineage_map(node)
            dom = lm.get("superkingdom", "NA")
            phylum = lm.get("phylum", "NA")
            return int(name), node.get("ScientificName",""), node.get("Rank",""), dom, phylum, "TAXID", "ok"
        except Exception as e:
            return "", "", "", "NA", "NA", "TAXID", f"efetch_failed:{e}"

    queries = [
        (f"\"{name}\"[SCIN]",        "SCIN"),     # exact scientific name
        (f"\"{name}\"[Common Name]", "COMMON"),   # exact common name
        (name,                       "LOOSE"),    # loose text search
    ]
    for q, tag in queries:
        try:
            h = Entrez.esearch(db="taxonomy", term=q, retmode="xml"); r = Entrez.read(h); h.close()
            if int(r["Count"]) == 0:
                continue
            tid = r["IdList"][0]
            node = fetch_taxnode_by_id(tid)
            lm = lineage_map(node)
            dom = lm.get("superkingdom", "NA")
            phylum = lm.get("phylum", "NA")
            return int(tid), node.get("ScientificName",""), node.get("Rank",""), dom, phylum, tag, "ok"
        except Exception as e:
            sys.stderr.write(f"[warn] lookup '{name}' via {tag}: {e}\n")
            time.sleep(0.15)
            continue
    return "", "", "", "NA", "NA", "NA", "no_match"

# ---- Resolve all entries ----
rows = []
seen = set()
for nm in raw_names:
    key = nm.strip().lower()
    if not key or key in seen:
        continue
    seen.add(key)
    tid, canon, rank, dom, phylum, tag, status = tax_lookup(nm)
    rows.append({
        "input_name": nm,
        "taxid": tid,
        "organism_name": canon,      # explicit column as requested (canonical scientific name)
        "canonical_name": canon,     # keep for continuity with earlier files
        "rank": rank,
        "domain": dom,
        "phylum": phylum,
        "match_type": tag,
        "status": status
    })
    time.sleep(0.25)  # be polite to NCBI; increase if you hit rate limits

df = pd.DataFrame(rows)
df = df.sort_values(
    ["status","domain","phylum","organism_name","input_name"],
    na_position="last"
).reset_index(drop=True)

# ---- Write outputs ----
map_csv       = OUT / "taxon_map.csv"                # full mapping table (with organism_name + phylum)
ids_txt       = OUT / "taxids_list.txt"              # one TaxID per line
entrez_txt    = OUT / "entrez_query.txt"             # txidXXX[ORGN] OR ...
gui_taxids    = OUT / "blast_gui_taxids.csv"         # comma-separated numeric IDs for GUI (enter one-by-one via "Add organism")
gui_names_txt = OUT / "blast_gui_names.txt"          # comma-separated canonical names (GUI-friendly labels)

df.to_csv(map_csv, index=False)

valid_taxids = [str(t) for t in df["taxid"].tolist() if str(t).isdigit()]
ids_txt.write_text("\n".join(valid_taxids))
entrez_txt.write_text(" OR ".join([f"txid{t}[ORGN]" for t in valid_taxids]))

# For the web GUI: names and IDs as comma-separated lines (note: GUI prefers adding names one at a time)
canon_names = [n for n in df["organism_name"].tolist() if n]
gui_taxids.write_text(", ".join(valid_taxids))
gui_names_txt.write_text(", ".join(canon_names))

print("🗺  Map CSV :", map_csv)
print("🧬 TaxIDs  :", ids_txt)
print("🔎 ENTREQ  :", entrez_txt)
print("🧩 GUI IDs :", gui_taxids)
print("🧩 GUI names:", gui_names_txt)

# ---- Summary ----
print("\n=== Summary ===")
print("Total names read :", len(raw_names))
print("Resolved (ok)    :", int((df["status"]=="ok").sum()))
print("Unresolved       :", int((df["status"]!="ok").sum()))
if (df["status"]!="ok").any():
    print(df[df["status"]!="ok"][["input_name","status"]].head(12))

📄 Names file: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template/taxonomic_common_names.txt
📝 Loaded 127 name(s)
🗺  Map CSV : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxon_map.csv
🧬 TaxIDs  : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxids_list.txt
🔎 ENTREQ  : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/entrez_query.txt
🧩 GUI IDs : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/blast_gui_taxids.csv
🧩 GUI names: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/blast_gui_names.txt

=== Summary ===
Total names read : 127
Resolved (ok)    : 123
Unresolved       : 1
                                  input_name    status
0  Candidatus Syntrophoarchaeum butanivorans  no_match


# **Fasta Header Manipulation**

In [12]:
# --- Rewrite FASTA headers to ">Phylum Scientific_name" using TaxID lookups ---
from Bio import Entrez, SeqIO
from pathlib import Path
import re, sys, time, csv

# ====== REQUIRED: set your email (and optionally API key) ======
Entrez.email = "you@university.edu"     # <-- put a real email here
# Entrez.api_key = "YOUR_NCBI_API_KEY"  # optional

# ====== I/O paths ======
# If your course scaffold defined these, we’ll use them; else fall back to local files.
IN_FASTA  = None
OUT_FASTA = None
if 'DATA_DIR' in globals() and 'OUTPUT_DIR' in globals():
    DATA_DIR  = Path(DATA_DIR)
    OUTPUT_DIR = Path(OUTPUT_DIR); OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    IN_FASTA  = IN_FASTA  or (OUTPUT_DIR / "per_taxid_top_hits.fasta")
    OUT_FASTA = OUT_FASTA or (OUTPUT_DIR / "per_taxid_top_hits.phylum_name.fasta")
else:
    # Fallback: edit these two lines if not running in the course scaffold
    IN_FASTA  = Path("per_taxid_top_hits.fasta")
    OUT_FASTA = Path("per_taxid_top_hits.phylum_name.fasta")

CACHE_CSV = OUT_FASTA.with_suffix(".taxid_map.csv")   # saves a small map for reference

# ====== utils ======
_tax_cache = {}  # tid -> (phylum, scientific_name)

def fetch_taxnode_by_id(tid: str):
    """Return taxonomy record (dict) for a TaxID."""
    h = Entrez.efetch(db="taxonomy", id=str(tid), retmode="xml")
    rec = Entrez.read(h); h.close()
    return rec[0] if rec else {}

def lineage_rank(node, rank_name):
    """Get lineage scientific name at given rank (e.g., 'phylum', 'superkingdom')."""
    for x in node.get("LineageEx", []):
        if x.get("Rank") == rank_name:
            return x.get("ScientificName")
    return None

def phylum_and_name_for_taxid(tid: str):
    """Resolve TaxID -> (phylum, canonical scientific name). Cached."""
    if tid in _tax_cache:
        return _tax_cache[tid]
    try:
        node = fetch_taxnode_by_id(tid)
        phylum = lineage_rank(node, "phylum") or "NA"
        sci    = node.get("ScientificName", "Unknown")
        _tax_cache[tid] = (phylum, sci)
        return phylum, sci
    except Exception as e:
        sys.stderr.write(f"[warn] taxid {tid} lookup failed: {e}\n")
        _tax_cache[tid] = ("NA", "Unknown")
        return "NA", "Unknown"

def extract_taxid_from_header(header: str):
    """Find taxid:#### in a FASTA header; return digits or None."""
    m = re.search(r"taxid:(\d+)", header)
    return m.group(1) if m else None

# ====== main ======
if not IN_FASTA.exists():
    print(f"❌ Error: Input FASTA not found at {IN_FASTA}")
    print("Please ensure the file exists or is generated by a previous step.")
    raise FileNotFoundError(f"Input FASTA not found: {IN_FASTA}")

records_in  = list(SeqIO.parse(str(IN_FASTA), "fasta"))
if not records_in:
    raise ValueError(f"No records found in {IN_FASTA}")

# Resolve all unique TaxIDs first (polite rate limiting)
taxids = []
for r in records_in:
    tid = extract_taxid_from_header(r.description)
    if tid:
        taxids.append(tid)
uniq_taxids = sorted(set(taxids), key=int)

print(f"🔎 Found {len(uniq_taxids)} unique TaxIDs in {IN_FASTA.name}")
for i, tid in enumerate(uniq_taxids, 1):
    ph, nm = phylum_and_name_for_taxid(tid)
    if i % 5 == 0:    # gentle pacing
        time.sleep(0.2)

# Rewrite headers and write output FASTA
written = 0
with open(OUT_FASTA, "w") as out_fa:
    for r in records_in:
        tid = extract_taxid_from_header(r.description)
        if not tid:
            ph, nm = "NA", "Unknown"
        else:
            ph, nm = phylum_and_name_for_taxid(tid)
        # EXACT requested format:
        r.id = f"{ph} {nm}"
        r.name = ""
        r.description = ""   # keep header clean as just ">Phylum Scientific_name"
        SeqIO.write(r, out_fa, "fasta")
        written += 1

# Save a small mapping file for audit/reference
with open(CACHE_CSV, "w", newline="") as f:
    w = csv.writer(f)
    w.writerow(["taxid","phylum","scientific_name"])
    for tid in uniq_taxids:
        ph, nm = _tax_cache.get(tid, ("NA","Unknown"))
        w.writerow([tid, ph, nm])

print(f"✅ Wrote {written} sequences to: {OUT_FASTA}")
print(f"🗺  Saved TaxID→(phylum,name) map: {CACHE_CSV}")

# ---- Optional note on duplicate headers ----
if len(uniq_taxids) < len(records_in):
    # Many seqs may collapse to identical headers if same species; that’s okay for MSA tools that only need sequences.
    # If you need unique headers, uncomment the two lines below to append an incrementing number.
    # for k, rec in enumerate(SeqIO.parse(str(OUT_FASTA), "fasta"), 1): ...
    pass

🔎 Found 47 unique TaxIDs in per_taxid_top_hits.fasta
✅ Wrote 77 sequences to: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/per_taxid_top_hits.phylum_name.fasta
🗺  Saved TaxID→(phylum,name) map: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/per_taxid_top_hits.phylum_name.taxid_map.csv


**HTML** **generator**

In [13]:
# --- Merge 1+ FASTAs, resolve TaxID for each seq (from header or by accession),
# --- then rewrite headers to ">Phylum Scientific_name" and write *all* records.
from Bio import Entrez, SeqIO
from pathlib import Path
import re, sys, time, csv, hashlib

# ====== REQUIRED ======
Entrez.email = "you@university.edu"      # put a real email here
# Entrez.api_key = "YOUR_NCBI_API_KEY"   # optional but recommended

# ====== Inputs ======
# If your course scaffold exists, we auto-pick common files. You can add more.
IN_FASTAS = []
if 'DATA_DIR' in globals() and 'OUTPUT_DIR' in globals():
    DATA_DIR   = Path(DATA_DIR)
    OUTPUT_DIR = Path(OUTPUT_DIR); OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    # Add any that exist:
    for cand in [
        OUTPUT_DIR / "per_taxid_top_hits.fasta",
        DATA_DIR   / "blast_hits.fasta",          # from earlier XML→FASTA step
        DATA_DIR   / "query_proteins.fasta",      # in case you want to include the query too
    ]:
        if cand.exists(): IN_FASTAS.append(cand)
    OUT_FASTA = OUTPUT_DIR / "all_hits.phylum_name.fasta"
else:
    # Fallback: edit these paths as needed
    IN_FASTAS = [Path("per_taxid_top_hits.fasta"), Path("blast_hits.fasta")]
    OUT_FASTA = Path("all_hits.phylum_name.fasta")

if not IN_FASTAS:
    raise FileNotFoundError("No input FASTA files found. Add paths to IN_FASTAS.")

MAP_CSV = OUT_FASTA.with_suffix(".taxid_map.csv")

# ====== helpers ======
_tax_cache = {}   # tid -> (phylum, sci)
_acc2tid   = {}   # accession -> taxid

DNA = set("ACGTNUWSMKRYBDHV-")

def extract_taxid_from_header(header: str):
    m = re.search(r"taxid:(\d+)", header)
    return m.group(1) if m else None

def first_accession_token(header: str):
    """
    Try to pull an accession token from the header (e.g., 'WP_012345678.1', 'XOB97566.1').
    We use first whitespace-delimited token that looks like an accession.
    """
    tok = header.split()[0]
    # (very loose) accession shape: letters/underscore + digits + optional .version
    if re.match(r"^[A-Za-z]{1,6}[_]?\d+(?:\.\d+)?$", tok):
        return tok
    return None

def fetch_taxnode_by_id(tid: str):
    h = Entrez.efetch(db="taxonomy", id=str(tid), retmode="xml")
    rec = Entrez.read(h); h.close()
    return rec[0] if rec else {}

def lineage_rank(node, rank_name):
    for x in node.get("LineageEx", []):
        if x.get("Rank") == rank_name:
            return x.get("ScientificName")
    return None

def phylum_and_name_for_taxid(tid: str):
    if tid in _tax_cache:
        return _tax_cache[tid]
    try:
        node = fetch_taxnode_by_id(tid)
        ph = lineage_rank(node, "phylum") or "NA"
        nm = node.get("ScientificName", "Unknown")
        _tax_cache[tid] = (ph, nm)
        return ph, nm
    except Exception as e:
        sys.stderr.write(f"[warn] taxid {tid} lookup failed: {e}\n")
        _tax_cache[tid] = ("NA", "Unknown")
        return "NA", "Unknown"

def taxid_from_accession(acc: str):
    """Resolve protein accession → TaxID via esummary (cached)."""
    if not acc: return None
    if acc in _acc2tid: return _acc2tid[acc]
    try:
        h = Entrez.esummary(db="protein", id=acc, retmode="xml")
        s = Entrez.read(h); h.close()
        # esummary returns a list of DocSums; TaxId often in 'TaxId' or 'TaxID'
        if s and "DocSum" in s[0]:
            for item in s[0]["DocSum"]["Item"]:
                if item.attributes.get("Name") in ("TaxId","TaxID"):
                    tid = str(item.value)
                    _acc2tid[acc] = tid
                    return tid
        # Alternate path: esummary REST sometimes packs fields differently; try generic keys
        if s and isinstance(s[0], dict):
            tid = s[0].get("TaxId") or s[0].get("TaxID")
            if tid:
                tid = str(tid)
                _acc2tid[acc] = tid
                return tid
    except Exception as e:
        sys.stderr.write(f"[warn] esummary failed for {acc}: {e}\n")
    _acc2tid[acc] = None
    return None

def seq_hash(seq):
    return hashlib.sha256(seq.encode("ascii", "ignore")).hexdigest()

# ====== collect records from all inputs; deduplicate ======
all_records = []
for p in IN_FASTAS:
    if not p.exists(): continue
    all_records.extend(list(SeqIO.parse(str(p), "fasta")))

if not all_records:
    raise ValueError("No sequences found in input FASTAs.")

# Deduplicate primarily by accession if present, else by sequence hash
seen_keys = set()
kept = []
for r in all_records:
    header = r.description or r.id
    acc = first_accession_token(header)
    key = ("acc", acc) if acc else ("seq", seq_hash(str(r.seq)))
    if key in seen_keys:
        continue
    seen_keys.add(key)
    kept.append(r)

print(f"📦 Loaded {len(all_records)} records from {len(IN_FASTAS)} file(s); kept {len(kept)} after dedup.")

# ====== resolve TaxID for each kept record ======
uniq_taxids = set()
for r in kept:
    header = r.description or r.id
    tid = extract_taxid_from_header(header)
    if not tid:
        acc = first_accession_token(header)
        tid = taxid_from_accession(acc)
        # small pause every few lookups
        time.sleep(0.1)
    r.annotations["taxid"] = tid
    if tid: uniq_taxids.add(tid)

print(f"🔎 Resolved TaxID for {sum(1 for r in kept if r.annotations['taxid'])} / {len(kept)} records.")

# Pre-fetch phylum/name for all TaxIDs we have
for i, tid in enumerate(sorted(uniq_taxids, key=int)):
    phylum_and_name_for_taxid(tid)
    if (i+1) % 5 == 0:
        time.sleep(0.2)

# ====== write output with requested headers ======
written = 0
with open(OUT_FASTA, "w") as out_fa, open(MAP_CSV, "w", newline="") as mapf:
    w = csv.writer(mapf)
    w.writerow(["accession","taxid","phylum","scientific_name"])
    for r in kept:
        header = r.description or r.id
        acc = first_accession_token(header) or ""
        tid = r.annotations.get("taxid")
        if tid:
            ph, nm = phylum_and_name_for_taxid(tid)
        else:
            ph, nm = "NA", "Unknown"
        # EXACT requested format:
        r.id = f"{ph} {nm}"
        r.name = ""
        r.description = ""
        # If you need unique headers, uncomment the next line:
        # r.id = f"{ph} {nm} | acc:{acc or 'NA'}"
        SeqIO.write(r, out_fa, "fasta")
        w.writerow([acc, tid or "", ph, nm])
        written += 1

print(f"✅ Wrote {written} sequences to: {OUT_FASTA}")
print(f"🗺  Wrote TaxID→(phylum,name) map: {MAP_CSV}")

📦 Loaded 77 records from 1 file(s); kept 76 after dedup.
🔎 Resolved TaxID for 76 / 76 records.
✅ Wrote 76 sequences to: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/all_hits.phylum_name.fasta
🗺  Wrote TaxID→(phylum,name) map: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/all_hits.phylum_name.taxid_map.csv
