<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/taxonomic_codes_ipynb_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Developing a taxonomic table for taxonomic codes for a diverse set of sequences**


BIOINFO4/5203 — Colab Exercise Template

Use this template for every weekly exercise. It standardizes setup, data paths, and the final summary so grading in Canvas is quick.

Workflow

    Click the "Open in Colab" link in Canvas (points to this notebook in GitHub).
    Run Setup cells (installs and mounts Google Drive).
    Run the Exercise cells (edit as instructed for each lecture).
    Verify the Results Summary prints the values requested by Canvas.
    File → Print → Save as PDF and upload .ipynb + PDF to Canvas.

    Instructor note (delete in student copy if desired):

        Place datasets for this lecture at: Drive → BIOINFO4-5203-F25 → Data → Lxx_topic
        Update the constants in Config below: COURSE_DIR, LECTURE_CODE (e.g., L05), and TOPIC.
        For heavy jobs (trees, assemblies), provide the PETE output files in the same Data folder so students can analyze them here if the queue is busy.



**Auto‑setup + course folder (uses your Teaching path)**

##A. Mount Google Drive, Import Coding Libraries Necessary for Running Subsequent Code

In [1]:

# Install FIRST, then import
%pip install -q biopython       # Install the Biopython package quietly (-q suppresses most output) so we can work with biological sequence files

from google.colab import drive  # Import the module that lets Colab interact with Google Drive
drive.mount('/content/drive')   # Mount your Google Drive so it appears in Colab's file system under /content/drive

import os, pandas as pd          # Import 'os' for file/directory operations, and pandas for working with data tables
from Bio import SeqIO            # Import SeqIO from Biopython for reading/writing biological sequence files (FASTA, GenBank, etc.)
import matplotlib.pyplot as plt  # Import Matplotlib's plotting library to create figures and graphs

print("✅ Dependencies installed & Drive mounted.")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/drive
✅ Dependencies installed & Drive mounted.



## B. Course folders: Define the course folders for places to load data to be processed and output to be saved

Edit only `LECTURE_CODE` and `TOPIC` if needed. All inputs will live in `Data/LECTURE_TOPIC` and outputs in `Outputs/LECTURE_TOPIC`.

**2) Define file sturcture

In [2]:

# --- Course folder config (customize LECTURE_CODE/TOPIC only) ---
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"
LECTURE_CODE = "Taxonomy"            # change per week (e.g., L02, L03, ...)
TOPIC        = "Template"    # short slug for the exercise

# Derived paths (do not change)
DATA_DIR   = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"

# Create folder structure if missing
for p in [f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", f"{COURSE_DIR}/Notebooks", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


📁 COURSE_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25
📁 DATA_DIR   : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template
📁 OUTPUT_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template


##Take file name "taxonomic_cmmon_names" and find taxonomic ids

**3) Version 3 Aug 25

In [4]:
# --- Build panel from your full names file & write CSV + TXT outputs (with eukaryote subgroup + common name) ---
from Bio import Entrez
from pathlib import Path
import pandas as pd, io, re, os, sys, time

# REQUIRED
Entrez.email = "you@university.edu"
# Entrez.api_key = "YOUR_NCBI_API_KEY"  # optional

assert 'DATA_DIR' in globals() and 'OUTPUT_DIR' in globals(), "Run your course folder setup cell first."
DATA_DIR = Path(DATA_DIR); OUT = Path(OUTPUT_DIR)
OUT.mkdir(parents=True, exist_ok=True)

# Locate the master names file
CANDIDATES = [
    DATA_DIR / "taxonomic_common_names",
    DATA_DIR / "taxonomic_common_names.txt",
    DATA_DIR / "taxonomic_common_names.csv",
    DATA_DIR / "taxonomic_common_names.tsv",
]
NAMES_PATH = next((p for p in CANDIDATES if p.exists()), None)
if not NAMES_PATH:
    raise FileNotFoundError(
        f"Could not find 'taxonomic_cmmon_names' (txt/csv/tsv) in {DATA_DIR}.\n"
        "Create it with one name per line (scientific or common)."
    )
print("📄 Names file:", NAMES_PATH)

def load_names(path: Path):
    text = path.read_text(errors="ignore")
    head = "\n".join(text.splitlines()[:5])
    is_table = ("," in head) or ("\t" in head)
    names = []
    if is_table:
        sep = "\t" if "\t" in head else ","
        df = pd.read_csv(io.StringIO(text), sep=sep, comment="#", header=None)
        names = [str(x).strip() for x in df.iloc[:,0].tolist() if str(x).strip()]
    else:
        for line in text.splitlines():
            line = line.strip()
            if line and not line.startswith("#"):
                names.append(line)
    return names

raw_names = load_names(NAMES_PATH)
print(f"📝 Loaded {len(raw_names)} name(s)")

def fetch_taxnode_by_id(tid: str):
    h2 = Entrez.efetch(db="taxonomy", id=tid, retmode="xml")
    rec = Entrez.read(h2); h2.close()
    return rec[0] if rec else {}

def extract_common_name(node: dict) -> str:
    # Try common name fields if present
    cn = node.get("CommonName") or node.get("GenbankCommonName")
    if cn: return cn
    other = node.get("OtherNames") or {}
    for k in ("GenbankCommonName","CommonName"):
        if other.get(k): return other[k]
    # Fall back to first 'Name' of type 'genbank common name'
    syns = other.get("Synonym") or []
    return ""

def infer_superkingdom(node: dict) -> str:
    return next((x["ScientificName"] for x in node.get("LineageEx", [])
                 if x.get("Rank") == "superkingdom"), "NA")

def infer_euk_group(node: dict) -> str:
    """
    Coarse eukaryote grouping for teaching:
      Mammal, Bird/Reptile/Fish, Plant, Fungi, Protist
    Non-eukaryotes -> ''
    """
    lineage = [x["ScientificName"] for x in node.get("LineageEx", [])]
    lineage_set = set(s.lower() for s in lineage)

    if "eukaryota" not in lineage_set:
        return ""

    # Animals (Metazoa)
    if "metazoa" in lineage_set:
        # Vertebrates and subclades → mammals, birds, reptiles, fish lumped
        verte_keys = {"vertebrata", "gnathostomata", "tetrapoda", "mammalia", "aves", "reptilia", "actinopterygii"}
        if any(k in lineage_set for k in verte_keys):
            # Make mammals explicit when possible
            if "mammalia" in lineage_set:
                return "Mammal"
            return "Bird/Reptile/Fish"
        return "Protist"  # non-vertebrate animals (rare in this list) could be left as "Protist" bucket or "Metazoa"

    # Plants
    if "viridiplantae" in lineage_set or "streptophyta" in lineage_set or "embryophyta" in lineage_set:
        return "Plant"

    # Fungi
    if "fungi" in lineage_set:
        return "Fungi"

    # Large protist lineages
    protist_keys = {
        "amoebozoa","sar","stramenopiles","alveolata","rhizaria","discoba","excavata",
        "haptophyta","cryptophyta","choanoflagellida","euglenozoa","apicomplexa","ciliophora"
    }
    if any(k in lineage_set for k in protist_keys):
        return "Protist"

    return "Protist"  # default for eukaryotes we didn’t categorize above

def tax_lookup(name: str):
    """
    Resolve 'name' → (taxid, canonical, rank, superkingdom, euk_group, common_name, match_type, status)
    """
    name = name.strip()

    # Direct TaxID
    if re.fullmatch(r"\d+", name):
        try:
            node = fetch_taxnode_by_id(name)
            return (int(name),
                    node.get("ScientificName",""),
                    node.get("Rank",""),
                    infer_superkingdom(node),
                    infer_euk_group(node),
                    extract_common_name(node),
                    "TAXID", "ok")
        except Exception as e:
            return "", "", "", "NA", "", "", "TAXID", f"efetch_failed:{e}"

    queries = [
        (f"\"{name}\"[SCIN]", "SCIN"),
        (f"\"{name}\"[Common Name]", "COMMON"),
        (name, "LOOSE"),
    ]
    for q, tag in queries:
        try:
            h = Entrez.esearch(db="taxonomy", term=q, retmode="xml"); r = Entrez.read(h); h.close()
            if int(r["Count"]) == 0:
                continue
            tid = r["IdList"][0]
            node = fetch_taxnode_by_id(tid)
            return (int(tid),
                    node.get("ScientificName",""),
                    node.get("Rank",""),
                    infer_superkingdom(node),
                    infer_euk_group(node),
                    extract_common_name(node),
                    tag, "ok")
        except Exception as e:
            sys.stderr.write(f"[warn] lookup '{name}' via {tag}: {e}\n")
            time.sleep(0.15)
            continue
    return "", "", "", "NA", "", "", "NA", "no_match"

rows = []
seen = set()
for nm in raw_names:
    key = nm.strip().lower()
    if not key or key in seen:
        continue
    seen.add(key)
    tid, canon, rank, dom, egrp, common, tag, status = tax_lookup(nm)
    rows.append({
        "input_name": nm,
        "taxid": tid,
        "canonical_name": canon,
        "common_name": common,
        "rank": rank,
        "superkingdom": dom,     # Bacteria / Archaea / Eukaryota
        "euk_group": egrp,       # Mammal / Bird/Reptile/Fish / Plant / Fungi / Protist / ''
        "match_type": tag,
        "status": status
    })
    time.sleep(0.25)

df = pd.DataFrame(rows).sort_values(
    ["status","superkingdom","euk_group","canonical_name","input_name"],
    na_position="last"
).reset_index(drop=True)

# Write outputs
map_csv   = OUT / "taxon_map.csv"
ids_txt   = OUT / "taxids_list.txt"
entrez_txt= OUT / "entrez_query.txt"

df.to_csv(map_csv, index=False)
valid_taxids = [str(t) for t in df["taxid"].tolist() if str(t).isdigit()]
ids_txt.write_text("\n".join(valid_taxids))
entrez_txt.write_text(" OR ".join([f"txid{t}[ORGN]" for t in valid_taxids]))

print("🗺  Map CSV :", map_csv)
print("🧬 TaxIDs  :", ids_txt)
print("🔎 ENTREQ  :", entrez_txt)

print("\n=== Summary ===")
print("Total names read :", len(raw_names))
print("Resolved (ok)    :", (df["status"]=="ok").sum())
print("Unresolved       :", (df["status"]!="ok").sum())
if (df["status"]!="ok").any():
    display(df[df["status"]!="ok"][["input_name","status"]].head(12))

📄 Names file: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template/taxonomic_common_names.txt
📝 Loaded 127 name(s)
🗺  Map CSV : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxon_map.csv
🧬 TaxIDs  : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxids_list.txt
🔎 ENTREQ  : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/entrez_query.txt

=== Summary ===
Total names read : 127
Resolved (ok)    : 123
Unresolved       : 1


Unnamed: 0,input_name,status
0,Candidatus Syntrophoarchaeum butanivorans,no_match


**HTML** **generator**

In [4]:
# --- Rebuild HTML table with filter + download links + subgroup bullets ---
from pathlib import Path
import pandas as pd

assert 'OUTPUT_DIR' in globals(), "Run your course folder setup cell first."
OUT = Path(OUTPUT_DIR)

csv_path    = OUT / "taxon_map.csv"
ids_path    = OUT / "taxids_list.txt"
entrez_path = OUT / "entrez_query.txt"
html_path   = OUT / "taxon_map.html"

df = pd.read_csv(csv_path)

# Normalize for display
if "scientific_name" not in df.columns and "canonical_name" in df.columns:
    df = df.rename(columns={"canonical_name":"scientific_name"})
if "superkingdom" not in df.columns:
    df["superkingdom"] = ""

# Counts
total = len(df)
counts_super = df["superkingdom"].value_counts().to_dict()
counts_egrp  = (df[df["superkingdom"]=="Eukaryota"]["euk_group"]
                .fillna("").replace("", "Other/NA").value_counts().to_dict())

taxid_list   = ids_path.read_text() if ids_path.exists() else ""
entrez_query = entrez_path.read_text() if entrez_path.exists() else ""

# Pretty table
disp_cols = ["taxid","scientific_name","common_name","superkingdom","euk_group","rank","match_type","status","input_name"]
table_html = df[disp_cols].to_html(index=False, escape=True, classes="tax-table")

# Bullet summaries (static text you asked for)
bullets_html = """
<h3 style="margin:14px 0 6px;">Coverage overview</h3>
<ul>
  <li><strong>Bacteria (~40 taxa):</strong> <em>Escherichia coli, Bacillus subtilis, Pseudomonas aeruginosa, Staphylococcus aureus, Streptomyces coelicolor, Mycobacterium tuberculosis, Deinococcus radiodurans, Synechocystis sp. PCC 6803, Vibrio cholerae, Thermus aquaticus</em>, plus many more spread across major bacterial phyla.</li>
  <li><strong>Archaea (~20 taxa):</strong> <em>Methanocaldococcus jannaschii, Haloferax volcanii, Halobacterium salinarum, Sulfolobus solfataricus, Thermoplasma acidophilum, Archaeoglobus fulgidus, Nitrosopumilus maritimus, Methanobrevibacter smithii</em>, etc.</li>
  <li><strong>Eukaryotes (~40 taxa):</strong>
      <ul>
        <li><em>Mammals:</em> Homo sapiens, Mus musculus, Rattus norvegicus, Bos taurus, Pan troglodytes, Gorilla gorilla, Pongo abelii, Tursiops truncatus.</li>
        <li><em>Birds/Reptiles/Fish:</em> Gallus gallus, (e.g., Anolis carolinensis), Danio rerio.</li>
        <li><em>Plants:</em> Arabidopsis thaliana, Oryza sativa, Zea mays, Physcomitrella patens.</li>
        <li><em>Fungi:</em> Saccharomyces cerevisiae, Neurospora crassa, Schizosaccharomyces pombe.</li>
        <li><em>Protists:</em> Plasmodium falciparum, Trypanosoma brucei, Tetrahymena thermophila, Dictyostelium discoideum.</li>
      </ul>
  </li>
</ul>
"""

html = f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Representative Taxa Panel (NCBI TaxIDs)</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
  :root {{ --bg:#0b1020; --card:#121a2b; --ink:#e5ecff; --muted:#9bb0d6; --accent:#7aa2ff; --border:#20304f; }}
  body {{ font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Inter, Arial, sans-serif; margin:0; padding:24px; background:var(--bg); color:var(--ink); }}
  .wrap {{ max-width:1100px; margin:0 auto; background:var(--card); border:1px solid var(--border);
          border-radius:14px; padding:22px 22px 28px; box-shadow:0 10px 30px rgba(0,0,0,.35), inset 0 1px 0 rgba(255,255,255,.04); }}
  .muted {{ color:var(--muted); font-size:13px; }}
  .counts span {{ background:rgba(122,162,255,.12); border:1px solid var(--border); padding:6px 10px; border-radius:10px; margin-right:8px; }}
  .panel pre {{ background:#0a1327; border:1px solid var(--border); padding:10px 12px; border-radius:10px; overflow:auto; }}
  .panel code {{ color:var(--accent); }}
  .tax-table {{ width:100%; border-collapse:collapse; font-size:14px; margin-top:10px; }}
  .tax-table th, .tax-table td {{ border:1px solid var(--border); padding:8px 10px; }}
  .tax-table th {{ position:sticky; top:0; background:#101a32; z-index:2; }}
  .tax-table tr:nth-child(odd) td {{ background:#0e1730; }}
  .tax-table tr:nth-child(even) td {{ background:#0c162c; }}
  .pill {{ display:inline-block; padding:2px 8px; border-radius:999px; border:1px solid var(--border); font-size:12px; color:var(--muted);}}
  .filter input {{ width:100%; padding:10px 12px; border-radius:10px; border:1px solid var(--border); background:#0a1327; color:var(--ink); }}
  ul {{ margin: 6px 0 8px 18px; }}
</style>
</head>
<body>
  <div class="wrap">
    <h1>Representative Taxa Panel <span class="pill">NCBI TaxIDs</span></h1>
    <div class="muted">Built directly from your names file; shows superkingdom and eukaryote branch for teaching context.</div>

    <div class="counts" style="margin:10px 0;">
      <span>Total: <strong>{total}</strong></span>
      <span>Bacteria: <strong>{counts_super.get('Bacteria', 0)}</strong></span>
      <span>Archaea: <strong>{counts_super.get('Archaea', 0)}</strong></span>
      <span>Eukaryota: <strong>{counts_super.get('Eukaryota', 0)}</strong></span>
      <span>Euk·Mammal: <strong>{counts_egrp.get('Mammal', 0)}</strong></span>
      <span>Euk·Bird/Reptile/Fish: <strong>{counts_egrp.get('Bird/Reptile/Fish', 0)}</strong></span>
      <span>Euk·Plant: <strong>{counts_egrp.get('Plant', 0)}</strong></span>
      <span>Euk·Fungi: <strong>{counts_egrp.get('Fungi', 0)}</strong></span>
      <span>Euk·Protist: <strong>{counts_egrp.get('Protist', 0)}</strong></span>
    </div>

    <div class="panel">
      <div><strong>Downloads:</strong>
        <a href="taxon_map.csv" download>taxon_map.csv</a> ·
        <a href="taxids_list.txt" download>taxids_list.txt</a> ·
        <a href="entrez_query.txt" download>entrez_query.txt</a>
      </div>

      <div class="filter" style="margin:12px 0 6px;">
        <input id="flt" type="search" placeholder="Type to filter (matches any column)…" oninput="filterRows()">
      </div>

      <div>
        <strong>Entrez BLAST filter:</strong>
        <pre><code>{entrez_query}</code></pre>
      </div>

      {bullets_html}
    </div>

    {table_html}
  </div>

<script>
function filterRows(){{
  const q = document.getElementById('flt').value.toLowerCase();
  const rows = document.querySelectorAll('table.tax-table tbody tr');
  rows.forEach(tr => {{
    tr.style.display = tr.innerText.toLowerCase().includes(q) ? '' : 'none';
  }});
}}
</script>
</body>
</html>
"""

html_path.write_text(html, encoding="utf-8")
print("🌐 HTML:", html_path)

🌐 HTML: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxon_map.html


In [4]:
# Resolve taxon names -> NCBI TaxIDs and build BLAST Entrez query filters
# pip install biopython
from Bio import Entrez
import csv, re, time
from pathlib import Path

Entrez.email = "you@university.edu"   # <-- set this!

names_block = """
Oceanithermus
Oceanithermus profundus
Deinococcota bacterium DY0809b
Allomeiothermus silvanus
Meiothermus sp.
Deinococcus sp.
Meiothermus ruber
Meiothermus taiwanensis
Calidithermus roseus
Calidithermus terrae
Marinithermus hydrothermalis DSM 14884
Marinithermus hydrothermalis
Thermus thalpophilus
Thermus brockianus
Thermus arciformis
Thermus tenuipuniceus
Thermus altitudinis
Thermus albus
Thermus islandicus
Thermus aquaticus
Thermus sediminis
Pleurocapsa sp. SU_196_0
Deinococcus geothermalis
Deinococcus planocerae
Thermus scotoductus
Hymenolepis microstoma
Rodentolepis nana
Mesocestoides corti
Taenia asiatica
Taenia crassiceps
Echinococcus granulosus
Echinococcus multilocularis
Taenia solium
Hydatigera taeniaeformis
Phormidium sp. CCY1219
Nostoc ellipsosporum NOK
Vollenhovia emeryi
Drosophila ananassae
Nostocales cyanobacterium ELA608
Oppiella nova]
Rhizophagus clarus
Lytechinus pictus
Diadema antillarum
Ornithodoros turicata
Dermacentor silvarum
Schistocerca gregaria
Schistocerca cancellata
Mus musculus
Loxodonta africana
Rousettus aegyptiacus
Myotis myotis
Ovis aries
Sus scrofa
Eschrichtius robustus
Eschrichtius robustus
Acropora muricata
Acropora millepora
Octopus vulgaris
Oscillatoriaceae cyanobacterium
Halobacteriales archaeon
Cyanobacteriota bacterium
Leptolyngbya sp. 7M
Thermoplasmata archaeon
Nitrosotalea sp.
Candidatus Micrarchaeia archaeon
Synechococcus sp. RC10A2
Kamptonema cortianum
Nitrososphaerota archaeon
Candidatus Woesearchaeota archaeon
Thermosynechococcus sp.
Thermosynechococcaceae cyanobacterium
Hydrococcus sp. Prado102
Scytonema sp. NUACC21
Cyanobacteria bacterium P01_D01_bin.50
Mojavia pulchra JT2-VF2
Aulosira sp. DedQUE10
Tolypothrix tenuis
Desmonostoc muscorum
Komarekiella delphini-convector
Leptolyngbya sp. NIES-2104
Leptolyngbya sp. NIES-3755
Hyella patelloides
Planktothrix sp.
Oculatellaceae cyanobacterium bins.114
Waterburya sp.
Allocoleopsis franciscana
Coleofasciculaceae cyanobacterium
Candidatus Cyanaurora vandensis
Anthocerotibacter panamensis
Gloeobacter kilaueensis
Gloeobacterales cyanobacterium ES-bin-313
Methanolobus sp.
Methanosarcinaceae archaeon
Methanosalsum zhilinae
Methanobacteriota archaeon
Methanobacteriota archaeon
Candidatus Sysuiplasma superficiale
Methanomassiliicoccales archaeon
Thermoplasmatales archaeon I-plasma
Ferroplasma sp.
Thermoplasma sp.
Thermoplasma volcanium GSS1
Thermoplasma volcanium
Oxyplasma meridianum
Thermoplasmatales archaeon Gpl
Cuniculiplasma divulgatum
archaeon BMS3Bbin15
Methanobacteriota archaeon
Candidatus Hydrothermarchaeaceae archaeon
uncultured archaeon
Candidatus Argoarchaeum ethanivorans
Candidatus Methanoperedens sp.
Candidatus Syntrophoarchaeum butanivorans
Candidatus Syntrophoarchaeum sp. WYZ-LMO15
""".strip().splitlines()

# ---------- cleaning helpers ----------
bad_tokens = [
    r"\bDSM\s*\d+\b", r"\bNOK\b", r"\bGSS1\b", r"\bbin\.?\s*\d+\b",
    r"\bbins?\.?\s*\d+\b", r"\bELA\d+\w*\b", r"\bSU_\d+_\d+\b", r"\bPrado\d+\b",
    r"\bP\d+_\w+\b", r"\bJT\d+-\w+\b", r"\bWYZ-\w+\b", r"\bES-bin-\d+\b", r"]$"
]
generic_words = ("bacterium", "archaeon", "cyanobacterium", "cyanobacteria", "family", "order")

def clean_name(s: str) -> str:
    s = s.strip()
    for pat in bad_tokens:
        s = re.sub(pat, "", s, flags=re.I)
    s = re.sub(r"\s+", " ", s).strip(" .")
    return s

def genus_of_sp(name: str) -> str | None:
    m = re.search(r"^([A-Z][a-zA-Z_-]+)\s+sp\.", name)
    return m.group(1) if m else None

def primary_trim(name: str) -> str:
    # collapse "X cyanobacterium ..." -> "X cyanobacterium"
    m = re.search(r"^(.*?\b(?:%s))\b" % "|".join(generic_words), name, flags=re.I)
    return m.group(1) if m else name

def need_expansion(original: str, rank: str) -> bool:
    # Recommend :exp for sp./generic/above-species ranks
    return (
        " sp." in original
        or any(w in original.lower() for w in generic_words)
        or rank not in ("species", "subspecies", "no rank")  # genus/family/etc.
    )

# ---------- NCBI lookup ----------
def tax_lookup(name: str):
    """Return dict with taxid, canonical, rank, note."""
    queries = [
        f"\"{name}\"[SCIN]",    # exact scientific name
        name,                   # general search
    ]
    # If 'sp.' present, add genus fallback
    g = genus_of_sp(name)
    if g:
        queries.append(f"\"{g}\"[SCIN]")
        queries.append(g)

    # If generic like 'family/order bacterium/archaeon', trim trailing strain codes
    trimmed = primary_trim(name)
    if trimmed != name:
        queries.insert(1, f"\"{trimmed}\"[SCIN]")
        queries.insert(2, trimmed)

    seen = set()
    for q in queries:
        if q in seen:
            continue
        seen.add(q)
        h = Entrez.esearch(db="taxonomy", term=q, retmode="xml")
        r = Entrez.read(h); h.close()
        if int(r["Count"]) == 0:
            continue
        tid = r["IdList"][0]
        h2 = Entrez.efetch(db="taxonomy", id=tid, retmode="xml")
        rec = Entrez.read(h2); h2.close()
        node = rec[0]
        return {
            "taxid": int(tid),
            "canonical": node.get("ScientificName", ""),
            "rank": node.get("Rank", ""),
            "note": f"matched via {q}"
        }
        # (Optionally: inspect node['LineageEx'] for more logic)
    return {"taxid": "", "canonical": "", "rank": "", "note": "no match"}

# ---------- main ----------
rows = []
seen_inputs = set()
for raw in names_block:
    if not raw.strip():
        continue
    original = raw.strip()
    if original in seen_inputs:
        continue
    seen_inputs.add(original)

    cleaned = clean_name(original)
    info = tax_lookup(cleaned)
    sugg = "Organism:exp" if need_expansion(original, info.get("rank","")) else "Organism:noexp"

    rows.append({
        "input_name": original,
        "cleaned_name": cleaned,
        "taxid": info["taxid"],
        "canonical_name": info["canonical"],
        "rank": info["rank"],
        "suggested_field": sugg,
        "note": info["note"]
    })
    time.sleep(0.34)  # be courteous to NCBI

# Write CSVs
with open("taxon_map.csv","w",newline="") as f:
    w = csv.DictWriter(f, fieldnames=[
        "input_name","cleaned_name","taxid","canonical_name","rank","suggested_field","note"
    ])
    w.writeheader(); w.writerows(rows)

# Unresolved
unresolved = [r for r in rows if not r["taxid"]]
if unresolved:
    with open("unresolved.csv","w",newline="") as f:
        w = csv.DictWriter(f, fieldnames=[
            "input_name","cleaned_name","note"
        ])
        w.writeheader();
        for r in unresolved:
            w.writerow({k:r[k] for k in ("input_name","cleaned_name","note")})

# Unique TaxIDs
taxids = sorted({str(r["taxid"]) for r in rows if r["taxid"]})
Path("taxids.txt").write_text("\n".join(taxids))

# Build two Entrez query strings (limit length as needed for URLs)
tokens_noexp = [f"txid{t}[Organism:noexp]" for t in taxids]
tokens_exp   = []
for r in rows:
    if r["taxid"]:
        field = "Organism:exp" if r["suggested_field"]=="Organism:exp" else "Organism:noexp"
        tokens_exp.append(f"txid{r['taxid']}[{field}]")

# Deduplicate while preserving order
def dedup(seq):
    s, out = set(), []
    for x in seq:
        if x not in s:
            s.add(x); out.append(x)
    return out

eq_noexp = " OR ".join(dedup(tokens_noexp))
eq_exp   = " OR ".join(dedup(tokens_exp))

Path("entrez_query_noexp.txt").write_text(eq_noexp)
Path("entrez_query_exp.txt").write_text(eq_exp)

print(f"Resolved {len(taxids)} unique TaxIDs out of {len(rows)} names.")
print("Wrote: taxon_map.csv, taxids.txt, entrez_query_noexp.txt, entrez_query_exp.txt")
if unresolved:
    print(f"{len(unresolved)} unresolved -> unresolved.csv")

Resolved 107 unique TaxIDs out of 111 names.
Wrote: taxon_map.csv, taxids.txt, entrez_query_noexp.txt, entrez_query_exp.txt
2 unresolved -> unresolved.csv
