<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/taxonomic_codes_ipynb_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Developing a taxonomic table for taxonomic codes for a diverse set of sequences**


BIOINFO4/5203 — Colab Exercise Template

Use this template for every weekly exercise. It standardizes setup, data paths, and the final summary so grading in Canvas is quick.

Workflow

    Click the "Open in Colab" link in Canvas (points to this notebook in GitHub).
    Run Setup cells (installs and mounts Google Drive).
    Run the Exercise cells (edit as instructed for each lecture).
    Verify the Results Summary prints the values requested by Canvas.
    File → Print → Save as PDF and upload .ipynb + PDF to Canvas.

    Instructor note (delete in student copy if desired):

        Place datasets for this lecture at: Drive → BIOINFO4-5203-F25 → Data → Lxx_topic
        Update the constants in Config below: COURSE_DIR, LECTURE_CODE (e.g., L05), and TOPIC.
        For heavy jobs (trees, assemblies), provide the PETE output files in the same Data folder so students can analyze them here if the queue is busy.



**Auto‑setup + course folder (uses your Teaching path)**

##A. Mount Google Drive, Import Coding Libraries Necessary for Running Subsequent Code

In [1]:

# Install FIRST, then import
%pip install -q biopython       # Install the Biopython package quietly (-q suppresses most output) so we can work with biological sequence files

from google.colab import drive  # Import the module that lets Colab interact with Google Drive
drive.mount('/content/drive')   # Mount your Google Drive so it appears in Colab's file system under /content/drive

import os, pandas as pd          # Import 'os' for file/directory operations, and pandas for working with data tables
from Bio import SeqIO            # Import SeqIO from Biopython for reading/writing biological sequence files (FASTA, GenBank, etc.)
import matplotlib.pyplot as plt  # Import Matplotlib's plotting library to create figures and graphs

print("✅ Dependencies installed & Drive mounted.")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m1.8/3.3 MB[0m [31m56.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.3/3.3 MB[0m [31m46.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/drive
✅ Dependencies installed & Drive mounted.



## B. Course folders: Define the course folders for places to load data to be processed and output to be saved

Edit only `LECTURE_CODE` and `TOPIC` if needed. All inputs will live in `Data/LECTURE_TOPIC` and outputs in `Outputs/LECTURE_TOPIC`.

**2) Define file sturcture

In [2]:

# --- Course folder config (customize LECTURE_CODE/TOPIC only) ---
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"
LECTURE_CODE = "Taxonomy"            # change per week (e.g., L02, L03, ...)
TOPIC        = "Template"    # short slug for the exercise

# Derived paths (do not change)
DATA_DIR   = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"

# Create folder structure if missing
for p in [f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", f"{COURSE_DIR}/Notebooks", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


📁 COURSE_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25
📁 DATA_DIR   : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template
📁 OUTPUT_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template


##Take file name "taxonomic_cmmon_names" and find taxonomic ids

**3) Version 3 Aug 25

In [3]:
# --- Build panel from your full names file & write CSV + TXT outputs ---
from Bio import Entrez
from pathlib import Path
import pandas as pd, io, re, os, sys, time

# REQUIRED: set your email; optionally set your NCBI API key for higher rate limits
Entrez.email = "you@university.edu"
# Entrez.api_key = "YOUR_NCBI_API_KEY"  # optional

# Expect these from your course scaffold cell
assert 'DATA_DIR' in globals() and 'OUTPUT_DIR' in globals(), "Run your course folder setup cell first."
DATA_DIR = Path(DATA_DIR); OUT = Path(OUTPUT_DIR)
OUT.mkdir(parents=True, exist_ok=True)

# Locate your master names file (exact name you’ve been using)
CANDIDATES = [
    DATA_DIR / "taxonomic_common_names",
    DATA_DIR / "taxonomic_common_names.txt",
    DATA_DIR / "taxonomic_common_names.csv",
    DATA_DIR / "taxonomic_common_names.tsv",
]
NAMES_PATH = next((p for p in CANDIDATES if p.exists()), None)
if not NAMES_PATH:
    raise FileNotFoundError(
        f"Could not find 'taxonomic_cmmon_names' (txt/csv/tsv) in {DATA_DIR}.\n"
        "Create it with one name per line (scientific or common)."
    )
print("📄 Names file:", NAMES_PATH)

def load_names(path: Path):
    text = path.read_text(errors="ignore")
    head = "\n".join(text.splitlines()[:5])
    is_table = ("," in head) or ("\t" in head)
    names = []
    if is_table:
        sep = "\t" if "\t" in head else ","
        df = pd.read_csv(io.StringIO(text), sep=sep, comment="#", header=None)
        col0 = df.columns[0]
        names = [str(x).strip() for x in df[col0].tolist() if str(x).strip()]
    else:
        for line in text.splitlines():
            line = line.strip()
            if line and not line.startswith("#"):
                names.append(line)
    return names

raw_names = load_names(NAMES_PATH)
print(f"📝 Loaded {len(raw_names)} name(s)")

def fetch_taxnode_by_id(tid: str):
    h2 = Entrez.efetch(db="taxonomy", id=tid, retmode="xml")
    rec = Entrez.read(h2); h2.close()
    return rec[0] if rec else {}

def tax_lookup(name: str):
    """
    Try to resolve a line like 'Homo sapiens' or 'Human' → (taxid, canonical, rank, domain, match_type, status)
    - match_type: SCIN | COMMON | LOOSE | TAXID
    - domain from LineageEx: Bacteria / Archaea / Eukaryota (default 'NA' if unknown)
    """
    name = name.strip()
    # If it's already an integer TaxID, accept directly
    if re.fullmatch(r"\d+", name):
        try:
            node = fetch_taxnode_by_id(name)
            dom = next((x["ScientificName"] for x in node.get("LineageEx", [])
                        if x.get("Rank") == "superkingdom"), "NA")
            return int(name), node.get("ScientificName",""), node.get("Rank",""), dom, "TAXID", "ok"
        except Exception as e:
            return "", "", "", "NA", "TAXID", f"efetch_failed:{e}"

    queries = [
        (f"\"{name}\"[SCIN]",   "SCIN"),
        (f"\"{name}\"[Common Name]", "COMMON"),
        (name,                 "LOOSE"),
    ]
    for q, tag in queries:
        try:
            h = Entrez.esearch(db="taxonomy", term=q, retmode="xml"); r = Entrez.read(h); h.close()
            if int(r["Count"]) == 0:
                continue
            tid = r["IdList"][0]
            node = fetch_taxnode_by_id(tid)
            dom = next((x["ScientificName"] for x in node.get("LineageEx", [])
                        if x.get("Rank") == "superkingdom"), "NA")
            return int(tid), node.get("ScientificName",""), node.get("Rank",""), dom, tag, "ok"
        except Exception as e:
            sys.stderr.write(f"[warn] lookup '{name}' via {tag}: {e}\n")
            time.sleep(0.15)
            continue
    return "", "", "", "NA", "NA", "no_match"

rows = []
seen = set()
for nm in raw_names:
    key = nm.strip().lower()
    if not key or key in seen:
        continue
    seen.add(key)
    tid, canon, rank, dom, tag, status = tax_lookup(nm)
    rows.append({
        "input_name": nm,
        "taxid": tid,
        "canonical_name": canon,
        "rank": rank,
        "domain": dom,
        "match_type": tag,
        "status": status
    })
    time.sleep(0.25)  # be polite to NCBI; increase if you hit rate limits

df = pd.DataFrame(rows)
df = df.sort_values(["status","domain","canonical_name","input_name"], na_position="last").reset_index(drop=True)

# Write outputs
map_csv = OUT / "taxon_map.csv"
ids_txt = OUT / "taxids_list.txt"
entrez_txt = OUT / "entrez_query.txt"

df.to_csv(map_csv, index=False)
valid_taxids = [str(t) for t in df["taxid"].tolist() if str(t).isdigit()]
ids_txt.write_text("\n".join(valid_taxids))
entrez_txt.write_text(" OR ".join([f"txid{t}[ORGN]" for t in valid_taxids]))

print("🗺  Map CSV :", map_csv)
print("🧬 TaxIDs  :", ids_txt)
print("🔎 ENTREQ  :", entrez_txt)

# Report summary
print("\n=== Summary ===")
print("Total names read :", len(raw_names))
print("Resolved (ok)    :", (df["status"]=="ok").sum())
print("Unresolved       :", (df["status"]!="ok").sum())
if (df["status"]!="ok").any():
    print(df[df["status"]!="ok"][["input_name","status"]].head(12))

📄 Names file: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template/taxonomic_common_names.txt
📝 Loaded 127 name(s)
🗺  Map CSV : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxon_map.csv
🧬 TaxIDs  : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxids_list.txt
🔎 ENTREQ  : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/entrez_query.txt

=== Summary ===
Total names read : 127
Resolved (ok)    : 123
Unresolved       : 1
                                  input_name    status
0  Candidatus Syntrophoarchaeum butanivorans  no_match


**HTML** **generator**

In [4]:
# --- Rebuild HTML table with filter + download links (reads from OUT) ---
from pathlib import Path
import pandas as pd

assert 'OUTPUT_DIR' in globals(), "Run your course folder setup cell first."
OUT = Path(OUTPUT_DIR)

csv_path   = OUT / "taxon_map.csv"
ids_path   = OUT / "taxids_list.txt"
entrez_path= OUT / "entrez_query.txt"
html_path  = OUT / "taxon_map.html"

df = pd.read_csv(csv_path)
df = df.sort_values(["domain","scientific_name","common_name"] if "common_name" in df.columns else ["domain","canonical_name"])

taxid_list  = ids_path.read_text() if ids_path.exists() else ""
entrez_query= entrez_path.read_text() if entrez_path.exists() else ""

# For display, prefer canonical_name if present
display_df = df.copy()
if "canonical_name" in display_df.columns and "scientific_name" not in display_df.columns:
    display_df = display_df.rename(columns={"canonical_name":"scientific_name"})

counts = display_df["domain"].value_counts().to_dict()
total  = len(display_df)

table_html = display_df.to_html(index=False, escape=True, classes="tax-table")

html = f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Representative Taxa Panel (NCBI TaxIDs)</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
  :root {{ --bg:#0b1020; --card:#121a2b; --ink:#e5ecff; --muted:#9bb0d6; --accent:#7aa2ff; --border:#20304f; }}
  body {{ font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Inter, Arial, sans-serif; margin:0; padding:24px; background:var(--bg); color:var(--ink); }}
  .wrap {{ max-width:1100px; margin:0 auto; background:var(--card); border:1px solid var(--border);
          border-radius:14px; padding:22px 22px 28px; box-shadow:0 10px 30px rgba(0,0,0,.35), inset 0 1px 0 rgba(255,255,255,.04); }}
  .muted {{ color:var(--muted); font-size:13px; }}
  .counts span {{ background:rgba(122,162,255,.12); border:1px solid var(--border); padding:6px 10px; border-radius:10px; margin-right:8px; }}
  .panel pre {{ background:#0a1327; border:1px solid var(--border); padding:10px 12px; border-radius:10px; overflow:auto; }}
  .panel code {{ color:var(--accent); }}
  .tax-table {{ width:100%; border-collapse:collapse; font-size:14px; margin-top:10px; }}
  .tax-table th, .tax-table td {{ border:1px solid var(--border); padding:8px 10px; }}
  .tax-table th {{ position:sticky; top:0; background:#101a32; z-index:2; }}
  .tax-table tr:nth-child(odd) td {{ background:#0e1730; }}
  .tax-table tr:nth-child(even) td {{ background:#0c162c; }}
  .pill {{ display:inline-block; padding:2px 8px; border-radius:999px; border:1px solid var(--border); font-size:12px; color:var(--muted);}}
  .filter input {{ width:100%; padding:10px 12px; border-radius:10px; border:1px solid var(--border); background:#0a1327; color:var(--ink); }}
</style>
</head>
<body>
  <div class="wrap">
    <h1>Representative Taxa Panel <span class="pill">NCBI TaxIDs</span></h1>
    <div class="muted">Built directly from your names file; counts and table reflect the full list.</div>
    <div class="counts">
      <span>Total: <strong>{total}</strong></span>
      <span>Bacteria: <strong>{counts.get('Bacteria', 0)}</strong></span>
      <span>Archaea: <strong>{counts.get('Archaea', 0)}</strong></span>
      <span>Eukaryota: <strong>{counts.get('Eukaryota', 0)}</strong></span>
    </div>

    <div class="panel">
      <div><strong>Downloads:</strong>
        <a href="taxon_map.csv" download>taxon_map.csv</a> ·
        <a href="taxids_list.txt" download>taxids_list.txt</a> ·
        <a href="entrez_query.txt" download>entrez_query.txt</a>
      </div>

      <div class="filter" style="margin:12px 0 6px;">
        <input id="flt" type="search" placeholder="Type to filter (matches any column)…" oninput="filterRows()">
      </div>

      <div>
        <strong>Entrez BLAST filter:</strong>
        <pre><code>{entrez_query}</code></pre>
      </div>
    </div>

    {table_html}
  </div>

<script>
function filterRows(){{
  const q = document.getElementById('flt').value.toLowerCase();
  const rows = document.querySelectorAll('table.tax-table tbody tr');
  rows.forEach(tr => {{
    tr.style.display = tr.innerText.toLowerCase().includes(q) ? '' : 'none';
  }});
}}
</script>
</body>
</html>
"""

html_path.write_text(html, encoding="utf-8")
print("🌐 HTML:", html_path)

🌐 HTML: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxon_map.html
