<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/taxonomic_codes_ipynb_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Developing a taxonomic table for taxonomic codes for a diverse set of sequences**


BIOINFO4/5203 — Colab Exercise Template

Use this template for every weekly exercise. It standardizes setup, data paths, and the final summary so grading in Canvas is quick.

Workflow

    Click the "Open in Colab" link in Canvas (points to this notebook in GitHub).
    Run Setup cells (installs and mounts Google Drive).
    Run the Exercise cells (edit as instructed for each lecture).
    Verify the Results Summary prints the values requested by Canvas.
    File → Print → Save as PDF and upload .ipynb + PDF to Canvas.

    Instructor note (delete in student copy if desired):

        Place datasets for this lecture at: Drive → BIOINFO4-5203-F25 → Data → Lxx_topic
        Update the constants in Config below: COURSE_DIR, LECTURE_CODE (e.g., L05), and TOPIC.
        For heavy jobs (trees, assemblies), provide the PETE output files in the same Data folder so students can analyze them here if the queue is busy.



**Auto‑setup + course folder (uses your Teaching path)**

##A. Mount Google Drive, Import Coding Libraries Necessary for Running Subsequent Code

In [10]:

# Install FIRST, then import
%pip install -q biopython       # Install the Biopython package quietly (-q suppresses most output) so we can work with biological sequence files

from google.colab import drive  # Import the module that lets Colab interact with Google Drive
drive.mount('/content/drive')   # Mount your Google Drive so it appears in Colab's file system under /content/drive

import os, pandas as pd          # Import 'os' for file/directory operations, and pandas for working with data tables
from Bio import SeqIO            # Import SeqIO from Biopython for reading/writing biological sequence files (FASTA, GenBank, etc.)
import matplotlib.pyplot as plt  # Import Matplotlib's plotting library to create figures and graphs

print("✅ Dependencies installed & Drive mounted.")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m2.5/3.3 MB[0m [31m77.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/drive
✅ Dependencies installed & Drive mounted.



## B. Course folders: Define the course folders for places to load data to be processed and output to be saved

Edit only `LECTURE_CODE` and `TOPIC` if needed. All inputs will live in `Data/LECTURE_TOPIC` and outputs in `Outputs/LECTURE_TOPIC`.

**2) Define file sturcture

In [11]:

# --- Course folder config (customize LECTURE_CODE/TOPIC only) ---
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"
LECTURE_CODE = "Taxonomy"            # change per week (e.g., L02, L03, ...)
TOPIC        = "Template"    # short slug for the exercise

# Derived paths (do not change)
DATA_DIR   = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"

# Create folder structure if missing
for p in [f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", f"{COURSE_DIR}/Notebooks", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


📁 COURSE_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25
📁 DATA_DIR   : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template
📁 OUTPUT_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template


##Take file name "taxonomic_cmmon_names" and find taxonomic ids

**3) Parse FASTA → CSV, then plot a quick figure → PNG in Outputs/**

In [12]:
# --- Read 'taxonomic_common_names' from DATA_DIR, resolve to TaxIDs (incl. common names), merge & write outputs ---
from Bio import Entrez
from pathlib import Path
import pandas as pd
import csv, io, re, time, sys, os

# REQUIRED
Entrez.email = "you@university.edu"   # <-- set your email here

# Expect DATA_DIR and OUTPUT_DIR from your course scaffold
assert 'DATA_DIR' in globals() and 'OUTPUT_DIR' in globals(), "Run your course folder setup cell first."
DATA_DIR = Path(DATA_DIR)
OUT      = Path(OUTPUT_DIR)
OUT.mkdir(parents=True, exist_ok=True)

# ---- Locate the names file (exact name requested) ----
# Accept either no extension, .txt, .csv, or .tsv
CANDIDATES = [
    DATA_DIR / "taxonomic_common_names",
    DATA_DIR / "taxonomic_common_names.txt",
    DATA_DIR / "taxonomic_common_names.csv",
    DATA_DIR / "taxonomic_c0mmon_names.tsv",
]
NAMES_PATH = next((p for p in CANDIDATES if p.exists()), None)
if not NAMES_PATH:
    raise FileNotFoundError(
        f"Could not find 'taxonomic_cmmon_names' in {DATA_DIR}. "
        "Create it as a plain text/CSV/TSV file (one name per line; first column if table)."
    )
print(f"📄 Names file: {NAMES_PATH}")

# ---- Load names (plain text OR first column of CSV/TSV). Lines starting with # are ignored. ----
def load_names(path: Path):
    text = path.read_text(errors="ignore")
    # Heuristic: treat as table if we see a comma or tab in first few lines
    head = "\n".join(text.splitlines()[:5])
    is_table = ("," in head) or ("\t" in head)
    names = []
    if is_table:
        # Try TSV first if tabs present, else CSV
        sep = "\t" if "\t" in head else ","
        df = pd.read_csv(io.StringIO(text), sep=sep, comment="#", header=None)
        # take first column
        col = df.columns[0]
        names = [str(x).strip() for x in df[col].tolist() if str(x).strip()]
    else:
        for line in text.splitlines():
            line = line.strip()
            if not line or line.startswith("#"):
                continue
            names.append(line)
    return names

raw_names = load_names(NAMES_PATH)
if not raw_names:
    raise ValueError(f"No names found in {NAMES_PATH} (file empty or only comments).")
print(f"📝 Loaded {len(raw_names)} name(s)")

# ---- Resolve name -> TaxID (try scientific first, then common name, then loose) ----
def tax_lookup(name: str):
    """
    Resolve 'name' (scientific or common) -> (taxid, canonical, rank, match_type, status)
    match_type: SCIN / COMMON / LOOSE
    status: 'ok' or reason
    """
    name = name.strip()
    # If user accidentally included a TaxID, accept it directly
    if re.fullmatch(r"\d+", name):
        try:
            tid = int(name)
            h2 = Entrez.efetch(db="taxonomy", id=str(tid), retmode="xml"); rec = Entrez.read(h2); h2.close()
            node = rec[0]
            return tid, node.get("ScientificName",""), node.get("Rank",""), "TAXID", "ok"
        except Exception as e:
            return "", "", "", "TAXID", f"efetch_failed:{e}"

    queries = [
        (f"\"{name}\"[SCIN]", "SCIN"),               # strict scientific name
        (f"\"{name}\"[Common Name]", "COMMON"),     # strict common name
        (name, "LOOSE"),                            # loose text search (all-name fields)
    ]
    tried = set()
    for q, tag in queries:
        if q in tried:
            continue
        tried.add(q)
        try:
            h = Entrez.esearch(db="taxonomy", term=q, retmode="xml"); r = Entrez.read(h); h.close()
            if int(r["Count"]) == 0:
                continue
            tid = r["IdList"][0]
            h2 = Entrez.efetch(db="taxonomy", id=tid, retmode="xml"); rec = Entrez.read(h2); h2.close()
            node = rec[0]
            canon = node.get("ScientificName","")
            rank  = node.get("Rank","")
            return int(tid), canon, rank, tag, "ok"
        except Exception as e:
            sys.stderr.write(f"[warn] lookup '{name}' via {tag}: {e}\n")
            continue
    return "", "", "", "NA", "no_match"

rows = []
seen = set()
for nm in raw_names:
    nm = nm.strip()
    if not nm or nm.lower() in seen:
        continue
    seen.add(nm.lower())
    tid, canon, rank, tag, status = tax_lookup(nm)
    rows.append({
        "input_name": nm,
        "taxid": tid,
        "canonical_name": canon,
        "rank": rank,
        "match_type": tag,
        "status": status
    })
    time.sleep(0.25)  # be polite to NCBI

df_new = pd.DataFrame(rows).sort_values(["status","canonical_name","input_name"], na_position="last")
map_path = OUT / "augmented_taxon_map.csv"
df_new.to_csv(map_path, index=False)
print(f"🗺  Wrote map: {map_path}")

# ---- Merge with existing TaxID lists if present ----
existing_taxids = set()
for p in [OUT/"taxids.txt", OUT/"taxids-augmented.txt", DATA_DIR/"taxids.txt"]:
    if p.exists():
        existing_taxids |= {t.strip() for t in p.read_text().splitlines() if t.strip().isdigit()}

new_taxids = {str(t) for t in df_new["taxid"].tolist() if str(t).isdigit()}
all_taxids = sorted(existing_taxids | new_taxids, key=lambda x: int(x))

# ---- Write merged outputs ----
(OUT / "taxids-augmented.txt").write_text("\n".join(all_taxids))
eq_exp = " OR ".join([f"txid{t}[ORGN]" for t in all_taxids])  # [ORGN] is robust in BLAST
(OUT / "entrez_query_exp_aug.txt").write_text(eq_exp)

# ---- Report unresolved names (if any) ----
unresolved = df_new[df_new["status"] != "ok"]
if not unresolved.empty:
    bad_list = ", ".join(unresolved["input_name"].tolist()[:12])
    print(f"⚠️ Unresolved names: {len(unresolved)} (showing a few): {bad_list}")

print(f"✅ Resolved {len(new_taxids)} new TaxIDs. Total in merged list: {len(all_taxids)}")
print("🧬 TaxIDs :", OUT / 'taxids-augmented.txt')
print("🔎 ENTREQ :", OUT / 'entrez_query_exp_aug.txt')

📄 Names file: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/Taxonomy_Template/taxonomic_common_names.txt
📝 Loaded 127 name(s)
🗺  Wrote map: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/augmented_taxon_map.csv
⚠️ Unresolved names: 1 (showing a few): Candidatus Syntrophoarchaeum butanivorans
✅ Resolved 123 new TaxIDs. Total in merged list: 123
🧬 TaxIDs : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/taxids-augmented.txt
🔎 ENTREQ : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/Taxonomy_Template/entrez_query_exp_aug.txt
