<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/L02_%E2%80%93_Setup_%2B_BLASTn_Mini%E2%80%91API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# BIOINFO4/5203 — Week 2 Exercise (Foundations)

**Goals for today**
- Mount Google Drive and create your course folders
- Load a small FASTA file
- Compute simple sequence statistics
- Save a plot and a summary text into your `Outputs/` folder

> **Deliverables to Canvas:** the executed notebook (`.ipynb`) and a PDF export with outputs visible.


##Cell 1 — Setup Google Drive + Week Folders

In [None]:
# --- L02 SETUP: mount Google Drive and make week folders ---
%pip -q install biopython

from google.colab import drive
drive.mount('/content/drive')

import os

# Your course root in Drive (change only if you chose a different root in L01)
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"

# Week label for L02 (used to create subfolders)
LECTURE_CODE = "L02_databases_formats"
TOPIC        = "blastn_seed"

# Per-week data/output folders
DATA_DIR    = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR  = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"
for p in [COURSE_DIR, f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("✅ Drive mounted.")
print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Dependencies installed & Drive mounted.



## Cell 2 — Put your unknown DNA FASTA into DATA_DIR
This cell verifies a FASTA is present and, if not, lets students upload directly into the right folder

In [None]:
# --- Data check (and optional upload) ---
import os
from google.colab import files
from shutil import move

# Look for a DNA FASTA in DATA_DIR
cands = [f for f in os.listdir(DATA_DIR) if f.lower().endswith((".fa",".fasta",".fna"))]

if not cands:
    print("⚠️ No FASTA found in Data/. Use the picker to upload one now.")
    uploaded = files.upload()  # Choose a local .fa/.fasta/.fna
    for name in uploaded.keys():
        move(name, f"{DATA_DIR}/{name}")
    cands = [f for f in os.listdir(DATA_DIR) if f.lower().endswith((".fa",".fasta",".fna"))]

assert cands, "No FASTA in Data/. Please upload a .fa/.fasta/.fna and re-run this cell."

# Use the newest FASTA
cands.sort(key=lambda f: os.path.getmtime(f"{DATA_DIR}/{f}"), reverse=True)
FASTA_PATH = f"{DATA_DIR}/{cands[0]}"
print("📄 Using FASTA:", FASTA_PATH)

📁 COURSE_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25
📁 DATA_DIR   : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/L01_foundations
📁 OUTPUT_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L01_foundations


##Cell 3 — BLASTn Mini‑API (download top similar DNA sequences)
	•	Runs BLASTn at NCBI (database nt for L02 simplicity)
	•	Saves: raw BLAST XML, a FASTA of top hits, and a CSV summary

In [None]:
# --- BLASTn mini-API: find similar DNA and download top hits ---
from Bio import Entrez, SeqIO
from Bio.Blast import NCBIWWW, NCBIXML
import os, csv, time

# REQUIRED by NCBI: set your email (students: put YOUR email)
Entrez.email = "your_email@university.edu"  # <-- change me
API_KEY = None  # optional: paste NCBI API key to raise rate limits

# Read first record as query
rec = next(SeqIO.parse(FASTA_PATH, "fasta"))
query_seq = str(rec.seq)
print(f"🔎 Query record: {rec.id} (len={len(rec)} nt)")

# Run BLASTn vs nt (fast & familiar for L02)
print("⏳ Running BLASTn vs nt …")
blast_handle = NCBIWWW.qblast(
    program="blastn",
    database="nt",
    sequence=query_seq,
    expect=1e-5,
    megablast=True,
    filter="L",                # low-complexity filter
    entrez_query=None          # no organism filter in L02
)
blast_xml = blast_handle.read()
blast_handle.close()

# Save XML (reproducibility)
xml_path = f"{OUTPUT_DIR}/blastn_results.xml"
with open(xml_path, "w") as f:
    f.write(blast_xml)
print("💾 Saved BLAST XML:", xml_path)

# Parse top hits
record = NCBIXML.read(__import__("io").StringIO(blast_xml))
assert record.alignments, "BLAST returned no hits. Try a different sequence."

top_n = 10
hits = []
for aln in record.alignments[:top_n]:
    hsp = aln.hsps[0]
    hits.append({
        "title": aln.title,
        "accession": aln.accession,
        "length": aln.length,
        "identity": hsp.identities,
        "align_len": hsp.align_length,
        "pct_identity": round(100.0 * hsp.identities / hsp.align_length, 2)
    })

# Fetch FASTA for each hit
def efetch_fasta(acc):
    params = dict(db="nuccore", id=acc, rettype="fasta", retmode="text")
    if API_KEY: params["api_key"] = API_KEY
    with Entrez.efetch(**params) as h:
        return h.read()

fasta_out = f"{OUTPUT_DIR}/blastn_top_hits.fasta"
csv_out   = f"{OUTPUT_DIR}/blastn_top_hits_summary.csv"

with open(fasta_out, "w") as fa, open(csv_out, "w", newline="") as cs:
    writer = csv.DictWriter(cs, fieldnames=list(hits[0].keys()))
    writer.writeheader()
    for i, h in enumerate(hits, 1):
        try:
            fa.write(efetch_fasta(h["accession"]))
            writer.writerow(h)
            print(f"  ✅ {i:02d}/{len(hits)} {h['accession']}  {h['pct_identity']}% id")
        except Exception as e:
            print(f"  ⚠️  Failed {h['accession']}: {e}")
        time.sleep(0.34 if not API_KEY else 0.1)  # be polite to NCBI

print(f"\n💾 FASTA of hits   : {fasta_out}")
print(f"💾 Hits summary CSV: {csv_out}")

🔎 Parsed: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/L01_foundations/unknown_seq_DNA.fasta


Unnamed: 0,id,length,A,C,G,T
0,unknown_seq,1428,379,339,378,332



## Cell 4 — (Optional) Quick visualization: identity distribution


In [None]:
# --- Quick plot of % identity for the top hits ---
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv(csv_out)
plt.figure()
df["pct_identity"].plot(kind="bar")
plt.title("% identity of BLASTn top hits")
plt.xlabel("Hit index")
plt.ylabel("% identity")
plt.tight_layout()

png_path = f"{OUTPUT_DIR}/blastn_identity_plot.png"
plt.savefig(png_path, dpi=150)
print("💾 Saved plot:", png_path)

Translating sequences to protein...

>unknown_seq
MYEKLQPPSVGSKITFVAGKPVVPNDPIIPYIRGDGTGVDIWPATELVINAAIAKAYGGERKINWFKVYAGDEACELYGTYQYLPEDTLTAIKEYGVAIKGPLTTPVGGGIRSLNVALRQIFDLYTCVRPCRYYPGTPSPHKTPEKLDIIVYRENTEDIYLGIEWAEGTEGAKKLIAYLNDELIPTTPALGKKQIRLDSGIGIKPISKTGSQRLVRRAILHALRLPKAKQMVTLVHKGNIMKFTEGAFRDWGYELATTEFRAECVTERESWILGNKESNPDLTIEANAHMIDPGYDTLTEEKQAVIKQEVEQVLNSIWESHGNGQWKEKVMVNDRIADSIFQQIQTRPDEYSILATMNLNGDYLSDAAAAVVGGLGMGPGANIGDSAAIFEATHGTAPKHAGLDRINPGSVILSGVMMLEFMGWQEAADLIKKGIGAAIANREVTYDLARLMEPKVDKPLKCSEFAQAIVSHFDD

✅ Translations saved to: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L01_foundations/translated_proteins.fasta


## Cell 5 — Canvas summary (simple key/value printout)

In [None]:
# --- Summary for Canvas auto-grading / quick check ---
summary = {
    "LECTURE": LECTURE_CODE,
    "TOPIC": TOPIC,
    "QUERY_FASTA": os.path.basename(FASTA_PATH),
    "HITS_FASTA": os.path.basename(fasta_out),
    "HITS_CSV": os.path.basename(csv_out)
}
print("=== L02 SUMMARY ===")
for k,v in summary.items():
    print(f"{k}={v}")

with open(f"{OUTPUT_DIR}/summary.txt","w") as f:
    for k,v in summary.items():
        f.write(f"{k}={v}\n")
print("💾 Wrote", f"{OUTPUT_DIR}/summary.txt")

💾 Saved CSV -> /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L01_foundations/seq_summary.csv
🖼️ Saved PNG -> /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L01_foundations/lengths_barplot.png
📦 Output dir listing: ['lengths_barplot.png', 'summary.txt', 'seq_summary.csv', 'translated_proteins.fasta']


Now translate the sequence by running this code:

## E. Results summary (copy into Canvas if requested)

In [None]:

summary_path = f"{OUTPUT_DIR}/summary.txt"
with open(summary_path, "w") as f:
    f.write(f"LECTURE={LECTURE_CODE}\n")
    f.write(f"TOPIC={TOPIC}\n")
    f.write(f"N_records={len(records)}\n")
    f.write(f"FASTA={os.path.basename(fasta_path)}\n")
print("📝 Saved summary ->", summary_path)

print("=== SUMMARY ===")
print("LECTURE=", LECTURE_CODE)
print("TOPIC=", TOPIC)
print("N_records=", len(records))
print("FASTA=", os.path.basename(fasta_path))


📝 Saved summary -> /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L01_foundations/summary.txt
=== SUMMARY ===
LECTURE= L01
TOPIC= foundations
N_records= 1
FASTA= unknown_seq_DNA.fasta



## F. Export & submit
- **File → Print → Save as PDF**, then upload the PDF and `.ipynb` to Canvas.  
- Ensure your `Outputs/` folder contains: `seq_summary.csv`, `lengths_barplot.png`, and `summary.txt`.
