<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/L01_foundations_week1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# BIOINFO4/5203 — Week 1 Exercise (Foundations)

**Goals for today**
- Mount Google Drive and create your course folders
- Load a small FASTA file
- Compute simple sequence statistics
- Save a plot and a summary text into your `Outputs/` folder

> **Deliverables to Canvas:** the executed notebook (`.ipynb`) and a PDF export with outputs visible.


##A. Mount Google Drive, Import Coding Libraries Necessary for Running Subsequent Code

In [None]:

# Install FIRST, then import
%pip install -q biopython       # Install the Biopython package quietly (-q suppresses most output) so we can work with biological sequence files

from google.colab import drive  # Import the module that lets Colab interact with Google Drive
drive.mount('/content/drive')   # Mount your Google Drive so it appears in Colab's file system under /content/drive

import os, pandas as pd          # Import 'os' for file/directory operations, and pandas for working with data tables
from Bio import SeqIO            # Import SeqIO from Biopython for reading/writing biological sequence files (FASTA, GenBank, etc.)
import matplotlib.pyplot as plt  # Import Matplotlib's plotting library to create figures and graphs

print("✅ Dependencies installed & Drive mounted.")



## B. Course folders: Define the course folders for places to load data to be processed and output to be saved

Edit only `LECTURE_CODE` and `TOPIC` if needed. All inputs will live in `Data/LECTURE_TOPIC` and outputs in `Outputs/LECTURE_TOPIC`.


In [None]:

# --- Course folder config (customize LECTURE_CODE/TOPIC only) ---
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"
LECTURE_CODE = "L01"            # change per week (e.g., L02, L03, ...)
TOPIC        = "foundations"    # short slug for the exercise

# Derived paths (do not change)
DATA_DIR   = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"

# Create folder structure if missing
for p in [f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", f"{COURSE_DIR}/Notebooks", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


# **Playground:**

##C. Actual Bioinformatics Lesson:
Analyze the FASTA DNA SEQUENCE placed in your L01 'Data' folder file using a Python code

Depends upon a pre-placed FASTA file (e.g. your assigned sequence)

1. Put a small FASTA file into your data folder shown above (or use the demo created).  
2. Run the next cells to load, summarize, and plot your sequences.  
3. Confirm that outputs are written into your `Outputs/` folder for this week.



In [None]:

# Find a FASTA file in DATA_DIR
fasta_path = None
for fname in os.listdir(DATA_DIR):
    if fname.lower().endswith((".fa", ".fasta", ".faa")):
        fasta_path = f"{DATA_DIR}/{fname}"
        break
assert fasta_path, "No FASTA found in Data/. Add a FASTA or run the demo cell above."

# Parse sequences
records = list(SeqIO.parse(fasta_path, "fasta"))
assert len(records) > 0, "No records found in FASTA."
ids = [r.id for r in records]
lengths = [len(r.seq) for r in records]

# Compute base composition (A,C,G,T) or amino acids if protein
def base_counts(seq):
    s = str(seq).upper()
    return {
        "A": s.count("A"),
        "C": s.count("C"),
        "G": s.count("G"),
        "T": s.count("T")
    }

comps = [base_counts(r.seq) for r in records]

import pandas as pd
df = pd.DataFrame({
    "id": ids,
    "length": lengths,
    "A": [c["A"] for c in comps],
    "C": [c["C"] for c in comps],
    "G": [c["G"] for c in comps],
    "T": [c["T"] for c in comps],
})

print("🔎 Parsed:", fasta_path)
display(df)



## C. Translate the FASTA DNA SEQUENCE placed in your L01 'Data' folder file using a Python code

1. As above, the code needs a FASTA file into your data folder shown above.  
2. Run the next cells to load, summarize, and plot your sequences.  
3. Confirm that outputs are written into your `Outputs/` folder for this week.



In [None]:
# Translate each DNA sequence to protein
from Bio.Seq import Seq

print("Translating sequences to protein...\n")
translations = []
for record in records:
    seq_obj = Seq(str(record.seq))
    protein_seq = seq_obj.translate(to_stop=True)  # stops at first stop codon
    translations.append((record.id, protein_seq))
    print(f">{record.id}")
    print(protein_seq)

# Optional: Save to Outputs folder
translated_path = f"{OUTPUT_DIR}/translated_proteins.fasta"
with open(translated_path, "w") as f:
    for rec_id, prot in translations:
        f.write(f">{rec_id}\n{prot}\n")
print(f"\n✅ Translations saved to: {translated_path}")

## D. Save outputs (CSV + PNG)

In [None]:

# Save CSV
csv_path = f"{OUTPUT_DIR}/seq_summary.csv"
df.to_csv(csv_path, index=False)
print("💾 Saved CSV ->", csv_path)

# Plot lengths
plt.figure()
df.set_index("id")["length"].plot(kind="bar")
plt.title("Sequence lengths")
plt.xlabel("Sequence ID"); plt.ylabel("Length")
png_path = f"{OUTPUT_DIR}/lengths_barplot.png"
plt.savefig(png_path, bbox_inches="tight")
plt.close()
print("🖼️ Saved PNG ->", png_path)

# Verify directory contents
print("📦 Output dir listing:", os.listdir(OUTPUT_DIR))


Now translate the sequence by running this code:

## E. Results summary (copy into Canvas if requested)

In [None]:

summary_path = f"{OUTPUT_DIR}/summary.txt"
with open(summary_path, "w") as f:
    f.write(f"LECTURE={LECTURE_CODE}\n")
    f.write(f"TOPIC={TOPIC}\n")
    f.write(f"N_records={len(records)}\n")
    f.write(f"FASTA={os.path.basename(fasta_path)}\n")
print("📝 Saved summary ->", summary_path)

print("=== SUMMARY ===")
print("LECTURE=", LECTURE_CODE)
print("TOPIC=", TOPIC)
print("N_records=", len(records))
print("FASTA=", os.path.basename(fasta_path))



## F. Export & submit
- **File → Print → Save as PDF**, then upload the PDF and `.ipynb` to Canvas.  
- Ensure your `Outputs/` folder contains: `seq_summary.csv`, `lengths_barplot.png`, and `summary.txt`.


In [None]:
for day in range(1, 8):
    print(day)

In [None]:
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
for day in days_of_week:
    print(day)