<a href="https://colab.research.google.com/github/RobBurnap/Bioinformatics-MICR4203-MICR5203/blob/main/notebooks/Pairwise-Alignment_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


BIOINFO4/5203 — Colab Exercise Template

Use this template for every weekly exercise. It standardizes setup, data paths, and the final summary so grading in Canvas is quick.

Workflow

    Click the "Open in Colab" link in Canvas (points to this notebook in GitHub).
    Run Setup cells (installs and mounts Google Drive).
    Run the Exercise cells (edit as instructed for each lecture).
    Verify the Results Summary prints the values requested by Canvas.
    File → Print → Save as PDF and upload .ipynb + PDF to Canvas.

    Instructor note (delete in student copy if desired):

        Place datasets for this lecture at: Drive → BIOINFO4-5203-F25 → Data → Lxx_topic
        Update the constants in Config below: COURSE_DIR, LECTURE_CODE (e.g., L05), and TOPIC.
        For heavy jobs (trees, assemblies), provide the PETE output files in the same Data folder so students can analyze them here if the queue is busy.



**Auto‑setup + course folder (uses your Teaching path)**

##A. Mount Google Drive, Import Coding Libraries Necessary for Running Subsequent Code

In [1]:

# Install FIRST, then import
%pip install -q biopython       # Install the Biopython package quietly (-q suppresses most output) so we can work with biological sequence files

from google.colab import drive  # Import the module that lets Colab interact with Google Drive
drive.mount('/content/drive')   # Mount your Google Drive so it appears in Colab's file system under /content/drive

import os, pandas as pd          # Import 'os' for file/directory operations, and pandas for working with data tables
from Bio import SeqIO            # Import SeqIO from Biopython for reading/writing biological sequence files (FASTA, GenBank, etc.)
import matplotlib.pyplot as plt  # Import Matplotlib's plotting library to create figures and graphs

print("✅ Dependencies installed & Drive mounted.")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/drive
✅ Dependencies installed & Drive mounted.



## B. Course folders: Define the course folders for places to load data to be processed and output to be saved

Edit only `LECTURE_CODE` and `TOPIC` if needed. All inputs will live in `Data/LECTURE_TOPIC` and outputs in `Outputs/LECTURE_TOPIC`.

**2) Make a tiny demo dataset (FASTA) in your Data/ folder**

In [None]:

# --- Course folder config (customize LECTURE_CODE/TOPIC only) ---
COURSE_DIR   = "/content/drive/MyDrive/Teaching/BIOINFO4-5203-F25"
LECTURE_CODE = "L00"            # change per week (e.g., L02, L03, ...)
TOPIC        = "Template"    # short slug for the exercise

# Derived paths (do not change)
DATA_DIR   = f"{COURSE_DIR}/Data/{LECTURE_CODE}_{TOPIC}"
OUTPUT_DIR = f"{COURSE_DIR}/Outputs/{LECTURE_CODE}_{TOPIC}"

# Create folder structure if missing
for p in [f"{COURSE_DIR}/Data", f"{COURSE_DIR}/Outputs", f"{COURSE_DIR}/Notebooks", DATA_DIR, OUTPUT_DIR]:
    os.makedirs(p, exist_ok=True)

print("📁 COURSE_DIR :", COURSE_DIR)
print("📁 DATA_DIR   :", DATA_DIR)
print("📁 OUTPUT_DIR :", OUTPUT_DIR)


📁 COURSE_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25
📁 DATA_DIR   : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/L00_Template
📁 OUTPUT_DIR : /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L00_Template


##Alignment Scoring and Exploration (Python Lab)

**3) Required Imports**

In [None]:
# Lecture 1.5: Alignment Scoring and Exploration (Python Lab)

# Required Imports
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


🧪 Wrote dataset: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Data/L00_Template/demo.fasta


**3. Load Substitution Matrices**

In [None]:
# Load Substitution Matrices
blosum62 = MatrixInfo.blosum62
pam250 = MatrixInfo.pam250

Unnamed: 0,id,length
0,seqA,31
1,seqB,26


CSV exists?  True -> /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L00_Template/seq_lengths.csv
PNG exists?  True -> /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L00_Template/length_hist.png
OUTPUT_DIR contents: ['seq_lengths.csv', 'length_hist.png']


**4) Step 1: Define Your Sequences (Edit Here)**

In [None]:
#Step 1: Define Your Sequences (Edit Here)
# ---------------------------------------------
# Enter your sequences below (same length, or with gaps "-")
# You can copy and paste sequences from a file or just write them directly.

seq1 = "THIS-LINE"
seq2 = "ISALIGNED"



📝 Saved summary: /content/drive/MyDrive/Teaching/BIOINFO4-5203-F25/Outputs/L00_Template/summary.txt


In [None]:
# Step 2: Choose Scoring Parameters
# ------------------------------------
# Choose substitution matrix and gap penalties

substitution_matrix = blosum62  # options: blosum62 or pam250
gap_open_penalty = -10          # penalty for introducing a gap
gap_extend_penalty = -1         # penalty for extending a gap



In [None]:
# Step 3: Scoring Function for Aligned Sequences
# -------------------------------------------------
def score_alignment(seq1, seq2, subst_matrix, gap_open=-10, gap_extend=-1):
    score = 0
    in_gap = False
    for a, b in zip(seq1, seq2):
        if a == '-' or b == '-':
            if not in_gap:
                score += gap_open
                in_gap = True
            else:
                score += gap_extend
        else:
            in_gap = False
            pair = (a.upper(), b.upper())
            score += subst_matrix.get(pair, subst_matrix.get((pair[1], pair[0]), -4))
    return score

# 🔍 Step 4: Score Your Alignment
# -------------------------------
print("Manual alignment score:", score_alignment(seq1, seq2, substitution_matrix, gap_open_penalty, gap_extend_penalty))

# ⚙️ Step 5: Biopython's Built-In Global Alignment (Reference)
# -------------------------------------------------------------
# This shows how software would align the sequences from scratch.

alignments = pairwise2.align.globalds(seq1.replace("-", ""), seq2.replace("-", ""), substitution_matrix, gap_open_penalty, gap_extend_penalty)
print("\nBiopython alignment (no gaps input manually):")
print(pairwise2.format_alignment(*alignments[0]))

# 📊 Step 6: Heatmap of Pairwise Scores (Optional)
# -------------------------------------------------
def plot_score_heatmap(sequences, subst_matrix):
    n = len(sequences)
    score_matrix = pd.DataFrame(index=range(n), columns=range(n))
    for i in range(n):
        for j in range(n):
            alignments = pairwise2.align.globalds(sequences[i], sequences[j], subst_matrix, gap_open_penalty, gap_extend_penalty)
            score_matrix.iloc[i, j] = alignments[0][2]
    score_matrix = score_matrix.astype(float)
    sns.heatmap(score_matrix, annot=True, cmap="coolwarm")
    plt.title("Pairwise Alignment Scores")
    plt.show()

# 🧪 Example Heatmap (Optional for Students)
seqlist = ["MENSDS", "MENGES", "MENNDS", "MENSD"]
plot_score_heatmap(seqlist, substitution_matrix)

# 📚 Student Exercises:
# 1. Try scoring different pairs of sequences with gap variations.
# 2. Modify gap_open and gap_extend, see how alignment changes.
# 3. Compare scoring using BLOSUM62 vs. PAM250.
# 4. Create your own alignment and evaluate its score.

# 💬 Reflection:
# - Why do similar alignments have different scores in PAM vs. BLOSUM?
# - What does a high or low score tell us biologically?
# - When would you choose a more or less stringent substitution matrix?