# Pattern Hunters: Introduction to Biological Sequences
## Understanding DNA and Protein Data for Phylogenetic Analysis

**For BSc Zoology Students**

---

### Learning Objectives
By the end of this notebook, you will:
1. Understand how biological sequences are represented digitally
2. Learn to retrieve real sequences from public databases
3. Compare sequences to identify similarities and differences
4. Prepare for understanding why alignment is necessary

### Why This Matters
Understanding evolution isn't about memorizing trees - it's about analyzing actual biological data. Every phylogenetic tree you see in textbooks started with real DNA or protein sequences from living organisms.

---

## Part 1: Setting Up Our Tools

We'll use Biopython - a free toolkit used by researchers worldwide.

In [None]:
# Install required packages
!pip install biopython -q

print("✓ Biopython installed successfully!")
print("You now have the same tools professional bioinformaticians use.")

In [None]:
# Import the tools we need
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pandas as pd

# Always tell NCBI who you are (scientific courtesy)
Entrez.email = "your.email@example.com"  # Replace with your email

print("✓ Libraries imported successfully!")

## Part 2: What Does DNA Look Like in a Computer?

DNA is made of four nucleotides: **A**denine, **T**hymine, **G**uanine, **C**ytosine.
In computers, we represent DNA as simple text strings.

In [None]:
# Let's create a simple DNA sequence
dna_sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

print("Our DNA sequence:")
print(dna_sequence)
print(f"\nLength: {len(dna_sequence)} nucleotides")
print(f"\nNucleotide composition:")
print(f"  A: {dna_sequence.count('A')}")
print(f"  T: {dna_sequence.count('T')}")
print(f"  G: {dna_sequence.count('G')}")
print(f"  C: {dna_sequence.count('C')}")

In [None]:
# DNA can be transcribed to RNA
rna_sequence = dna_sequence.transcribe()
print("DNA:", dna_sequence)
print("RNA:", rna_sequence)
print("\nNotice: T → U (Thymine becomes Uracil)")

In [None]:
# And translated to protein
protein_sequence = dna_sequence.translate()
print("DNA:    ", dna_sequence)
print("Protein:", protein_sequence)
print("\nNotice: Every 3 nucleotides (codon) codes for 1 amino acid")
print(f"40 nucleotides → {len(protein_sequence)} amino acids (with stop codon *)")

## Part 3: Retrieving Real Sequences from NCBI

NCBI (National Center for Biotechnology Information) maintains GenBank - a huge database of all publicly available DNA sequences. Let's get real primate sequences!

In [None]:
def fetch_sequence(accession_id, sequence_name):
    """
    Fetch a sequence from NCBI GenBank
    
    Parameters:
    accession_id: The GenBank accession number
    sequence_name: A friendly name for the sequence
    """
    try:
        handle = Entrez.efetch(db="nucleotide", id=accession_id, rettype="fasta", retmode="text")
        record = SeqIO.read(handle, "fasta")
        handle.close()
        record.id = sequence_name
        record.description = f"{sequence_name} - {accession_id}"
        return record
    except Exception as e:
        print(f"Error fetching {sequence_name}: {e}")
        return None

print("✓ Function ready to fetch sequences from GenBank")

### Let's Get Real Primate Cytochrome B Sequences

Cytochrome b is a mitochondrial gene commonly used in phylogenetic studies because:
- It's present in all animals
- It evolves at a useful rate (not too fast, not too slow)
- It's relatively short and easy to sequence

We'll get sequences from primates to study human evolution.

In [None]:
# Define our species and their GenBank accession numbers
# These are real sequences from NCBI!
primate_sequences = {
    "Human": "NC_012920.1",              # Homo sapiens
    "Chimpanzee": "NC_001643.1",         # Pan troglodytes
    "Gorilla": "NC_011120.1",            # Gorilla gorilla
    "Orangutan": "NC_002083.1",          # Pongo pygmaeus
    "Gibbon": "NC_002082.1",             # Hylobates lar
    "Rhesus_Monkey": "NC_005943.1",      # Macaca mulatta
}

print("We will fetch cytochrome b sequences from:")
for species, accession in primate_sequences.items():
    print(f"  • {species}: {accession}")

In [None]:
# Fetch the sequences
print("Fetching sequences from NCBI GenBank...\n")

sequences = []
for species, accession in primate_sequences.items():
    print(f"Downloading {species}...", end=" ")
    record = fetch_sequence(accession, species)
    if record:
        sequences.append(record)
        print("✓")

print(f"\n✓ Successfully fetched {len(sequences)} sequences!")

## Part 4: Examining Our Sequences

Let's look at what we downloaded.

In [None]:
# Create a summary table
summary_data = []
for seq in sequences:
    summary_data.append({
        'Species': seq.id,
        'Length': len(seq.seq),
        'A count': seq.seq.count('A'),
        'T count': seq.seq.count('T'),
        'G count': seq.seq.count('G'),
        'C count': seq.seq.count('C'),
        'GC%': round(((seq.seq.count('G') + seq.seq.count('C')) / len(seq.seq)) * 100, 1)
    })

df = pd.DataFrame(summary_data)
print("Sequence Summary:")
print(df.to_string(index=False))

In [None]:
# Let's look at the first 100 nucleotides of each sequence
print("First 100 nucleotides of each sequence:\n")
for seq in sequences:
    print(f"{seq.id:15} {str(seq.seq[:100])}")

## Part 5: Comparing Sequences - Finding Similarities

**Question**: How similar are human and chimpanzee sequences?

Let's write a simple function to compare two sequences position by position.

In [None]:
def compare_sequences(seq1, seq2, name1, name2):
    """
    Compare two sequences and calculate similarity
    """
    # Make sure they're the same length for simple comparison
    min_length = min(len(seq1), len(seq2))
    
    matches = 0
    for i in range(min_length):
        if seq1[i] == seq2[i]:
            matches += 1
    
    similarity = (matches / min_length) * 100
    differences = min_length - matches
    
    print(f"Comparison: {name1} vs {name2}")
    print(f"  Compared length: {min_length} nucleotides")
    print(f"  Matching positions: {matches}")
    print(f"  Different positions: {differences}")
    print(f"  Similarity: {similarity:.2f}%")
    print()
    
    return similarity

print("✓ Comparison function ready")

In [None]:
# Compare human with each other species
human_seq = [s for s in sequences if s.id == "Human"][0]

print("How similar is human to other primates?\n")

for seq in sequences:
    if seq.id != "Human":
        similarity = compare_sequences(human_seq.seq, seq.seq, "Human", seq.id)

## Part 6: Visualizing Differences

Let's create a visual comparison of a small region.

In [None]:
def visualize_alignment(sequences, start=0, length=60):
    """
    Show sequences aligned with differences highlighted
    """
    print(f"Positions {start+1} to {start+length}:\n")
    
    # Print position markers
    print(" " * 15, end="")
    for i in range(start, start + length, 10):
        print(f"{i+1:<10}", end="")
    print("\n")
    
    # Print sequences
    for seq in sequences:
        print(f"{seq.id:15}", end="")
        segment = str(seq.seq[start:start+length])
        print(segment)
    
    # Print conservation line
    print(f"{'Conservation':15}", end="")
    for i in range(start, start + length):
        nucleotides = set([str(seq.seq[i]) for seq in sequences if len(seq.seq) > i])
        if len(nucleotides) == 1:
            print("*", end="")  # Conserved position
        else:
            print(".", end="")  # Variable position
    print("\n")
    print("* = conserved (same in all species)")
    print(". = variable (differs among species)")

# Visualize the first 60 nucleotides
visualize_alignment(sequences, start=0, length=60)

## Part 7: Creating a Pairwise Similarity Matrix

To understand evolutionary relationships, we need to compare ALL species with ALL other species.

In [None]:
# Create a similarity matrix
import numpy as np

species_names = [seq.id for seq in sequences]
n = len(sequences)
similarity_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        if i == j:
            similarity_matrix[i][j] = 100.0  # 100% similar to itself
        else:
            seq1 = sequences[i].seq
            seq2 = sequences[j].seq
            min_length = min(len(seq1), len(seq2))
            matches = sum(1 for k in range(min_length) if seq1[k] == seq2[k])
            similarity_matrix[i][j] = (matches / min_length) * 100

# Create a DataFrame for nice display
similarity_df = pd.DataFrame(similarity_matrix, 
                            index=species_names, 
                            columns=species_names)

print("Pairwise Sequence Similarity Matrix (%):\n")
print(similarity_df.round(2))

In [None]:
# Visualize as a heatmap
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(similarity_df, annot=True, fmt='.1f', cmap='YlGnBu', 
            cbar_kws={'label': 'Similarity (%)'})
plt.title('Primate Cytochrome B Sequence Similarity', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("• Darker colors = higher similarity")
print("• Diagonal is always 100% (species compared to itself)")
print("• Which species is most similar to humans?")
print("• Which is most different?")

## Part 8: Questions to Think About

**Before moving to the next notebook, consider:**

1. **Sequence Length**: Did all our sequences have exactly the same length? Why might they differ?

2. **Similarity Patterns**: 
   - Which species was most similar to humans?
   - Does this match what you know about primate evolution?
   - Why is human-chimpanzee similarity not 100%?

3. **Conservation**: 
   - Some positions were conserved (*), others variable (.)
   - Why might some positions be more conserved than others?
   - What does conservation tell us about natural selection?

4. **The Alignment Problem**:
   - We compared sequences position-by-position
   - But what if one species has an insertion or deletion?
   - How would that affect our similarity calculation?

This last question leads us to the next notebook: **Multiple Sequence Alignment**

---

## Summary

**What You've Learned:**
1. ✓ DNA sequences are represented as strings of letters (A, T, G, C)
2. ✓ Real sequences can be downloaded from NCBI GenBank
3. ✓ We can compare sequences to measure similarity
4. ✓ Humans are most similar to chimpanzees (~98-99%)
5. ✓ Even closely related species have differences

**Skills Acquired:**
- Using Biopython to handle biological sequences
- Fetching data from public databases
- Calculating sequence similarity
- Creating similarity matrices

**Next Steps:**
Move to Notebook 2: **Multiple Sequence Alignment** to learn how to properly align sequences before building phylogenetic trees.

---

### Save Your Sequences for the Next Notebook

In [None]:
# Save sequences to a FASTA file
SeqIO.write(sequences, "primate_cytb.fasta", "fasta")
print("✓ Sequences saved to 'primate_cytb.fasta'")
print("You can use this file in the next notebook!")