# Lab 4.2: Phylogenetic Tree Builder
## Unit 4: Speciation & Human Evolution

### 🎯 Learning Objectives
- Perform sequence alignment and calculate genetic distances
- Construct phylogenetic trees using UPGMA
- Interpret tree topology and branch lengths
- Apply molecular clock to estimate divergence times
- Analyze primate phylogeny (Dryopithecus → Homo sapiens)

### 📖 Connection to Course
Covers **Phylogenetic Methods** from Unit 4: Building and interpreting evolutionary trees

### 🌳 The Big Question
**How can we reconstruct evolutionary history?** Let's build trees from DNA!

In [None]:
# === GOOGLE COLAB SETUP ===
try:
    from google.colab import output
    output.enable_custom_widget_manager()
    print("✓ Widgets enabled")
except:
    print("✓ Running outside Colab")

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from ipywidgets import *
from IPython.display import display, HTML
from datetime import datetime
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

print("✓ Libraries loaded!")

## Part 1: Phylogenetic Methods

### What is Phylogenetics?

The study of evolutionary relationships among species using molecular and morphological data.

### Key Concepts

**Phylogenetic Tree:**
- Diagram showing evolutionary relationships
- Branch points (nodes) = common ancestors
- Branch lengths = amount of change
- Tips = current species

**Homology vs Analogy:**
- **Homology**: Similarity due to common ancestry (use for trees!)
- **Analogy**: Similarity due to convergence (misleading!)

### Methods

**1. Distance-Based (UPGMA)**
- Calculate pairwise distances
- Cluster most similar sequences
- Fast, simple
- Assumes molecular clock

**2. Maximum Parsimony**
- Find tree with fewest mutations
- Occam's Razor approach
- No clock assumption

**3. Maximum Likelihood**
- Statistical approach
- Most probable tree given data
- Computationally intensive

### Genetic Distance Models

**p-distance (simplest):**
## d = differences / length

**Jukes-Cantor (corrects for multiple hits):**
## d = -3/4 ln(1 - 4p/3)

**Kimura 2-parameter (transitions vs transversions):**
## d = -1/2 ln[(1-2P-Q)√(1-2Q)]

### Molecular Clock

**Assumption:** Mutations accumulate at constant rate

**Formula:**
## T = d / (2r)

Where:
- T = divergence time (years)
- d = genetic distance
- r = substitution rate per year

**Typical rates:**
- Mammals: ~1% per million years (nuclear DNA)
- mtDNA: ~10× faster

### UPGMA Algorithm

1. Start with distance matrix
2. Find closest pair → join
3. Calculate new distances (average)
4. Repeat until one cluster
5. Draw tree from cluster history

## Part 2: Primate DNA Database

In [None]:
# Simplified primate cytochrome b sequences (partial, for demonstration)
# In reality these would be full 1140bp sequences
primate_sequences = {
    'Human': 'ATGACCAACATCCGAAAATCACACCCACTATTAAAAATTATTAACAACTCATTCATCGACCTCCCCACCCCATCCAAC',
    'Chimp': 'ATGACCAACATCCGAAAATCACACCCACTATTAAAAATTATTAACAACTCATTCATCGACCTCCCCACCCCATCCAAC',
    'Gorilla': 'ATGACCAACATCCGAAAATCACACCCACTATTAAAAATTATTAACAACTCGTTCATCGACCTCCCCGCCCCATCTAAC',
    'Orangutan': 'ATGACCAACATCCGAAAATCACATCCACTATTAAAAATCATTAACAACTCATTTATCGACCTCCCCACCCCATCCAAC',
    'Gibbon': 'ATGACCAACATTCGAAAGTCACACCCACTATCAAAAATTATTAACAACTCGTTCATTGACTTACCCACCCCGTCTAAC',
    'Baboon': 'ATGACTAACATCCGGAAATCACACCCCCTATTAAAAATTATTAATAACTCATTCATTGACCTGCCCACCCCATCCAAC',
    'Macaque': 'ATGACTAACATCCGAAAATCACACCCCCTATTAAAAATTATTAATAACTCATTCATTGACCTGCCCACCCCATCCAAC',
    'Marmoset': 'ATGACTAATATCCGCAAATCACACCCCCTATCAAAAATTATTAACAACTCATTTATCGACTTACCCACACCATCCAAC',
    'Lemur': 'ATGACCAATATCCGAAAATCACACCCTCTATCCAGAATTATTAACAACTCCTTTATCGATCTCCCAACCCCGTCTAAC'
}

# Additional info
primate_info = {
    'Human': {'group': 'Great Ape', 'divergence_mya': 0},
    'Chimp': {'group': 'Great Ape', 'divergence_mya': 6},
    'Gorilla': {'group': 'Great Ape', 'divergence_mya': 8},
    'Orangutan': {'group': 'Great Ape', 'divergence_mya': 14},
    'Gibbon': {'group': 'Lesser Ape', 'divergence_mya': 18},
    'Baboon': {'group': 'Old World Monkey', 'divergence_mya': 25},
    'Macaque': {'group': 'Old World Monkey', 'divergence_mya': 25},
    'Marmoset': {'group': 'New World Monkey', 'divergence_mya': 40},
    'Lemur': {'group': 'Prosimian', 'divergence_mya': 65}
}

print("PRIMATE PHYLOGENY DATABASE")
print("="*70)
print(f"{'Species':<15}{'Group':<20}{'Divergence (Mya)'}")
print("="*70)
for sp, info in primate_info.items():
    print(f"{sp:<15}{info['group']:<20}{info['divergence_mya']}")
print("\n✓ Database ready!")

## Part 3: Genetic Distance Calculator

In [None]:
def calculate_distances(sequences):
    """
    Calculate pairwise genetic distances
    """
    species = list(sequences.keys())
    n = len(species)
    
    # Initialize distance matrix
    distances = np.zeros((n, n))
    
    # Calculate pairwise distances
    for i in range(n):
        for j in range(i+1, n):
            seq1 = sequences[species[i]]
            seq2 = sequences[species[j]]
            
            # Count differences
            diffs = sum(1 for a, b in zip(seq1, seq2) if a != b)
            
            # p-distance
            p_dist = diffs / len(seq1)
            
            # Jukes-Cantor correction
            if p_dist < 0.75:  # Avoid log of negative
                jc_dist = -0.75 * np.log(1 - 4*p_dist/3)
            else:
                jc_dist = p_dist  # Fallback
            
            distances[i,j] = jc_dist
            distances[j,i] = jc_dist
    
    # Create DataFrame
    df = pd.DataFrame(distances, index=species, columns=species)
    
    return df, distances

# Calculate for all primates
dist_df, dist_matrix = calculate_distances(primate_sequences)

print("\nPAIRWISE GENETIC DISTANCES (Jukes-Cantor)")
print("="*70)
print(dist_df.round(4))
print("\n✓ Distance matrix ready for tree building!")

## Part 4: UPGMA Tree Builder

In [None]:
def build_upgma_tree(sequences, show_distances=True):
    """
    Build phylogenetic tree using UPGMA
    """
    species = list(sequences.keys())
    dist_df, dist_matrix = calculate_distances(sequences)
    
    # Perform hierarchical clustering (UPGMA)
    linkage_matrix = linkage(dist_matrix[np.triu_indices(len(species), k=1)], 
                            method='average')
    
    # Create dendrogram
    fig, ax = plt.subplots(figsize=(12, 8))
    
    dendrogram(linkage_matrix, labels=species, ax=ax,
              orientation='right',
              leaf_font_size=12)
    
    ax.set_xlabel('Genetic Distance (Jukes-Cantor)', fontsize=12)
    ax.set_title('Primate Phylogenetic Tree (UPGMA)', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()
    
    if show_distances:
        # Show distance matrix as heatmap
        fig2 = go.Figure(data=go.Heatmap(
            z=dist_matrix,
            x=species,
            y=species,
            colorscale='Viridis',
            text=np.round(dist_matrix, 3),
            texttemplate='%{text}',
            textfont={"size": 10}
        ))
        
        fig2.update_layout(
            title='Distance Matrix Heatmap',
            xaxis_title='Species',
            yaxis_title='Species',
            height=600
        )
        
        fig2.show()
    
    # Analysis
    print("\n" + "="*70)
    print("PHYLOGENETIC TREE ANALYSIS")
    print("="*70)
    
    # Find closest pairs
    min_dist = float('inf')
    closest_pair = None
    for i in range(len(species)):
        for j in range(i+1, len(species)):
            if dist_matrix[i,j] < min_dist:
                min_dist = dist_matrix[i,j]
                closest_pair = (species[i], species[j])
    
    print(f"\nMOST CLOSELY RELATED: {closest_pair[0]} and {closest_pair[1]}")
    print(f"  Distance: {min_dist:.4f}")
    print(f"  → These species share most recent common ancestor")
    
    # Molecular clock estimate
    rate = 0.01  # 1% per million years
    divergence_time = min_dist / (2 * rate)
    print(f"\nMOLECULAR CLOCK ESTIMATE:")
    print(f"  Divergence time: {divergence_time:.1f} million years ago")
    print(f"  (assuming rate = {rate*100}% per Myr)")
    
    print(f"\nKEY CLADES:")
    print(f"  Great Apes: Human, Chimp, Gorilla, Orangutan")
    print(f"  Old World Monkeys: Baboon, Macaque")
    print(f"  New World Monkeys: Marmoset")
    print(f"  Prosimians: Lemur (most distant)")
    print("="*70)

# Build tree
display(HTML("<h3>🌳 Build Phylogenetic Tree</h3>"))
build_upgma_tree(primate_sequences)

## Part 5: Molecular Clock Calculator

In [None]:
def molecular_clock(genetic_distance, rate_per_myr):
    """
    Estimate divergence time using molecular clock
    """
    # T = d / (2r)
    # Factor of 2 because both lineages accumulate mutations
    
    divergence_time = genetic_distance / (2 * rate_per_myr)
    
    # Visualization
    fig = go.Figure()
    
    # Timeline
    times = np.linspace(0, divergence_time, 100)
    distances = 2 * rate_per_myr * times
    
    fig.add_trace(go.Scatter(
        x=times, y=distances,
        mode='lines',
        line=dict(color='#3498DB', width=3),
        name='Expected distance'
    ))
    
    # Mark observed
    fig.add_trace(go.Scatter(
        x=[divergence_time], y=[genetic_distance],
        mode='markers',
        marker=dict(size=15, color='#E74C3C'),
        name='Observed'
    ))
    
    fig.update_layout(
        title='<b>Molecular Clock</b>',
        xaxis_title='Time (Million Years Ago)',
        yaxis_title='Genetic Distance',
        height=400
    )
    
    # Print results
    print("\n" + "="*70)
    print("MOLECULAR CLOCK CALCULATION")
    print("="*70)
    print(f"\nGENETIC DISTANCE: {genetic_distance:.4f}")
    print(f"SUBSTITUTION RATE: {rate_per_myr:.4f} per Myr ({rate_per_myr*100:.2f}%)")
    print(f"\nFORMULA: T = d / (2r)")
    print(f"         T = {genetic_distance:.4f} / (2 × {rate_per_myr:.4f})")
    print(f"         T = {genetic_distance:.4f} / {2*rate_per_myr:.4f}")
    print(f"         T = {divergence_time:.2f} million years")
    print(f"\nDIVERGENCE TIME: {divergence_time:.2f} Mya")
    print(f"\nKEY INSIGHT:")
    print(f"  Both lineages evolve independently")
    print(f"  Total distance = 2 × rate × time")
    print(f"  Factor of 2 is crucial!")
    print("="*70)
    
    fig.show()

# Interactive
dist_slider = FloatSlider(value=0.07, min=0.01, max=0.50, step=0.01,
                         description='Distance (d):')
rate_slider = FloatSlider(value=0.01, min=0.005, max=0.03, step=0.001,
                         description='Rate (r/Myr):')

display(HTML("<h3>⏰ Molecular Clock Calculator</h3>"))
interact(molecular_clock, genetic_distance=dist_slider, rate_per_myr=rate_slider);

## Part 6: Challenge Problems

### Challenge 1: Human-Chimp Divergence 🦍

**Given:**
- Human-Chimp genetic distance: 1.2%
- Substitution rate: 1% per million years

**Questions:**
1. Calculate divergence time
2. Why is molecular clock useful?
3. What can affect clock accuracy?

<details>
<summary>Solution</summary>

**1. Divergence Time:**

Formula: T = d / (2r)

Given:
- d = 0.012 (1.2%)
- r = 0.01 per Myr

T = 0.012 / (2 × 0.01)
T = 0.012 / 0.02
**T = 6 million years ago**

This matches fossil evidence!

**2. Why Useful?**

**Dating without fossils:**
- Soft-bodied organisms (no fossils)
- Recent divergences (no time for fossilization)
- Fill gaps in fossil record

**Independent calibration:**
- Can check fossil dates
- Resolves conflicts
- More precise estimates

**3. Factors Affecting Accuracy:**

**Violations of clock:**
- Rate varies across lineages
- Generation time effects (mice vs elephants)
- Population size effects

**Selection:**
- Purifying selection slows clock
- Positive selection speeds it up
- Use neutral sites!

**Saturation:**
- Multiple hits at same site
- Underestimates true distance
- Use correction models (Jukes-Cantor)

**Calibration:**
- Need fossil anchor points
- Rate may not be constant
- Different genes have different rates
</details>

### Challenge 2: Build a Tree 🌳

**Given distance matrix:**
```
     A    B    C    D
A    0   0.1  0.3  0.4
B   0.1   0   0.3  0.4
C   0.3  0.3   0   0.2
D   0.4  0.4  0.2   0
```

**Questions:**
1. Which species are most closely related?
2. Sketch the UPGMA tree
3. Calculate branch lengths

<details>
<summary>Solution</summary>

**1. Most Closely Related:**

**A and B** (distance = 0.1, smallest!)

**2. UPGMA Algorithm:**

**Step 1:** Join A and B
- Create cluster (AB)
- Branch length: 0.1/2 = 0.05 each

**Calculate new distances:**
- (AB) to C: (0.3 + 0.3)/2 = 0.3
- (AB) to D: (0.4 + 0.4)/2 = 0.4

New matrix:
```
       (AB)   C     D
(AB)    0    0.3   0.4
C      0.3    0    0.2
D      0.4   0.2    0
```

**Step 2:** Join C and D (distance = 0.2, smallest)
- Create cluster (CD)
- Branch length: 0.2/2 = 0.1 each

**Calculate:**
- (AB) to (CD): (0.3 + 0.4)/2 = 0.35

**Step 3:** Join (AB) and (CD)
- Final node
- Branch from (AB): 0.35/2 - 0.05 = 0.125
- Branch from (CD): 0.35/2 - 0.1 = 0.075

**3. Tree:**
```
                    ┌─── A
               ┌────┤
               │    └─── B
          ─────┤
               │    ┌─── C
               └────┤
                    └─── D
```

**Branch Lengths:**
- A: 0.05 + 0.125 = 0.175
- B: 0.05 + 0.125 = 0.175
- C: 0.1 + 0.075 = 0.175
- D: 0.1 + 0.075 = 0.175

**(All equal because UPGMA assumes constant rate!)**
</details>

### Challenge 3: Primate Phylogeny Interpretation 🐵

**Looking at the primate tree:**

**Questions:**
1. Which group is most distant from humans?
2. Are Old World monkeys more related to humans than New World monkeys?
3. What does tree tell us about human origins?

<details>
<summary>Solution</summary>

**1. Most Distant:**

**LEMUR (Prosimian)** is most distant!

**Phylogenetic Order (human perspective):**
1. Human (reference)
2. Chimp (closest, ~6 Mya)
3. Gorilla (~8 Mya)
4. Orangutan (~14 Mya)
5. Gibbon (~18 Mya)
6. Old World Monkeys (~25 Mya)
7. New World Monkeys (~40 Mya)
8. **Prosimians (~65 Mya) ← Most distant!**

**2. Old World vs New World Monkeys:**

**YES!** Old World monkeys (Baboon, Macaque) are MORE related to humans.

**Why?**

**Phylogenetic groupings:**
- Humans + Apes + Old World Monkeys = **Catarrhini**
- New World Monkeys = **Platyrrhini**
- Split ~40 Mya (South America separated)

**Shared traits (Catarrhini):**
- Narrow nose
- Downward-facing nostrils
- Non-prehensile tail

**New World unique:**
- Wide nose
- Side-facing nostrils
- Often prehensile tail

**3. Human Origins Insights:**

**From phylogeny:**

**Closest relatives: African apes**
- Chimp (99% similar!)
- Gorilla
- → Humans evolved in AFRICA

**Not Asian apes:**
- Orangutan more distant
- Gibbon even more distant
- → Asian origin hypothesis rejected

**Timing:**
- Human-Chimp split: ~6 Mya
- → Look for fossils 6-8 Mya in Africa
- → Found! Sahelanthropus, Orrorin, Ardipithecus

**Molecular + Fossil evidence converge:**
- Africa origin confirmed
- Timeframe confirmed
- Chimp closest living relative confirmed

**Modern humans:**
- All humans very similar genetically
- African populations most diverse
- → Out of Africa model supported
</details>

In [None]:
def export_results():
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Export distance matrix
    dist_df, _ = calculate_distances(primate_sequences)
    csv_file = f"/content/lab_4_2_distances_{timestamp}.csv"
    dist_df.to_csv(csv_file)
    print(f"✓ Saved: {csv_file}")
    
    # Export primate info
    info_data = []
    for sp, info in primate_info.items():
        info_data.append({
            'Species': sp,
            'Group': info['group'],
            'Divergence_Mya': info['divergence_mya']
        })
    df = pd.DataFrame(info_data)
    info_file = f"/content/lab_4_2_primate_info_{timestamp}.csv"
    df.to_csv(info_file, index=False)
    print(f"✓ Saved: {info_file}")
    print(f"\nExported distance matrix and primate info")

btn = Button(description='📥 Export', button_style='success', icon='download')
btn.on_click(lambda b: export_results())
display(HTML("<h3>📤 Export</h3>"))
display(btn)

## Summary

### Key Concepts

✅ **Phylogenetic Trees** - Evolutionary relationships  
✅ **Genetic Distance** - Sequence divergence  
✅ **UPGMA** - Distance-based tree building  
✅ **Molecular Clock** - Dating divergences  
✅ **Primate Phylogeny** - Human evolutionary context  

### Key Equations

**Genetic Distance (Jukes-Cantor):**
## d = -3/4 ln(1 - 4p/3)

**Molecular Clock:**
## T = d / (2r)

### Tree Building Methods

**UPGMA (used in this lab):**
- Distance-based, hierarchical clustering
- Fast and simple
- Assumes molecular clock

**Maximum Parsimony:**
- Fewest mutations
- No clock assumption

**Maximum Likelihood:**
- Statistical best fit
- Most powerful

### Primate Phylogeny

**Major Groups (human perspective):**
1. Great Apes: Human, Chimp, Gorilla, Orangutan
2. Lesser Apes: Gibbon
3. Old World Monkeys: Baboon, Macaque  
4. New World Monkeys: Marmoset
5. Prosimians: Lemur (most distant)

**Key Findings:**
- Humans most related to African apes
- Human-Chimp divergence: ~6 Mya
- Supports African origin hypothesis

### Molecular Clock Applications

**Dating divergences:**
- Human-Chimp: 6 Mya
- Human-Gorilla: 8 Mya
- Human-Orangutan: 14 Mya

**Typical rates:**
- Nuclear DNA: 1% per million years
- mtDNA: 10% per million years

### Real-World Applications

**Evolution:**
- Reconstruct phylogeny
- Date divergences
- Test evolutionary hypotheses

**Conservation:**
- Identify species
- Assess genetic diversity
- Guide breeding programs

**Medicine:**
- Track disease origins
- Understand drug resistance
- Vaccine development

**Forensics:**
- Species identification
- Population assignment

### The Big Picture

**DNA = Historical Document**

Sequences preserve evolutionary history:
- More similar = more recent common ancestor
- Distance ∝ time since divergence
- Tree shows all relationships at once

**Molecular + Morphological + Fossil:**
All three agree on primate phylogeny!

### Next Lab

**Lab 4.3: Human Evolution Explorer** - Our own evolutionary story!

**Congratulations!** You can now build evolutionary trees! 🎉