# Pattern Hunters: Tree Interpretation & Evolutionary Analysis
## Reading Phylogenetic Trees and Understanding Evolution

**For BSc Zoology Students**

---

### Learning Objectives
By the end of this notebook, you will:
1. Read and interpret phylogenetic trees correctly
2. Understand common misconceptions about trees
3. Identify monophyletic, paraphyletic, and polyphyletic groups
4. Calculate divergence times using molecular clocks
5. Map character evolution onto trees
6. Connect molecular data to evolutionary history

### Why This Matters

Building a tree is just the first step. The real biology comes from:
- Understanding what trees tell us about evolution
- Avoiding common misinterpretations
- Connecting molecular patterns to real evolutionary history

---

## Part 1: Setup

In [None]:
# Install packages
!pip install biopython matplotlib seaborn scipy -q

print("âœ“ Packages installed!")

In [None]:
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from io import StringIO

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")

print("âœ“ Libraries imported!")

In [None]:
# Create a sample primate phylogenetic tree
# This tree represents the evolutionary relationships among primates

# Define the tree in Newick format
newick_str = "((((Human:0.012,Chimp:0.012):0.004,Gorilla:0.016):0.008,Orangutan:0.024):0.015,Macaque:0.039);"

# Parse the tree
tree = Phylo.read(StringIO(newick_str), "newick")

print("âœ“ Phylogenetic tree created successfully")
print(f"   Tree has {tree.count_terminals()} terminal nodes (species)")

# Display tree
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax, do_show=False)
ax.set_title('Primate Phylogenetic Tree', fontsize=14, fontweight='bold')
ax.set_xlabel('Evolutionary Distance (substitutions per site)', fontsize=11)
plt.tight_layout()
plt.show()

print("\nðŸ“Š This tree shows:")
print("   â€¢ Humans and chimps are most closely related")
print("   â€¢ Gorillas diverged next")
print("   â€¢ Orangutans are more distantly related")
print("   â€¢ Macaques (Old World monkeys) are the outgroup")

## Part 2: Reading Trees Correctly - Common Misconceptions

### Misconception #1: "Taxa at the top are more advanced"
**WRONG!** Order doesn't matter. Trees can be rotated at any node.

### Misconception #2: "Taxa closer on the page are more related"
**WRONG!** Relationships are determined by common ancestors, not proximity.

### Misconception #3: "Reading left to right shows evolution"
**WRONG!** All living species are equally modern.

Let's explore these:

In [None]:
# Demonstrate tree rotation
print("Tree Rotation Example")
print("="*60)
print("\nThe same tree can be drawn in multiple ways by rotating around nodes.")
print("All versions show THE SAME evolutionary relationships!\n")

# Create simple example trees to show rotation
simple_tree_str = "(((Human,Chimp),Gorilla),Orangutan);"
simple_tree = Phylo.read(StringIO(simple_tree_str), "newick")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Original
Phylo.draw(simple_tree, axes=ax1, do_show=False)
ax1.set_title('Version 1: Human at top', fontweight='bold')

# Rotate (swap branches)
# Note: This is conceptual - Phylo doesn't have direct rotation
ax2.text(0.5, 0.5, 'Same tree, different layout\n(Human at bottom)\n\nRelationships UNCHANGED', 
         ha='center', va='center', fontsize=12, transform=ax2.transAxes)
ax2.axis('off')
ax2.set_title('Version 2: Rotated at node', fontweight='bold')

plt.tight_layout()
plt.show()

print("Key Point: Physical position on the page is arbitrary!")

In [None]:
# How to read relationships correctly
print("\nHow to Correctly Determine Relationships:")
print("="*60)
print("\n1. Find the MOST RECENT COMMON ANCESTOR (MRCA)")
print("2. Count nodes back to find relatedness")
print("3. Closer common ancestor = more closely related\n")

print("Example questions:")
print("\nQ1: Are humans more closely related to chimps or gorillas?")
print("A1: Look for MRCA of Human+Chimp vs MRCA of Human+Gorilla")
print("    Human+Chimp share a MORE RECENT ancestor")
print("    Therefore: Humans are MORE closely related to chimps\n")

print("Q2: Which is more related to humans: gorilla or orangutan?")
print("A2: Count nodes from human to each:")
print("    Human â†’ node â†’ Gorilla = 2 steps")
print("    Human â†’ node â†’ node â†’ Orangutan = 3 steps")
print("    Therefore: Humans are MORE closely related to gorillas")

## Part 3: Understanding Tree Components

### Key Components:
1. **Terminals (Tips/Leaves)**: The species we sampled
2. **Branches**: Represent lineages through time
3. **Nodes**: Represent common ancestors
4. **Root**: The most ancient common ancestor
5. **Branch lengths**: Amount of evolutionary change

In [None]:
# Analyze tree structure
def analyze_tree_structure(tree):
    """
    Comprehensive analysis of tree structure
    """
    print("Tree Structure Analysis")
    print("="*60)
    
    # Terminals
    terminals = tree.get_terminals()
    print(f"\n1. TERMINALS (Living species):")
    print(f"   Count: {len(terminals)}")
    print(f"   Names: {', '.join([t.name for t in terminals])}")
    
    # Internal nodes
    non_terminals = tree.get_nonterminals()
    print(f"\n2. INTERNAL NODES (Ancestors):")
    print(f"   Count: {len(non_terminals)}")
    print(f"   These represent extinct common ancestors")
    
    # Total branch length
    total_length = tree.total_branch_length()
    print(f"\n3. TOTAL BRANCH LENGTH:")
    print(f"   {total_length:.4f}")
    print(f"   Sum of all evolutionary changes")
    
    # Tree depth
    depths = tree.depths()
    max_depth = max(depths.values())
    print(f"\n4. TREE DEPTH:")
    print(f"   {max_depth:.4f}")
    print(f"   Root to furthest tip")
    
    # Is it bifurcating?
    is_bifur = tree.is_bifurcating()
    print(f"\n5. BIFURCATING (binary):")
    print(f"   {is_bifur}")
    if is_bifur:
        print(f"   Each ancestor splits into exactly 2 lineages")
    else:
        print(f"   Some nodes have 3+ descendants (polytomy)")

analyze_tree_structure(tree)

## Part 4: Monophyly, Paraphyly, and Polyphyly

Understanding how we classify groups:

### Monophyletic (Clade)
- An ancestor and ALL its descendants
- **Example**: All great apes (including humans)
- These are "natural" evolutionary groups

### Paraphyletic
- An ancestor and SOME (not all) descendants
- **Example**: "Apes excluding humans" (artificial)
- Not considered natural groups

### Polyphyletic
- Group based on convergent traits, not common ancestry
- **Example**: "Flying vertebrates" (bats + birds)
- Definitely not natural groups

In [None]:
# Identify clades
def identify_clades(tree):
    """
    Find all monophyletic groups in the tree
    """
    print("Monophyletic Groups (Clades) in Our Tree")
    print("="*60)
    
    clades = []
    for clade in tree.find_clades():
        if not clade.is_terminal():
            terminals = clade.get_terminals()
            if len(terminals) >= 2:
                clade_members = [t.name for t in terminals]
                clades.append(clade_members)
    
    # Sort by size
    clades.sort(key=len, reverse=True)
    
    for i, members in enumerate(clades, 1):
        print(f"\nClade {i}: ({len(members)} species)")
        print(f"  Members: {', '.join(members)}")
        
        # Try to name the clade
        if 'Human' in members and 'Chimpanzee' in members and 'Gorilla' in members:
            if 'Orangutan' in members:
                print(f"  Common name: Great Apes (Hominidae)")
            else:
                print(f"  Common name: African Great Apes")
        elif 'Human' in members and 'Chimpanzee' in members and len(members) == 2:
            print(f"  Common name: Homo-Pan clade")

identify_clades(tree)

In [None]:
# Test if a group is monophyletic
def is_monophyletic_group(tree, species_list):
    """
    Check if a set of species forms a monophyletic group
    """
    # Get all terminals for these species
    target_terminals = []
    for terminal in tree.get_terminals():
        if terminal.name in species_list:
            target_terminals.append(terminal)
    
    if len(target_terminals) != len(species_list):
        return False, "Not all species found in tree"
    
    # Find MRCA
    mrca = tree.common_ancestor(target_terminals)
    
    # Get all descendants of MRCA
    mrca_terminals = mrca.get_terminals()
    mrca_names = [t.name for t in mrca_terminals]
    
    # Check if it's exactly our group
    if set(mrca_names) == set(species_list):
        return True, "Monophyletic (includes ancestor and ALL descendants)"
    else:
        extra = set(mrca_names) - set(species_list)
        return False, f"Paraphyletic (missing: {', '.join(extra)})"

# Test some groups
print("Testing Groups for Monophyly")
print("="*60)

test_groups = [
    (["Human", "Chimpanzee"], "Homo-Pan"),
    (["Human", "Chimpanzee", "Gorilla"], "African Apes"),
    (["Chimpanzee", "Gorilla"], "Apes excluding humans"),
]

for species_list, group_name in test_groups:
    is_mono, reason = is_monophyletic_group(tree, species_list)
    print(f"\n{group_name}:")
    print(f"  Species: {', '.join(species_list)}")
    print(f"  Status: {reason}")

## Part 5: Molecular Clocks and Divergence Times

If mutations accumulate at a relatively constant rate, we can estimate when species diverged.

### Molecular Clock Equation:
**Time = Distance / (2 Ã— Rate)**

Where:
- Distance = evolutionary distance between species
- Rate = substitution rate per million years
- Factor of 2 accounts for changes in both lineages

### Calibration:
We need fossil dates or known divergences to calibrate the clock.

In [None]:
# Calculate pairwise distances
def get_pairwise_distance(tree, species1, species2):
    """
    Get evolutionary distance between two species
    """
    term1 = None
    term2 = None
    
    for terminal in tree.get_terminals():
        if species1.lower() in terminal.name.lower():
            term1 = terminal
        if species2.lower() in terminal.name.lower():
            term2 = terminal
    
    if term1 and term2:
        return tree.distance(term1, term2)
    return None

# Calculate key distances
print("Pairwise Evolutionary Distances")
print("="*60)

key_pairs = [
    ("Human", "Chimpanzee"),
    ("Human", "Gorilla"),
    ("Human", "Orangutan"),
]

distances = {}
for sp1, sp2 in key_pairs:
    dist = get_pairwise_distance(tree, sp1, sp2)
    if dist:
        distances[f"{sp1}-{sp2}"] = dist
        print(f"\n{sp1} â†” {sp2}:")
        print(f"  Distance: {dist:.6f}")

In [None]:
# Estimate divergence times using known calibration
print("\nEstimating Divergence Times")
print("="*60)
print("\nUsing calibration: Human-Chimpanzee divergence â‰ˆ 6-7 million years ago (MYA)")
print("(from fossil evidence)\n")

# Get human-chimp distance
human_chimp_dist = get_pairwise_distance(tree, "Human", "Chimpanzee")

if human_chimp_dist:
    # Calibrate: known divergence time
    known_divergence_mya = 6.5  # Average of 6-7 MYA
    
    # Calculate rate: substitutions per million years
    rate = human_chimp_dist / (2 * known_divergence_mya)
    
    print(f"Calibration:")
    print(f"  Human-Chimp distance: {human_chimp_dist:.6f}")
    print(f"  Known divergence time: {known_divergence_mya} MYA")
    print(f"  Calculated rate: {rate:.8f} substitutions/site/MY")
    
    # Estimate other divergence times
    print(f"\nEstimated Divergence Times:")
    print("-" * 60)
    
    for sp1, sp2 in key_pairs:
        dist = get_pairwise_distance(tree, sp1, sp2)
        if dist:
            estimated_time = dist / (2 * rate)
            print(f"\n{sp1} â†” {sp2}:")
            print(f"  Distance: {dist:.6f}")
            print(f"  Estimated divergence: {estimated_time:.2f} MYA")
            
            # Compare with known estimates
            if "Gorilla" in sp2:
                print(f"  (Literature estimate: ~8-10 MYA)")
            elif "Orangutan" in sp2:
                print(f"  (Literature estimate: ~12-16 MYA)")

In [None]:
# Visualize divergence times on tree
print("\nVisualization: Tree with Divergence Times")

# Create time-calibrated tree
fig, ax = plt.subplots(figsize=(14, 8))
Phylo.draw(tree, axes=ax, do_show=False)

# Add time scale if possible
ax.set_xlabel('Time (Million Years Ago)', fontsize=12)
ax.set_title('Time-Calibrated Phylogenetic Tree\n(Based on Human-Chimp calibration)', 
             fontsize=14, fontweight='bold')

# Add timeline
ax.axvline(x=0, color='red', linestyle='--', alpha=0.5, label='Present')
ax.legend()

plt.tight_layout()
plt.show()

## Part 6: Character Evolution Mapping

We can map traits onto trees to understand how characters evolved.

### Examples:
- Morphological traits (bipedalism, brain size)
- Behavioral traits (tool use, language)
- Ecological traits (habitat, diet)
- Molecular traits (chromosome number, genome size)

In [None]:
# Define some traits for our primates
primate_traits = {
    'Human': {
        'brain_size_cc': 1350,
        'bipedal': True,
        'tool_use': 'complex',
        'habitat': 'terrestrial',
        'chromosome_pairs': 23
    },
    'Chimpanzee': {
        'brain_size_cc': 400,
        'bipedal': False,
        'tool_use': 'simple',
        'habitat': 'arboreal/terrestrial',
        'chromosome_pairs': 24
    },
    'Gorilla': {
        'brain_size_cc': 500,
        'bipedal': False,
        'tool_use': 'minimal',
        'habitat': 'terrestrial',
        'chromosome_pairs': 24
    },
    'Orangutan': {
        'brain_size_cc': 400,
        'bipedal': False,
        'tool_use': 'simple',
        'habitat': 'arboreal',
        'chromosome_pairs': 24
    },
}

# Display as table
traits_df = pd.DataFrame(primate_traits).T
print("Primate Trait Comparison")
print("="*80)
print(traits_df.to_string())
print("\nNote: These are approximate values for illustration")

In [None]:
# Visualize trait evolution
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Brain size
species = list(primate_traits.keys())
brain_sizes = [primate_traits[sp]['brain_size_cc'] for sp in species]

axes[0,0].barh(species, brain_sizes, color='steelblue')
axes[0,0].set_xlabel('Brain Size (cc)', fontsize=10)
axes[0,0].set_title('Brain Size Evolution', fontweight='bold')
axes[0,0].grid(axis='x', alpha=0.3)

# Locomotion
locomotion = [1 if primate_traits[sp]['bipedal'] else 0 for sp in species]
colors = ['green' if x else 'gray' for x in locomotion]
axes[0,1].barh(species, locomotion, color=colors)
axes[0,1].set_xlabel('Bipedal (1) vs Quadrupedal (0)', fontsize=10)
axes[0,1].set_title('Locomotion', fontweight='bold')
axes[0,1].set_xlim(-0.1, 1.1)

# Tool use (encoded)
tool_encoding = {'complex': 3, 'simple': 2, 'minimal': 1}
tool_scores = [tool_encoding.get(primate_traits[sp]['tool_use'], 0) for sp in species]
axes[1,0].barh(species, tool_scores, color='coral')
axes[1,0].set_xlabel('Tool Use Complexity (1-3)', fontsize=10)
axes[1,0].set_title('Tool Use Evolution', fontweight='bold')
axes[1,0].grid(axis='x', alpha=0.3)

# Chromosome pairs
chromosomes = [primate_traits[sp]['chromosome_pairs'] for sp in species]
axes[1,1].barh(species, chromosomes, color='purple')
axes[1,1].set_xlabel('Chromosome Pairs', fontsize=10)
axes[1,1].set_title('Chromosome Number', fontweight='bold')
axes[1,1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nEvolutionary Insights:")
print("  â€¢ Humans show dramatic brain enlargement")
print("  â€¢ Bipedalism is unique to human lineage")
print("  â€¢ Tool use evolved independently in multiple lineages")
print("  â€¢ Human chromosome fusion (24 â†’ 23 pairs) is derived")

## Part 7: Inferring Ancestral States

Given traits in living species, can we infer what ancestors were like?

### Parsimony Approach:
Minimize the number of evolutionary changes needed.

In [None]:
# Simple ancestral state reconstruction
print("Ancestral State Reconstruction (Parsimony)")
print("="*60)

print("\nTrait: Bipedalism")
print("-" * 40)
print("Living species:")
for sp in species:
    bipedal = "Yes" if primate_traits[sp]['bipedal'] else "No"
    print(f"  {sp:15} {bipedal}")

print("\nParsimony inference:")
print("  â€¢ Human-Chimp ancestor: NOT bipedal (requires 1 change)")
print("  â€¢ African ape ancestor: NOT bipedal")
print("  â€¢ Great ape ancestor: NOT bipedal")
print("\n  â†’ Bipedalism evolved ONCE in human lineage after")
print("     divergence from chimpanzees (~6-7 MYA)")

print("\n" + "="*60)
print("\nTrait: Complex Tool Use")
print("-" * 40)
print("Living species:")
for sp in species:
    tools = primate_traits[sp]['tool_use']
    print(f"  {sp:15} {tools}")

print("\nParsimony inference:")
print("  â€¢ Human-Chimp ancestor: simple tool use")
print("  â€¢ Complex tool use evolved in human lineage")
print("  â€¢ Simple tool use arose independently in orangutans")
print("\n  â†’ Tool use shows mosaic evolution with multiple origins")

## Part 8: Connecting Molecules to Morphology

Our molecular tree should match evolutionary patterns from:
- Fossils
- Anatomy
- Biogeography
- Embryology

In [None]:
# Compare molecular tree with other evidence
print("Concordance Between Molecular and Other Evidence")
print("="*80)

evidence_table = {
    'Relationship': [
        'Human-Chimpanzee sister taxa',
        'African apes monophyly',
        'Great apes monophyly',
        'Human-chimp 6-7 MYA divergence'
    ],
    'Molecular': ['âœ“', 'âœ“', 'âœ“', 'âœ“'],
    'Fossils': ['âœ“', 'âœ“', 'âœ“', 'âœ“'],
    'Anatomy': ['âœ“', 'âœ“', 'âœ“', 'N/A'],
    'Chromosomes': ['âœ“', 'âœ“', 'âœ“', 'N/A'],
    'Behavior': ['âœ“', 'âœ“', '~', 'N/A']
}

evidence_df = pd.DataFrame(evidence_table)
print(evidence_df.to_string(index=False))

print("\nKey: âœ“ = supports, ~ = weak support, N/A = not applicable")
print("\nConclusion: Multiple lines of evidence support our molecular tree!")
print("This is how we gain confidence in phylogenetic hypotheses.")

## Part 9: Practical Applications

Phylogenetic trees aren't just academic - they have real applications:

In [None]:
print("Practical Applications of Phylogenetic Trees")
print("="*80)

applications = {
    'Conservation Biology': [
        'Identify evolutionarily distinct species',
        'Prioritize conservation efforts',
        'Understand population structure',
        'Example: Preserving unique lemur lineages'
    ],
    'Medicine & Health': [
        'Track disease outbreaks (COVID-19)',
        'Understand drug resistance evolution',
        'Find animal models for research',
        'Example: Using chimps/gorillas to study human diseases'
    ],
    'Agriculture': [
        'Crop improvement and breeding',
        'Track pest invasions',
        'Understand domestication',
        'Example: Rice variety phylogenetics'
    ],
    'Forensics': [
        'Wildlife crime investigation',
        'Species identification',
        'Source population tracking',
        'Example: Illegal ivory trade tracking'
    ],
    'Basic Science': [
        'Understand evolutionary processes',
        'Test hypotheses about adaptation',
        'Reconstruct Earth\'s biodiversity history',
        'Example: Understanding human evolution'
    ]
}

for category, examples in applications.items():
    print(f"\n{category}:")
    print("-" * 40)
    for example in examples:
        print(f"  â€¢ {example}")

## Part 10: Summary Exercise - Interpret a Tree

Let's practice everything we've learned:

In [None]:
print("Tree Interpretation Exercise")
print("="*80)
print("\nUsing our primate phylogenetic tree, answer these questions:")
print("\n1. Which two species are most closely related?")
print("   Answer: Human and Chimpanzee (share most recent common ancestor)")

print("\n2. Is 'African apes' a monophyletic group?")
print("   Answer: Yes (includes Human, Chimp, Gorilla and their ancestor)")

print("\n3. When did humans and gorillas diverge?")
estimated_time = get_pairwise_distance(tree, "Human", "Gorilla") / (2 * rate) if 'rate' in locals() else "N/A"
print(f"   Answer: ~{estimated_time:.1f} MYA (based on molecular clock)" if estimated_time != "N/A" else "   Answer: ~8-10 MYA (from literature)")

print("\n4. Which trait is a synapomorphy (shared derived) for great apes?")
print("   Answer: Larger brain size, no tail, shoulder structure")

print("\n5. Which trait is an autapomorphy (unique derived) for humans?")
print("   Answer: Bipedalism, greatly enlarged brain, complex language")

print("\n6. Does this tree support a molecular clock?")
print("   Answer: Check if tip-to-root distances are equal (UPGMA) or")
print("           vary (NJ). NJ trees typically show rate variation.")

print("\n7. What evidence supports this tree topology?")
print("   Answer: DNA sequences, fossils, anatomy, chromosomes, behavior")
print("           - Multiple independent lines of evidence agree!")

## Final Summary: From Sequences to Evolutionary Understanding

### The Complete Workflow:

**Notebook 1: Sequences**
- Downloaded real DNA sequences from GenBank
- Compared sequences position by position
- Identified patterns of similarity

**Notebook 2: Alignment**
- Aligned sequences to account for indels
- Identified conserved vs. variable regions
- Found phylogenetically informative sites

**Notebook 3: Tree Building**
- Calculated evolutionary distances
- Built trees using UPGMA and NJ
- Assessed confidence with bootstrapping

**Notebook 4: Interpretation** (this notebook)
- Read trees correctly (avoiding misconceptions)
- Identified monophyletic groups
- Estimated divergence times
- Mapped character evolution
- Connected molecules to morphology

---

### Key Takeaways:

1. **Trees show relationships, not progress**
   - All living species are equally evolved
   - Position on tree doesn't indicate "advancement"

2. **Multiple types of evidence converge**
   - Molecules, fossils, anatomy all agree
   - This strengthens our confidence

3. **Evolution is a branching process**
   - Lineages split (speciation)
   - Each lineage evolves independently
   - Trees represent this history

4. **Molecular data is powerful**
   - Can study any organism with DNA
   - Quantitative and reproducible
   - Complements traditional methods

5. **Phylogenetics has real applications**
   - Medicine, conservation, agriculture
   - Tracks diseases, identifies species
   - Informs policy and practice

---

### Skills You've Mastered:

âœ“ Retrieving biological sequences from databases
âœ“ Performing multiple sequence alignment
âœ“ Building phylogenetic trees
âœ“ Assessing tree reliability (bootstrapping)
âœ“ Interpreting evolutionary relationships
âœ“ Estimating divergence times
âœ“ Mapping character evolution
âœ“ Connecting molecular and morphological data

**You can now analyze evolutionary relationships scientifically!**

---

### Next Steps:

1. **Try different genes**: COI, 16S rRNA, RAG1
2. **Explore different groups**: Your favorite animals, plants, microbes
3. **Advanced methods**: Maximum Likelihood, Bayesian inference
4. **Larger datasets**: Genome-wide phylogenies
5. **Real research**: Apply these skills to your own questions!

---

## Congratulations! ðŸŽ‰

You've completed the Pattern Hunters phylogenetics series. You now understand how scientists reconstruct evolutionary history using molecular data - the same methods used in cutting-edge research worldwide.

**Remember**: Every phylogenetic tree in textbooks started just like this - with sequences, alignment, and careful analysis. You have the tools to do the same!

---