# üß¨ Unit 4: Human Evolution - Part 2
## Molecular Analysis of Human Origins

**Using DNA evidence to trace our evolutionary history**

---

### Learning Objectives

By the end of this notebook, you will:
1. Understand how DNA can reveal evolutionary relationships
2. Calculate genetic distances between species
3. Use molecular clocks to estimate divergence times
4. Analyze real primate DNA sequences
5. Interpret evidence for human-Neanderthal interbreeding
6. Understand mitochondrial Eve and Y-chromosome Adam

---

### Why Molecular Evidence?

Fossils tell us about morphology, but DNA tells us:
- **Exact evolutionary relationships**
- **When lineages split** (molecular clock)
- **Population movements and mixing**
- **Details invisible in fossils** (biochemistry, gene function)

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from ipywidgets import interact, interactive, Dropdown, IntSlider
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

print("‚úÖ Libraries loaded successfully!")
print("üß¨ Ready to explore molecular evolution!")

## Part 1: DNA as a Molecular Clock

### The Concept

DNA sequences accumulate mutations at a relatively constant rate over time. This allows us to:
1. Compare DNA sequences between species
2. Calculate genetic distances
3. Estimate when lineages diverged

**Key Formula:**  
**Divergence Time = Genetic Distance / (2 √ó Mutation Rate)**

The factor of 2 accounts for mutations accumulating in both lineages since divergence.

In [None]:
def demonstrate_molecular_clock():
    """
    Visual demonstration of molecular clock concept
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Left plot: Mutation accumulation over time
    time = np.linspace(0, 10, 100)
    mutation_rate = 0.01  # mutations per million years
    
    mutations_lineage1 = mutation_rate * time
    mutations_lineage2 = mutation_rate * time
    total_differences = mutations_lineage1 + mutations_lineage2
    
    ax1.plot(time, mutations_lineage1, 'b-', linewidth=2, label='Lineage 1 (e.g., Human)')
    ax1.plot(time, mutations_lineage2, 'r-', linewidth=2, label='Lineage 2 (e.g., Chimp)')
    ax1.plot(time, total_differences, 'g--', linewidth=3, label='Total Differences')
    ax1.fill_between(time, 0, total_differences, alpha=0.2, color='green')
    
    ax1.set_xlabel('Time Since Divergence (Million Years)', fontsize=12)
    ax1.set_ylabel('Accumulated Mutations (%)', fontsize=12)
    ax1.set_title('Molecular Clock: Mutations Accumulate Linearly', fontsize=14, weight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(alpha=0.3)
    
    # Right plot: Relationship between genetic distance and time
    # Using real data from primates
    species_pairs = [
        ('Human-Chimp', 6, 1.2),
        ('Human-Gorilla', 8, 1.6),
        ('Human-Orangutan', 14, 3.1),
        ('Human-Macaque', 25, 7.0),
        ('Human-Marmoset', 35, 13.0)
    ]
    
    divergence_times = [x[1] for x in species_pairs]
    genetic_distances = [x[2] for x in species_pairs]
    labels = [x[0] for x in species_pairs]
    
    ax2.scatter(divergence_times, genetic_distances, s=200, alpha=0.7, 
               c=range(len(species_pairs)), cmap='viridis', edgecolors='black', linewidth=2)
    
    # Add trend line
    z = np.polyfit(divergence_times, genetic_distances, 1)
    p = np.poly1d(z)
    ax2.plot(divergence_times, p(divergence_times), "r--", alpha=0.8, linewidth=2, label='Best fit line')
    
    # Add labels
    for i, label in enumerate(labels):
        ax2.annotate(label, (divergence_times[i], genetic_distances[i]), 
                    xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    ax2.set_xlabel('Divergence Time (Million Years Ago)', fontsize=12)
    ax2.set_ylabel('Genetic Distance (% DNA difference)', fontsize=12)
    ax2.set_title('Real Data: Primate Genetic Distances', fontsize=14, weight='bold')
    ax2.legend()
    ax2.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Key Insight:")
    print(f"Molecular clock rate for primates: ~{z[0]:.3f}% per million years")
    print(f"\nThis means: For every million years of separation,")
    print(f"DNA sequences accumulate ~{z[0]:.3f}% differences")
    print(f"\nüí° Humans and chimps differ by ~1.2% ‚Üí Diverged ~6 million years ago!")

demonstrate_molecular_clock()

## Part 2: Analyzing Real DNA Sequences

### Simulated Primate Cytochrome B Sequences

Let's work with a real gene that's commonly used in evolutionary studies.

In [None]:
# Simulated short sequences (based on real patterns)
# In reality, these are ~1000 bp long
sequences = {
    'Human':      'ATGGCAAGCCTACGAAAACTACACCCTACTAAAAATCATTAACGACTCATTCATTGACCTACCAACACCATCAAACATCTCATC',
    'Chimp':      'ATGGCAAGCCTACGAAAACTACACCCTACTAAAAATCATTAACGACTCATTCATTGACCTACCAACACCATCAAACATCTCGTC',
    'Gorilla':    'ATGGCAAGCCTACGAAAACTACACCCTACTAAAAATCATTAACGACTCATTCATTGACCTACCAACACCATCAAACATCTCATC',
    'Orangutan':  'ATGGCAAGCCTACGAAAACTTCACCCTACTAAAAATTATTAACGACTCATTCATTGACCTACCAACACCCCCAAACATCTCATC',
    'Macaque':    'ATGGCAAGCCTGCGAAAACTTCACCCTACTAAAAATTATTAACAACTCATTCATTGACCTCCCAACACCCTCAAACATTTCATC',
    'Marmoset':   'ATGGCCAGCCTACGAAAACTTCACCCCGCTAAAAATTATTAATGACTCATTCATTGACCTCCCAACACCCCCGAATATCTCGTC'
}

species_list = list(sequences.keys())
seq_length = len(sequences['Human'])

print(f"üìù Cytochrome B gene sequences ({seq_length} bp fragment)\n")
for species, seq in sequences.items():
    print(f"{species:12} {seq[:40]}...{seq[-20:]}")

print(f"\n‚úÖ All sequences are {seq_length} base pairs long")

### Calculate Genetic Distances

In [None]:
def calculate_genetic_distance(seq1, seq2):
    """
    Calculate percentage difference between two sequences
    """
    differences = sum(1 for a, b in zip(seq1, seq2) if a != b)
    return (differences / len(seq1)) * 100

# Create distance matrix
n_species = len(species_list)
distance_matrix = np.zeros((n_species, n_species))

for i, sp1 in enumerate(species_list):
    for j, sp2 in enumerate(species_list):
        distance_matrix[i, j] = calculate_genetic_distance(sequences[sp1], sequences[sp2])

# Create DataFrame for better visualization
df_distances = pd.DataFrame(distance_matrix, 
                           index=species_list, 
                           columns=species_list)

print("üß¨ Genetic Distance Matrix (% sequence difference):\n")
display(df_distances.round(2))

# Visualize as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df_distances, annot=True, fmt='.2f', cmap='RdYlGn_r', 
            cbar_kws={'label': '% Genetic Distance'},
            linewidths=0.5, square=True)
plt.title('Genetic Distance Matrix: Primate Cytochrome B', fontsize=14, weight='bold')
plt.tight_layout()
plt.show()

print("\nüîç Observations:")
print(f"Smallest distance: Human-Chimp = {df_distances.loc['Human', 'Chimp']:.2f}%")
print(f"This confirms: Humans and chimps are most closely related!")

## Part 3: Building a Molecular Phylogenetic Tree

### From Distances to Phylogeny

In [None]:
def build_phylogenetic_tree():
    """
    Build and visualize a phylogenetic tree from genetic distances
    """
    # Convert distance matrix to condensed form for hierarchical clustering
    condensed_dist = pdist(distance_matrix)
    
    # Perform hierarchical clustering (UPGMA method)
    linkage_matrix = linkage(condensed_dist, method='average')
    
    # Plot dendrogram
    plt.figure(figsize=(12, 8))
    
    dendrogram(linkage_matrix, 
              labels=species_list,
              orientation='right',
              leaf_font_size=12,
              color_threshold=0)
    
    plt.xlabel('Genetic Distance (%)', fontsize=12, weight='bold')
    plt.title('Molecular Phylogenetic Tree: Primate Relationships\n(Based on Cytochrome B sequences)', 
             fontsize=14, weight='bold')
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("\nüå≥ Phylogenetic Tree Interpretation:\n")
    print("1. Shortest branch = Human-Chimp (most recent common ancestor)")
    print("2. Gorilla joins next (slightly older divergence)")
    print("3. Orangutan is more distant (Asian great ape)")
    print("4. Macaque is even more distant (Old World monkey)")
    print("5. Marmoset is most distant (New World monkey)")
    print("\nüí° Tree topology matches what fossils tell us!")

build_phylogenetic_tree()

## Part 4: The Molecular Clock Calculator

### Interactive Tool: Estimate Divergence Times

In [None]:
def molecular_clock_calculator(species1='Human', species2='Chimp', mutation_rate=0.2):
    """
    Calculate divergence time using molecular clock
    
    mutation_rate: % change per million years (typical range: 0.1-0.3)
    """
    # Get genetic distance
    genetic_dist = df_distances.loc[species1, species2]
    
    # Calculate divergence time
    # Formula: Time = Distance / (2 * rate)
    # Factor of 2 because mutations accumulate in both lineages
    divergence_time = genetic_dist / (2 * mutation_rate)
    
    # Visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Left: Timeline visualization
    ax1.plot([0, divergence_time], [1, 1], 'k-', linewidth=2)
    ax1.plot([divergence_time, divergence_time*2], [1, 1.5], 'b-', linewidth=3, label=species1)
    ax1.plot([divergence_time, divergence_time*2], [1, 0.5], 'r-', linewidth=3, label=species2)
    
    ax1.scatter([0], [1], s=300, c='black', marker='o', zorder=5)
    ax1.scatter([divergence_time*2], [1.5], s=300, c='blue', marker='o', zorder=5)
    ax1.scatter([divergence_time*2], [0.5], s=300, c='red', marker='o', zorder=5)
    
    ax1.text(0, 0.8, 'Common\nAncestor', ha='center', fontsize=11, weight='bold')
    ax1.text(divergence_time*2, 1.65, species1, ha='center', fontsize=11, weight='bold')
    ax1.text(divergence_time*2, 0.35, species2, ha='center', fontsize=11, weight='bold')
    ax1.text(divergence_time, 0.7, f'{divergence_time:.1f} MYA', ha='center', 
            fontsize=12, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))
    
    ax1.set_xlim(-2, divergence_time*2 + 2)
    ax1.set_ylim(0, 2)
    ax1.set_xlabel('Time (Million Years Ago)', fontsize=12)
    ax1.set_title(f'Estimated Divergence: {species1} and {species2}', fontsize=13, weight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(alpha=0.3)
    ax1.set_yticks([])
    
    # Right: Calculation breakdown
    ax2.axis('off')
    
    calc_text = f"""
    MOLECULAR CLOCK CALCULATION
    ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
    
    Given Information:
    ‚Ä¢ Species pair: {species1} vs {species2}
    ‚Ä¢ Genetic distance: {genetic_dist:.2f}%
    ‚Ä¢ Mutation rate: {mutation_rate:.2f}% per MY
    
    Formula:
    Time = Genetic Distance / (2 √ó Mutation Rate)
    
    Why divide by 2?
    Because mutations accumulate in BOTH lineages
    since they split from their common ancestor.
    
    Calculation:
    Time = {genetic_dist:.2f} / (2 √ó {mutation_rate:.2f})
    Time = {genetic_dist:.2f} / {2*mutation_rate:.2f}
    Time = {divergence_time:.2f} million years
    
    ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
    
    RESULT:
    {species1} and {species2} diverged approximately
    {divergence_time:.2f} million years ago (MYA)
    """
    
    ax2.text(0.1, 0.5, calc_text, fontsize=11, family='monospace',
            verticalalignment='center')
    
    plt.tight_layout()
    plt.show()
    
    # Comparison with known values
    known_divergences = {
        ('Human', 'Chimp'): 6,
        ('Human', 'Gorilla'): 8,
        ('Human', 'Orangutan'): 14,
        ('Human', 'Macaque'): 25
    }
    
    pair = tuple(sorted([species1, species2]))
    if pair in known_divergences:
        known_time = known_divergences[pair]
        error = abs(divergence_time - known_time)
        print(f"\n‚úÖ Comparison with fossil/molecular data:")
        print(f"   Estimated: {divergence_time:.2f} MYA")
        print(f"   Known value: {known_time} MYA")
        print(f"   Error: {error:.2f} million years ({error/known_time*100:.1f}%)")
        
        if error < 2:
            print(f"   üéØ Excellent agreement!")
        elif error < 5:
            print(f"   ‚úì Reasonable estimate")
        else:
            print(f"   ‚ö†Ô∏è Try adjusting mutation rate")

# Interactive widget
interact(molecular_clock_calculator,
         species1=Dropdown(options=species_list, value='Human', description='Species 1:'),
         species2=Dropdown(options=species_list, value='Chimp', description='Species 2:'),
         mutation_rate=FloatSlider(min=0.1, max=0.5, step=0.05, value=0.2,
                                   description='Mutation Rate\n(% per MY):',
                                   style={'description_width': 'initial'}));

## Part 5: Human Genetic Diversity

### Mitochondrial Eve and Y-Chromosome Adam

In [None]:
print("üß¨ MITOCHONDRIAL EVE\n" + "="*60)
print("""
What is Mitochondrial Eve?
‚Ä¢ The most recent common ancestor (MRCA) of all living humans
  through the MATERNAL line
‚Ä¢ Based on mitochondrial DNA (mtDNA)
‚Ä¢ mtDNA is inherited ONLY from mothers

When did she live?
‚Ä¢ Estimated: 150,000 - 200,000 years ago
‚Ä¢ Location: Africa

Important Clarifications:
‚ùå She was NOT the only woman alive at that time
‚ùå She was NOT the first woman
‚úÖ She is simply the MRCA through maternal lineages
‚úÖ All other maternal lineages have died out
""")

print("\nüß¨ Y-CHROMOSOME ADAM\n" + "="*60)
print("""
What is Y-Chromosome Adam?
‚Ä¢ The MRCA of all living humans through the PATERNAL line
‚Ä¢ Based on Y-chromosome DNA
‚Ä¢ Y-chromosome is inherited ONLY from fathers

When did he live?
‚Ä¢ Estimated: 200,000 - 300,000 years ago
‚Ä¢ Location: Africa

Important Notes:
‚ùå He did NOT live at the same time as Mitochondrial Eve
‚ùå They probably never met!
‚úÖ Different mutation rates ‚Üí different MRCA times
""")

print("\nüí° KEY INSIGHT:\n" + "="*60)
print("""
Both analyses point to AFRICA as the origin of modern humans!

This supports the "Out of Africa" theory:
1. Homo sapiens evolved in Africa (~300,000 years ago)
2. Small populations left Africa (~60,000-70,000 years ago)
3. These migrants populated the rest of the world
4. African populations retain the HIGHEST genetic diversity
   (because they're the oldest!)
""")

## Part 6: Neanderthal DNA in Modern Humans

### Evidence of Interbreeding

In [None]:
# Neanderthal DNA percentages in modern populations
populations = {
    'Sub-Saharan African': 0.0,
    'Middle Eastern': 2.0,
    'European': 2.1,
    'East Asian': 2.3,
    'South Asian': 1.9,
    'Native American': 2.0,
    'Oceanian': 2.2
}

df_neanderthal = pd.DataFrame(list(populations.items()), 
                             columns=['Population', 'Neanderthal DNA (%)'])

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart
colors = ['red' if x == 0 else 'steelblue' for x in df_neanderthal['Neanderthal DNA (%)']]
ax1.barh(df_neanderthal['Population'], df_neanderthal['Neanderthal DNA (%)'], 
        color=colors, alpha=0.7, edgecolor='black', linewidth=2)
ax1.set_xlabel('Neanderthal DNA (%)', fontsize=12, weight='bold')
ax1.set_title('Neanderthal Ancestry in Modern Human Populations', fontsize=14, weight='bold')
ax1.grid(axis='x', alpha=0.3)

# Add percentage labels
for i, row in df_neanderthal.iterrows():
    ax1.text(row['Neanderthal DNA (%)'] + 0.05, i, f"{row['Neanderthal DNA (%)']:.1f}%",
            va='center', fontsize=10, weight='bold')

# Pie chart showing composition
avg_non_african = df_neanderthal[df_neanderthal['Population'] != 'Sub-Saharan African']['Neanderthal DNA (%)'].mean()
sizes = [avg_non_african, 100 - avg_non_african]
labels = [f'Neanderthal\n{avg_non_african:.1f}%', f'Modern Human\n{100-avg_non_african:.1f}%']
colors_pie = ['coral', 'lightblue']
explode = (0.1, 0)

ax2.pie(sizes, explode=explode, labels=labels, colors=colors_pie, autopct='',
       shadow=True, startangle=90, textprops={'fontsize': 12, 'weight': 'bold'})
ax2.set_title('Average Non-African Genome Composition', fontsize=14, weight='bold')

plt.tight_layout()
plt.show()

print("\nüî¨ Key Findings from Ancient DNA:\n")
print("1. ‚úÖ Non-African populations have 1.5-2.5% Neanderthal DNA")
print("2. ‚úÖ Sub-Saharan Africans have essentially 0% Neanderthal DNA")
print("3. ‚úÖ This proves interbreeding occurred AFTER Homo sapiens left Africa")
print("4. ‚úÖ Neanderthals lived in Europe and Asia, not Africa")
print("\nüí° Timeline:")
print("   ‚Ä¢ ~60,000 years ago: Modern humans leave Africa")
print("   ‚Ä¢ ~50,000-40,000 years ago: Interbreeding with Neanderthals")
print("   ‚Ä¢ ~40,000 years ago: Neanderthals go extinct")
print("\nüß¨ Some Neanderthal genes are still functional today:")
print("   ‚Ä¢ Immune system genes (helped adapt to new pathogens)")
print("   ‚Ä¢ Skin and hair genes (cold adaptation)")
print("   ‚Ä¢ BUT also: increased risk for some diseases")

## Part 7: Practical Exercise - Sequence Comparison

### Compare Two Sequences by Hand

In [None]:
def compare_sequences_visual(sp1='Human', sp2='Chimp'):
    """
    Visually compare two DNA sequences
    """
    seq1 = sequences[sp1]
    seq2 = sequences[sp2]
    
    print(f"\nüîç Comparing {sp1} vs {sp2}\n" + "="*80)
    print("\nColor coding: \033[92mGreen = Match\033[0m, \033[91mRed = Difference\033[0m\n")
    
    # Print sequences in blocks of 10
    block_size = 10
    differences = 0
    
    for i in range(0, len(seq1), block_size):
        block1 = seq1[i:i+block_size]
        block2 = seq2[i:i+block_size]
        
        # Position label
        print(f"{i+1:3d}-{min(i+block_size, len(seq1)):3d}: ", end='')
        
        # Print seq1 with colors
        for j, (base1, base2) in enumerate(zip(block1, block2)):
            if base1 == base2:
                print(f"\033[92m{base1}\033[0m", end='')  # Green for match
            else:
                print(f"\033[91m{base1}\033[0m", end='')  # Red for mismatch
                differences += 1
        
        print(f"  {sp1}")
        print(" "*9, end='')  # Indent for alignment
        
        # Print seq2
        for base1, base2 in zip(block1, block2):
            if base1 == base2:
                print(f"\033[92m{base2}\033[0m", end='')
            else:
                print(f"\033[91m{base2}\033[0m", end='')
        
        print(f"  {sp2}")
        print()  # Blank line between blocks
    
    percent_diff = (differences / len(seq1)) * 100
    
    print("="*80)
    print(f"\nüìä Results:")
    print(f"   Total positions: {len(seq1)}")
    print(f"   Differences: {differences}")
    print(f"   Percent difference: {percent_diff:.2f}%")
    print(f"\nüí° Interpretation: These species differ by ~{percent_diff:.1f}% in this gene")

# Interactive widget
interact(compare_sequences_visual,
         sp1=Dropdown(options=species_list, value='Human', description='Species 1:'),
         sp2=Dropdown(options=species_list, value='Chimp', description='Species 2:'));

## Summary: Key Takeaways

### What We've Learned:

1. **Molecular Evidence Complements Fossils**
   - DNA provides precise evolutionary relationships
   - Can estimate divergence times (molecular clock)
   - Reveals population history invisible in fossils

2. **The Molecular Clock**
   - DNA mutations accumulate at ~constant rate
   - Formula: Time = Distance / (2 √ó Rate)
   - Calibrated using fossil evidence

3. **Human-Primate Relationships**
   - Humans and chimps: ~1.2% DNA difference, diverged ~6 MYA
   - Gorilla is next closest relative (~8 MYA)
   - All great apes share recent common ancestor

4. **Human Origins**
   - Mitochondrial Eve: ~150,000-200,000 years ago, Africa
   - Y-chromosome Adam: ~200,000-300,000 years ago, Africa
   - Both support "Out of Africa" theory

5. **Neanderthal Introgression**
   - 1.5-2.5% of non-African genomes are Neanderthal
   - Proves interbreeding occurred
   - Some Neanderthal genes still functional

6. **African Genetic Diversity**
   - Highest within Africa
   - Decreases with distance from Africa
   - Bottleneck effect from small migrating populations

---

### Methods You Can Now Apply:
- Calculate genetic distances from DNA sequences
- Build phylogenetic trees from molecular data
- Use molecular clocks to estimate divergence times
- Interpret population genetic data

---

*Created for BSc Zoology students at Kuchinda College*  
*Part of Unit 4: Origin and Evolution of Man*  
*Dr. Alok Patel and Ms. Susama Kar, Department of Zoology*