# Introduction to Epigenetic Data Analysis

Welcome to this hands-on exploration of epigenomic data analysis. In this notebook, we'll work through real epigenomic datasets to understand how DNA methylation and histone modifications regulate gene expression across different cell types and how these patterns change in development and disease.

## What we'll cover

This notebook walks through several key concepts in epigenomic data analysis:

- **Understanding epigenomic data structure** - What do methylation and ChIP-seq datasets contain?
- **DNA methylation patterns** - How methylation varies across cell types and genomic regions
- **Histone modification landscapes** - Mapping the histone code around genes
- **Chromatin state classification** - Identifying active, repressed, and poised genomic regions
- **Development and disease** - How epigenetic patterns change during differentiation and in cancer

We'll use data from three major epigenomic resources:
- **NIH Roadmap Epigenomics Program** - 111 reference human epigenomes
- **ENCODE Project** - Encyclopedia of DNA elements
- **Cancer epigenome studies** - Disease-associated epigenetic changes

## A note on the data

The analyses here use simulated data based on real epigenomic patterns from published studies. The patterns and insights we'll discover reflect actual biological findings from the epigenomics field.

Let's dive into the dynamic world of epigenetics.

## Setup and Data Loading

First, let's import the necessary libraries and set up our epigenomic analysis tools.

In [None]:
import sys
import os
sys.path.append('../scripts')

# Import our custom epigenomic analysis tools
from epigenetic_utils import (
    EpigenomeDataProcessor, 
    MethylationAnalyzer, 
    HistoneModificationAnalyzer,
    EpigenomeVisualizer
)
from config import (
    SAMPLE_METADATA, 
    HISTONE_MARKS, 
    DEMO_GENES,
    generate_sample_methylation_data,
    generate_sample_histone_data,
    get_gene_info
)

# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Epigenomic analysis libraries loaded successfully.")
print(f"Available sample metadata: {len(SAMPLE_METADATA)} epigenomes")
print(f"Histone marks analyzed: {len(HISTONE_MARKS)} modifications")
print(f"Demo genes available: {sum(len(cat) for cat in DEMO_GENES.values())} genes")

In [None]:
# Classify methylation levels and visualize distribution
sample_col = [col for col in methylation_data.columns 
              if col not in ['chr', 'start', 'end', 'region_id']][0]

classifications = methylation_analyzer.classify_methylation_levels(methylation_data[sample_col])
class_counts = pd.Series(classifications).value_counts()

print(f"Methylation classification for sample {sample_col}:")
print("=" * 45)
for category, count in class_counts.items():
    percentage = (count / len(classifications)) * 100
    print(f"  {category}: {count} regions ({percentage:.1f}%)")

# Create methylation distribution visualization
try:
    fig = visualizer.plot_methylation_distribution(methylation_data)
    plt.show()
except Exception as e:
    print(f"Visualization requires matplotlib: {e}")
    print("Install with: pip install matplotlib seaborn")

print("\nKey observations:")
print("- Most genomic regions show intermediate methylation")
print("- Unmethylated regions often correspond to active promoters")
print("- Highly methylated regions include gene bodies and repetitive elements")

## Differential Methylation Analysis

One of the most powerful applications of methylation analysis is identifying regions that show different methylation patterns between cell types or conditions. Let's compare ESCs with differentiated cells to find differentially methylated regions (DMRs).

In [None]:
# Perform differential methylation analysis
print("Differential Methylation Analysis: ESCs vs Differentiated Cells")
print("=" * 65)

# Identify ESC and differentiated samples
esc_samples = [sample for sample, info in SAMPLE_METADATA.items() 
               if info['category'] == 'ESC' and sample in methylation_data.columns]
diff_samples = [sample for sample, info in SAMPLE_METADATA.items() 
               if info['category'] in ['ESC_Derived', 'Other'] and sample in methylation_data.columns]

print(f"ESC samples: {esc_samples[:5]}...")
print(f"Differentiated samples: {diff_samples[:5]}...")

if len(esc_samples) >= 2 and len(diff_samples) >= 2:
    dmrs = processor.identify_differential_regions(
        methylation_data, esc_samples, diff_samples,
        min_diff=0.2, p_threshold=0.05
    )
    
    if len(dmrs) > 0:
        print(f"\nFound {len(dmrs)} differentially methylated regions (DMRs)")
        print("\nTop 10 DMRs by significance:")
        display.display(dmrs[['region_id', 'group1_mean', 'group2_mean', 
                              'methylation_diff', 'p_value']].head(10))
        
        # Analyze DMR patterns
        hypermethylated = dmrs[dmrs['methylation_diff'] > 0]
        hypomethylated = dmrs[dmrs['methylation_diff'] < 0]
        
        print(f"\nDMR patterns:")
        print(f"  Hypermethylated in ESCs: {len(hypermethylated)} regions")
        print(f"  Hypomethylated in ESCs: {len(hypomethylated)} regions")
        
        # Show most extreme differences
        most_hyper = dmrs.loc[dmrs['methylation_diff'].idxmax()]
        most_hypo = dmrs.loc[dmrs['methylation_diff'].idxmin()]
        
        print(f"\nMost hypermethylated in ESCs: {most_hyper['region_id']} (Δ={most_hyper['methylation_diff']:.3f})")
        print(f"Most hypomethylated in ESCs: {most_hypo['region_id']} (Δ={most_hypo['methylation_diff']:.3f})")
    else:
        print("No significant DMRs found with current thresholds.")
        print("This could indicate similar methylation patterns or need for parameter adjustment.")
else:
    print("Insufficient samples for differential analysis.")
    print("Need at least 2 samples per group for statistical comparison.")

## Histone Modification Landscapes

Histone modifications create a complex regulatory code that marks different functional genomic elements. Let's analyze histone modification patterns around genes and classify chromatin states.

In [None]:
# Generate and analyze histone modification data
histone_analyzer = HistoneModificationAnalyzer()

print("Generating sample histone modification data...")
histone_data = generate_sample_histone_data(n_regions=100, window_size=100)

print(f"\nHistone modification dataset:")
print(f"Shape: {histone_data.shape}")
print(f"Genes analyzed: {histone_data['gene'].nunique()}")
print(f"Positions per gene: {len(histone_data) // histone_data['gene'].nunique()}")
print(f"Histone marks: {[col for col in histone_data.columns if col.startswith('H3')]}")

display.display(histone_data.head())

# Calculate average signal levels
histone_means = histone_data[['H3K4me3', 'H3K4me1', 'H3K27ac', 'H3K27me3', 'H3K9me3']].mean()

print("\nAverage histone modification signal levels:")
for mark, level in histone_means.items():
    mark_info = HISTONE_MARKS.get(mark, {})
    function = mark_info.get('function', 'Unknown function')
    print(f"  {mark}: {level:.2f} ({function})")

In [None]:
# Analyze histone modification profile for a specific gene
gene_of_interest = 'Gene_1'
print(f"Analyzing histone modification profile for {gene_of_interest}:")

gene_profile = histone_data[histone_data['gene'] == gene_of_interest]

if len(gene_profile) > 0:
    print(f"\nProfile data points: {len(gene_profile)}")
    print(f"Position range: {gene_profile['position'].min()} to {gene_profile['position'].max()} bp from TSS")
    
    # Show peak signals for each mark
    print("\nPeak signal levels:")
    for mark in ['H3K4me3', 'H3K4me1', 'H3K27ac', 'H3K27me3', 'H3K9me3']:
        peak_signal = gene_profile[mark].max()
        peak_position = gene_profile.loc[gene_profile[mark].idxmax(), 'position']
        print(f"  {mark}: {peak_signal:.2f} (peak at {peak_position} bp from TSS)")
    
    # Visualize profile
    try:
        fig = visualizer.plot_histone_modification_profile(
            gene_profile.set_index('position'),
            gene_name=gene_of_interest
        )
        plt.show()
    except Exception as e:
        print(f"Visualization requires matplotlib: {e}")
        
    print("\nPattern interpretation:")
    tss_h3k4me3 = gene_profile[gene_profile['position'] == 0]['H3K4me3'].iloc[0] if len(gene_profile[gene_profile['position'] == 0]) > 0 else gene_profile['H3K4me3'].mean()
    if tss_h3k4me3 > 2.0:
        print("- High H3K4me3 at TSS suggests active promoter")
    if gene_profile['H3K27ac'].mean() > 1.0:
        print("- H3K27ac presence indicates active chromatin")
    if gene_profile['H3K27me3'].mean() > 1.0:
        print("- H3K27me3 signal suggests some repressive activity")
else:
    print(f"No data found for {gene_of_interest}")

## Chromatin State Classification

Different combinations of histone modifications define distinct chromatin states. Let's classify genomic regions based on their histone modification signatures.

In [None]:
# Classify chromatin states
print("Chromatin State Classification Analysis:")
print("=" * 40)

# Summarize histone modifications by gene (average across positions)
gene_summary = histone_data.groupby('gene')[['H3K4me3', 'H3K4me1', 'H3K27ac', 'H3K27me3', 'H3K9me3']].mean()

print(f"Analyzing {len(gene_summary)} genes for chromatin state classification...")

# Classify chromatin states
classified_data = histone_analyzer.classify_chromatin_states(gene_summary)

# Count and display chromatin states
state_counts = classified_data['chromatin_state'].value_counts()

print("\nChromatin state distribution:")
for state, count in state_counts.items():
    percentage = (count / len(classified_data)) * 100
    print(f"  {state}: {count} genes ({percentage:.1f}%)")

# Show examples of each state
print("\nExample genes for each chromatin state:")
for state in state_counts.index:
    examples = classified_data[classified_data['chromatin_state'] == state].index[:3].tolist()
    print(f"  {state}: {', '.join(examples)}")

# Analyze histone mark patterns by state
print("\nAverage histone mark levels by chromatin state:")
state_profiles = classified_data.groupby('chromatin_state')[['H3K4me3', 'H3K4me1', 'H3K27ac', 'H3K27me3', 'H3K9me3']].mean()
display.display(state_profiles.round(2))

In [None]:
# Visualize chromatin state distribution
try:
    fig = visualizer.plot_chromatin_state_pie(classified_data['chromatin_state'])
    plt.show()
except Exception as e:
    print(f"Visualization requires matplotlib: {e}")

print("\nBiological interpretation of chromatin states:")
print("\n• Active Promoter: High H3K4me3, active gene transcription")
print("• Active Enhancer: H3K4me1 + H3K27ac, regulatory element activity")
print("• Poised Enhancer: H3K4me1 only, ready for activation")
print("• Polycomb Repressed: H3K27me3, developmental gene silencing")
print("• Heterochromatin: H3K9me3, constitutive gene silencing")
print("• Low Signal: Minimal histone modifications, inactive regions")

print("\nThese states create a functional map of the genome,")
print("showing which regions are active, poised, or silenced.")

## Cancer Epigenetics: Tumor Suppressor Silencing

Cancer involves not just genetic mutations, but also epigenetic changes that silence tumor suppressor genes. Let's examine how key cancer genes show altered methylation patterns in disease.

In [None]:
# Analyze cancer-related genes
print("Cancer Epigenetics: Tumor Suppressor Gene Silencing")
print("=" * 55)

np.random.seed(42)  # For reproducible results

cancer_analysis = []

for gene_name, gene_info in DEMO_GENES['cancer'].items():
    print(f"\n{gene_info['name']}:")
    print(f"  Location: {gene_info['chr']}:{gene_info['start']:,}-{gene_info['end']:,}")
    print(f"  Function: {gene_info['function']}")
    
    # Simulate normal vs cancer methylation patterns
    # Tumor suppressors often become hypermethylated in cancer
    normal_methylation = np.random.beta(1, 5)  # Low methylation (active)
    cancer_methylation = np.random.beta(4, 1)  # High methylation (silenced)
    
    # Corresponding histone mark changes
    normal_h3k4me3 = np.random.uniform(2.5, 4.0)  # Active promoter
    cancer_h3k4me3 = np.random.uniform(0.2, 1.0)  # Silenced promoter
    
    normal_h3k27me3 = np.random.uniform(0.0, 0.5)  # Low repression
    cancer_h3k27me3 = np.random.uniform(1.5, 3.0)  # High repression
    
    fold_change_meth = cancer_methylation / normal_methylation if normal_methylation > 0 else float('inf')
    fold_change_h3k4me3 = cancer_h3k4me3 / normal_h3k4me3
    
    print(f"  Methylation patterns:")
    print(f"    Normal: {normal_methylation:.3f} (active gene)")
    print(f"    Cancer: {cancer_methylation:.3f} (silenced gene)")
    print(f"    Fold change: {fold_change_meth:.1f}x increase")
    
    print(f"  H3K4me3 (active promoter mark):")
    print(f"    Normal: {normal_h3k4me3:.2f}")
    print(f"    Cancer: {cancer_h3k4me3:.2f}")
    print(f"    Fold change: {fold_change_h3k4me3:.2f}x (loss of active mark)")
    
    cancer_analysis.append({
        'gene': gene_name,
        'normal_meth': normal_methylation,
        'cancer_meth': cancer_methylation,
        'meth_fc': fold_change_meth,
        'normal_h3k4me3': normal_h3k4me3,
        'cancer_h3k4me3': cancer_h3k4me3,
        'h3k4me3_fc': fold_change_h3k4me3
    })

# Summary of cancer epigenetic changes
cancer_df = pd.DataFrame(cancer_analysis)
print(f"\nSummary of epigenetic changes in cancer:")
print(f"Average methylation increase: {cancer_df['meth_fc'].mean():.1f}x")
print(f"Average H3K4me3 decrease: {cancer_df['h3k4me3_fc'].mean():.2f}x")

print("\nClinical implications:")
print("- Hypermethylation silences tumor suppressor genes")
print("- Loss of active histone marks confirms gene inactivation")
print("- These changes can be reversed with epigenetic drugs")
print("- DNA methyltransferase inhibitors can reactivate silenced genes")

## Integrated Epigenomic Analysis

The power of epigenomics comes from integrating multiple types of data. Let's combine DNA methylation and histone modification data to get a comprehensive view of chromatin regulation.

In [None]:
# Perform integrated epigenomic analysis
print("Integrated Epigenomic Analysis:")
print("=" * 35)

# Generate coordinated methylation and histone data
np.random.seed(42)
n_regions = 200

integrated_regions = []

for i in range(n_regions):
    region_id = f"Region_{i+1}"
    
    # Assign functional category
    category = np.random.choice([
        'active_promoter', 'active_enhancer', 'poised_enhancer',
        'polycomb_repressed', 'heterochromatin', 'quiescent'
    ], p=[0.15, 0.20, 0.15, 0.15, 0.10, 0.25])
    
    # Generate biologically consistent patterns
    if category == 'active_promoter':
        methylation = np.random.beta(1, 6)  # Very low
        h3k4me3 = np.random.uniform(3, 5)
        h3k27ac = np.random.uniform(2, 4)
        h3k27me3 = np.random.uniform(0, 0.5)
    elif category == 'active_enhancer':
        methylation = np.random.beta(1, 4)  # Low
        h3k4me3 = np.random.uniform(0, 1)
        h3k27ac = np.random.uniform(2, 4)
        h3k27me3 = np.random.uniform(0, 0.5)
    elif category == 'poised_enhancer':
        methylation = np.random.beta(2, 3)  # Medium
        h3k4me3 = np.random.uniform(0, 1)
        h3k27ac = np.random.uniform(0, 1)
        h3k27me3 = np.random.uniform(0.5, 1.5)
    elif category == 'polycomb_repressed':
        methylation = np.random.beta(2, 2)  # Variable
        h3k4me3 = np.random.uniform(0, 0.5)
        h3k27ac = np.random.uniform(0, 0.5)
        h3k27me3 = np.random.uniform(2, 4)
    elif category == 'heterochromatin':
        methylation = np.random.beta(4, 1)  # High
        h3k4me3 = np.random.uniform(0, 0.3)
        h3k27ac = np.random.uniform(0, 0.3)
        h3k27me3 = np.random.uniform(0, 1)
    else:  # quiescent
        methylation = np.random.beta(2, 2)  # Variable
        h3k4me3 = np.random.uniform(0, 1)
        h3k27ac = np.random.uniform(0, 1)
        h3k27me3 = np.random.uniform(0, 1)
    
    integrated_regions.append({
        'region_id': region_id,
        'category': category,
        'methylation': methylation,
        'H3K4me3': h3k4me3,
        'H3K27ac': h3k27ac,
        'H3K27me3': h3k27me3
    })

integrated_df = pd.DataFrame(integrated_regions)

print(f"Integrated dataset: {len(integrated_df)} genomic regions")
print(f"Categories: {integrated_df['category'].value_counts().to_dict()}")

# Calculate correlations between epigenetic marks
correlation_matrix = integrated_df[['methylation', 'H3K4me3', 'H3K27ac', 'H3K27me3']].corr()

print("\nCorrelation matrix between epigenetic marks:")
display.display(correlation_matrix.round(3))

# Analyze patterns by category
print("\nMean epigenetic levels by functional category:")
category_means = integrated_df.groupby('category')[['methylation', 'H3K4me3', 'H3K27ac', 'H3K27me3']].mean()
display.display(category_means.round(3))

In [None]:
# Create integrated visualizations
print("Creating integrated epigenomic visualizations...")

try:
    # PCA analysis of integrated data
    fig1 = visualizer.plot_pca_analysis(
        integrated_df.set_index('region_id')[['methylation', 'H3K4me3', 'H3K27ac', 'H3K27me3']].T,
        metadata=integrated_df.set_index('region_id')[['category']],
        color_by='category'
    )
    plt.title('PCA of Integrated Epigenomic Data')
    plt.show()
    
    # Correlation heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0, 
                square=True, cbar_kws={'label': 'Correlation'})
    plt.title('Epigenetic Mark Correlations')
    plt.tight_layout()
    plt.show()
    
except Exception as e:
    print(f"Visualization requires matplotlib and seaborn: {e}")

print("\nKey insights from integrated analysis:")
print("\n1. NEGATIVE CORRELATION between methylation and active marks:")
meth_h3k4me3_corr = correlation_matrix.loc['methylation', 'H3K4me3']
print(f"   Methylation vs H3K4me3: r = {meth_h3k4me3_corr:.3f}")
print("   High methylation = low H3K4me3 (gene silencing)")

print("\n2. POSITIVE CORRELATION between active marks:")
h3k4me3_h3k27ac_corr = correlation_matrix.loc['H3K4me3', 'H3K27ac']
print(f"   H3K4me3 vs H3K27ac: r = {h3k4me3_h3k27ac_corr:.3f}")
print("   Active promoters often have both marks")

print("\n3. FUNCTIONAL SPECIALIZATION by category:")
print("   Active promoters: Low methylation + High H3K4me3")
print("   Active enhancers: Low methylation + High H3K27ac")
print("   Heterochromatin: High methylation + Low active marks")
print("   Polycomb repressed: High H3K27me3 + Variable methylation")

## Summary and Biological Insights

In this notebook, we've explored the fascinating world of epigenetic regulation through computational analysis of DNA methylation and histone modification patterns.

### What we learned

**Epigenetic mechanisms**: DNA methylation and histone modifications work together to regulate gene expression without changing the underlying DNA sequence.

**Cell-type specificity**: Different cell types have distinct epigenetic signatures that reflect their functional state and developmental history.

**Chromatin states**: Combinations of histone modifications define functional genomic elements - active promoters, enhancers, and repressed regions.

**Development and differentiation**: Epigenetic changes drive cell fate decisions by activating lineage-specific genes and silencing pluripotency factors.

**Disease relevance**: Cancer involves epigenetic silencing of tumor suppressor genes through DNA hypermethylation and repressive histone marks.

### Practical implications

These analyses demonstrate why epigenetics is crucial for precision medicine:

- **Biomarkers**: Epigenetic patterns can predict disease risk and treatment response
- **Therapeutics**: Epigenetic drugs can reverse disease-associated silencing
- **Development**: Understanding normal epigenetic programs guides regenerative medicine
- **Personalized medicine**: Epigenetic profiles vary between individuals and populations

### Key biological principles

**The histone code**: Different histone modifications mark distinct functional states:
- H3K4me3: Active promoters
- H3K27ac: Active enhancers  
- H3K27me3: Polycomb-repressed genes
- H3K9me3: Heterochromatin

**DNA methylation**: CpG methylation provides stable gene silencing:
- Promoter methylation silences tumor suppressors in cancer
- Gene body methylation may enhance transcription
- Methylation patterns are maintained through cell divisions

**Chromatin states**: Integrate multiple marks to classify genomic regions:
- Active vs. repressed chromatin
- Promoters vs. enhancers
- Poised vs. active regulatory elements

### Further exploration

To continue learning about epigenomic data analysis:

1. **Explore real datasets** from the NIH Roadmap Epigenomics Program
2. **Try the demo script** for additional analysis examples
3. **Examine specific genes** of interest in your research
4. **Integrate with transcriptomic data** to understand gene regulation
5. **Apply machine learning** to predict gene expression from epigenetic features

The field of epigenomics continues to reveal new layers of gene regulation. The analytical approaches we've covered provide a foundation for understanding how epigenetic modifications control cellular identity, development, and disease.

**Next steps**: Try running the `demo.py` script to see these analyses applied to larger datasets, and explore the `epigenetic_utils.py` module for additional analysis functions.