# PyPopART Basic Workflow

This notebook demonstrates the basic workflow for constructing and visualizing haplotype networks using PyPopART.

## Steps Covered:
1. Loading sequence data
2. Calculating genetic distances
3. Identifying unique haplotypes
4. Constructing a haplotype network
5. Analyzing network statistics
6. Visualizing the network

In [None]:
# Import required modules
from pypopart.io import load_alignment
from pypopart.core.distance import DistanceCalculator
from pypopart.core.condensation import condense_alignment
from pypopart.algorithms import MJNAlgorithm
from pypopart.stats import NetworkStatistics
from pypopart.visualization import StaticVisualizer, InteractiveVisualizer

## 1. Load Sequence Alignment

PyPopART supports multiple sequence file formats including FASTA, NEXUS, PHYLIP, and GenBank.

In [None]:
# Load alignment from file
alignment = load_alignment('../data/examples/sample.fasta')

print(f"Loaded {len(alignment)} sequences")
print(f"Alignment length: {alignment.length} bp")

In [None]:
# Display alignment statistics
stats = alignment.calculate_stats()

print("\nAlignment Statistics:")
print(f"  Number of sequences: {stats.num_sequences}")
print(f"  Alignment length: {stats.length} bp")
print(f"  Variable sites: {stats.variable_sites}")
print(f"  Parsimony informative sites: {stats.parsimony_informative_sites}")
print(f"  GC content: {stats.gc_content:.2f}%")
print(f"  Gap percentage: {stats.gap_percentage:.2f}%")

## 2. Calculate Genetic Distances

PyPopART provides several distance metrics:
- **Hamming**: Simple count of differences
- **Jukes-Cantor (JC)**: Corrects for multiple substitutions
- **Kimura 2-parameter (K2P)**: Accounts for transitions vs transversions
- **Tamura-Nei**: Most complex model, accounts for GC content and rate variation

In [None]:
# Calculate distance matrix using Kimura 2-parameter
calculator = DistanceCalculator(method='k2p')
dist_matrix = calculator.calculate_matrix(alignment)

print(f"Distance matrix shape: {dist_matrix.shape}")
print("\nFirst 5x5 distances:")
print(dist_matrix[:5, :5])

## 3. Identify Unique Haplotypes

Condense identical sequences into unique haplotypes with frequency information.

In [None]:
# Identify unique haplotypes
haplotypes, freq_map = condense_alignment(alignment)

print(f"Found {len(haplotypes)} unique haplotypes from {len(alignment)} sequences")
print("\nHaplotype frequencies:")
for hap in haplotypes:
    print(f"  {hap.id}: {hap.frequency} sequences")

## 4. Construct Haplotype Network

We'll use the Median-Joining Network algorithm, which infers ancestral haplotypes.

In [None]:
# Construct Median-Joining Network
mjn = MJNAlgorithm(epsilon=0)  # epsilon=0 for maximum simplification
network = mjn.construct_network(haplotypes, dist_matrix)

print("Network constructed:")
print(f"  Nodes: {network.number_of_nodes()}")
print(f"  Edges: {network.number_of_edges()}")

# Count median vectors
n_medians = len([n for n in network.nodes() if network.nodes[n].get('is_median', False)])
print(f"  Median vectors (inferred ancestors): {n_medians}")

## 5. Analyze Network Statistics

Calculate various network metrics to understand the structure.

In [None]:
# Calculate network statistics
net_stats = NetworkStatistics(network)
summary = net_stats.summary()

print("Network Statistics:")
for key, value in summary.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

## 6. Visualize the Network

### Static Visualization (Matplotlib)

In [None]:
# Create static visualization
viz = StaticVisualizer(network)
viz.plot(
    layout_algorithm='spring',
    figsize=(10, 8),
    show_labels=True,
    title='Haplotype Network (Median-Joining)',
    output_file='network_static.png'
)
print("Static plot saved to network_static.png")

### Interactive Visualization (Plotly)

In [None]:
# Create interactive visualization
viz_interactive = InteractiveVisualizer(network)
fig = viz_interactive.plot(
    layout_algorithm='spring',
    width=800,
    height=600,
    show_labels=True,
    title='Interactive Haplotype Network'
)

# Display in notebook
fig.show()

# Or save to HTML file
fig.write_html('network_interactive.html')
print("Interactive plot saved to network_interactive.html")

## Summary

This notebook demonstrated the complete workflow for:
1. ✅ Loading sequence alignments
2. ✅ Calculating genetic distances
3. ✅ Identifying unique haplotypes
4. ✅ Constructing a Median-Joining Network
5. ✅ Analyzing network statistics
6. ✅ Creating static and interactive visualizations

## Next Steps

- Try different network algorithms (MST, MSN, TCS)
- Experiment with different distance metrics
- Add population metadata and color nodes by population
- Perform population genetics analyses (Tajima's D, FST, etc.)
- Compare networks from different algorithms

See other notebooks in this series for more advanced topics!