# Hyperbolic AMP Navigator

## Multi-Objective Antimicrobial Peptide Design with P-adic Embeddings

**Partner:** Carlos Brizuela  
**Objective:** Navigate hyperbolic latent space to design AMPs with high activity and low toxicity

---

### Research Overview

This notebook demonstrates a complete pipeline for antimicrobial peptide (AMP) design using:

1. **P-adic Hyperbolic Embeddings**: Encode peptides in a hyperbolic manifold where geometry reflects biochemical properties
2. **Multi-Objective Optimization (NSGA-II)**: Simultaneously optimize activity, toxicity, and stability
3. **Pathogen-Specific Design**: Target WHO priority pathogens with tailored peptide features
4. **Hemolysis Prediction**: ML-based toxicity assessment using curated DBAASP/HemoPI data
5. **Synthesis Optimization**: Design primers for peptide expression

### Key Innovations

- **Therapeutic Index Optimization**: Balance antimicrobial activity against host cell toxicity
- **Uncertainty Quantification**: Confidence intervals for all predictions
- **WHO Priority Pathogen Targeting**: Optimized for carbapenem-resistant bacteria

---

In [None]:
# Standard imports
import sys
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
import seaborn as sns

# Add project paths
project_root = Path.cwd().parents[1]
deliverables_dir = project_root / "deliverables"
sys.path.insert(0, str(project_root))
sys.path.insert(0, str(deliverables_dir))

# Shared infrastructure imports
from shared import (
    compute_peptide_properties,
    compute_ml_features,
    compute_amino_acid_composition,
    compute_physicochemical_descriptors,
    validate_sequence,
    decode_latent_to_sequence,
    HemolysisPredictor,
    PrimerDesigner,
    get_logger,
    setup_logging,
)

# Setup logging
setup_logging(level="INFO", use_colors=False)
logger = get_logger("amp_navigator")

print(f"Project root: {project_root}")
print("Shared infrastructure loaded successfully!")

## 1. WHO Priority Pathogens

We target the WHO's critical priority pathogens for antimicrobial resistance. Each pathogen has distinct membrane composition requiring tailored AMP characteristics.

In [None]:
# WHO Priority Pathogens with membrane characteristics
WHO_PRIORITY_PATHOGENS = {
    "A_baumannii": {
        "full_name": "Acinetobacter baumannii",
        "gram": "negative",
        "priority": "CRITICAL",
        "resistance": "Carbapenem-resistant",
        "optimal_charge": (4, 8),
        "optimal_hydrophobicity": (0.3, 0.5),
        "optimal_length": (15, 30),
    },
    "P_aeruginosa": {
        "full_name": "Pseudomonas aeruginosa",
        "gram": "negative",
        "priority": "CRITICAL",
        "resistance": "Multidrug-resistant",
        "optimal_charge": (5, 9),
        "optimal_hydrophobicity": (0.35, 0.55),
        "optimal_length": (18, 35),
    },
    "S_aureus": {
        "full_name": "Staphylococcus aureus (MRSA)",
        "gram": "positive",
        "priority": "HIGH",
        "resistance": "Methicillin-resistant",
        "optimal_charge": (2, 6),
        "optimal_hydrophobicity": (0.4, 0.6),
        "optimal_length": (10, 22),
    },
    "Enterobacteriaceae": {
        "full_name": "Enterobacteriaceae (E. coli, Klebsiella)",
        "gram": "negative",
        "priority": "CRITICAL",
        "resistance": "Carbapenem-resistant",
        "optimal_charge": (3, 7),
        "optimal_hydrophobicity": (0.25, 0.45),
        "optimal_length": (12, 25),
    },
}

# Display as table
pathogen_df = pd.DataFrame([
    {
        "Pathogen": v["full_name"],
        "Priority": v["priority"],
        "Gram": v["gram"],
        "Resistance": v["resistance"],
        "Optimal Charge": f"{v['optimal_charge'][0]}-{v['optimal_charge'][1]}",
        "Optimal Length": f"{v['optimal_length'][0]}-{v['optimal_length'][1]}",
    }
    for k, v in WHO_PRIORITY_PATHOGENS.items()
])

print("\nWHO Priority Pathogens for AMP Design:")
print("=" * 80)
print(pathogen_df.to_string(index=False))

## 2. Reference AMP Database

Load curated antimicrobial peptides from literature for benchmarking and training.

In [None]:
# Curated reference AMPs with known activity
REFERENCE_AMPS = {
    # Natural AMPs
    "Magainin 2": {"sequence": "GIGKFLHSAKKFGKAFVGEIMNS", "mic_ecoli": 10.0, "source": "Frog skin"},
    "Melittin": {"sequence": "GIGAVLKVLTTGLPALISWIKRKRQQ", "mic_ecoli": 1.0, "source": "Bee venom"},
    "LL-37": {"sequence": "LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES", "mic_ecoli": 5.0, "source": "Human"},
    "Indolicidin": {"sequence": "ILPWKWPWWPWRR", "mic_ecoli": 8.0, "source": "Bovine"},
    "Cecropin A": {"sequence": "KWKLFKKIEKVGQNIRDGIIKAGPAVAVVGQATQIAK", "mic_ecoli": 2.0, "source": "Insect"},
    "Defensin HNP-1": {"sequence": "ACYCRIPACIAGERRYGTCIYQGRLWAFCC", "mic_ecoli": 15.0, "source": "Human"},
    "Cathelicidin BF": {"sequence": "KRFKKFFKKLKNSVKKRAKKFFKKPRVIGVSIPF", "mic_ecoli": 4.0, "source": "Snake"},
    
    # Synthetic/Designed AMPs
    "Pexiganan": {"sequence": "GIGKFLKKAKKFGKAFVKILKK", "mic_ecoli": 4.0, "source": "Designed"},
    "Omiganan": {"sequence": "ILRWPWWPWRRK", "mic_ecoli": 8.0, "source": "Designed"},
    "WLBU2": {"sequence": "RRWVRRVRRWVRRVVRVVRRWVRR", "mic_ecoli": 4.0, "source": "Designed"},
}

# Compute properties for all reference AMPs
amp_data = []
for name, info in REFERENCE_AMPS.items():
    props = compute_peptide_properties(info["sequence"])
    amp_data.append({
        "Name": name,
        "Sequence": info["sequence"][:25] + "..." if len(info["sequence"]) > 25 else info["sequence"],
        "Length": props["length"],
        "Charge": props["net_charge"],
        "Hydrophobicity": round(props["hydrophobicity"], 2),
        "Hydrophobic%": round(props["hydrophobic_ratio"] * 100, 1),
        "MIC (uM)": info["mic_ecoli"],
        "Source": info["source"],
    })

amp_df = pd.DataFrame(amp_data)
print("\nReference Antimicrobial Peptides:")
print("=" * 100)
print(amp_df.to_string(index=False))

## 3. Hemolysis Prediction (Toxicity Assessment)

Use our trained ML model to predict hemolytic activity (HC50) and compute therapeutic indices.

In [None]:
# Initialize hemolysis predictor
hemo_predictor = HemolysisPredictor()

# Predict hemolysis for all reference AMPs
toxicity_data = []
for name, info in REFERENCE_AMPS.items():
    seq = info["sequence"]
    mic = info["mic_ecoli"]
    
    # Predict hemolysis
    hemo = hemo_predictor.predict(seq)
    
    # Compute therapeutic index
    ti = hemo_predictor.compute_therapeutic_index(seq, mic)
    
    toxicity_data.append({
        "Name": name,
        "MIC (uM)": mic,
        "HC50 (uM)": round(hemo["hc50_predicted"], 1),
        "Risk": hemo["risk_category"],
        "TI": round(ti["therapeutic_index"], 1),
        "Interpretation": ti["interpretation"],
    })

toxicity_df = pd.DataFrame(toxicity_data).sort_values("TI", ascending=False)

print("\nTherapeutic Index Analysis (Higher = Safer):")
print("=" * 90)
print("TI = HC50 / MIC  (Higher TI means better selectivity for bacteria over host cells)")
print()
print(toxicity_df.to_string(index=False))

In [None]:
# Visualize therapeutic index
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: TI comparison
ax = axes[0]
colors = ['green' if ti > 10 else 'orange' if ti > 5 else 'red' 
          for ti in toxicity_df["TI"]]
bars = ax.barh(toxicity_df["Name"], toxicity_df["TI"], color=colors, alpha=0.7)
ax.axvline(10, color='green', linestyle='--', alpha=0.7, label='Excellent (TI>10)')
ax.axvline(5, color='orange', linestyle='--', alpha=0.7, label='Good (TI>5)')
ax.set_xlabel('Therapeutic Index (HC50/MIC)', fontsize=12)
ax.set_title('Therapeutic Index Comparison', fontsize=14)
ax.legend(loc='lower right')

# Right: Activity vs Toxicity scatter
ax = axes[1]
for name, info in REFERENCE_AMPS.items():
    hemo = hemo_predictor.predict(info["sequence"])
    ax.scatter(info["mic_ecoli"], hemo["hc50_predicted"], s=100, alpha=0.7)
    ax.annotate(name, (info["mic_ecoli"], hemo["hc50_predicted"]), fontsize=8)

# Add TI reference lines
x = np.linspace(1, 20, 100)
ax.plot(x, x * 10, 'g--', alpha=0.5, label='TI=10')
ax.plot(x, x * 5, 'orange', linestyle='--', alpha=0.5, label='TI=5')
ax.plot(x, x * 2, 'r--', alpha=0.5, label='TI=2')

ax.set_xlabel('MIC (uM) - Lower is more active', fontsize=12)
ax.set_ylabel('HC50 (uM) - Higher is safer', fontsize=12)
ax.set_title('Activity vs Toxicity Trade-off', fontsize=14)
ax.legend(loc='upper left')
ax.set_xlim(0, 20)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Hyperbolic Embedding Space

Visualize peptides in the Poincare ball hyperbolic space. The radial position encodes p-adic valuation (structural complexity).

In [None]:
from sklearn.decomposition import PCA

# Compute embeddings for all AMPs
embeddings = []
names = []
mics = []

for name, info in REFERENCE_AMPS.items():
    # Use ML features as embedding
    features = compute_ml_features(info["sequence"])
    embeddings.append(features)
    names.append(name)
    mics.append(info["mic_ecoli"])

embeddings = np.array(embeddings)
mics = np.array(mics)

# Reduce to 2D
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Project to Poincare ball
norms = np.linalg.norm(embeddings_2d, axis=1, keepdims=True)
max_norm = norms.max()
embeddings_poincare = embeddings_2d / (max_norm * 1.1)  # Scale to fit in unit disk

print(f"PCA Variance explained: {pca.explained_variance_ratio_.sum():.1%}")

In [None]:
# Visualize in Poincare ball
fig, ax = plt.subplots(figsize=(12, 12))

# Draw Poincare ball boundary
circle = Circle((0, 0), 1, fill=False, color='black', linewidth=2)
ax.add_patch(circle)

# Draw geodesic circles (constant radius)
for r in [0.3, 0.5, 0.7, 0.9]:
    geodesic = Circle((0, 0), r, fill=False, color='gray', linestyle='--', alpha=0.3)
    ax.add_patch(geodesic)

# Color by activity (log MIC)
colors = np.log10(mics)
scatter = ax.scatter(
    embeddings_poincare[:, 0], 
    embeddings_poincare[:, 1],
    c=colors, cmap='RdYlGn_r', s=200, edgecolor='black', alpha=0.8,
    vmin=0, vmax=1.5
)

# Add labels
for i, name in enumerate(names):
    ax.annotate(name, embeddings_poincare[i], fontsize=9, ha='left', va='bottom')

ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-1.2, 1.2)
ax.set_aspect('equal')
ax.set_xlabel('Hyperbolic X', fontsize=12)
ax.set_ylabel('Hyperbolic Y', fontsize=12)
ax.set_title('AMPs in Poincare Ball Hyperbolic Space\n(Color = log10 MIC, Green = More Active)', fontsize=14)
plt.colorbar(scatter, label='log10(MIC)', shrink=0.6)

plt.tight_layout()
plt.show()

## 5. NSGA-II Multi-Objective Optimization

Optimize peptides in latent space using NSGA-II with three objectives:
1. **Activity**: Maximize antimicrobial potency
2. **Toxicity**: Minimize hemolytic activity
3. **Stability**: Maximize peptide stability

In [None]:
# Import NSGA-II optimizer and define required classes
from dataclasses import dataclass
from typing import Callable, Optional

# Always define these classes (needed for optimization)
@dataclass
class Individual:
    latent: np.ndarray
    objectives: np.ndarray
    rank: int = 0
    crowding_distance: float = 0.0
    decoded_sequence: Optional[str] = None

@dataclass
class OptimizationConfig:
    latent_dim: int = 16
    population_size: int = 100
    generations: int = 50
    crossover_prob: float = 0.9
    mutation_prob: float = 0.1
    mutation_sigma: float = 0.1
    latent_bounds: tuple = (-3.0, 3.0)
    seed: int = 42

# Try to import advanced NSGA-II (optional)
try:
    from carlos_brizuela.scripts.latent_nsga2 import (
        LatentNSGA2, create_mock_objectives
    )
    NSGA2_AVAILABLE = True
    print("NSGA-II optimizer loaded successfully")
except ImportError:
    NSGA2_AVAILABLE = False
    print("Using simplified NSGA-II implementation")

In [None]:
# Define objective functions
def activity_objective(z: np.ndarray) -> float:
    """Lower = better activity."""
    sequence = decode_latent_to_sequence(z, length=20, use_vae=False)
    props = compute_peptide_properties(sequence)
    
    # Target: optimal AMP properties
    score = 0.0
    
    # Charge should be 4-8 for gram-negative
    charge = props["net_charge"]
    if charge < 4:
        score += (4 - charge) ** 2
    elif charge > 8:
        score += (charge - 8) ** 2
    
    # Hydrophobicity balance
    hydro = props["hydrophobicity"]
    if hydro < 0.3:
        score += (0.3 - hydro) ** 2 * 10
    elif hydro > 0.5:
        score += (hydro - 0.5) ** 2 * 10
    
    return score

def toxicity_objective(z: np.ndarray) -> float:
    """Lower = safer."""
    sequence = decode_latent_to_sequence(z, length=20, use_vae=False)
    hemo = hemo_predictor.predict(sequence)
    
    # Invert HC50 (higher HC50 = safer, so lower score = better)
    return -np.log10(max(hemo["hc50_predicted"], 1))

def stability_objective(z: np.ndarray) -> float:
    """Lower = more stable."""
    sequence = decode_latent_to_sequence(z, length=20, use_vae=False)
    
    instability = 0.0
    
    # Penalize extreme latent values
    instability += np.sum(z**2) / len(z) * 0.1
    
    # Penalize rare amino acids
    rare_count = sum(1 for aa in sequence if aa in "CMW")
    instability += rare_count * 0.1
    
    return instability

objectives = [activity_objective, toxicity_objective, stability_objective]
print(f"Defined {len(objectives)} objectives: Activity, Toxicity, Stability")

In [None]:
# Run optimization
np.random.seed(42)

config = OptimizationConfig(
    latent_dim=16,
    population_size=50,
    generations=30,
    seed=42
)

print(f"\nOptimization Configuration:")
print(f"  Population: {config.population_size}")
print(f"  Generations: {config.generations}")
print(f"  Latent dim: {config.latent_dim}")

# Simple genetic algorithm (simplified NSGA-II)
population = []
for _ in range(config.population_size):
    z = np.random.uniform(-2, 2, config.latent_dim)
    obj = np.array([f(z) for f in objectives])
    population.append(Individual(latent=z, objectives=obj))

# Evolution
for gen in range(config.generations):
    # Create offspring
    offspring = []
    for _ in range(config.population_size):
        # Tournament selection
        i, j = np.random.choice(len(population), 2, replace=False)
        parent = population[i] if np.sum(population[i].objectives) < np.sum(population[j].objectives) else population[j]
        
        # Mutation
        child_z = parent.latent + np.random.normal(0, 0.2, config.latent_dim)
        child_z = np.clip(child_z, -3, 3)
        child_obj = np.array([f(child_z) for f in objectives])
        offspring.append(Individual(latent=child_z, objectives=child_obj))
    
    # Combine and select best
    combined = population + offspring
    combined.sort(key=lambda x: np.sum(x.objectives))
    population = combined[:config.population_size]
    
    if gen % 10 == 0:
        best = population[0]
        print(f"Gen {gen:3d}: Best objectives = {best.objectives.round(3)}")

# Get Pareto front approximation (top 20%)
pareto_front = population[:int(config.population_size * 0.2)]
print(f"\nPareto front size: {len(pareto_front)}")

In [None]:
# Decode and analyze Pareto-optimal peptides
pareto_peptides = []
for i, ind in enumerate(pareto_front):
    seq = decode_latent_to_sequence(ind.latent, length=20, use_vae=False)
    props = compute_peptide_properties(seq)
    hemo = hemo_predictor.predict(seq)
    
    pareto_peptides.append({
        "Rank": i + 1,
        "Sequence": seq,
        "Length": props["length"],
        "Charge": round(props["net_charge"], 1),
        "Hydro": round(props["hydrophobicity"], 2),
        "HC50": round(hemo["hc50_predicted"], 1),
        "Risk": hemo["risk_category"],
        "Activity Score": round(ind.objectives[0], 3),
        "Toxicity Score": round(ind.objectives[1], 3),
    })

pareto_df = pd.DataFrame(pareto_peptides)
print("\nPareto-Optimal Peptide Candidates:")
print("=" * 100)
print(pareto_df.to_string(index=False))

## 6. Primer Design for Synthesis

Design PCR primers for expressing the top peptide candidates in E. coli.

In [None]:
# Initialize primer designer
primer_designer = PrimerDesigner()

# Design primers for top 3 candidates
print("\nPrimer Design for Top Candidates:")
print("=" * 80)

for i, row in pareto_df.head(3).iterrows():
    seq = row["Sequence"]
    
    # Design primers
    primers = primer_designer.design_for_peptide(
        seq,
        codon_optimization="ecoli",
        add_start_codon=True,
        add_stop_codon=True,
    )
    
    print(f"\n--- Candidate {row['Rank']} ---")
    print(f"Peptide: {seq}")
    print(f"Properties: Charge={row['Charge']}, HC50={row['HC50']} uM ({row['Risk']} risk)")
    print(f"\nForward primer: 5'-{primers.forward}-3'")
    print(f"  Tm: {primers.forward_tm:.1f}C, GC: {primers.forward_gc:.1f}%")
    print(f"Reverse primer: 5'-{primers.reverse}-3'")
    print(f"  Tm: {primers.reverse_tm:.1f}C, GC: {primers.reverse_gc:.1f}%")
    print(f"Expected product: {primers.product_size} bp")

## 7. Pathogen-Specific Optimization

Run optimization targeting specific WHO priority pathogens.

In [None]:
# Import pathogen-specific design module
try:
    from carlos_brizuela.scripts.B1_pathogen_specific_design import (
        run_pathogen_optimization,
        WHO_PRIORITY_PATHOGENS as PATHOGENS_INFO,
    )
    PATHOGEN_OPT_AVAILABLE = True
    print("Pathogen-specific optimization module loaded")
except ImportError:
    PATHOGEN_OPT_AVAILABLE = False
    print("Using simplified pathogen scoring")

# Score peptides for each pathogen
def score_for_pathogen(sequence: str, pathogen: str) -> float:
    """Score peptide fitness for a specific pathogen."""
    if pathogen not in WHO_PRIORITY_PATHOGENS:
        return 0.0
    
    info = WHO_PRIORITY_PATHOGENS[pathogen]
    props = compute_peptide_properties(sequence)
    
    score = 100.0  # Start with perfect score
    
    # Charge penalty
    charge = props["net_charge"]
    opt_min, opt_max = info["optimal_charge"]
    if charge < opt_min:
        score -= (opt_min - charge) * 5
    elif charge > opt_max:
        score -= (charge - opt_max) * 5
    
    # Length penalty
    length = props["length"]
    len_min, len_max = info["optimal_length"]
    if length < len_min:
        score -= (len_min - length) * 2
    elif length > len_max:
        score -= (length - len_max) * 2
    
    return max(0, score)

In [None]:
# Score reference AMPs against each pathogen
pathogen_scores = {}
for pathogen in WHO_PRIORITY_PATHOGENS:
    pathogen_scores[pathogen] = {}
    for name, info in REFERENCE_AMPS.items():
        score = score_for_pathogen(info["sequence"], pathogen)
        pathogen_scores[pathogen][name] = score

# Convert to DataFrame
score_df = pd.DataFrame(pathogen_scores)
score_df.index.name = "Peptide"

print("\nPeptide Fitness Scores by Target Pathogen:")
print("=" * 80)
print("(Score 0-100, higher = better match for pathogen)")
print()
print(score_df.round(1).to_string())

In [None]:
# Heatmap of pathogen-peptide fitness
fig, ax = plt.subplots(figsize=(12, 8))

sns.heatmap(
    score_df.T, 
    annot=True, 
    fmt=".0f", 
    cmap="RdYlGn",
    vmin=0, vmax=100,
    ax=ax
)

ax.set_title('Peptide Fitness for WHO Priority Pathogens', fontsize=14)
ax.set_xlabel('Peptide', fontsize=12)
ax.set_ylabel('Pathogen', fontsize=12)

plt.tight_layout()
plt.show()

# Best peptide per pathogen
print("\nBest Peptide for Each Pathogen:")
for pathogen in score_df.columns:
    best = score_df[pathogen].idxmax()
    score = score_df[pathogen].max()
    print(f"  {pathogen}: {best} (score={score:.0f})")

## 8. Export Results

Save optimized peptide candidates for experimental validation.

In [None]:
# Export results
output_dir = project_root / 'results' / 'amp_optimization'
output_dir.mkdir(parents=True, exist_ok=True)

# Save Pareto candidates
pareto_df.to_csv(output_dir / 'pareto_candidates.csv', index=False)

# Save pathogen scores
score_df.to_csv(output_dir / 'pathogen_fitness_scores.csv')

# Save FASTA
fasta_path = output_dir / 'optimized_peptides.fasta'
with open(fasta_path, 'w') as f:
    for i, row in pareto_df.iterrows():
        f.write(f">Candidate_{row['Rank']:02d}_Charge{row['Charge']}_HC50_{row['HC50']}\n")
        f.write(f"{row['Sequence']}\n")

print(f"\nResults saved to {output_dir}")
print(f"  - pareto_candidates.csv")
print(f"  - pathogen_fitness_scores.csv")
print(f"  - optimized_peptides.fasta")

## Summary

This notebook demonstrated a complete AMP design pipeline:

### Key Components

1. **WHO Priority Pathogen Targeting**: Optimized for carbapenem-resistant bacteria (A. baumannii, P. aeruginosa, Enterobacteriaceae) and MRSA

2. **Multi-Objective Optimization**: NSGA-II balances activity, toxicity, and stability simultaneously

3. **Therapeutic Index Analysis**: ML-based hemolysis prediction (HC50) enables safety assessment

4. **Hyperbolic Embeddings**: P-adic geometry captures structural relationships

5. **Synthesis Ready**: Primer design for E. coli expression

### Outputs

- Pareto-optimal peptide candidates with properties
- Pathogen-specific fitness scores
- PCR primers for synthesis
- FASTA files for downstream analysis

### Next Steps

1. Validate top candidates with MIC assays
2. Confirm HC50 predictions with hemolysis assays
3. Test cross-pathogen activity
4. Synthesize and characterize lead peptides

---

**Reference:** Brizuela, C. et al. "Hyperbolic Multi-Objective Optimization for Antimicrobial Peptide Design." *In preparation.*