# Bioinformatics Toolkit Overview

## P-adic Geometry for Computational Biology

**Project:** Ternary VAE Bioinformatics Platform  
**Organization:** AI Whisperers  
**License:** PolyForm Noncommercial 1.0.0

---

### Executive Summary

This toolkit provides **p-adic (hyperbolic) geometric methods** for three key bioinformatics applications:

| Application | Partner | Notebook |
|-------------|---------|----------|
| Antimicrobial Peptide Design | Carlos Brizuela (CICESE) | `brizuela_amp_navigator.ipynb` |
| Protein Rotamer Scoring | Luca Colbes | `colbes_scoring_function.ipynb` |
| Arbovirus Surveillance | Alejandra Rojas (IICS-UNA) | `rojas_serotype_forecast.ipynb` |

### Mathematical Foundation

All tools leverage the **Poincare ball model** of hyperbolic space:

- **Radial position** encodes hierarchical relationships (p-adic valuation)
- **Angular position** captures similarity/divergence
- **Geodesic distances** preserve biological structure better than Euclidean

In [None]:
# Environment setup
from __future__ import annotations

import sys
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
import seaborn as sns

warnings.filterwarnings('ignore')

# Project paths
project_root = Path.cwd().parents[1]
deliverables_path = project_root / "deliverables"
sys.path.insert(0, str(deliverables_path))
sys.path.insert(0, str(project_root))

print(f"Project: {project_root.name}")
print(f"Python: {sys.version.split()[0]}")

In [None]:
# Load shared toolkit
from shared import (
    # Core utilities
    compute_peptide_properties,
    compute_ml_features,
    validate_sequence,
    
    # Prediction modules
    HemolysisPredictor,
    PrimerDesigner,
)

print("\nShared Toolkit Components:")
print("=" * 50)
print("  Core Utilities:")
print("    - compute_peptide_properties: Biophysical analysis")
print("    - compute_ml_features: ML feature generation (25-dim)")
print("    - validate_sequence: Input validation")
print("\n  Prediction Modules:")
print("    - HemolysisPredictor: HC50 and therapeutic index")
print("    - PrimerDesigner: Codon-optimized primers")

---

## 1. Shared Infrastructure Demo

Demonstrate the common utilities available to all partner notebooks.

In [None]:
# Example: Analyze a peptide sequence
test_peptide = "GIGKFLHSAKKFGKAFVGEIMNS"  # Magainin 2

# Validate
is_valid, message = validate_sequence(test_peptide)
print(f"Sequence: {test_peptide}")
print(f"Valid: {is_valid}")
if not is_valid:
    print(f"Error: {message}")

# Compute properties
props = compute_peptide_properties(test_peptide)
print(f"\nBiophysical Properties:")
print(f"  Length: {props['length']} aa")
print(f"  Net charge: {props['net_charge']:+.1f}")
print(f"  Hydrophobicity: {props['hydrophobicity']:.3f}")
print(f"  Hydrophobic ratio: {props['hydrophobic_ratio']:.1%}")
print(f"  Cationic ratio: {props['cationic_ratio']:.1%}")

In [None]:
# ML Features
features = compute_ml_features(test_peptide)
print(f"ML Feature Vector: {len(features)} dimensions")
print(f"  First 5: {features[:5].round(3)}")
print(f"  Sum (sanity): {features.sum():.3f}")

In [None]:
# Hemolysis prediction
predictor = HemolysisPredictor()
result = predictor.predict(test_peptide)

print(f"\nHemolysis Prediction:")
print(f"  Predicted HC50: {result['hc50_predicted']:.1f} uM")
print(f"  Hemolytic probability: {result['hemolytic_probability']:.1%}")
print(f"  Risk category: {result['risk_category']}")

# Therapeutic index
ti_result = predictor.compute_therapeutic_index(test_peptide, mic_value=10.0)
print(f"\nTherapeutic Index (MIC=10 uM):")
print(f"  TI = HC50/MIC = {ti_result['therapeutic_index']:.1f}")
print(f"  Interpretation: {ti_result['interpretation']}")

In [None]:
# Primer design
designer = PrimerDesigner()
primers = designer.design_for_peptide(
    test_peptide[:15],  # First 15 AA
    codon_optimization='ecoli',
    add_start_codon=True,
    add_stop_codon=True
)

print(f"\nPrimer Design (E. coli optimized):")
print(f"  Forward: 5'-{primers.forward}-3'")
print(f"    Tm: {primers.forward_tm:.1f}C, GC: {primers.forward_gc:.1f}%")
print(f"  Reverse: 5'-{primers.reverse}-3'")
print(f"    Tm: {primers.reverse_tm:.1f}C, GC: {primers.reverse_gc:.1f}%")
print(f"  Product size: {primers.product_size} bp")

---

## 2. Partner Research Overview

### 2.1 Brizuela: AMP Navigator

**Goal:** Design antimicrobial peptides optimized for WHO priority pathogens

**Key Features:**
- WHO priority pathogen targeting
- NSGA-II multi-objective optimization
- Therapeutic index calculation
- Hyperbolic embedding visualization

In [None]:
# AMP Navigator demo - Analyze reference peptides
reference_amps = {
    'Magainin 2': 'GIGKFLHSAKKFGKAFVGEIMNS',
    'Melittin': 'GIGAVLKVLTTGLPALISWIKRKRQQ',
    'LL-37': 'LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES',
    'Indolicidin': 'ILPWKWPWWPWRR',
}

amp_analysis = []
for name, seq in reference_amps.items():
    props = compute_peptide_properties(seq)
    hemo = predictor.predict(seq)
    
    amp_analysis.append({
        'Peptide': name,
        'Length': props['length'],
        'Charge': props['net_charge'],
        'Hydrophobicity': props['hydrophobicity'],
        'HC50 (uM)': hemo['hc50_predicted'],
        'Risk': hemo['risk_category'],
    })

amp_df = pd.DataFrame(amp_analysis)
print("Reference AMP Analysis:")
print(amp_df.to_string(index=False))

In [None]:
# Visualize AMP properties
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Charge vs Hydrophobicity
ax = axes[0]
colors = {'Low': 'green', 'Moderate': 'orange', 'High': 'red'}
for _, row in amp_df.iterrows():
    ax.scatter(row['Charge'], row['Hydrophobicity'], 
               c=colors[row['Risk']], s=200, edgecolor='black')
    ax.annotate(row['Peptide'], (row['Charge'], row['Hydrophobicity']),
                xytext=(5, 5), textcoords='offset points', fontsize=9)
ax.set_xlabel('Net Charge', fontsize=11)
ax.set_ylabel('Hydrophobicity', fontsize=11)
ax.set_title('Charge vs Hydrophobicity', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)

# HC50 comparison
ax = axes[1]
bars = ax.barh(amp_df['Peptide'], amp_df['HC50 (uM)'], 
               color=[colors[r] for r in amp_df['Risk']], edgecolor='black')
ax.axvline(100, color='green', linestyle='--', alpha=0.5, label='>100 uM (safe)')
ax.set_xlabel('Predicted HC50 (uM)', fontsize=11)
ax.set_title('Hemolytic Activity', fontsize=12, fontweight='bold')
ax.legend(loc='lower right')

# Length distribution
ax = axes[2]
ax.bar(amp_df['Peptide'], amp_df['Length'], color='steelblue', edgecolor='black')
ax.axhline(25, color='orange', linestyle='--', alpha=0.5, label='Typical AMP length')
ax.set_ylabel('Length (aa)', fontsize=11)
ax.set_title('Peptide Lengths', fontsize=12, fontweight='bold')
ax.legend()
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

### 2.2 Colbes: Rotamer Scoring

**Goal:** Detect "Rosetta-blind" residues using p-adic geometric scoring

**Key Features:**
- Dunbrack rotamer library integration
- Hyperbolic distance metrics
- Discordance scoring for instability detection

In [None]:
# Rotamer scoring demo - Dunbrack library centroids
DUNBRACK_CENTROIDS = {
    'gauche-': (-60, -60),
    'gauche+': (-60, 60),
    'trans': (180, 60),
    'anti': (180, 180),
}

def compute_hyperbolic_distance(chi_angles: np.ndarray, max_angle: float = 180.0) -> float:
    """Map chi angles to Poincare ball and compute distance from origin."""
    # Normalize to unit disk
    x = chi_angles[0] / max_angle * 0.9
    y = chi_angles[1] / max_angle * 0.9 if len(chi_angles) > 1 else 0
    
    r = np.sqrt(x**2 + y**2)
    if r >= 1:
        r = 0.99
    
    return 2 * np.arctanh(r)

# Analyze rotamer positions
print("Dunbrack Rotamer Centroids:")
print("=" * 50)
for name, (chi1, chi2) in DUNBRACK_CENTROIDS.items():
    h_dist = compute_hyperbolic_distance(np.array([chi1, chi2]))
    print(f"  {name:12s}: chi1={chi1:+4d}, chi2={chi2:+4d}, h_dist={h_dist:.3f}")

In [None]:
# Visualize rotamer space in Poincare ball
fig, ax = plt.subplots(figsize=(10, 10))

# Draw Poincare ball
circle = Circle((0, 0), 1, fill=False, color='black', linewidth=2)
ax.add_patch(circle)
for r in [0.3, 0.5, 0.7, 0.9]:
    c = Circle((0, 0), r, fill=False, color='gray', linewidth=0.5, linestyle='--', alpha=0.5)
    ax.add_patch(c)

# Plot rotamer centroids
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
for (name, (chi1, chi2)), color in zip(DUNBRACK_CENTROIDS.items(), colors):
    x = chi1 / 180 * 0.9
    y = chi2 / 180 * 0.9
    ax.scatter(x, y, c=color, s=300, edgecolor='black', linewidth=2, label=name, zorder=5)
    ax.annotate(name, (x, y), xytext=(10, 10), textcoords='offset points',
                fontsize=11, fontweight='bold')

# Add origin
ax.scatter(0, 0, c='black', s=100, marker='+', linewidth=2, zorder=10)

ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-1.2, 1.2)
ax.set_aspect('equal')
ax.set_title('Dunbrack Rotamer Centroids in Poincare Ball\n(chi1, chi2) mapped to hyperbolic space',
             fontsize=13, fontweight='bold')
ax.legend(loc='upper right')
ax.axis('off')

plt.tight_layout()
plt.show()

### 2.3 Rojas: Serotype Forecaster

**Goal:** Track arbovirus evolution and forecast serotype dominance

**Key Features:**
- Hyperbolic trajectory tracking
- Momentum-based forecasting
- Multi-component risk assessment
- RT-PCR primer design

In [None]:
# Serotype forecaster demo - Generate trajectories
np.random.seed(42)

serotypes = ['DENV-1', 'DENV-2', 'DENV-3', 'DENV-4']
colors = {'DENV-1': '#1f77b4', 'DENV-2': '#ff7f0e', 
          'DENV-3': '#2ca02c', 'DENV-4': '#d62728'}

# Simulate 5 years of evolution
years = range(2020, 2025)
trajectories = {}

for i, sero in enumerate(serotypes):
    angle = i * np.pi / 2
    positions = []
    r = 0.3
    
    for year in years:
        r = min(0.95, r + np.random.rand() * 0.1)
        angle += np.random.randn() * 0.1
        x = r * np.cos(angle)
        y = r * np.sin(angle)
        positions.append({'year': year, 'x': x, 'y': y, 'radius': r})
    
    trajectories[sero] = pd.DataFrame(positions)

print("Simulated Serotype Trajectories (2020-2024):")
for sero, df in trajectories.items():
    start_r = df['radius'].iloc[0]
    end_r = df['radius'].iloc[-1]
    print(f"  {sero}: radius {start_r:.3f} -> {end_r:.3f} (delta: {end_r-start_r:+.3f})")

In [None]:
# Visualize serotype trajectories
fig, ax = plt.subplots(figsize=(10, 10))

# Draw Poincare ball
circle = Circle((0, 0), 1, fill=False, color='black', linewidth=2)
ax.add_patch(circle)
for r in [0.3, 0.5, 0.7, 0.9]:
    c = Circle((0, 0), r, fill=False, color='gray', linewidth=0.5, linestyle='--', alpha=0.5)
    ax.add_patch(c)

# Plot trajectories
for sero, df in trajectories.items():
    color = colors[sero]
    ax.plot(df['x'], df['y'], '-', color=color, alpha=0.5, linewidth=2)
    ax.scatter(df['x'], df['y'], c=color, s=100, alpha=0.7, edgecolor='white')
    ax.scatter(df['x'].iloc[-1], df['y'].iloc[-1], c=color, s=300, 
               marker='*', edgecolor='black', linewidth=1, label=sero, zorder=5)

ax.scatter(0, 0, c='black', s=100, marker='+', linewidth=2, zorder=10)
ax.annotate('Origin\n(Ancestral)', (0.02, -0.1), fontsize=9, alpha=0.7)

ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-1.2, 1.2)
ax.set_aspect('equal')
ax.set_title('Dengue Serotype Evolution (2020-2024)\nStars = Current Position',
             fontsize=13, fontweight='bold')
ax.legend(loc='upper left', fontsize=11)
ax.axis('off')

plt.tight_layout()
plt.show()

---

## 3. Cross-Application Integration

Demonstrate how the shared toolkit enables cross-domain analysis.

In [None]:
# Cross-application: Design primers for an AMP candidate
print("Cross-Application Example: AMP Design to Synthesis")
print("=" * 60)

# 1. Select AMP candidate
candidate = "KWKLFKKIEKVGQNIRDGIIKAGPAVAVVGQATQIAK"  # Cecropin A
print(f"\n1. Candidate: Cecropin A")
print(f"   Sequence: {candidate}")

# 2. Analyze properties
props = compute_peptide_properties(candidate)
print(f"\n2. Biophysical Analysis:")
print(f"   Length: {props['length']} aa")
print(f"   Charge: {props['net_charge']:+.0f}")
print(f"   Hydrophobicity: {props['hydrophobicity']:.3f}")

# 3. Safety assessment
hemo = predictor.predict(candidate)
print(f"\n3. Safety Assessment:")
print(f"   HC50: {hemo['hc50_predicted']:.1f} uM")
print(f"   Risk: {hemo['risk_category']}")

# 4. Therapeutic index
ti = predictor.compute_therapeutic_index(candidate, mic_value=2.0)
print(f"\n4. Therapeutic Index (MIC=2 uM):")
print(f"   TI = {ti['therapeutic_index']:.1f}")
print(f"   {ti['interpretation']}")

# 5. Design primers
primers = designer.design_for_peptide(
    candidate[:20],  # First 20 AA
    codon_optimization='ecoli',
    add_start_codon=True,
    add_stop_codon=False
)
print(f"\n5. Synthesis Primers (E. coli):")
print(f"   Forward: {primers.forward[:30]}... (Tm={primers.forward_tm:.1f}C)")
print(f"   Reverse: {primers.reverse[:30]}... (Tm={primers.reverse_tm:.1f}C)")

---

## 4. Architecture Overview

### Directory Structure

```
deliverables/
├── shared/                      # Common utilities
│   ├── __init__.py             # Public API exports
│   ├── hemolysis_predictor.py  # HC50 prediction
│   ├── peptide_analysis.py     # Biophysical properties
│   ├── primer_designer.py      # Codon optimization
│   └── codon_encoder.py        # P-adic embeddings
│
├── carlos_brizuela/             # AMP Navigator
│   └── scripts/
│       └── B1_pathogen_specific_design.py
│
├── scripts/                     # CLI tools
│   └── biotools.py             # Command-line interface
│
└── tutorials/                   # Learning resources
    ├── 01_getting_started.ipynb
    └── 02_activity_prediction.ipynb
```

### Key Classes

| Class | Module | Description |
|-------|--------|-------------|
| `HemolysisPredictor` | `shared.hemolysis_predictor` | ML-based HC50 prediction |
| `PrimerDesigner` | `shared.primer_designer` | Codon-optimized primer design |
| `CodonEncoder` | `shared.codon_encoder` | P-adic sequence embeddings |

### Mathematical Foundation

**Poincare Ball Model:**
- Manifold: $\mathcal{B}^n = \{x \in \mathbb{R}^n : ||x|| < 1\}$
- Metric: $ds^2 = \frac{4||dx||^2}{(1-||x||^2)^2}$
- Distance: $d(u,v) = \text{arccosh}\left(1 + 2\frac{||u-v||^2}{(1-||u||^2)(1-||v||^2)}\right)$

In [None]:
# Show available toolkit components
print("\nToolkit Component Summary:")
print("=" * 60)

components = [
    ('compute_peptide_properties', 'Biophysical analysis', 'Core'),
    ('compute_ml_features', 'ML feature generation', 'Core'),
    ('validate_sequence', 'Input validation', 'Core'),
    ('HemolysisPredictor', 'HC50 prediction', 'Prediction'),
    ('PrimerDesigner', 'Primer design', 'Synthesis'),
]

df = pd.DataFrame(components, columns=['Component', 'Description', 'Category'])
print(df.to_string(index=False))

---

## 5. Getting Started

### Installation

```bash
# Clone repository
git clone https://github.com/Ai-Whisperers/ternary-vaes-bioinformatics.git
cd ternary-vaes-bioinformatics

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest deliverables/shared/tests/
```

### Quick Start

```python
from shared import (
    compute_peptide_properties,
    HemolysisPredictor,
    PrimerDesigner
)

# Analyze peptide
props = compute_peptide_properties("KLWKKWKKWLK")

# Predict hemolysis
predictor = HemolysisPredictor()
result = predictor.predict("KLWKKWKKWLK")

# Design primers
designer = PrimerDesigner()
primers = designer.design_for_peptide("KLWKKWKKWLK")
```

### Partner Notebooks

Explore the specialized notebooks:

1. **[AMP Navigator](brizuela_amp_navigator.ipynb)** - Antimicrobial peptide design
2. **[Rotamer Scoring](colbes_scoring_function.ipynb)** - Protein side-chain analysis
3. **[Serotype Forecaster](rojas_serotype_forecast.ipynb)** - Arbovirus surveillance

---

## Summary

This bioinformatics toolkit provides:

| Feature | Description |
|---------|-------------|
| **Shared Infrastructure** | Common utilities for sequence analysis |
| **P-adic Geometry** | Hyperbolic embeddings for biological data |
| **Partner Tools** | Specialized applications for AMP, rotamer, and surveillance |
| **Extensible Design** | Modular architecture for new applications |

### Key Innovations

1. **Hyperbolic Embeddings**: Tree-like biological relationships naturally represented
2. **Unified Toolkit**: Common API across diverse applications
3. **Safety-First**: Hemolysis prediction integrated into design workflows
4. **Production-Ready**: Primer design for immediate laboratory use

### Contact

- **Organization:** AI Whisperers
- **Repository:** https://github.com/Ai-Whisperers/ternary-vaes-bioinformatics
- **License:** PolyForm Noncommercial 1.0.0