# Example: End-To-End *De Novo* Protein Design Pipeline

## Overview

This notebook demonstrates an end-to-end protein design workflow using three deep learning networks from the Institute for Protein Design:

| Step | Model | Purpose |
|------|-------|---------|
| 1. **Backbone Generation** | RFD3 | Generate novel protein backbones via diffusion |
| 2. **Sequence Design** | MPNN | Design amino acid sequences for the generated backbone |
| 3. **Structure Validation** | RF3 | Predict the structure from designed sequence to validate designability |

All models are unified through [AtomWorks](https://github.com/RosettaCommons/atomworks) (for both inference and training), relying on Biotite `AtomArray` objects.

This notebook assumes you have the base checkpoints downloaded: `foundry install rfd3 ligandmpnn rf3`. You can also specify the paths directly yourself if you wish. You can register your foundry venv to jupyter with: `python -m ipykernel install --user --name=foundry --display-name "foundry"`.

### Pipeline Flow
```
RFD3 (backbone) → MPNN (sequence) → RF3 (validation) → RMSD comparison
```
---

## Section 0: Installation

Install the Foundry package (includes RFD3, MPNN, and RF3):

```bash
pip install 'rc-foundry[all]'
```

Download the model weights (~6GB total, takes a couple minutes):

```bash
foundry install rfd3 ligandmpnn rf3
```

---

In [1]:
# Shared utilities for visualization (from AtomWorks)
from atomworks.io.utils.visualize import view

Environment variable CCD_MIRROR_PATH not set. Will not be able to use function requiring this variable. To set it you may:
  (1) add the line 'export VAR_NAME=path/to/variable' to your .bashrc or .zshrc file
  (2) set it in your current shell with 'export VAR_NAME=path/to/variable'
  (3) write it to a .env file in the root of the atomworks.io repository
Environment variable PDB_MIRROR_PATH not set. Will not be able to use function requiring this variable. To set it you may:
  (1) add the line 'export VAR_NAME=path/to/variable' to your .bashrc or .zshrc file
  (2) set it in your current shell with 'export VAR_NAME=path/to/variable'
  (3) write it to a .env file in the root of the atomworks.io repository


## Section 1: Backbone Generation with RFD3

RFdiffusion3 (RFD3) generates *de novo* all-atom proteins that meet specific conditioning requirements.

**Parameters Used** *(many more are available for more complex protein design tasks)*:
- `length`: Target protein length in residues
- `diffusion_batch_size`: Number of structures to generate per batch
- `n_batches`: Number of batches to run

**Outputs:** Dictionary of `RFD3Output` objects.

In [10]:
from lightning.fabric import seed_everything
from rfd3.engine import RFD3InferenceConfig, RFD3InferenceEngine

# Set seed for reproducibility
# seed_everything(0)

# Configure RFD3 inference
config = RFD3InferenceConfig(
    specification={
        'length': 50,  # Generate 80-residue proteins
        'extra': {},  # We are not using any extra specifications here.
    },
    diffusion_batch_size=2,  # Generate 2 structures per batch
)


# Set high precision for matrix multiplications
import torch
torch.set_float32_matmul_precision('high')


# Initialize engine and run generation
model = RFD3InferenceEngine(**config)
outputs = model.run(
    inputs=None,      # None for unconditional generation
    out_dir=None,     # None to return in memory (no file output)
    n_batches=1,      # Generate 1 batch
)

Using bfloat16 Automatic Mixed Precision (AMP)
15:01:38 INFO rfd3.engine: [rank: 0] Finished inference batch in 9.58 seconds.


In [11]:
# View generated example IDs (one key per generated structure)
outputs.keys()

dict_keys(['_0'])

In [None]:
# Inspect RFD3 outputs and extract the generated backbone
for idx, data in outputs.items():
    print(f"Batch {idx}: {len(data)} structure(s)")
    print(f"  Output type: {type(data[0]).__name__}")
    print(f"  AtomArray: {data[0].atom_array}")

# Extract the first generated backbone for downstream use
first_key = next(iter(outputs.keys()))
atom_array = outputs[first_key][0].atom_array

# Visualize the generated backbone
view(atom_array)


Batch _0: 2 structure(s)
  Output type: RFD3Output
  AtomArray:     A       1  MET N      N       -13.118   -7.468    0.055
    A       1  MET CA     C       -11.901   -7.143    0.782
    A       1  MET C      C       -10.660   -7.577   -0.019
    A       1  MET O      O       -10.596   -8.709   -0.487
    A       1  MET CB     C       -11.884   -7.784    2.157
    A       1  MET CG     C       -12.743   -7.049    3.175
    A       1  MET SD     S       -12.469   -7.670    4.845
    A       1  MET CE     C       -13.261   -6.416    5.795
    A       2  PHE N      N        -9.722   -6.721   -0.176
    A       2  PHE CA     C        -8.478   -6.954   -0.832
    A       2  PHE C      C        -7.365   -7.168    0.182
    A       2  PHE O      O        -7.295   -6.424    1.166
    A       2  PHE CB     C        -8.168   -5.830   -1.814
    A       2  PHE CG     C        -9.167   -5.700   -2.923
    A       2  PHE CD1    C        -9.026   -6.455   -4.081
    A       2  PHE CD2    C       -1

<py3Dmol.view at 0x7445a3257080>

---

## Section 2: Sequence Design with MPNN

Protein and Ligand MPNN (Message Passing Neural Network) designs amino acid sequences that will fold into a target backbone structure.

**Model Options:**
- `protein_mpnn`: Original ProteinMPNN for protein-only design
- `ligand_mpnn`: Extended model supporting ligand-aware design

**Key Parameters:**
- `batch_size`: Number of sequences to generate per structure
- `remove_waters`: Whether to exclude water molecules from context

In [13]:
from mpnn.inference_engines.mpnn import MPNNInferenceEngine

# Configure MPNN inference engine
# See mpnn.utils.inference.MPNN_GLOBAL_INFERENCE_DEFAULTS for all options
engine_config = {
    "model_type": "ligand_mpnn",  # or "protein_mpnn" for vanilla ProteinMPNN
    "is_legacy_weights": True,    # Required for now for ligand_mpnn and protein_mpnn
    "out_directory": None,        # Return results in memory
    "write_structures": False,
    "write_fasta": False,
}

# Configure per-input inference options
# See mpnn.utils.inference.MPNN_PER_INPUT_INFERENCE_DEFAULTS for all options
input_configs = [
    {
        "batch_size": 10,         # Generate 10 sequences per structure
        "remove_waters": True,
    }
]

# Run sequence design on the RFD3-generated backbone
model = MPNNInferenceEngine(**engine_config)
mpnn_outputs = model.run(input_dicts=input_configs, atom_arrays=[atom_array])

In [14]:
from biotite.structure import get_residue_starts
from biotite.sequence import ProteinSequence

# Extract and display the designed sequences
print(f"Generated {len(mpnn_outputs)} designed sequences:\n")

for i, item in enumerate(mpnn_outputs):
    res_starts = get_residue_starts(item.atom_array)
    # Convert 3-letter codes to 1-letter using Biotite
    seq_1letter = ''.join(
        ProteinSequence.convert_letter_3to1(res_name)
        for res_name in item.atom_array.res_name[res_starts]
    )
    print(f"Sequence {i+1}: {seq_1letter}")

Generated 10 designed sequences:

Sequence 1: PYRYLHTRTKTVITLPEEPTRESMIKGLQETLKLSREEAEKAIADLVRVE
Sequence 2: VVRLLHRRSQTVIDLPEEPTRESMLAGLQQTLGLSPEEAEAALADLVRVE
Sequence 3: VYKLLNTKTNTVIELPEEPTRESMIRGLQETLGLSEEEAEEAISDLVLVK
Sequence 4: MYKYLHTKTNTVITLEEEPTEESLIKGLQETLKLSEKEAKEAIKDLVLIE
Sequence 5: MYRYLHTKSKVVLELEEEPTEESMIKALQEKLKLSKEEAKKAVKDLVRVE
Sequence 6: MVRLLHTDTDTVIDLPEEPTRELMVKGLQEVLGLSREEAERAIARLVRVE
Sequence 7: MYRYLHTKTNTIIELPEEPTEELMIKGLMETLGLSEEEAKKAIKDLVRVE
Sequence 8: VYRYLHTRTQTTIDLPEEPTEASLIAGLQRALGLSEEEARREVAHLVRVE
Sequence 9: VYRYLHRRTQTVIELPEEPTRESMIRGLQEVLGLSEAEAERAIADLVRVE
Sequence 10: VYRYLHRRSQVVLELPEEPTEESLIRALQETLGLSEEEAREAIADLVLVE


---

## Section 3: Structure Prediction with RF3

RF3 (RoseTTAFold 3) predicts protein structures from sequences. By re-folding the MPNN-designed sequence, we can validate whether the design is likely to adopt the intended backbone structure.

**Outputs:** `RF3Output` objects containing:
- `atom_array`: Predicted structure as Biotite AtomArray
- `summary_confidences`: Overall confidence metrics (pLDDT, PAE, pTM, etc.)
- `confidences`: Per-atom/residue confidence scores

**Confidence Metrics:**
| Metric | Description |
|--------|-------------|
| pLDDT | Per-residue confidence (0-1, higher is better) |
| PAE | Predicted Aligned Error (lower is better) |
| pTM | Predicted TM-score |
| ranking_score | Overall model quality score |

In [15]:
from rf3.inference_engines.rf3 import RF3InferenceEngine
from rf3.utils.inference import InferenceInput


# Initialize RF3 inference engine
inference_engine = RF3InferenceEngine(ckpt_path='rf3', verbose=False)

# Create input from the MPNN-designed structure (first design)
# This re-folds the sequence to validate it adopts the intended structure
input_structure = InferenceInput.from_atom_array(atom_array, example_id="example_protein")
rf3_outputs = inference_engine.run(inputs=input_structure)

# Outputs: dict mapping example_id -> list[RF3Output] (multiple models per input)
print(f"Output keys: {rf3_outputs.keys()}")
print(f"Number of models for 'example_protein': {len(rf3_outputs['example_protein'])}")

15:08:17 INFO rf3.inference_engines.rf3: [rank: 0] Loading checkpoint from /home/hdwang/.foundry/checkpoints/rf3_foundry_01_24_latest_remapped.ckpt...
Using bfloat16 Automatic Mixed Precision (AMP)
15:08:47 INFO rf3.inference_engines.rf3: [rank: 0] Found 1 structures to predict!
15:08:47 INFO rf3.inference_engines.rf3: [rank: 0] Predicting structure 1/1: example_protein


Output keys: dict_keys(['example_protein'])
Number of models for 'example_protein': 5


In [16]:
# Extract the top-ranked prediction
rf3_output = rf3_outputs["example_protein"][0]

# Inspect RF3Output structure
print(f"RF3Output contains:")
print(f"  - atom_array: {len(rf3_output.atom_array)} atoms")
print(f"  - summary_confidences: {list(rf3_output.summary_confidences.keys())}")
print(f"  - confidences: {list(rf3_output.confidences.keys()) if rf3_output.confidences else None}")

# Visualize the predicted structure
view(rf3_output.atom_array)

RF3Output contains:
  - atom_array: 376 atoms
  - summary_confidences: ['chain_ptm', 'chain_pair_pae_min', 'chain_pair_pde_min', 'chain_pair_pae', 'chain_pair_pde', 'overall_plddt', 'overall_pde', 'overall_pae', 'ptm', 'iptm', 'has_clash', 'ranking_score']
  - confidences: ['atom_chain_ids', 'atom_plddts', 'pae', 'token_chain_ids', 'token_res_ids']


<py3Dmol.view at 0x74462bfde9c0>

In [17]:
# Summary confidences: overall model quality metrics
summary = rf3_output.summary_confidences

print("=== Summary Confidences ===")
print(f"  Overall pLDDT:    {summary['overall_plddt']:.3f}")
print(f"  Overall PAE:      {summary['overall_pae']:.2f} A")
print(f"  Overall PDE:      {summary['overall_pde']:.3f}")
print(f"  pTM:              {summary['ptm']:.3f}")
print(f"  ipTM:             {summary.get('iptm', 'N/A (single chain)')}")
print(f"  Ranking score:    {summary['ranking_score']:.3f}")
print(f"  Has clash:        {summary['has_clash']}")

=== Summary Confidences ===
  Overall pLDDT:    0.778
  Overall PAE:      8.58 A
  Overall PDE:      2.918
  pTM:              0.543
  ipTM:             0.0
  Ranking score:    0.109
  Has clash:        False


In [18]:
# Detailed per-atom/residue confidences
conf = rf3_output.confidences

print("=== Per-Atom/Residue Confidences ===")
print(f"  atom_plddts:      {len(conf['atom_plddts'])} values (one per atom)")
print(f"  atom_chain_ids:   {len(conf['atom_chain_ids'])} values")
print(f"  token_chain_ids:  {len(conf['token_chain_ids'])} values (one per residue)")
print(f"  token_res_ids:    {len(conf['token_res_ids'])} values")
print(f"  PAE matrix:       {len(conf['pae'])}x{len(conf['pae'][0])}")

# Preview first 10 atom pLDDT scores
import numpy as np
print(f"\nFirst 10 atom pLDDTs: {np.round(conf['atom_plddts'][:10], 2).tolist()}")

=== Per-Atom/Residue Confidences ===
  atom_plddts:      376 values (one per atom)
  atom_chain_ids:   376 values
  token_chain_ids:  50 values (one per residue)
  token_res_ids:    50 values
  PAE matrix:       50x50

First 10 atom pLDDTs: [0.69, 0.71, 0.72, 0.69, 0.69, 0.66, 0.67, 0.62, 0.7, 0.71]


---

## Section 4: Validation and Export

The final step compares the RF3-predicted structure against the original RFD3-generated backbone. A low backbone RMSD indicates the designed sequence is likely to fold into the intended structure (high designability).

In [19]:
from biotite.structure import rmsd, superimpose
from atomworks.constants import PROTEIN_BACKBONE_ATOM_NAMES
import numpy as np

# Get structures for comparison
aa_generated = atom_array              # Original RFD3 backbone (from Section 1)
aa_refolded = rf3_output.atom_array    # RF3-predicted structure

# Filter to backbone atoms (N, CA, C, O)
bb_generated = aa_generated[np.isin(aa_generated.atom_name, PROTEIN_BACKBONE_ATOM_NAMES)]
bb_refolded = aa_refolded[np.isin(aa_refolded.atom_name, PROTEIN_BACKBONE_ATOM_NAMES)]

# Superimpose structures and calculate RMSD
bb_refolded_fitted, _ = superimpose(bb_generated, bb_refolded)
rmsd_value = rmsd(bb_generated, bb_refolded_fitted)

print(f"Backbone RMSD: {rmsd_value:.2f} A")
print(f"\nInterpretation: {'Excellent' if rmsd_value < 1.0 else 'Good' if rmsd_value < 2.0 else 'Moderate'} designability")

Backbone RMSD: 6.53 A

Interpretation: Moderate designability


In [20]:
from atomworks.io.utils.io_utils import to_cif_file

# Export structures to CIF format for visualization in PyMOL/ChimeraX
to_cif_file(aa_generated, "generated.cif")
to_cif_file(aa_refolded, "refolded.cif")

print("Exported structures:")
print("  - generated.cif: Original RFD3 backbone")
print("  - refolded.cif:  RF3-predicted structure")

Exported structures:
  - generated.cif: Original RFD3 backbone
  - refolded.cif:  RF3-predicted structure


### Superimposed Result

The image below shows the generated backbone (RFD3) superimposed with the re-folded structure (RF3). Close alignment indicates successful design.

![Superimposed Protein](../docs/_static/superimposed_80_residue_protein.png)