# De Novo Protein Design Workflow (June 2025)

Workflow for creating de novo protein binders using NVIDIA Inference Microservices (NIMs) deployed via NVIDIA Brev cloud GPU platform.

Here, we take an alternative approach to bypass performing Alphafold2 structural prediction (step #1) on the cloud GPU. Instead, we first pre-compute the protein structure (.PDB) on [AlphaFold2 colab](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb#scrollTo=R_AH6JSXaeb2), given our input protein sequence. 

The resulting .PDB file can then be used as input into **RFDiffusion** to generate the protein backbones, then into **ProteinMPNN** to back-generate the sequence. Finally, the generated peptide structure is validated via [PRODIGY](https://rascar.science.uu.nl/prodigy/) (Gibbs Free Energy) and Rosetta [FlexPepDoc](https://r2.graylab.jhu.edu/auth/login?next=%2Fapps%2Fsubmit%2Fflexpepdock). 

## Getting Started 

### Software Pre-requisites
* Python 3.11+

### Hardware requirements
*  **RFdiffusion ** runs on 1 x GPU, ≥12 GiB GPU memory, 15GB free SSD drive space
*  **ProteinMPNN ** runs on 1 x GPU, ≥3 GiB GPU memory, 10GB free SSD drive space
Total: 2 x GPU, 47 GiB GPU memory, 1.3TB GB SSD drive space, 60GiB RAM,24 CPU

### Ensure that you have these files:
* `protein-binder-design_v3.ipynb` uploaded to a [public Github repo](https://github.com/Keonapang/generative-protein-binder-design/blob/main/src/protein-binder-design.ipynb)
*`docker-compose.yaml` (3MB) from the original [BioNeMo repo](https://github.com/NVIDIA-BioNeMo-blueprints/generative-protein-binder-design/blob/main/deploy/docker-compose.yaml)
* `cycle1_alphafold2_output.pdb` (~80KB) pre-computed on [AlphaFold2 colab](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb#scrollTo=R_AH6JSXaeb2)
* `cycle2_alphafold2_output.pdb` (~80KB) pre-computed on [AlphaFold2 colab](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb#scrollTo=R_AH6JSXaeb2)

## 1. AlphaFold2

Pre-compute the protein structure (.PDB) on [AlphaFold2 colab](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb#scrollTo=R_AH6JSXaeb2), given our input protein sequence. In our previous study design, we had identified 8 potential binding sites on the ApoB-100 protein:

* Sequence "1A": "LKTSQCTLKEVYGFNPEGKALLKKTKNSEEFAAAMSRYEL" # A91-130
* Sequence "1B": "EEAKQVLFLDTVYGNCSTHFTVKTRKGNVATEISTERDLG"  #A170-209
* Sequence "1C": "VAEAICKEQHLFLPFSYKNKYGMVAQVTQTLKLEDTPKIN" # A255-294
* Sequence "2A": "CSTHILQWLKRVHANPLLIDVVTYLVALIPEPSAQQLREI", # A390-429
* Sequence "2B": "GTQELLDIANYLMEQIQDDCTGDEDYTYLILRVIGNMGQT", # A459-498
* Sequence "2D": "EQVKNFVASHIANILNSEELDIQDLKKLVKEALKESQLPT" # A587-626

**Inputs**:
- `sequence`: EACH of the 8 amino acid sequences above

**Outputs**:
- Predicted structures in PDB format, for example `cycle1A_alphafold2_output.pdb`

**Runtime**: 
- ~2hrs on Colab
- 12 minutes for a ~550AA length target_sequence on a H100 GPU


## Deploy NIM on NVIDIA Brev cloud GPU

1. Go to [brev.nvidia](https://brev.nvidia.com/) and create a new **'Launchable'** with the following settings:
    - **Compute**: A100 (80GiB GPU memory) 4 GPUs x 48 CPUs | 480GiB | 128GiB RAM ($7.92/hr)
    - **Container**: use VM-mode (+jupyter notebook)
    - **File**: Link to [jupyter notebook](https://github.com/Keonapang/generative-protein-binder-design/blob/main/src/protein-binder-design_v3.ipynb)


2. Click **"Launch"** and **"Go to Instance Page"**. Wait ~15 minutes for the cloud server to initiate.

3. When the server is ready, **upload** (i.e.drag and drop) 2 files from this repo: 
    - `./deploy/docker-compose.yaml` sets up the Docker images, networks, and complex dependencies required by each NIM. 
    - `./docs/cycle1_alphafold2_output.pdb` from AlphaFold2 on Colab

4.  Open terminal on your local computer (Command Prompt for windows, Terminal for mac) and install brev:

```bash
    brew install brevdev/homebrew-/brev && brev login --token <****> # Install the CLI - THIS STEP NEEDS PERMISSION FROM ICT
```
5. On your NVIDIA Brev 'Instance' page > under **"Access"** tab, run code on your terminal to connect to VM instance:

```bash
    brev shell <instance-name> # find instance-name on brev.nvidias 
```
5. Ensure that you've [generated](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/containers/bionemo-framework) a **NGC Personal API Key**, and run code:

```bash
    export NGC_CLI_API_KEY=<enter-key> # enter personal API key
    docker login nvcr.io --username='$oauthtoken' --password="${NGC_CLI_API_KEY}"

    sudo apt-get install -y docker-compose
    sudo apt install python3.11

    ## Create the nim cache directory to download any model data to your local/server disk 
    mkdir -p ~/.cache/nim
    chmod -R 777 ~/.cache/nim    ## Make it writable by the NIM
    export HOST_NIM_CACHE=~/.cache/nim

    # Run Docker compose
    docker compose 
```


Running **Docker Compose** takes 20-25mins total. When the containers start, they will begin by pulling the models for each NIM. The terminal will look like:

```bash
    [+] Running 4/4
    ✔ Network protein-binder-design_default                 Created          0.1s 
    ✔ Container protein-binder-design-alphafold-multimer-1  Started          6.3s 
    ✔ Container protein-binder-design-rfdiffusion-1         Started          6.2s 
    ✔ Container protein-binder-design-proteinmpnn-1         Started          6.2s 
```

6. **Finally, check the status** of the four running NIMS with the command:

```bash
    curl localhost:8081/v1/health/ready # alphafold2
    curl localhost:8082/v1/health/ready # RFdiffusion
    curl localhost:8083/v1/health/ready # Protein MPNN
    curl localhost:8084/v1/health/ready # alphafold2-multimer

    # check what dockers are running
    docker container ls
    docker container logs <CONTAINER-ID> # get CONTAINER-ID from command above
    df -h       # check storage space
```

### Open python on the cloud GPU

In [None]:
import json
import os
import requests
from enum import StrEnum, Enum # must be Python 3.11+ 
from typing import Tuple, Dict, Any, List
from pathlib import Path

We need to start up the NIMs and it will take some time for the models to download (**10-20mins**).

In [None]:
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY") or input("Paste Run Key: ") # see above for API instructions

In [None]:
HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {NVIDIA_API_KEY}",
    "poll-seconds": "900"
}
NIM_HOST_URL_BASE = "http://localhost"
# 3 different endpoints for the models
class NIM_PORTS(Enum):
    RFDIFFUSION_PORT = 8082
    PROTEINMPNN_PORT = 8083
    AF2_MULTIMER_PORT = 8084

class NIM_ENDPOINTS(StrEnum):
    RFDIFFUSION =  "biology/ipd/rfdiffusion/generate"
    PROTEINMPNN =  "biology/ipd/proteinmpnn/predict"
    AF2_MULTIMER = "protein-structure/alphafold2/multimer/predict-structure-from-sequences"

def query_nim(
            payload: Dict[str, Any],
            nim_endpoint: str,
            headers: Dict[str, str] = HEADERS,
            base_url: str = "http://localhost",
            nim_port: int = 8080,
            echo: bool = False) -> Tuple[int, Dict]:
    function_url = f"{base_url}:{nim_port}/{nim_endpoint}"
    if echo:
        print("*"*80)
        print(f"\tURL: {function_url}")
        print(f"\tPayload: {payload}")
        print("*"*80)
    response = requests.post(function_url,
                            json=payload,
                            headers=headers)
    if response.status_code == 200:
        return response.status_code, response.json()
    else:
        raise Exception(f"Error: response returned code [{response.status_code}], with text: {response.text}")

def check_nim_readiness(nim_port: NIM_PORTS,
                        base_url: str = NIM_HOST_URL_BASE,
                        endpoint: str = "v1/health/ready") -> bool:
    """
    Return true if a NIM is ready.
    """
    try:
        response = requests.get(f"{base_url}:{nim_port}/{endpoint}")
        d = response.json()
        if "status" in d:
            if d["status"] == "ready":
                return True
        return False
    except Exception as e:
        print(e)
        return False

def get_reduced_pdb(pdb_id: str, rcsb_path: str = None) -> str:
    pdb = Path(pdb_id)
    if not pdb.exists() and rcsb_path is not None:
        pdb.write_text(requests.get(rcsb_path).text)
    lines = filter(lambda line: line.startswith("ATOM"), pdb.read_text().split("\n"))
    return "\n".join(list(lines))

class ExampleRequestParams:
    def __init__(self,
                target_sequence: str,
                contigs: str, 
                hotspot_res: List[str],
                input_pdb_chains: List[str],
                ca_only: bool,
                use_soluble_model: bool,
                sampling_temp: List[float],
                diffusion_steps: int = 15,
                num_seq_per_target: int = 20):
        self.target_sequence = target_sequence
        self.contigs = contigs
        self.hotspot_res = hotspot_res
        self.input_pdb_chains = input_pdb_chains
        self.ca_only = ca_only
        self.use_soluble_model = use_soluble_model
        self.sampling_temp = sampling_temp
        self.diffusion_steps = diffusion_steps
        self.num_seq_per_target = num_seq_per_target

Test whether each NIM is running using the `check_nim_readiness` function:

In [None]:
status = check_nim_readiness(NIM_PORTS.RFDIFFUSION_PORT.value)
print(f"RFDiffusion ready: {status}")
status = check_nim_readiness(NIM_PORTS.PROTEINMPNN_PORT.value)
print(f"ProteinMPNN ready: {status}")

## 2. Run RFDiffusion + ProteinMPNN

**RFDiffusion** applies generative diffusion techniques to create novel protein structures. It excels in designing complex protein architectures, including binders and symmetric assemblies, by sculpting atomic clouds into functional proteins. This step is also available on Colab ([diffusion.ipynb](https://colab.research.google.com/github/sokrypton/ColabDesign/blob/v1.1.1/rf/examples/diffusion.ipynb#scrollTo=TuRUfQJZ4vkM)).

**Inputs**
- `input_pdb` is the protein target in PDB format
- `contigs` is used to specify regions to work on. See the [RFDiffusion repo](https://github.com/RosettaCommons/RFdiffusion?tab=readme-ov-file#running-the-diffusion-script) under 'Binder Design' for more details. A20-60/0 50-100 means to generate a binder to chain A residue 20-60, where the binder is 50-100 residues long.
- `diffusion_steps` is the number of diffusion steps (15-30 recommended)

**Output**:
- `output_pdb` is the output pdb;`protein` is the input pdb

**Runtime**: ~15 seconds (for ~550 AA target_sequence on the H100 GPU)

**ProteinMPNN** (Protein Message Passing Neural Network) is a graph neural network that predicts amino acid sequences for given protein backbones, leveraging evolutionary, functional, and structural information to generate sequences that are likely to fold into the desired 3D structures.

**Inputs**: 
- `input_pdb` Input protein for which amino acid sequences need to be predicted
- `ca_only` Defaults to false, CA-only model addresses specific needs in protein design where focusing on the alpha carbon (CA)
- `use_soluble_model` soluble vs non-soluble for membrane protein studies
- `num_seq_per_target` defaults to 1. Number of seqs to generate that will fold into the given target protein structure
- `sampling_temp` ranges from 0 to 1 ranges from 0 to 1 and controls the diversity of design outcomes by adjusting the probability values for the 20 amino acids at each sequence position. 
 
**Outputs**:
- `ProteinMPNN.fa` (string): fasta file containing the designed sequence(s) for the given structure.
- `scores`: (array) log-probabilities of the designed sequences, which indicate the likelihood of each sequence given the input structure
- `probs`: (array) predicted probabilities for each amino acid at each position

**Runtime**: < 30 seconds for 20 short sequences (on the H100 GPU)

In [None]:
## Study design:
# - 8 input peptides targets found; each 10-20 amino acid residues long
# - peptides are distanced at least 30 AA apart
# - bewteen cycle 1 and cycle 2 locations, there is a ~1000 amino acid gap 

# EXECUTION
num_seq = 1 # number of sequences to generate per target
diffusion = 50 
temp = 0.5 # sampling temperature (range: 0-1) to adjust the probability values for the 20 amino acids at each position, controls the diversity of the design outcomes

for cycle in "1A" "1B" "1C" "1D" "2A" "2B" "2C" "2D"; do
    python3.11  "/home/ubuntu/3_protein_binder_design.py" --cycle "$cycle" --num_seq "$num_seq" --diffusion "$diffusion" --temp "$temp"
done 

**Example of a ProteinMPNN output**

`3_cycle1A_1seqs_50diff_0.5temp_proteinmpnn.fasta`

**>T=0.5, sample=1, score=2.0571, global_score=2.0571, seq_recovery=0.0000**
**TQEQLAQNKKEERVKLEKQMS**

* `T=0.5` - temperature 0.5 was used to sample sequences
* `sample` - sequence sample no. 1, 2,...etc
* `score` - average over residues that were designed negative log probability of sampled amino acids
* `global score` - average over all residues in all chains negative log probability of sampled/fixed amino acids
* `TQEQLAQNKKEERVKLEKQMS` - peptide chain that designed

# MODEL VALIDATION

### Option 1: PRODIGY Gibbs Free Energy
Provide the PDB ID of the target complex (protein + peptide) in multi-PDB format. [PRODIGY web server](https://rascar.science.uu.nl/prodigy/). Example of a multi-PDB format:

MODEL     1                                                                     
ATOM      1  N   LEU A   1       3.770  10.500 -13.625  1.00 64.62           N  
ATOM      2  CA  LEU A   1       3.465   9.531 -12.578  1.00 64.62           C  
ATOM      3  C   LEU A   1       4.594   8.516 -12.430  1.00 64.62           C  
ATOM      4  CB  LEU A   1       2.152   8.805 -12.883  1.00 64.62           C  
ATOM      5  O   LEU A   1       4.969   7.848 -13.391  1.00 64.62           O  
ATOM      6  CG  LEU A   1       1.371   8.281 -11.672  1.00 64.62           C  
ATOM      7  CD1 LEU A   1       0.621   9.422 -10.992  1.00 64.62           C  
ATOM      8  CD2 LEU A   1       0.407   7.176 -12.102  1.00 64.62           C  
ATOM      9  N   LYS A   2       5.578   8.828 -11.562  1.00 70.88           N  
TER       10      LEU A  2      
ATOM      2  CA   GLY B   1      18.770  -4.344 -4.925  1.00 64.62        
ATOM      2  CA   PRO B   1      11.324  -4.344 -4.925  1.00 30.20        

 
### Option 2: FlexPepDoc

Please log in to GitHub to use the [FlexPepDoc ROSIE web server](https://r2.graylab.jhu.edu/auth/login?next=%2Fapps%2Fsubmit%2Fflexpepdock).