## Protein Generation and Structure Prediction with BioNemo

This example notebook shows how to generate new protein sequences and predict folded protein structures using ProtGPT2 and OpenFold pre-trained models via the BioNemo service API. These models were trained and deployed using NVIDIA's BioNeMo framework for Large Language Models. For more details, please visit NVIDIA BioNeMo Service at https://www.nvidia.com/en-us/gpu-cloud/bionemo/ 

This notebook will walk through protein generation and visualization following sections:

 - **BioNeMo Service Configuration**
   - Install dependencies and define the BioNeMo service endpoint and API key required for access
 - **Generating Protein Sequences**
   - Generate protein sequences with the ProtGPT2 model
 - **Predicting 3d Protein Structure**
   - Predict the 3D structure of the generated proteins using OpenFold

### BioNeMo Service Configurations
To get started, please configure and provide your NGC access token by visiting https://ngc.nvidia.com/setup/api-key

In [None]:
API_KEY = "YOUR KEY HERE"
API_HOST = "https://api.stg.bionemo.ngc.nvidia.com/v1"

Let's start by installing and importing library dependencies. We'll use _requests_ to interact with the BioNeMo service, and _py3Dmol_ for visualization.

In [None]:
!pip install py3dmol

In [None]:
import io
import re
import time
import numpy
import requests
import json
import requests

import json
import datetime
import py3Dmol

from typing import Iterable, Dict

Next let's validate our connection to the BioNeMo service. You should get a "200" HTTP response in the following cell, indicating a successful connection. Note the _Authorization_ field in the header response which contains your NGC access token credentials. We'll re-use this header for all future interactions with the service.
<div class="alert alert-block alert-info">
    <b>Tip:</b>  If you do not receive a "200 OK" HTTP response when testing connectivity to BioNeMo service in the cell below, verify
    
 - the API_HOST defined above is the correct address of the BioNeMo service, and
 - the API_KEY is authorized to access BioNeMo service at the API_HOST

Refer to this <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status">list of HTTP responses status codes</a> to help determine the cause of a non-200 response.
</div>

In [None]:
#Checking to see if the access is configured
response = requests.get(
    f"{API_HOST}/models",
    headers={"Authorization": f"Bearer {API_KEY}"})
print("Query BioNeMo Service:", response)

#Add key to headers for remainder of notebook
headers = {
    'Authorization': f'Bearer {API_KEY}'
}

Some of the BioNeMo services such as are _sychronous_, meaning the service API call will block until a result is returned. Two examples of synchronous services are MegaMolBART molecule embedding, and looking up PDBs using the BioNeMo Uniprot service.Other services are _asynchronous_, such as the protein folding and docking services. Functions calling _asynchronous_ services are nonblocking, and immediately return a handle called a _correlation_id_ that can be used to query the results. This allows us to batch multiple requests together and query them for completion at a later time.

In the following cell we introduce a helper function _query_async_result_ that will block on a _correlation_id_ task until the computation is completely and a result is returned.  For these non-blocking asynchronous calls, it is important to block on the correlation id to ensure we've received the response data before continuing.

In [None]:
def query_async_result(request, print_result=False):
    if isinstance(request, str):
        #Request is a correlation id string
        correlation_id=request
    elif isinstance(request, requests.models.Response):
        submission_response = json.loads(request.content)
        correlation_id=submission_response['correlation_id']
    
    i = 0
    while True:
        response = requests.get(
            f"{API_HOST}/task/{correlation_id}",
            headers=headers,
        )      
    
        status_result = json.loads(response.content)
        if status_result['control_info']['status'] == 'DONE':
            if(print_result): print(status_result['response'])
            return(status_result['response'])
        if status_result['control_info']['status'] == 'ERROR':
            print("ERROR, Cancelling Result Retrieval")
            return            
        else:
            print(f"{status_result['control_info']['status']}{''.join(['.']*i)}" , end="\r")
            time.sleep(10)  # waiting for the prediction from BioNeMo Server
        i = i+1

## Protein structure prediction via API request to BioNeMo Service

In the following, we will demonstrate an example of protein structure prediction utilizing sequences we generate from an AI model. In this case we'll be using ProtGPT2, a langauge model that generates protein sequences that are similar to their natural counterparts.  But before we generate our own sequences, let's take a look at the BioNeMo UniProt lookup service. This can be leveraged to retrieve the sequence of a protein-of-interest, using the UniProt ID as input. Here we will be looking at [Thioredoxin](https://www.uniprot.org/uniprotkb/P10599/entry). This small protein plays a vital role in regulating cellular redox (reductionâ€“oxidation) homeostasis by reducing disulfide bonds in proteins. It is found in most living organisms, from bacteria to humans. It acts as a reducing agent by donating electrons to other proteins to help maintain their proper shape and function. It also helps to remove reactive oxygen species (ROS) from cells, which can damage DNA, proteins, and lipids if not removed.

Let's begin by retrieving the sequence for Thioredoxin, utilizing the BioNeMo UniProt lookup service.
<div class="alert alert-block alert-info">
    <b>Tip:</b>  We use the specific sequence for Thioredoxin in this example workflow.  The UniProt ID lookup feature built in to BioNeMo Service allows a user to work with virtually any protein sequence.  Feel free to use this workflow as a starting point for your own experimentation!
</div>

In [None]:
uniprot_id="P10599"
response = requests.get(f'{API_HOST}/uniprot/{uniprot_id}', headers=headers)
sequence = json.loads(response.content)
print(sequence)

Generating new protein sequences is a key component of protein engineering, which allows scientists to create proteins with specific functions and properties that may not exist in nature. These engineered proteins can be used in a variety of applications, including drug development, biocatalysis, and biomaterials, among others.

It's important to choose your parameters for the protein generation process carefully.

-   max_length: maximum number of generated tokens.
        
Note that common tokens in ProtGPT3 are k-mers of length 3, so 1 token = 3 amino acids. In other words, a max_length=400 could translate to an effective protein sequence of up to 1200 amino acids.

- top_k: Sets the number of highest probability vocabulary tokens to keep for top-k-filtering
- repetition_penalty: Sets the penalty for [repeating tokens](https://arxiv.org/pdf/1909.05858.pdf), where a value of 1 correspond to no penalty.
- num_return_sequences: The effective number of whole protein sequences to return
- percent_to_keep: Sets the percent of whole protein sequences to keep for each protein generation iteration cycle, based on their cumulative perplexity. 

As an illustrative example, suppose that 50 sequences of tokens are generated per itration. From these generated sequences of tokens (which could contain protein sequence fragments or other invalid # sequences), whole protein sequences are reconstructed. Let's suppose that there are 43 valid, whole protein sequences from the generative iteration. Of these 43 whole protein sequences, only the top `percent_to_keep`, according to perplexity, are kept. For instance, with percent_to_keep=10, only ~4 of the 43 sequences (i.e., ~10% of 43) will be kept. After these 4 sequences are added to the return object, the generation process will resume until `num_return_sequences` is reached. Therefore - the lower the value of `percent_to_keep`, the longer the overall generation process will take.

Now let's define a sane set of parameters for our example problem.

In [None]:
# Here we define the generation parameters.
parameters = {
    "max_length": 100,
    "top_k": 950,
    "repetition_penalty": 1.2,
    "num_return_sequences": 5,
    "percent_to_keep": 10
}

Now we can submit our service request to BioNeMo. Even though this service is asynchronous, we'll block on waiting for our request to be processed. With the chosen input parameters, we expect to obtain 5 new sequences.

In [None]:
# Submit request ticket
submission_request = requests.post(
    f"{API_HOST}/protein-sequence/protgpt2/generate",
    headers=headers,
    json=parameters
)

# Wait for request to be processed
result = query_async_result(submission_request)

Let's save the output in json format, and print the data it for a quick sanity check.

In [None]:
results_json=json.loads(result)
print(results_json['generated_sequences'][0])

With our sequences in hand, we're ready to predict the corresponding protein structures using the BioNeMo folding services. We'll use OpenFold to fold both the original protein, and the first generated sequence. Taking advantage of the ascynchronous nature of the folding services, note that we batch and submit both foldings requests before waiting for any service response. 

In [None]:
# Below is the sequence of Human Thioredoxin as sourced from UniProt

#Original Sequence
original_request = requests.post(
    f"{API_HOST}/protein-structure/openfold/predict",
    headers={"Authorization": f"Bearer {API_KEY}"},
    files={
        "sequence": (None, sequence),    
    }
)

#Generated Sequence        
generated_sequence=results_json['generated_sequences'][0]
generated_request = requests.post(
    f"{API_HOST}/protein-structure/openfold/predict",
    headers={"Authorization": f"Bearer {API_KEY}"},
    files={
        "sequence": (None, generated_sequence),  
    }
)

original_result=query_async_result(original_request)
pdb_filename = "BioNeMo_OpenFold_original.pdb"
with open(pdb_filename, 'w') as pdb_file:
    for line in json.loads(original_result)['pdbs']:
        pdb_file.write(line)
        
generated_result=query_async_result(generated_request)
pdb_filename = "BioNeMo_OpenFold_generated.pdb"
with open(pdb_filename, 'w') as pdb_file:
    for line in json.loads(generated_result)['pdbs']:
        pdb_file.write(line)

### Understanding the response

The BioNeMo Server will respond with a `correlation_id`, indicating a unique identifier for the request. Once the request is submitted, it is queued for processing. As soon as a processing slot is available, the structure prediction process is started. You can keep an eye on the submission request by querying the Server with the `correlation_id`.

In the following, we will wait for the status to be completed in order to download the structure prediction and save it into a pdb file that we then can visualize.

More information about the API can be found here: https://developer.nvidia.com/docs/bionemo-service/working-with-the-api.html

### Visualizing the predicted structures and prediction confidence

Finally, we visualize the predictions in 3D using [*py3Dmol*](https://pypi.org/project/py3Dmol/).

We take advantage of the predicted IDDT, i.e., a proxy to model confidence, and visualize it as a color similar to AlpaFold2.  In this color scheme, we bin the confidence intervals as:
 - <span style="color:blue">&#11035;</span> (dark blue, very high) - 90-100%</span>
 - <span style="color:lightblue">&#11035;</span> (light blue, confident) - 70-90%
 - <span style="color:yellow">&#11035;</span> (yellow, low) - 50-70%
 - <span style="color:orange">&#11035;</span> (orange, very low) - <50%

Regions in dark blue, with very high confidence, are expected to be modeled with high accuracy and can be used in applications that require high accuracy such as identifying binding sites.  Light blue regions with 70-90% confidence are expected to model general structure well.  Regions with low 50-70% confidence should be treated with caution, and below 50% confidence should not be used.

In the visualizations below, we show the two cases from the examples above:
 - The predicted structure of the original Thioredoxin sequence from the UniProt sequence look-up service, and
 - The predicted structure of the novel protein sequence generated with ProtGPT
 
Note that these are independent structures.  The following visualizations are not intended for comparison, but rather to showcase the ability to predict 3D structure using OpenFold.
 

In [None]:
# Define color palette for IDDT
color_palette = {
    range(90,100): 'blue',
    range(70,90): 'lightblue',
    range(50,70): 'yellow',
    range(0,50): 'orange',
}

def get_color(IDDT):
    for key in color_palette:
        if IDDT in key:
            return color_palette[key]

First the original Thioredoxin structure:

In [None]:
filename="BioNeMo_OpenFold_original.pdb"
# Loading the predicted structure saved in PDB file
with open(filename) as ifile:
    system = "".join([x for x in ifile])

#configuring the structure display
view = py3Dmol.view(width=800, height=800)
view.addModelsAsFrames(system)

# Iterate over residues and color based on IDDT value
for i, line in enumerate(system.split("\n")):
    split = line.split()
    if len(split) == 0 or split[0] != "ATOM":
        continue

    color = get_color(int(float(split[10])))
    
    view.setStyle({'model': -1, 'serial': i+1}, {"cartoon": {'color': color}})

view.zoomTo()
view.show()

Next the predicted structure of the generated sequence.

In [None]:
filename="BioNeMo_OpenFold_generated.pdb"
# Loading the predicted structure saved in PDB file
with open(filename) as ifile:
    system = "".join([x for x in ifile])

#configuring the structure display
view = py3Dmol.view(width=800, height=800)
view.addModelsAsFrames(system)

# Iterate over residues and color based on IDDT value
for i, line in enumerate(system.split("\n")):
    split = line.split()
    if len(split) == 0 or split[0] != "ATOM":
        continue

    color = get_color(int(float(split[10])))
    
    view.setStyle({'model': -1, 'serial': i+1}, {"cartoon": {'color': color}})

view.zoomTo()
view.show()

## Conclusion
In this notebook, we've walked through an example workflow generating protein sequences and visualizing their 3D structure, covering:
 - How to configure the BioNeMo Service API
 - Generating protein sequences with the ProtGPT2 model
 - Predicting 3D protein structure of the generated sequences using OpenFold
 
While this notebook demonstrates some of the rich capabilities of the BioNeMo service, this could just be the beginning of a production end-to-end drug discovery pipeline.