<div class="jupyter-biolm-header">
    <img style="float: left; padding-right: 10px; height: 60px" src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/logo.png">
    <p>
    <br>
    <br>
    <br>
    </p>
</div>

# **GPT2 for Directed or Undirected Antibody Generation**

GPT models are fantastic for Q&A in English language, but for protein engineering the currently necessitate finetuning for decent molecule generation. For this example, we finetuned a ProtGPT2 model (pretrained on UniRef) on hundreds of thousands of SARS-Cov-19 natural antibodies from the OAS database. Below, we simply prompt our GPT2 model to generated a new molecule, not with a question but a either open-ended generation or by including `EVQL` to more assuredly generated a human heavy chain antibody.

---

## **Set Your API Token**

In order to use the BioLM API, you need to have a token. You can get one from
the [User API Tokens](https://biolm.ai/ui/accounts/user-api-tokens/) page.

Paste the API token you generated in the cell below, as the value
of the variable `BIOLMAI_TOKEN `.

In [None]:
BIOLMAI_TOKEN = " "  # !!! YOUR API TOKEN HERE !!!

## API Call with Python Requests

We need to make sure we have the Python `requests` module loaded first.

In [None]:
try:
    # Install packages to make API requests in JLite
    import micropip
    await micropip.install('requests')
    await micropip.install('pyodide-http')
    # Patch requests for in-browser support
    import pyodide_http
    pyodide_http.patch_all()
except ModuleNotFoundError:
    pass  # Won't be using micropip outside of JLite

import requests
from IPython.display import JSON  # Helpful UI for JSON display


import pandas as pd
import os, sys
import json
import datetime
import urllib3

In [None]:

def generate_gpt2_cv2_hchain(seed_seq=None):
    """POST create a new GPT2 antibody from fine-tuned SARS-Cov2 immune responses."""


    url = "https://biolm.ai/api/v1/models/gpt2_sarscovd_heavy/generate/"
    
    if not seed_seq:
        seed_seq = ''
        

    payload = json.dumps({
      "instances": [
        {
          "data": {
            "text": seed_seq
          }
        }
      ]
    })
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Token {BIOLMAI_TOKEN.strip()}",
    }

    response = requests.request("POST", url, headers=headers, data=payload, timeout=480)
    
    resp_json = response.json()
    
    return resp_json['predictions']['generated']

In [None]:
resp = generate_gpt2_cv2_hchain('EVQ')

resp

In [None]:
df = pd.DataFrame(['EVQ' for _ in range(10000)], columns=['seed_seq'])

In [None]:
def apply_gen_abs(seed_seq):
    g = generate_gpt2_cv2_hchain(seed_seq)
    _d = pd.DataFrame.from_dict(g)
    _d = _d.query('perplexity <= 125.0').reset_index(drop=True)
    _d = _d.loc[_d.text.str.len() <= 256, :].reset_index(drop=True)
    return _d

In [None]:
generated_seq_dfs = df.seed_seq.iloc[:2500].apply(apply_gen_abs)  # use parallel_apply and pandarralel for parallel requests

In [None]:
generated_seqs = pd.concat(list(generated_seq_dfs), axis=0)
generated_seqs['len'] = generated_seqs.text.str.len()

In [None]:
generated_seqs.sort_values('perplexity', ascending=True).head(10)

In [None]:
generated_seqs.sample(10)

In [None]:
generated_seqs.to_csv('generated_sars_cov2_ab_seqs.csv', index=False)

In [None]:
generated_seqs.shape

The `perplexity` measure is correlated with similarity to known molecules - the lower the values, the more likely the sequence folds into something real. There are ~9.5k sequences with a `perplexity <= 125.0`, to be further ranked and selected using _other_ models now.

## Rank with ESM-1v & Other Evaluations

In order to pull out likely functional sequences, we could also score these with ESM-1v - or any ESM flavor - since those models were trained on functional sequences only. See [In silico Deep Mutational Scan](./3.1_ESM-1v_Deep_Mutational_Scan_Protein.ipynb) for more info.


We could also see how close the low-perplexity generated sequences are to those in the test set. Align or calculate Levenshtein distances from antibodies in the test set. Number the antibodies so we can assess their CDR loops comapred to known SARS-Cov-2 antibodies. And of course many other evaluations we could make, which will be up to you.

### See more use-cases and APIs on your [BioLM Console Catalog](https://biolm.ai/console/catalog/).
<br>

#### BioLM hosts deep learning models and runs inference at scale. You do the science.
<br>

<table class="jupyter-biolm-header-table" style="width: 100%; border-collapse: collapse; background-color: white; float: left;">
    <tr>
        <td style="text-align: left; vertical-align: middle; background-color: white;">
            <img src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/console-overview/enzyme_engineering_icon.png"  style="height: 40px; float: left; padding-right: 10px;"> Enzyme Engineering
        </td>
        <td style="text-align: left; vertical-align: middle; background-color: white;">
            <img src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/console-overview/antibody_engineering_icon.png"  style="height: 40px; float: left; padding-right: 10px;"> Antibody Engineering
        </td>
        <td style="text-align: left; vertical-align: middle; background-color: white;">
            <img src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/console-overview/biosecurity_icon.png"  style="height: 40px; float: left; padding-right: 10px;"> Biosecurity
        </td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: middle; background-color: white;">
            <img src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/console-overview/single_cell_genomics_icon.png"  style="height: 40px; float: left; padding-right: 10px;"> Single-Cell Genomics
        </td>
        <td style="text-align: left; vertical-align: middle; background-color: white;">
            <img src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/console-overview/dna_seq_modeling_icon.png"  style="height: 40px; float: left; padding-right: 10px;"> DNA Sequence Modelling
        </td>
        <td style="text-align: left; vertical-align: middle; background-color: white;">
            <img src="https://d31e6ufxekikrt.cloudfront.net/static/ui/images/console-overview/finetuning_icon.png"  style="height: 40px; float: left; padding-right: 10px;"> Finetuning
        </td>
    </tr>
</table>

### [**Contact us**](https://biolm.ai/ui/contact-us/) to learn more.
