# Protein LM Embeddings and Logits as Features

One of the best ways you can leverage a large language model is for feature generation. The internal, numeric representations the neural-net uses to make predictions can be output and used for downstream machine learning tasks. The numeric vectors from NLP models often encode additional, powerful information beyond simple one-hot encodings. Usually feature engineering for biology is heavily task-specific, but in this case the embeddings can be used for a variety of classification, regression, and other tasks.

On the backend, the process involves passing input sequences into the pre-trained model for tokenization and representation of the protein through its neural-net layers. Multiple representations of a protein - numeric vectors and/or matrices, such as attention maps - are created. Here we will quickly demo using ESM2 via GPU-backed REST API in order to quickly transform a sequence into embeddings without installing packages, setting up a GPU, and downloading the model.

In [2]:
import requests, os, json, datetime

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import xgboost as xgb

from glob import glob
from sklearn import model_selection, preprocessing

sns.set_style('white')

In [15]:
# Load sequences and fluorescence data
df = pd.concat([
    pd.read_csv(f) for f in
    glob(os.path.join('.', 'data', '*.csv'))
])

df.drop_duplicates('seq', inplace=True)

df = df.sample(8)

df

Unnamed: 0,seq,label
5005,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.571812
669,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.698216
18340,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.431162
3325,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.354492
13588,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.524261
22536,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,1.30103
4694,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGRLTLKFI...,1.666271
10346,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.337506


Here we have a DataFrame containing sequences and their measured fluorescence. We can use the ESM2 embeddings as features to perform a quick regression to predict fluorescence values; but first we need to actually generate the embeddings.

Let's write a function that takes a sequence and requests its embeddings via REST API. The [ESM2 transform endpoint documentation](https://api.biolm.ai/#ae27a66e-6e2e-4b1f-a540-f426e141d16f) provides examples of the structure of a response.

In [16]:
# First let's write a function to get our API token using username/password

def get_api_token():
    """Get a BioLM API token to use with future API requests.
    
    Copied from https://api.biolm.ai/#d7f87dfd-321f-45ae-99b6-eb203519ddeb.
    """
    url = "https://biolm.ai/api/auth/token/"

    payload = json.dumps({
      "username": os.environ.get("BIOLM_USER"),
      "password": os.environ.get("BIOLM_PASS")
    })
    headers = {
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    return response.json()

In [17]:
api_tok = get_api_token()

os.environ['BIOLM_ACCESS'] = api_tok['access']
os.environ['BIOLM_REFRESH'] = api_tok['refresh']

Great, now we have our `access` and `refresh` tokens.

In [18]:
# Now the function to POST sequences to GPU-backed API for ESM2 tokenization

def get_embeddings(seq):
    """Make a POST request to get the ESM2 sequence embeddings for a protein.
    
    Also record the clock time to get the embeddings.
    """
    start = datetime.datetime.now()

    url = "https://biolm.ai/api/v1/models/esm2_t33_650M_UR50D/transform/"
    
    # Normally would POST multiple sequences at once for greater efficiency,
    # but for simplicity sake will do one at at time right now
    payload = json.dumps({
      "instances": [
        {
          "data": {
            "text": seq
          }
        }
      ]
    })
    
    try:
        access = os.environ.get('BIOLM_ACCESS')
        assert access
        refresh = os.environ.get('BIOLM_REFRESH')
        assert refresh
    except AssertionError:
        raise AssertionError("BioLM access or refresh token not set")
    
    headers = {
      'Cookie': 'access={};refresh={}'.format(access, refresh),
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    
    end = datetime.datetime.now()
    clocktime = end - start

    resp_json = response.json()
    transformations = resp_json['predictions']  # List, containing dicts for each sequence POSTed
    # List of dict_keys(['name', 'mean_representations', 'contacts', 'logits', 'attentions'])
    
    return [posted_sequence['mean_representations'] for posted_sequence in transformations]

In [19]:
test_protein = df.sample(1).seq.iloc[0]

print("Sequence length: {}\n{}".format(len(test_protein), test_protein))

Sequence length: 237
SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGINFKEDGNTLGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK


We can POST that sequence

In [20]:
r = get_embeddings(test_protein)

r

[{'33': [-0.021676046773791313,
   -0.005061550065875053,
   -0.040631186217069626,
   -0.019578689709305763,
   -0.10794085264205933,
   -0.025330042466521263,
   -0.046856146305799484,
   0.01562693901360035,
   0.0714205950498581,
   -0.09475641697645187,
   0.027504898607730865,
   -0.021554259583353996,
   0.04936441034078598,
   0.1443367451429367,
   -0.011476458981633186,
   -0.002493778709322214,
   -0.024328220635652542,
   -0.012090039439499378,
   0.0015981955220922828,
   0.021388372406363487,
   -0.017306819558143616,
   0.04624186083674431,
   -0.034375932067632675,
   0.10438095033168793,
   -0.01787651889026165,
   0.004153604153543711,
   0.010341745801270008,
   0.01049477607011795,
   0.010774469934403896,
   -0.18331146240234375,
   0.019509457051753998,
   0.01878241077065468,
   0.02207043394446373,
   -0.04143424332141876,
   -0.010061749257147312,
   -0.006232182029634714,
   -0.05051258206367493,
   0.03627075254917145,
   -0.030475271865725517,
   -0.05467699

We get back a list of dicts. Each dictionary contains the mean representations of a layer(s) from ESM2. In this case, we return the embeddings from the the final hidden layer, `33`.

Let's load this representation and look at its shape.

In [9]:
first_posted_sequence = r[0]

embed_df = pd.DataFrame([first_posted_sequence['33']])

embed_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
0,-0.005434,-0.010844,-0.039889,-0.007301,-0.097819,-0.004575,-0.051021,0.002068,0.063417,-0.082937,...,0.144033,0.061956,0.00366,0.091268,-0.020887,0.108992,0.079288,0.013648,-0.014246,-0.018035


We can see that while the original sequence is `237` residues, the NN uses a vector of `1280` to represent sequences. So, anytime we request an embedding for a sequence, we'll get back a representation that is the same size as another sequence. This makes downstream ML, especially with other NNs, nice and easy since we don't have to worry about padding.

Let's make a function to use with an `apply()` on each sequence to run this more quickly.

In [21]:
def get_extract_embeddings(seq):
    """Get embeddings from ESM2 via API and extract the final layer of embeddings.
    
    Normally would do this in one function, but for demonstration purposes,
    will build another function for the apply() here."""
    return [s['33'] for s in get_embeddings(seq)][0]

Let's test this function:

In [11]:
%%time

embeddings = df.iloc[:8, :].seq.apply(get_extract_embeddings).to_list()

CPU times: user 107 ms, sys: 22.4 ms, total: 130 ms
Wall time: 3.37 s


In [12]:
pd.DataFrame(embeddings)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
0,-0.002958,0.004187,-0.045949,0.005272,-0.09929,0.00229,-0.037235,0.025133,0.073217,-0.090025,...,0.131799,0.063513,0.020858,0.100373,-0.018374,0.128063,0.079383,-0.006516,-0.016403,-0.021598
1,-0.008707,-0.004934,-0.044122,-0.009039,-0.099484,-0.028994,-0.045413,0.010935,0.075985,-0.089158,...,0.171022,0.060234,-0.015955,0.087368,-0.010338,0.086776,0.096674,0.03678,-0.017618,-0.010025


Now, instead of getting the embeddings, we could have written a function to retrieve the logits. In fact, they were returned by the same API endpoint - we simply need to use a different key.

In [None]:
def get_logits(seq):
    """Make a POST request to get the ESM2 sequence logits for a protein."""
    url = "https://biolm.ai/api/v1/models/esm2_t33_650M_UR50D/transform/"
    
    # Normally would POST multiple sequences at once for greater efficiency,
    # but for simplicity sake will do one at at time right now
    payload = json.dumps({
      "instances": [
        {
          "data": {
            "text": seq
          }
        }
      ]
    })
    
    try:
        access = os.environ.get('BIOLM_ACCESS')
        assert access
        refresh = os.environ.get('BIOLM_REFRESH')
        assert refresh
    except AssertionError:
        raise AssertionError("BioLM access or refresh token not set")
    
    headers = {
      'Cookie': 'access={};refresh={}'.format(access, refresh),
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    resp_json = response.json()
    transformations = resp_json['predictions']  # List, containing dicts for each sequence POSTed
    # List of dict_keys(['name', 'mean_representations', 'contacts', 'logits', 'attentions'])
    
    return [posted_sequence['logits'] for posted_sequence in transformations]

In [None]:
logits = get_logits(test_protein)

logits = np.array(logits)

print(logits.shape)