# Protein LM Embeddings and Logits as Features

One of the best ways you can leverage a large language model is for feature generation. The internal, numeric representations the neural-net uses to make predictions can be output and used for downstream machine learning tasks. The numeric vectors from NLP models often encode additional, powerful information beyond simple one-hot encodings. Usually feature engineering for biology is heavily task-specific, but in this case the embeddings can be used for a variety of classification, regression, and other tasks.

On the backend, the process involves passing input sequences into the pre-trained model for tokenization and representation of the protein through its neural-net layers. Multiple representations of a protein - numeric vectors and/or matrices, such as attention maps - are created. Here we will quickly demo using ESM2 via GPU-backed REST API in order to quickly transform a sequence into embeddings without installing packages, setting up a GPU, and downloading the model.

In [1]:
import requests, os, json, datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb

from glob import glob
from sklearn import model_selection, preprocessing

sns.set_style('white')

In [2]:
# Load sequences and fluorescence data
df = pd.concat([
    pd.read_csv(f) for f in
    glob(os.path.join('..', 'data', '*.csv'))
])

df.drop_duplicates('seq', inplace=True)

df = df.sample(8)

df

Unnamed: 0,seq,label
4622,SKGEELFTGAVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.689527
24179,SKGEELFTGVVPILVELDGDVNGHKLSVSGEGEGDATYGKLTLKFI...,1.300411
13049,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.583091
6642,SKGEELFTGVVPILVELDGNVNGHKFSVSGEGEGDATYGKLTLKFI...,1.640382
8701,SKGEELFTGVVPILVEPDGDVNGHKFSVSGEGEGDATYGKLALKFI...,1.300962
6745,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.444743
26861,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.432607
12353,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,1.301031


Here we have a DataFrame containing sequences and their measured fluorescence. We can use the ESM2 embeddings as features to perform a quick regression to predict fluorescence values; but first we need to actually generate the embeddings.

Let's write a function that takes a sequence and requests its embeddings via REST API. The [ESM2 transform endpoint documentation](https://api.biolm.ai/#ae27a66e-6e2e-4b1f-a540-f426e141d16f) provides examples of the structure of a response.

In [3]:
# First let's write a function to get our API token using username/password

def get_api_token():
    """Get a BioLM API token to use with future API requests.
    
    Copied from https://api.biolm.ai/#d7f87dfd-321f-45ae-99b6-eb203519ddeb.
    """
    url = "https://biolm.ai/api/auth/token/"

    payload = json.dumps({
      "username": os.environ.get("BIOLM_USER"),
      "password": os.environ.get("BIOLM_PASS")
    })
    headers = {
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    return response.json()

In [4]:
api_tok = get_api_token()

os.environ['BIOLM_ACCESS'] = api_tok['access']
os.environ['BIOLM_REFRESH'] = api_tok['refresh']

Great, now we have our `access` and `refresh` tokens.

In [5]:
# Now the function to POST sequences to GPU-backed API for ESM2 tokenization

def get_embeddings(seq):
    """Make a POST request to get the ESM2 sequence embeddings for a protein.
    
    Also record the clock time to get the embeddings.
    """
    start = datetime.datetime.now()

    url = "https://biolm.ai/api/v1/models/esm2_t33_650M_UR50D/transform/"
    
    # Normally would POST multiple sequences at once for greater efficiency,
    # but for simplicity sake will do one at at time right now
    payload = json.dumps({
      "instances": [
        {
          "data": {
            "text": seq
          }
        }
      ]
    })
    
    try:
        access = os.environ.get('BIOLM_ACCESS')
        assert access
        refresh = os.environ.get('BIOLM_REFRESH')
        assert refresh
    except AssertionError:
        raise AssertionError("BioLM access or refresh token not set")
    
    headers = {
      'Cookie': 'access={};refresh={}'.format(access, refresh),
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    
    end = datetime.datetime.now()
    clocktime = end - start

    resp_json = response.json()
    transformations = resp_json['predictions']  # List, containing dicts for each sequence POSTed
    # List of dict_keys(['name', 'mean_representations', 'contacts', 'logits', 'attentions'])
    
    return [posted_sequence['mean_representations'] for posted_sequence in transformations]

In [6]:
test_protein = df.sample(1).seq.iloc[0]

print("Sequence length: {}\n{}".format(len(test_protein), test_protein))

Sequence length: 237
SKGEELFTGAVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK


We can POST that sequence

In [7]:
r = get_embeddings(test_protein)

r

[{'33': [-0.014208842068910599,
   -0.00405506556853652,
   -0.03875024989247322,
   -0.01236201822757721,
   -0.08938455581665039,
   -0.01983371004462242,
   -0.048509057611227036,
   0.013832416385412216,
   0.06103594973683357,
   -0.0881727933883667,
   0.026761744171380997,
   -0.005428537260740995,
   0.047421421855688095,
   0.14044907689094543,
   -0.005264210514724255,
   0.002692201640456915,
   -0.019592441618442535,
   -0.0122526865452528,
   -0.003256772179156542,
   0.018242528662085533,
   -0.004818408749997616,
   0.03771073743700981,
   -0.020654568448662758,
   0.10477283596992493,
   -0.018947657197713852,
   0.00479768356308341,
   0.011153598316013813,
   0.014873011037707329,
   0.019013535231351852,
   -0.16434217989444733,
   0.024637620896100998,
   -0.0017954118084162474,
   0.025675874203443527,
   -0.0402197539806366,
   -0.00865594670176506,
   -0.008322625420987606,
   -0.035510916262865067,
   0.03397607430815697,
   -0.042429130524396896,
   -0.06002685

We get back a list of dicts. Each dictionary contains the mean representations of a layer(s) from ESM2. In this case, we return the embeddings from the the final hidden layer, `33`.

Let's load this representation and look at its shape.

In [8]:
%%time

first_posted_sequence = r[0]

embed_df = pd.DataFrame([first_posted_sequence['33']])

embed_df

CPU times: user 27.2 ms, sys: 2.12 ms, total: 29.3 ms
Wall time: 27.8 ms


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
0,-0.014209,-0.004055,-0.03875,-0.012362,-0.089385,-0.019834,-0.048509,0.013832,0.061036,-0.088173,...,0.137147,0.050349,-0.001072,0.080422,-0.007383,0.105817,0.086003,0.023727,-0.019055,-0.027628


We can see that while the original sequence is `237` residues, the NN uses a vector of `1280` to represent sequences. So, anytime we request an embedding for a sequence, we'll get back a representation that is the same size as another sequence. This makes downstream ML, especially with other NNs, nice and easy since we don't have to worry about padding.

Let's make a function to use with an `apply()` on each sequence to run this more quickly.

In [9]:
def get_extract_embeddings(seq):
    """Get embeddings from ESM2 via API and extract the final layer of embeddings.
    
    Normally would do this in one function, but for demonstration purposes,
    will build another function for the apply() here."""
    return [s['33'] for s in get_embeddings(seq)][0]

Let's test this function:

In [10]:
%%time

embeddings = df.iloc[:8, :].seq.apply(get_extract_embeddings).to_list()

CPU times: user 435 ms, sys: 96 ms, total: 531 ms
Wall time: 20.5 s


In [11]:
pd.DataFrame(embeddings)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
0,-0.014209,-0.004055,-0.03875,-0.012362,-0.089385,-0.019834,-0.048509,0.013832,0.061036,-0.088173,...,0.137147,0.050349,-0.001072,0.080422,-0.007383,0.105817,0.086003,0.023727,-0.019055,-0.027628
1,-0.002707,0.005955,-0.037064,-0.003509,-0.075809,-0.00088,-0.048482,0.007677,0.050569,-0.083957,...,0.12608,0.04101,0.007241,0.081618,-0.009309,0.110702,0.068603,0.004002,-0.001715,-0.023831
2,-0.021107,-0.008251,-0.054411,-0.005222,-0.105886,-0.014171,-0.030704,0.015839,0.063732,-0.086206,...,0.140083,0.055863,0.019336,0.089056,-0.009605,0.132365,0.089566,0.012116,-0.022976,-0.022603
3,0.000965,-0.007878,-0.027652,0.006491,-0.090655,-0.004497,-0.034889,0.010753,0.054625,-0.082407,...,0.123567,0.051401,0.0088,0.079291,-0.003876,0.118328,0.069559,0.025117,-0.021044,-0.011057
4,0.001983,-0.00272,-0.042784,0.000732,-0.074743,-0.012826,-0.038881,0.018679,0.067692,-0.075997,...,0.127118,0.047605,0.011307,0.07735,-0.007545,0.112219,0.071307,0.006672,-0.001638,-0.016977
5,-0.005035,-0.010702,-0.029447,0.002523,-0.098376,-0.012795,-0.037628,0.021798,0.06938,-0.079252,...,0.123336,0.060984,0.004788,0.076512,-0.004122,0.132009,0.075912,0.025371,-0.030921,-0.016529
6,-0.001926,-0.011603,-0.049866,-0.005355,-0.087965,-0.000569,-0.048681,0.010982,0.061695,-0.07822,...,0.134983,0.054642,0.017741,0.093205,-0.019374,0.125468,0.069507,0.007305,0.010007,-0.02115
7,-0.010767,-0.016367,-0.058742,-0.004007,-0.089008,-0.003591,-0.036587,0.01603,0.067068,-0.08583,...,0.139209,0.069656,0.018424,0.106277,-0.023594,0.117928,0.082645,0.007598,-0.015153,-0.02992


Now, instead of getting the embeddings, we could have written a function to retrieve the logits. In fact, they were returned by the same API endpoint - we simply need to use a different key.

In [12]:
def get_logits(seq):
    """Make a POST request to get the ESM2 sequence logits for a protein."""
    url = "https://biolm.ai/api/v1/models/esm2_t33_650M_UR50D/transform/"
    
    # Normally would POST multiple sequences at once for greater efficiency,
    # but for simplicity sake will do one at at time right now
    payload = json.dumps({
      "instances": [
        {
          "data": {
            "text": seq
          }
        }
      ]
    })
    
    try:
        access = os.environ.get('BIOLM_ACCESS')
        assert access
        refresh = os.environ.get('BIOLM_REFRESH')
        assert refresh
    except AssertionError:
        raise AssertionError("BioLM access or refresh token not set")
    
    headers = {
      'Cookie': 'access={};refresh={}'.format(access, refresh),
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    resp_json = response.json()
    transformations = resp_json['predictions']  # List, containing dicts for each sequence POSTed
    # List of dict_keys(['name', 'mean_representations', 'contacts', 'logits', 'attentions'])
    
    return [posted_sequence['logits'] for posted_sequence in transformations]

In [13]:
%%time

logits = get_logits(test_protein)

logits = np.array(logits)

print(logits.shape)

(1, 237)


In [15]:
pd.DataFrame(logits)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,227,228,229,230,231,232,233,234,235,236
0,-5.301396,-5.718361,-5.884704,-5.702101,-5.930498,-5.855449,-5.755562,-5.787621,-5.91846,-5.492683,...,-5.602943,-6.067883,-5.656158,-6.017365,-5.683462,-5.850456,-5.824902,-5.688544,-5.453043,-5.436296


One could also use the sum or mean as a representation of sequence:

In [16]:
logits.sum()

-1392.3157186508179

In [18]:
logits.mean()

-5.87474986772497