# Protein LM Embeddings and Logits as Features

One of the best ways you can leverage a large language model is for feature generation. The internal, numeric representations the neural-net uses to make predictions can be output and used for downstream machine learning tasks. The numeric vectors from NLP models often encode additional, powerful information beyond simple one-hot encodings. Usually feature engineering for biology is heavily task-specific, but in this case the embeddings can be used for a variety of classification, regression, and other tasks.

On the backend, the process involves passing input sequences into the pre-trained model for tokenization and representation of the protein through its neural-net layers. Multiple representations of a protein - numeric vectors and/or matrices, such as attention maps - are created. Here we will quickly demo using ESM2 via GPU-backed REST API in order to quickly transform a sequence into embeddings without installing packages, setting up a GPU, and downloading the model.

In [1]:
# Initiliaze pandarallel first so we can multiprocess DF
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=12)

INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [2]:
import requests, os, json, datetime

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import xgboost as xgb

from glob import glob
from sklearn import model_selection, preprocessing

sns.set_style('white')

In [3]:
# Load sequences and fluorescence data
df = pd.concat([
    pd.read_csv(f) for f in
    glob(os.path.join('.', 'data', '*.csv'))
])

df.drop_duplicates('seq', inplace=True)

df.sample(8)

Unnamed: 0,seq,label
2189,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.625562
11957,SKGEELFTGVVPILVELDGDVNGHEFSVSGEGEGDATYGKLTLKFI...,3.582321
10010,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.643653
22617,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTRKFI...,1.301031
2225,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.218823
21432,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGDGDATYGKLTLKFI...,1.302339
18219,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,3.575229
13416,SKGEELFTGVVPILVELDGDVNGRKFSVSGEGEGDATYGKLTLKFI...,1.301031


Here we have a DataFrame containing sequences and their measured fluorescence. We can use the ESM2 embeddings as features to perform a quick regression to predict fluorescence values.

Let's write a function that takes a sequence and requests its embeddings via REST API. The [ESM2 transform endpoint documentation](https://api.biolm.ai/#ae27a66e-6e2e-4b1f-a540-f426e141d16f) provides examples of the structure of a response.

In [4]:
# First let's write a function to get our API token using username/password

def get_api_token():
    """Get a BioLM API token to use with future API requests.
    
    Copied from https://api.biolm.ai/#d7f87dfd-321f-45ae-99b6-eb203519ddeb.
    """
    url = "https://biolm.ai/api/auth/token/"

    payload = json.dumps({
      "username": os.environ.get("BIOLM_USER"),
      "password": os.environ.get("BIOLM_PASS")
    })
    headers = {
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    return response.json()

In [5]:
api_tok = get_api_token()

os.environ['BIOLM_ACCESS'] = api_tok['access']
os.environ['BIOLM_REFRESH'] = api_tok['refresh']

Great, now we have our `access` and `refresh` tokens.

In [6]:
# Now the function to POST sequences to GPU-backed API for ESM2 tokenization

def get_embeddings(seq):
    """Make a POST request to get the ESM2 sequence embeddings for a protein.
    
    Also record the clock time to get the embeddings.
    """
    start = datetime.datetime.now()

    url = "https://biolm.ai/api/v1/models/esm2_t33_650M_UR50D/transform/"
    
    # Normally would POST multiple sequences at once for greater efficiency,
    # but for simplicity sake will do one at at time right now
    payload = json.dumps({
      "instances": [
        {
          "data": {
            "text": seq
          }
        }
      ]
    })
    
    try:
        access = os.environ.get('BIOLM_ACCESS')
        assert access
        refresh = os.environ.get('BIOLM_REFRESH')
        assert refresh
    except AssertionError:
        raise AssertionError("BioLM access or refresh token not set")
    
    headers = {
      'Cookie': 'access={};refresh={}'.format(access, refresh),
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    
    end = datetime.datetime.now()
    clocktime = end - start

    resp_json = response.json()
    transformations = resp_json['predictions']  # List, containing dicts for each sequence POSTed
    # List of dict_keys(['name', 'mean_representations', 'contacts', 'logits', 'attentions'])
    
    return [posted_sequence['mean_representations'] for posted_sequence in transformations]

In [7]:
test_protein = df.sample(1).seq.iloc[0]

print("Sequence length: {}\n{}".format(len(test_protein), test_protein))

Sequence length: 237
SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKYICTTGKLPVPWPTLVTTLSYGVQCFSRYPDLMKQHDSFKSAMPEGYVQERTIFFKDDGNYKTRVEVEFEGDTLVDRIELMGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIENGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQTALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK


We can POST that sequence

In [8]:
r = get_embeddings(test_protein)

r

[{'33': [-0.005434439517557621,
   -0.010844455100595951,
   -0.039888832718133926,
   -0.007300889119505882,
   -0.09781938046216965,
   -0.0045746369287371635,
   -0.051020555198192596,
   0.002067751716822386,
   0.06341726332902908,
   -0.0829365998506546,
   0.016918059438467026,
   0.002198295434936881,
   0.06036660820245743,
   0.12003503739833832,
   -0.0007652859203517437,
   0.020819995552301407,
   -0.01137372013181448,
   -0.012205617502331734,
   -0.009111090563237667,
   -0.014638413675129414,
   0.007059546187520027,
   0.026472367346286774,
   -0.019654259085655212,
   0.1188235878944397,
   -0.012040707282721996,
   0.008144490420818329,
   -0.0002720437478274107,
   0.028684960678219795,
   0.015224963426589966,
   -0.1702953726053238,
   0.04189925640821457,
   -0.01684240996837616,
   0.022443082183599472,
   -0.05151586979627609,
   -0.00023268169024959207,
   -0.014562771655619144,
   -0.03699558973312378,
   0.009846617467701435,
   -0.021281374618411064,
   -0.

We get back a list of dicts. Each dictionary contains the mean representations of a layer(s) from ESM2. In this case, we return the embeddings from the the final hidden layer, `33`.

Let's load this representation and look at its shape.

In [9]:
first_posted_sequence = r[0]

embed_df = pd.DataFrame([first_posted_sequence['33']])

embed_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
0,-0.005434,-0.010844,-0.039889,-0.007301,-0.097819,-0.004575,-0.051021,0.002068,0.063417,-0.082937,...,0.144033,0.061956,0.00366,0.091268,-0.020887,0.108992,0.079288,0.013648,-0.014246,-0.018035


We can see that while the original sequence is `237` residues, the NN uses a vector of `1280` to represent sequences. So, anytime we request an embedding for a sequence, we'll get back a representation that is the same size as another sequence. This makes downstream ML, especially with other NNs, nice and easy since we don't have to worry about padding.

We can now use `pandarallel` to make multiple requests in parallel. One of the advantages of a GPU-backed REST API is that you can achieve parallelization by using local CPU threads to make multiple REST requests. So let's get the embeddings for *all* the sequences we want to train and test on.

In [10]:
def get_extract_embeddings(seq):
    """Get embeddings from ESM2 via API and extract the final layer of embeddings.
    
    Normally would do this in one function, but for demonstration purposes,
    will build another function for the apply() here."""
    return [s['33'] for s in get_embeddings(seq)][0]

Quickly test this function:

In [11]:
%%time

embeddings = df.iloc[:2, :].seq.apply(get_extract_embeddings).to_list()

CPU times: user 107 ms, sys: 22.4 ms, total: 130 ms
Wall time: 3.37 s


In [12]:
pd.DataFrame(embeddings)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
0,-0.002958,0.004187,-0.045949,0.005272,-0.09929,0.00229,-0.037235,0.025133,0.073217,-0.090025,...,0.131799,0.063513,0.020858,0.100373,-0.018374,0.128063,0.079383,-0.006516,-0.016403,-0.021598
1,-0.008707,-0.004934,-0.044122,-0.009039,-0.099484,-0.028994,-0.045413,0.010935,0.075985,-0.089158,...,0.171022,0.060234,-0.015955,0.087368,-0.010338,0.086776,0.096674,0.03678,-0.017618,-0.010025


Now that it works, run it on a larger set of `1k` sequences. When BioLM receives these multiple requests in parallel, it will begin scaling up and loading the model on multiple GPUs for parallelization.

In [13]:
%%time

embeddings_1 = df.iloc[0:500, :].seq.parallel_apply(get_extract_embeddings).to_list()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=42), Label(value='0 / 42'))), HBox…

CPU times: user 855 ms, sys: 364 ms, total: 1.22 s
Wall time: 6min 54s


Multiple servers should be hot now, so let's see the time to get the next set of `1k` embeddings. It should be significantly faster.

In [None]:
%%time

embeddings_2 = df.iloc[500:1000, :].seq.parallel_apply(get_extract_embeddings).to_list()

With that demonstration done, let's do the remainder.

In [None]:
%%time

embeddings_3 = df.iloc[1000:, :].seq.parallel_apply(get_extract_embeddings).to_list()

In [None]:
embeddings = pd.concat([
    pd.DataFrame(embeddings_1),
    pd.DataFrame(embeddings_2),
    pd.DataFrame(embeddings_3),
], axis=0)

In [None]:
embeddings.sample(3)

## Modeling with XGBoost

In order to not require any transformations of the data, let's use a tree-based method to quickly create a regression model using these ESM2 embeddings.

In [None]:
train_x, test_x, train_y, test_y = model_selection.train_test_split(
    embeddings,
    df.label,
    test_size=0.2,
    random_state=54
)

print("X Train size: {}\nX Test size: {}".format(train_x.shape, test_x.shape))
print("Y Train size: {}\nY Test size: {}".format(train_y.shape, test_y.shape))

In [None]:
#Set up cross-validation modeling objective
data_dmatrix = xgb.DMatrix(data=train_x, label=train_y)

params = {
    "objective": "reg:linear",
    'colsample_bytree': 0.40,
    'learning_rate': 0.1,
    'max_depth': 100,
    'alpha': 10,
}

# Run CV
cv_results = xgb.cv(
    dtrain=data_dmatrix,
    params=params,
    nfold=3,
    num_boost_round=100,
    early_stopping_rounds=10,
    metrics="rmse",
    as_pandas=True,
    seed=123
)

We can see where the performance started...

In [None]:
cv_results.head()

But the best results are at the tail end, since this model attempts to converge upon a good fit.

In [None]:
cv_results.tail(10)

In [None]:
# Final RMSE on test set
print((cv_results["test-rmse-mean"]).tail(1))

In [None]:
# Final SD on test set
print((cv_results["test-rmse-std"]).tail(1))

We can get context for these values by looking at the `Y` values that were used in this cross-validation experiment.

In [None]:
plt.figure(figsize=(6, 5))

train_y.hist()

In [None]:
# hyperopt
# graphviz

Now we can train a model with the cross-validated parameters, using our full training dataset.

In [None]:
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=80)

Measuring the predicted values against our test set, let's see how well the model did using just the sequence embeddings features.

In [None]:
y_pred = xg_reg.predict(xgb.DMatrix(data=test_x, label=test_y))

In [None]:
sns.regplot(x=test_y, y=y_pred)

Lastly, we can attempt to inspect the model and learn a bit more about it, its fit, and our data.

In [None]:
xgb.plot_importance(xg_reg, max_num_features=20, grid=False)
plt.rcParams['figure.figsize'] = [10, 8]
plt.show()

In [None]:
xgb.plot_tree(xg_reg,num_trees=0)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()