If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.

In [1]:
import os
os.getcwd()

'/home/evlasova/hack'

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your username and password:

In [2]:
# import os
# os.environ["HUGGING_FACE_HUB_TOKEN"] = "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"

In [3]:
from huggingface_hub import login

login('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')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/evlasova/.cache/huggingface/token
Login successful


We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

# Fine-Tuning Protein Language Models

In this notebook, we're going to do some transfer learning to fine-tune some large, pre-trained protein language models on tasks of interest. If that sentence feels a bit intimidating to you, don't panic - there's [a blog post](https://huggingface.co/blog/deep-learning-with-proteins) that explains the concepts here in much more detail.

The specific model we're going to use is ESM-2, which is the state-of-the-art protein language model at the time of writing (November 2022). The citation for this model is [Lin et al, 2022](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1).

There are several ESM-2 checkpoints with differing model sizes. Larger models will generally have better accuracy, but they require more GPU memory and will take much longer to train. The available ESM-2 checkpoints (at time of writing) are:

| Checkpoint name | Num layers | Num parameters |
|------------------------------|----|----------|
| `esm2_t48_15B_UR50D`         | 48 | 15B     |
| `esm2_t36_3B_UR50D`          | 36 | 3B      |
| `esm2_t33_650M_UR50D`        | 33 | 650M    |
| `esm2_t30_150M_UR50D`        | 30 | 150M    |
| `esm2_t12_35M_UR50D`         | 12 | 35M     |
| `esm2_t6_8M_UR50D`           | 6  | 8M      |

Note that the larger checkpoints may be very difficult to train without a large cloud GPU like an A100 or H100, and the largest 15B parameter checkpoint will probably be impossible to train on **any** single GPU! Also, note that memory usage for attention during training will scale as `O(batch_size * num_layers * seq_len^2)`, so larger models on long sequences will use quite a lot of memory! We will use the `esm2_t12_35M_UR50D` checkpoint for this notebook, which should train on any Colab instance or modern GPU.

In [4]:
model_checkpoint = "facebook/esm2_t12_35M_UR50D"

***
# Token classification

Another common language model task is **token classification**. In this task, instead of classifying the whole sequence into a single category, we categorize each token (amino acid, in this case!) into one or more categories. This kind of model could be useful for:

- Predicting secondary structure
- Predicting buried vs. exposed residues
- Predicting residues that will receive post-translational modifications
- Predicting residues involved in binding pockets or active sites
- Probably several other things, it's been a while since I was a postdoc

## Data preparation

In this section, we're going to gather some training data from UniProt. As in the sequence classification example, we aim to create two lists: `sequences` and `labels`. Unlike in that example, however, the `labels` are more than just single integers. Instead, the label for each sample will be **one integer per token in the input**. This should make sense - when we do token classification, different tokens in the input may have different categories!

To demonstrate token classification, we're going to go back to UniProt and get some data on protein secondary structures. As above, this will probably the main section you want to change when adapting this code to your own problems.

Now we've defined a helper function, let's build our lists of sequences and labels:

In [5]:
import pandas as pd
import numpy as np

In [6]:
train = pd.read_csv('data/HPrank/train.csv')
train[330:332]

Unnamed: 0,hydrophobic_patch,hp_class,sequence,hp_rank
330,[ 0. 0. 0. 0. 0. 0. 0. 0. 0. ...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,GATAGAVSARAAEQQRLQRIVDAVARQEPRISWAAGLRDDGTTTLL...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
331,[0. 0. 0. ... 0. 0. 0.],[0 0 0 ... 0 0 0],GVVQSVNVSQAGYSSNDFKTATVTASDKLSDTSYQILQGTTVIATG...,[0 0 0 ... 0 0 0]


In [7]:
def has_no_literals(inputStr):
    return not any((not char.isdigit() and char not in [' ', ']', '[', '\n']) for char in inputStr)

has_no_literals('''[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0]''')

has_no_literals('''[0 0 0 ... 0 0 0]''')

False

In [8]:
def prepare_dataset(prefix, classification_type):
    def process_one_type(type_of_dataset):
        def has_none_literals(inputRow):
            return not any((not char.isdigit() and char not in [' ', ']', '[', '\n']) for char in inputRow[classification_type])

        train = pd.read_csv(f'{type_of_dataset}.csv')
        train = train[train.apply(has_none_literals, axis=1)]
        train_sequences = list(train.sequence)
        train_labels = train[classification_type].str.replace('\\n', '')
        train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
        return train_sequences, train_labels

    train_sequences, train_labels = process_one_type(prefix + 'train')
    test_sequences, test_labels = process_one_type(prefix + 'test')
    validate_sequences, validate_labels = process_one_type(prefix + 'validate')
    return train_sequences, test_sequences, validate_sequences, train_labels, test_labels, validate_labels

In [9]:
task_types = ['ss3', 'asabu', 'HPrank', 'Epitope_anti']
classification_types = ['ss3', 'buried', 'hp_class', 'interface']

train_sequences, test_sequences, validate_sequences, train_labels, test_labels, validate_labels = [[0 for __ in range(4)] for _ in range(6)]

for i in range(4):
    train_sequences[i], test_sequences[i], validate_sequences[i], train_labels[i], test_labels[i], validate_labels[i] = prepare_dataset('data/' + task_types[i] + '/', classification_types[i])

  train_labels = train[classification_type].str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train[classification_type].str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train[classification_type].str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train[classification_type].str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train[classification_type].str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train[classification_type].str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train

## Creating our dataset

Nice! Now we'll split and tokenize the data, and then create datasets - I'll go through this quite quickly here, since it's identical to how we did it in the sequence classification example above.

In [10]:
from transformers import AutoTokenizer

tokenizers = [AutoTokenizer.from_pretrained(model_checkpoint) for _ in range(4)]

train_tokenized, test_tokenized, val_tokenized = [[0 for __ in range(4)] for _ in range(3)]

for i in range(4):
    train_tokenized[i] = tokenizers[i](train_sequences[i])
    test_tokenized[i] = tokenizers[i](test_sequences[i])
    val_tokenized[i] = tokenizers[i](validate_sequences[i])

In [11]:
from datasets import Dataset

train_datasets, test_datasets, val_datasets = [[0 for __ in range(4)] for _ in range(3)]

for i in range(4):
    train_datasets[i] = Dataset.from_dict(train_tokenized[i])
    test_datasets[i] = Dataset.from_dict(test_tokenized[i])
    val_datasets[i] = Dataset.from_dict(val_tokenized[i])
    
    train_datasets[i] = train_datasets[i].add_column("labels", train_labels[i])
    test_datasets[i] = test_datasets[i].add_column("labels", test_labels[i])
    val_datasets[i] = val_datasets[i].add_column("labels", validate_labels[i])

## Model loading

The key difference here with the above example is that we use `AutoModelForTokenClassification` instead of `AutoModelForSequenceClassification`. We will also need a `data_collator` this time, as we're in the slightly more complex case where both inputs and labels must be padded in each batch.

In [12]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

num_labels = 3
models = [0, 0, 0, 0]

for i in range(4):
    models[i] = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of EsmForTokenClassification were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of EsmForTokenClassification were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of EsmForTokenClassification were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of EsmForTokenClassification were not initialized from the model checkpoint at facebook/es

In [13]:
#models

In [14]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizers[0])

Now we set up our `TrainingArguments` as before.

In [15]:
model_name = model_checkpoint.split("/")[-1]
batch_size = 8

args = TrainingArguments(
    f"{model_name}-finetuned-secondary-structure",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.001,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

Our `compute_metrics` function is a bit more complex than in the sequence classification task, as we need to ignore padding tokens (those where the label is `-100`).

In [16]:
from evaluate import load

metric = load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    labels = labels.reshape((-1,))
    predictions = np.argmax(predictions, axis=2)
    predictions = predictions.reshape((-1,))
    predictions = predictions[labels!=-100]
    labels = labels[labels!=-100]
    return metric.compute(predictions=predictions, references=labels)

And now we're ready to train our model!

In [18]:
trainers = [0,0,0,0]

for i in range(4):
    trainers[i] = Trainer(
        models[i],
        args,
        train_dataset=train_datasets[i],
        eval_dataset=test_datasets[i],
        tokenizer=tokenizers[i],
        compute_metrics=compute_metrics,
        data_collator=data_collator, 
        )

    trainers[i].train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.4461,0.433826,0.820073
2,0.3711,0.426254,0.825226
3,0.3023,0.443227,0.824271




Epoch,Training Loss,Validation Loss,Accuracy
1,0.3853,0.379402,0.831175
2,0.3388,0.375611,0.834506
3,0.2883,0.39622,0.831828




Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.509581,0.760567
2,0.532300,0.503572,0.761069
3,0.532300,0.493933,0.765113




Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.340275,0.906041
2,No log,0.329828,0.906041
3,No log,0.315165,0.906041


In [19]:
results = [0,0,0,0]

for i in range(4):
    results[i] = trainers[i].predict(val_datasets[i])
    print(results[i].metrics)

{'test_loss': 0.42773425579071045, 'test_accuracy': 0.824121256450892, 'test_runtime': 8.1066, 'test_samples_per_second': 135.939, 'test_steps_per_second': 17.023}


{'test_loss': 0.37525466084480286, 'test_accuracy': 0.8334674306148158, 'test_runtime': 8.159, 'test_samples_per_second': 135.065, 'test_steps_per_second': 16.914}


{'test_loss': 0.49315956234931946, 'test_accuracy': 0.7659964051134741, 'test_runtime': 6.001, 'test_samples_per_second': 106.15, 'test_steps_per_second': 13.331}


{'test_loss': 0.37525635957717896, 'test_accuracy': 0.8960250450621383, 'test_runtime': 0.4129, 'test_samples_per_second': 108.99, 'test_steps_per_second': 14.532}


In [20]:
for i in range(4):
    tokenizers[i].save_pretrained('models/esm_' + task_types[i])

In [21]:
for i in range(4):
    models[i].save_pretrained('models/esm_' + task_types[i])

This definitely seems harder than the first task, but we still attain a very respectable accuracy. Remember that to keep this demo lightweight, we used one of the smallest ESM models, focused on human proteins only and didn't put a lot of work into making sure we only included completely-annotated proteins in our training set. With a bigger model and a cleaner, broader training set, accuracy on this task could definitely go a lot higher!