If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your username and password:

In [1]:
from huggingface_hub import login

login('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')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/evlasova/.cache/huggingface/token
Login successful


Then you need to install Git-LFS. Uncomment the following instructions:

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

# Fine-Tuning Protein Language Models

In this notebook, we're going to do some transfer learning to fine-tune some large, pre-trained protein language models on tasks of interest. If that sentence feels a bit intimidating to you, don't panic - there's [a blog post](https://huggingface.co/blog/deep-learning-with-proteins) that explains the concepts here in much more detail.

The specific model we're going to use is ESM-2, which is the state-of-the-art protein language model at the time of writing (November 2022). The citation for this model is [Lin et al, 2022](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1).

There are several ESM-2 checkpoints with differing model sizes. Larger models will generally have better accuracy, but they require more GPU memory and will take much longer to train. The available ESM-2 checkpoints (at time of writing) are:

| Checkpoint name | Num layers | Num parameters |
|------------------------------|----|----------|
| `esm2_t48_15B_UR50D`         | 48 | 15B     |
| `esm2_t36_3B_UR50D`          | 36 | 3B      |
| `esm2_t33_650M_UR50D`        | 33 | 650M    |
| `esm2_t30_150M_UR50D`        | 30 | 150M    |
| `esm2_t12_35M_UR50D`         | 12 | 35M     |
| `esm2_t6_8M_UR50D`           | 6  | 8M      |

Note that the larger checkpoints may be very difficult to train without a large cloud GPU like an A100 or H100, and the largest 15B parameter checkpoint will probably be impossible to train on **any** single GPU! Also, note that memory usage for attention during training will scale as `O(batch_size * num_layers * seq_len^2)`, so larger models on long sequences will use quite a lot of memory! We will use the `esm2_t12_35M_UR50D` checkpoint for this notebook, which should train on any Colab instance or modern GPU.

In [2]:
# model_checkpoint = "facebook/esm2_t12_35M_UR50D"
model_checkpoint = "Rostlab/prot_t5_xl_bfd"

***
# Token classification

Another common language model task is **token classification**. In this task, instead of classifying the whole sequence into a single category, we categorize each token (amino acid, in this case!) into one or more categories. This kind of model could be useful for:

- Predicting secondary structure
- Predicting buried vs. exposed residues
- Predicting residues that will receive post-translational modifications
- Predicting residues involved in binding pockets or active sites
- Probably several other things, it's been a while since I was a postdoc

## Data preparation

In this section, we're going to gather some training data from UniProt. As in the sequence classification example, we aim to create two lists: `sequences` and `labels`. Unlike in that example, however, the `labels` are more than just single integers. Instead, the label for each sample will be **one integer per token in the input**. This should make sense - when we do token classification, different tokens in the input may have different categories!

To demonstrate token classification, we're going to go back to UniProt and get some data on protein secondary structures. As above, this will probably the main section you want to change when adapting this code to your own problems.

Now we've defined a helper function, let's build our lists of sequences and labels:

In [3]:
import pandas as pd
import numpy as np

In [4]:
def prepare_dataset():
    def process_one_type(type_of_dataset):
        train = pd.read_csv(f'{type_of_dataset}.csv')
        train_sequences = list(train.sequence)
        train_labels = train.ss3.str.replace('\\n', '')
        train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
        return train_sequences, train_labels

    train_sequences, train_labels = process_one_type('train')
    test_sequences, test_labels = process_one_type('test')
    validate_sequences, validate_labels = process_one_type('validate')
    return train_sequences, test_sequences, validate_sequences, train_labels, test_labels, validate_labels

In [5]:
train_sequences, test_sequences, validate_sequences, train_labels, test_labels, validate_labels = prepare_dataset()

  train_labels = train.ss3.str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train.ss3.str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])
  train_labels = train.ss3.str.replace('\\n', '')
  train_labels = np.array([np.array(list(map(int, x[1:-1].split(' ')))) for x in train_labels])


## Creating our dataset

Nice! Now we'll split and tokenize the data, and then create datasets - I'll go through this quite quickly here, since it's identical to how we did it in the sequence classification example above.

In [6]:
! pip install sentencepiece




In [7]:
from transformers import AutoTokenizer, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)

train_tokenized = tokenizer(train_sequences)
test_tokenized = tokenizer(test_sequences)
val_tokenized = tokenizer(validate_sequences)

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


In [8]:
from datasets import Dataset

train_dataset = Dataset.from_dict(train_tokenized)
test_dataset = Dataset.from_dict(test_tokenized)
val_dataset = Dataset.from_dict(val_tokenized)

train_dataset = train_dataset.add_column("labels", train_labels)
test_dataset = test_dataset.add_column("labels", test_labels)
val_dataset = val_dataset.add_column("labels", validate_labels)

## Model loading

The key difference here with the above example is that we use `AutoModelForTokenClassification` instead of `AutoModelForSequenceClassification`. We will also need a `data_collator` this time, as we're in the slightly more complex case where both inputs and labels must be padded in each batch.

In [9]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer, AutoModelForSeq2SeqLM, T5Model

num_labels = 3
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")

In [10]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

Now we set up our `TrainingArguments` as before.

In [11]:
model_name = model_checkpoint.split("/")[-1]
batch_size = 8

args = TrainingArguments(
    f"{model_name}-finetuned-secondary-structure",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.001,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)

Our `compute_metrics` function is a bit more complex than in the sequence classification task, as we need to ignore padding tokens (those where the label is `-100`).

In [12]:
from evaluate import load
import numpy as np

metric = load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    labels = labels.reshape((-1,))
    predictions = np.argmax(predictions, axis=2)
    predictions = predictions.reshape((-1,))
    predictions = predictions[labels!=-100]
    labels = labels[labels!=-100]
    return metric.compute(predictions=predictions, references=labels)

And now we're ready to train our model!

In [13]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

trainer.train()



ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

In [None]:
results = trainer.predict(val_dataset)

In [None]:
results.to_csv('prediction_results.csv')

This definitely seems harder than the first task, but we still attain a very respectable accuracy. Remember that to keep this demo lightweight, we used one of the smallest ESM models, focused on human proteins only and didn't put a lot of work into making sure we only included completely-annotated proteins in our training set. With a bigger model and a cleaner, broader training set, accuracy on this task could definitely go a lot higher!