### Installation of dependencies on a vast or runpod machine with latest pytorch docker image.

In [None]:
! pip install transformers[torch] evaluate datasets requests pandas scikit-learn peft bitsandbytes matplotlib sentencepiece

In [None]:
# !apt install git-lfs
# pip install wandb
# wandb login

# Fine-Tuning Protein Language Models

Inspired by [a blog post](https://huggingface.co/blog/deep-learning-with-proteins), in this notebook, we're going to do some transfer learning to fine-tune some large, pre-trained protein language models on the classification task of cath architecture classes. 

The specific models we're going to use here are either ESM-2 or prot_bert, which are the state-of-the-art protein language models.

There are several ESM-2 checkpoints with differing model sizes. Larger models will generally have better accuracy, but they require more GPU memory and will take much longer to train. The available ESM-2 checkpoints are:

| Checkpoint name | Num layers | Num parameters |
|------------------------------|----|----------|
| `esm2_t48_15B_UR50D`         | 48 | 15B     |
| `esm2_t36_3B_UR50D`          | 36 | 3B      |
| `esm2_t33_650M_UR50D`        | 33 | 650M    |
| `esm2_t30_150M_UR50D`        | 30 | 150M    |
| `esm2_t12_35M_UR50D`         | 12 | 35M     |
| `esm2_t6_8M_UR50D`           | 6  | 8M      |

We will use the `esm2_t12_35M_UR50D`  and `esm2_t33_650M_UR50D` checkpoints for this task.

In [None]:
model_checkpoint = "facebook/esm2_t33_650M_UR50D"

# Sequence classification

Given that we have the protein sequences available in our dataset, we can perform supervised learning of the CATH labels given the sequences as inputs. More specifically, we will do finetuning of the protein LLMs on the sequence-cath_label pairs to learn the classification task.

## Data preparation

Our goal is to create a pair of lists: `sequences` and `labels`. `sequences` will be a list of protein sequences, which will just be strings like "MNKL...", where each letter represents a single amino acid in the complete protein. `labels` will be a list of the category for each sequence. The categories will just be integers from 0 to 9. 

In [None]:
import os

# Change to the desired directory
# os.chdir("/root")
# Verify the change
print(os.listdir("./"))

In [None]:
import pandas as pd

train_df = pd.read_csv("./data/train.csv")
val_df = pd.read_csv("./data/val.csv")

In [None]:
train_sequences = train_df["sequences"].tolist()
test_sequences = val_df["sequences"].tolist()
train_labels = train_df["label"].tolist()
test_labels = val_df["label"].tolist()

In [None]:
pd.Series(train_labels).value_counts(sort=True, ascending=False).plot(
    kind="bar", backend="matplotlib", figsize=(10, 5)
)

In [None]:
pd.Series(test_labels).value_counts(sort=True, ascending=False).plot(
    kind="bar", backend="matplotlib", figsize=(10, 5)
)

## Tokenizing the data

All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called **tokenization**. For natural language this can be quite complex, as usually the network's vocabulary will not contain every possible word, which means the tokenizer must handle splitting rarer words into pieces, as well as all the complexities of capitalization and unicode characters and so on.

With proteins, however, things are very easy. In protein language models, each amino acid is converted to a single token. Every model on `transformers` comes with an associated `tokenizer` that handles tokenization for it, and protein language models are no different.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

This looks good! We can see that our sequence has been converted into `input_ids`, which is the tokenized sequence, and an `attention_mask`. The attention mask handles the case when we have sequences of variable length - in those cases, the shorter sequences are padded with blank "padding" tokens, and the attention mask is padded with 0s to indicate that those tokens should be ignored by the model.

In [None]:
train_tokenized = tokenizer(train_sequences)
test_tokenized = tokenizer(test_sequences)

## Dataset creation

Now we want to turn this data into a dataset that PyTorch can load samples from. We can use the HuggingFace `Dataset` class for this

In [None]:
from datasets import Dataset

train_dataset = Dataset.from_dict(train_tokenized)
test_dataset = Dataset.from_dict(test_tokenized)

train_dataset

In [None]:
train_dataset = train_dataset.add_column("labels", train_labels)
test_dataset = test_dataset.add_column("labels", test_labels)
train_dataset

Looks good! We're ready for training.

## Model loading

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch

num_labels = max(train_labels + test_labels) + 1  # Add 1 since 0 can be a label
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
)

######## for lora experiments   ############

# from peft import get_peft_model, LoraConfig, TaskType

# def find_target_modules(model):
#     target_modules = []
#     for name, module in model.named_modules():
#         if isinstance(module, torch.nn.Linear):
#             target_modules.append(name)
#     return target_modules


# # Get the target modules
# target_modules = find_target_modules(model)
# peft_config = LoraConfig(
#     task_type=TaskType.SEQ_CLS,
#     inference_mode=False,
#     r=32,
#     lora_alpha=64,
#     lora_dropout=0.1,
#     target_modules=target_modules,
# )


# model = get_peft_model(model, peft_config)
# model.print_trainable_parameters()


# def count_trainable_parameters(model):
# model_parameters = filter(lambda p: p.requires_grad, model.parameters())
# params = sum([np.prod(p.size()) for p in model_parameters])
# return params

In [None]:
def count_trainable_parameters(model):
    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    params = sum([np.prod(p.size()) for p in model_parameters])
    return params


count_trainable_parameters(model)

In [None]:
# for experimenting with freezing the core of the model (only the encoder)
# for name, param in model.named_parameters():
#     if not name.startswith("classifier"):
#         param.requires_grad = False
#         #print(name)

These warnings are telling us that the model is discarding some weights that it used for language modelling (the `lm_head`) and adding some weights for sequence classification (the `classifier`). This is exactly what we expect when we want to fine-tune a language model on a sequence classification task!

Next, we initialize our `TrainingArguments`. These control the various training hyperparameters, and will be passed to our `Trainer`.

In [None]:
%env WANDB_WATCH=all
%env WANDB_SILENT=true
%env WANDB_LOG_MODEL=end
%env WANDB_PROJECT=medium biosciences




version = 1  #experiment version
batch_size = 32
train_epochs = 100
num_workers = 8
lr = 1e-5 


In [None]:
from transformers import EarlyStoppingCallback
from transformers import TrainingArguments, Trainer

early_stopping = EarlyStoppingCallback(early_stopping_patience=5)

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned",
    evaluation_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=train_epochs,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    push_to_hub=False,
    fp16=True,
    fp16_full_eval=True,
    # bf16_full_eval=True,
    # bf16=True,
    save_total_limit=1,
    gradient_checkpointing=True,
    optim="adamw_torch",
    report_to="wandb",
    lr_scheduler_type="cosine",
    warmup_ratio=0.01,
    logging_strategy="epoch",
    run_name=f"{model_checkpoint.split('/')[-1]}-v-{version}",
    dataloader_num_workers=num_workers,
)

Next, we define the metric we will use to evaluate our models and write a `compute_metrics` function. We can load this from the `evaluate` library. I chose the weighted mode of f1, precision and recall calculation as we have multi-class classification problem with slight difference in the distribution of classes in the validation set.

In [None]:
import evaluate
import numpy as np


def compute_metrics(eval_preds):
    metric = evaluate.combine(["f1", "precision", "recall"])
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    metrics = metric.compute(
        predictions=predictions, references=labels, average="weighted"
    )
    return metrics

In [None]:
from torch.utils.data.dataloader import default_collate


def custom_collate_fn(batch):
    # Extract elements
    input_ids_sequence = [item["input_ids_sequence"] for item in batch]
    attention_mask_sequence = [item["attention_mask_sequence"] for item in batch]
    input_ids_structure = [item["input_ids_structure"] for item in batch]
    attention_mask_structure = [item["attention_mask_structure"] for item in batch]
    labels = [item["labels"] for item in batch]

    # Pad sequences to the maximum length in the batch
    input_ids_sequence_padded = torch.nn.utils.rnn.pad_sequence(
        input_ids_sequence, batch_first=True, padding_value=tokenizer.pad_token_id
    )
    attention_mask_sequence_padded = torch.nn.utils.rnn.pad_sequence(
        attention_mask_sequence, batch_first=True, padding_value=0
    )
    input_ids_structure_padded = torch.nn.utils.rnn.pad_sequence(
        input_ids_structure, batch_first=True, padding_value=tokenizer.pad_token_id
    )
    attention_mask_structure_padded = torch.nn.utils.rnn.pad_sequence(
        attention_mask_structure, batch_first=True, padding_value=0
    )

    labels = torch.stack(labels)

    return {
        "input_ids_sequence": input_ids_sequence_padded,
        "attention_mask_sequence": attention_mask_sequence_padded,
        "input_ids_structure": input_ids_structure_padded,
        "attention_mask_structure": attention_mask_structure_padded,
        "labels": labels,
    }


trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping],
    data_collator=custom_collate_fn,
)

In [None]:
trainer.train()

In [None]:
import wandb

wandb.finish()

## Model Evaluation

In [None]:
trainer.evaluate(test_dataset)

In [None]:
version = 1
batch_size = 64

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import wandb


num_labels = max(train_labels + test_labels) + 1
from transformers import EarlyStoppingCallback
import os


model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned",
    evaluation_strategy="epoch",
    eval_strategy="epoch",
    per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    fp16=True,
    fp16_full_eval=True,
    report_to="none",
)

import evaluate
import numpy as np


def compute_metrics(eval_preds):
    metric = evaluate.combine(["f1", "precision", "recall"])
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    metrics = metric.compute(
        predictions=predictions, references=labels, average="weighted"
    )
    return metrics


import wandb

# Use the API to fetch the artifact from wandb
# api = wandb.Api()
# artifact = api.artifact(
#     f"shahdsaf/medium biosciences/model-{model_checkpoint.split('/')[-1]}-v-{version}:latest"
# )

# # Download the artifact to a local directory
# model_dir = artifact.download()

model_dir = ""


# Load your Hugging Face model from that folder
#  using the same model class
model = AutoModelForSequenceClassification.from_pretrained(
    model_dir, num_labels=num_labels
)

trainer = Trainer(
    model,
    args,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)