<a href="https://colab.research.google.com/github/Gyuheon-Song/Bioinformatics/blob/main/2024_Bioinformatics_transformer_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://i.ibb.co/9bmXrF8/9.png" width="350" alt="5" border="0">

# **2024 Bioinformatics Deep Learning practics**
## **MicroBTrans : Transformer-based microbiome profile representation learning for disease prognosis**




### **Instructor info**
---
*   **Professor** : **Insuk Lee**, Network Biology Lab. biotechnology dept. Yonsei University
*   **Contact** : insuklee@yonsei.ac.kr
*   **Teaching Assitant**  : **Hanjune Kim**, Network Biology Lab. biotechnology
dept. Yonsei Univerisy
*   **Contact** : kaka0308@yonsei.ac.kr
*   **Lab location** : Network Biology Lab, S324, Science engineering hall, Yonsei University, Seoul

### **Leaarning Objective**
---
**In this tutorial, you will predict cFp1 endonuclease binding motifs (a component of CRISPR system) binding sites in given DNA sequences using pre-trained transformer model, ESM-2**
---

### **Reference** ###
https://www.science.org/doi/10.1126/science.ade2574

# **Preparation for practice**

## **Pre-configuration**

In [117]:
!pip install torch
!pip install transformers[torch]
!pip install datasets
!pip install evaluate



In [118]:
import os

### Congirue project and dataset directory ###
projectDir = "/content/2024_Bioinformatics_transformer_practice"
datasetDir = "/".join([projectDir, "Datasets"])
os.makedirs(datasetDir, exist_ok=True)

### Change current working directory to Project directory ###
os.chdir(projectDir)
print(os.getcwd())

/content/2024_Bioinformatics_transformer_practice


## **Dataset Load and Description**

In [119]:
import gdown
import subprocess
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import seaborn as sns

### Download Dataframe###
DB_path = "/".join([datasetDir, "datasets.zip"])
cmd1 = "gdown 1B2zPt78NaBoZ_ZRDIejJbv0LAU01lXWg -O %s" % "/".join([datasetDir, "train_cfp1.tsv"])
cmd2 = "gdown 11WMLlNd9FHZDR-dZyQ5hjLrSvrMeUJS7 -O %s" % "/".join([datasetDir, "test_cfp1.tsv"])
subprocess.run([cmd1], shell=True, capture_output=False)
subprocess.run([cmd2], shell=True, capture_output=False)

CompletedProcess(args=['gdown 11WMLlNd9FHZDR-dZyQ5hjLrSvrMeUJS7 -O /content/2024_Bioinformatics_transformer_practice/Datasets/test_cfp1.tsv'], returncode=0)

In [120]:
from typing import Optional, Union, Tuple

import evaluate
import numpy as np
from datasets import load_dataset
from torch.nn import CrossEntropyLoss
from transformers import (
    AutoConfig,
    AutoTokenizer,
    EsmModel,
    EsmForSequenceClassification,
    EsmPreTrainedModel,
    TrainingArguments,
    Trainer,
)
import torch
from torch import nn
from transformers.modeling_outputs import SequenceClassifierOutput

In [121]:
train_data_file = "/".join([datasetDir, "train_cfp1.tsv"])
valid_data_file = "/".join([datasetDir, "test_cfp1.tsv"])

In [122]:
### Load dataset ###
data_files = {"train": train_data_file, "valid": valid_data_file}

raw_datasets = load_dataset(
    "csv",
    data_files=data_files,
    use_auth_token=None,
)
print(raw_datasets)



Generating train split: 0 examples [00:00, ? examples/s]

Generating valid split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'sequence', 'labels'],
        num_rows: 3000
    })
    valid: Dataset({
        features: ['index', 'sequence', 'labels'],
        num_rows: 258
    })
})


In [123]:
### Build model ###
pretrained_model_name = "facebook/esm2_t6_8M_UR50D"
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name,
)

In [124]:
def preprocess_function(examples):
  result = tokenizer(
      examples["sequence"],
      padding="longest",
      truncation=True
  )
  return result

train_dataset = raw_datasets["train"].map(preprocess_function, batched=True)
valid_dataset = raw_datasets["valid"].map(preprocess_function, batched=True)

print(train_dataset)
print(valid_dataset)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/258 [00:00<?, ? examples/s]

Dataset({
    features: ['index', 'sequence', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 3000
})
Dataset({
    features: ['index', 'sequence', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 258
})


In [125]:
label_list = raw_datasets["train"].unique("labels")
label_list.sort()
num_labels = len(label_list)

config = AutoConfig.from_pretrained(
    pretrained_model_name,
    num_labels=num_labels,
)
model = EsmForSequenceClassification.from_pretrained(
    pretrained_model_name,
    config=config
)

model.config.label2id = {l: i for i, l in enumerate(label_list)}
model.config.id2label = {id: label for label, id in config.label2id.items()}

Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [126]:
metric = evaluate.load("accuracy")
def compute_metrics(p):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)

    result = metric.compute(predictions=preds, references=p.label_ids)
    return result

In [127]:
## Set trainer

num_epochs = 3

training_args = TrainingArguments(
    learning_rate = 5e-5,
    output_dir='./results',  # output directory
    num_train_epochs=num_epochs,     # total number of training epochs
    per_device_train_batch_size=1,   # batch size for evaluation
    do_train=True,                   # perform training
    save_strategy="no"               # checkpoint save strategy
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=None,
)


In [128]:
## Train the model
train_result = trainer.train()
metrics = train_result.metrics

print(metrics)

trainer.save_model()  # Saves the tokenizer too for easy upload

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

Step,Training Loss
500,0.8264
1000,0.9315
1500,0.6288
2000,0.5373
2500,0.5892
3000,0.653
3500,0.5444
4000,0.4572
4500,0.4608
5000,0.4777


{'train_runtime': 253.4151, 'train_samples_per_second': 35.515, 'train_steps_per_second': 35.515, 'total_flos': 14583660552000.0, 'train_loss': 0.5079356553819444, 'epoch': 3.0}
***** train metrics *****
  epoch                    =        3.0
  total_flos               =    13582GF
  train_loss               =     0.5079
  train_runtime            = 0:04:13.41
  train_samples_per_second =     35.515
  train_steps_per_second   =     35.515


In [129]:
print("*** Evaluation ***")
metrics = trainer.evaluate(eval_dataset=valid_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

*** Evaluation ***


***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9109
  eval_loss               =      0.431
  eval_runtime            = 0:00:00.33
  eval_samples_per_second =     778.22
  eval_steps_per_second   =      99.54
