<a href="https://colab.research.google.com/github/simecek/2022-09-12-deep-learning/blob/main/03_Genomic_sequence_classification_with_pretrained_dLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -qq transformers datasets

[K     |████████████████████████████████| 4.7 MB 2.1 MB/s 
[K     |████████████████████████████████| 365 kB 57.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 56.9 MB/s 
[K     |████████████████████████████████| 120 kB 72.0 MB/s 
[K     |████████████████████████████████| 212 kB 64.2 MB/s 
[K     |████████████████████████████████| 115 kB 50.0 MB/s 
[K     |████████████████████████████████| 127 kB 28.8 MB/s 
[?25h

## Dataset

We will use on of [the genomic benchmarks](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks) that is already preprocesed into HF dataset and was upload to HF Hub.

In [None]:
from datasets import load_dataset

dds = load_dataset("simecek/human_nontata_promoters")
dds



  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['labels', 'seq'],
        num_rows: 9034
    })
    train: Dataset({
        features: ['labels', 'seq'],
        num_rows: 27097
    })
})

In [None]:
ds = dds['train']
ds[0]

{'labels': 0,
 'seq': 'ACAGATTCAGGATGTCCTGTCGGGGCATGGACCCTGGAAAGCTGCGGACACCAGGAGGGCAGGCAAGAGAGTCTCATCTCTTGCTCCCTAGGAGCTATGAGTTGAGGGCGCCGTCTGAGCAGGAGGGACGGACGGGTGCCCAGGGTTTGAGGAAAGAGGGGTGTGGGAAGGACGCATGCTAGAACTTCAGAGCAGTTCAGCAGGTGCAGAATGGGAGTTATCATGGGGACTGTGGGAGAAGGGGCGGTGGG'}

## Tokenization

Neural networks need numerical input. Tokenizers transform sequence of letters into sequence of tokens (that can be then numbered and encoded).

In [None]:
from transformers import AutoTokenizer

# DNABert tokenizer on K-MERS, K=6
tokenizer = AutoTokenizer.from_pretrained("armheb/DNA_bert_6")

tokenizer("ACCTAG GTACGG")['input_ids']

[2, 664, 3380, 3]

In [None]:
def kmers_stride1(s, k=6):
    return [s[i:i + k] for i in range(0, len(s)-k+1)]

def tok_func(x): return tokenizer(" ".join(kmers_stride1(x["seq"])))

# example
tok_func({'seq': 'ATGGAAAGAGGCACCATTCT'})['input_ids']

[2,
 501,
 1989,
 3848,
 3089,
 56,
 212,
 835,
 3325,
 999,
 3983,
 3629,
 2214,
 650,
 2587,
 2142,
 3]

Now let us tokenize our dataset!

In [None]:
tok_ds = dds.map(tok_func, batched=False, remove_columns=['seq'])
tok_ds

  0%|          | 0/9034 [00:00<?, ?ex/s]

  0%|          | 0/27097 [00:00<?, ?ex/s]

DatasetDict({
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9034
    })
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27097
    })
})

## Pre-trained model (on Human genome)

In [None]:
from transformers import AutoModelForSequenceClassification

model_cls = AutoModelForSequenceClassification.from_pretrained("simecek/DNADebertaK6b", num_labels=2)

Downloading config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/178M [00:00<?, ?B/s]

Some weights of the model checkpoint at simecek/DNADebertaK6b were not used when initializing DebertaForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at simecek/DNA

## Training

In [None]:
from transformers import TrainingArguments, Trainer

EPOCHS = 1
BATCH_SIZE = 8
LEARNING_RATE = 8e-5

args = TrainingArguments('outputs', learning_rate=LEARNING_RATE, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
            evaluation_strategy="epoch", per_device_train_batch_size=BATCH_SIZE, per_device_eval_batch_size=BATCH_SIZE*2,
            num_train_epochs=EPOCHS, weight_decay=0.01, save_steps=100000, report_to='none')

PyTorch: setting up devices


In [None]:
from datasets import load_metric
import numpy as np

accuracy_metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

trainer = Trainer(model_cls, args, train_dataset=tok_ds['train'], eval_dataset=tok_ds['test'],
                  tokenizer=tokenizer, compute_metrics=compute_metrics)

Using cuda_amp half precision backend


In [None]:
trainer.train()

***** Running training *****
  Num examples = 27097
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3388


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6617,0.569398,0.735776


***** Running Evaluation *****
  Num examples = 9034
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=3388, training_loss=0.5854237093694658, metrics={'train_runtime': 189.6541, 'train_samples_per_second': 142.876, 'train_steps_per_second': 17.864, 'total_flos': 1738463299873440.0, 'train_loss': 0.5854237093694658, 'epoch': 1.0})

## Uploading model to HF Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
# trainer.to_hub("my_new_model_for_promoter_classification")
# same works for datasets