<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/ECCB2022/blob/main/notebooks/03_Transformers_and_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic transformer

## Setup & Data exploration

### Install libraries

In [112]:
!pip install transformers datasets tokenizers --quiet

### Load the dataset

In [113]:
from datasets import load_dataset

DATASET_NAME = "simecek/human_nontata_promoters"

# take a small portion of the dataset for time purposes
# the fist and the last 1000 samples because this specific dataset is ordered (positive, negative samples)
dataset_train = load_dataset(DATASET_NAME, split='train[:500]+train[-500:]')
dataset_train



Dataset({
    features: ['labels', 'seq'],
    num_rows: 1000
})

In [114]:
dataset_test = load_dataset(DATASET_NAME, split='test[:1000]+test[-1000:]')
dataset_test



Dataset({
    features: ['labels', 'seq'],
    num_rows: 2000
})

In [115]:
# one training sample sequence and its label
dataset_train[0]

{'labels': 0,
 'seq': 'ACAGATTCAGGATGTCCTGTCGGGGCATGGACCCTGGAAAGCTGCGGACACCAGGAGGGCAGGCAAGAGAGTCTCATCTCTTGCTCCCTAGGAGCTATGAGTTGAGGGCGCCGTCTGAGCAGGAGGGACGGACGGGTGCCCAGGGTTTGAGGAAAGAGGGGTGTGGGAAGGACGCATGCTAGAACTTCAGAGCAGTTCAGCAGGTGCAGAATGGGAGTTATCATGGGGACTGTGGGAGAAGGGGCGGTGGG'}

## Data preprocessing

### Tokenization

You can find useful models and their respective tokenizers right on the Hugging Face repository.

You can use the premade tokenizers even for your own model if they fit your task/purpose.

If you don't find existing tokenizers for your usecase, the Hugging Face documentation contains simple tutorials to create your own.

In [116]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("armheb/DNA_bert_6")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--armheb--DNA_bert_6/snapshots/a79a8fd96ad172f964a4dbef3f4d7545a5034baa/config.json
Model config BertConfig {
  "_name_or_path": "armheb/DNA_bert_6",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_ids": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_rnn_layer": 1,
  "output_past": true,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "rnn": "lstm",
  "rnn_dropout": 0.0,
  "rnn_hidden": 768,
  "split": 10,
  "transformers_version": "4.22.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 4101
}

loading file vocab.txt from cache at /ro

The tokenizers expects sequences of tokens separated by spaces. 

Therefore, we will move a sliding window along the sequence and extract k-mers of k=6 (6 characters long)

In [117]:
def kmers(s, k=6):
  return [s[i:i + k] for i in range(0, len(s)-k+1)]

Our k-mers function turns a sequence into a list of k-mers. 

In [118]:
example = 'ATGGAAAGAGGCACCATTCT'
print(example)

example = kmers(example)
print(example)

ATGGAAAGAGGCACCATTCT
['ATGGAA', 'TGGAAA', 'GGAAAG', 'GAAAGA', 'AAAGAG', 'AAGAGG', 'AGAGGC', 'GAGGCA', 'AGGCAC', 'GGCACC', 'GCACCA', 'CACCAT', 'ACCATT', 'CCATTC', 'CATTCT']


We are not done yet. 

Now we have to concatinate the k-mers separated by spaces into a string

In [119]:
example = 'ATGGAAAGAGGCACCATTCT'
print(example)

example_kmers = " ".join(kmers(example))
print(example_kmers)

ATGGAAAGAGGCACCATTCT
ATGGAA TGGAAA GGAAAG GAAAGA AAAGAG AAGAGG AGAGGC GAGGCA AGGCAC GGCACC GCACCA CACCAT ACCATT CCATTC CATTCT


Our DNA sequence is not transformed into k-mers (k=6).

That is how we transform every DNA sequence sample in our dataset before we input it into the tokenizer.

In [120]:
def tokenization(x): 
  return tokenizer(" ".join(kmers(x["seq"])))

In [121]:
example = {'seq': 'ATGGAAAGAGGCACCATTCT'}
print(example, '\n')

print('- middle step that happens inside as we saw above:')
print(example_kmers, '\n')

example = tokenization(example)
print(example)

{'seq': 'ATGGAAAGAGGCACCATTCT'} 

- middle step that happens inside as we saw above:
ATGGAA TGGAAA GGAAAG GAAAGA AAAGAG AAGAGG AGAGGC GAGGCA AGGCAC GGCACC GCACCA CACCAT ACCATT CCATTC CATTCT 

{'input_ids': [2, 501, 1989, 3848, 3089, 56, 212, 835, 3325, 999, 3983, 3629, 2214, 650, 2587, 2142, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [122]:
tokenizer.decode(example['input_ids'])

'[CLS] ATGGAA TGGAAA GGAAAG GAAAGA AAAGAG AAGAGG AGAGGC GAGGCA AGGCAC GGCACC GCACCA CACCAT ACCATT CCATTC CATTCT [SEP]'

We saw how to tokenize one sample.

An easy way to tokenize your whole dataset is to use the map() function.

In [123]:
# map() takes a function and applies it to every sample in the dataset 

dataset_train_tokenized = dataset_train.map(tokenization, batched=False)
dataset_test_tokenized = dataset_test.map(tokenization, batched=False)
dataset_train_tokenized



Dataset({
    features: ['labels', 'seq', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

In [124]:
# this is what later goes into our network as a one sample
print(dataset_train_tokenized[0]['input_ids'])

[2, 566, 2250, 795, 3165, 360, 1428, 1601, 2294, 972, 3874, 3195, 479, 1902, 3500, 1698, 2683, 2528, 1908, 3524, 1796, 3075, 4093, 4070, 3980, 3620, 2177, 503, 1999, 3887, 3246, 684, 2724, 2689, 2549, 1989, 3848, 3091, 62, 236, 931, 3712, 2548, 1985, 3831, 3021, 3879, 3215, 557, 2216, 660, 2625, 2296, 980, 3908, 3331, 1021, 4072, 3988, 3651, 2301, 997, 3976, 3601, 2104, 209, 824, 3282, 827, 3294, 875, 3485, 1638, 2443, 1566, 2155, 414, 1642, 2460, 1635, 2430, 1515, 1951, 3695, 2478, 1705, 2712, 2644, 2369, 1272, 979, 3902, 3305, 918, 3660, 2337, 1144, 466, 1850, 3292, 865, 3448, 1492, 1860, 3331, 1024, 4083, 4031, 3824, 2994, 3771, 2782, 2924, 3489, 1656, 2515, 1853, 3304, 916, 3649, 2296, 980, 3908, 3329, 1015, 4048, 3892, 3265, 759, 3024, 3892, 3268, 770, 3068, 4067, 3967, 3567, 1965, 3752, 2708, 2628, 2306, 1018, 4058, 3932, 3425, 1400, 1492, 1857, 3317, 965, 3848, 3089, 56, 212, 836, 3332, 1026, 4092, 4066, 3964, 3556, 1924, 3585, 2037, 4040, 3860, 3137, 247, 976, 3891, 3261, 742, 

### Model

Model pre-trained on human genome.

You can find models like this on the Hugging Face web repository.


In [125]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("simecek/DNADebertaK6b", num_labels=2)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--simecek--DNADebertaK6b/snapshots/d045450497234887e1e211066effec712575374e/config.json
Model config DebertaConfig {
  "_name_or_path": "simecek/DNADebertaK6b",
  "architectures": [
    "DebertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": null,
  "position_biased_input": true,
  "relative_attention": false,
  "torch_dtype": "float32",
  "transformers_version": "4.22.1",
  "type_vocab_size": 0,
  "vocab_size": 4101
}

loading weights file pytorch_model.bin from cach

In [126]:
# # EXERCISE 1 --help
# from transformers import DebertaConfig, DebertaForSequenceClassification

# # First we need the model's configuration - the default settings are just fine for our goal
# model_config = DebertaConfig(vocab_size=len(tokenizer.vocab), max_position_embeddings=512, num_hidden_layers=6, num_labels=2)
# model = DebertaForSequenceClassification(config = model_config)

In [127]:
from transformers import TrainingArguments, Trainer

BATCH_SIZE = 32
LEARNING_RATE = 1e-5
EPOCHS = 3

training_arguments = TrainingArguments(
        'outputs', 
        learning_rate=LEARNING_RATE, 
        fp16=True,
        evaluation_strategy="epoch", 
        per_device_train_batch_size=BATCH_SIZE, 
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=EPOCHS, 
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [128]:
from datasets import load_metric
import numpy as np

def compute_metrics(eval_preds):
    metric = load_metric("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [129]:
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset_train_tokenized,
    eval_dataset=dataset_test_tokenized,
    compute_metrics=compute_metrics,
)
trainer

Using cuda_amp half precision backend


<transformers.trainer.Trainer at 0x7f3e600bde50>

In [130]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: seq. If seq are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.614922,0.7795
2,No log,0.546625,0.7915
3,No log,0.53164,0.7945


The following columns in the evaluation set don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: seq. If seq are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
The following columns in the evaluation set don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: seq. If seq are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
The following columns in the evaluation set don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: seq. If seq are not expected by `De

TrainOutput(global_step=96, training_loss=0.5887002944946289, metrics={'train_runtime': 70.094, 'train_samples_per_second': 42.8, 'train_steps_per_second': 1.37, 'total_flos': 192471118560000.0, 'train_loss': 0.5887002944946289, 'epoch': 3.0})

# Exercise
1.   Exchange our current model for a **NOT** pre-trained model and compare 
 - you will find a hint in the code above
2.   Our DNA-pre-trained model is not the only one you can find on the Hugging Face repository. Try to exchange our pre-trained model for a different pre-trained model
 - for example this one [https://huggingface.co/armheb/DNA_bert_6](https://huggingface.co/armheb/DNA_bert_6)




