# GENA-LM Sequence classification example

## Install requirements

In [2]:
! pip install torch --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118


In [3]:
! pip install transformers[torch] datasets

Collecting transformers[torch]
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.2-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.5/268.5 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m90.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from tra

In [4]:
import torch
torch.cuda.is_available()

True

## Get pre-trained GENA-LM model
The classification head will be randomly initialized.

Table with available models:
https://drive.google.com/uc?export=view&id=1R2LF4POMcbMgla0J31ttrVHzqT624dlh

### Pre-trained GENA-LM for Masked Language Modeling

In [5]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/46.0 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading (…)ain/modeling_bert.py:   0%|          | 0.00/97.1k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/AIRI-Institute/gena-lm-bert-base-t2t:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading pytorch_model.bin:   0%|          | 0.00/541M [00:00<?, ?B/s]

### Pre-trained GENA-LM to finetune on sequence classification task

#### with HuggingFace

In [6]:
gena_module_name = model.__class__.__module__
print(gena_module_name)

transformers_modules.AIRI-Institute.gena-lm-bert-base-t2t.21343b983208dd7bd3430f5a0d812ab6131faa7d.modeling_bert


In [7]:
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
cls

transformers_modules.AIRI-Institute.gena-lm-bert-base-t2t.21343b983208dd7bd3430f5a0d812ab6131faa7d.modeling_bert.BertForSequenceClassification

In [8]:
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', num_labels=2)
model.classifier

Some weights of the model checkpoint at AIRI-Institute/gena-lm-bert-base-t2t were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at AIRI-Ins

Linear(in_features=768, out_features=2, bias=True)

#### cloning the GENA-LM repo




In [9]:
! git clone https://github.com/AIRI-Institute/GENA_LM.git
! cd GENA_LM/src/gena_lm

Cloning into 'GENA_LM'...
remote: Enumerating objects: 58, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 58 (delta 14), reused 42 (delta 6), pack-reused 0[K
Unpacking objects: 100% (58/58), 21.52 MiB | 8.07 MiB/s, done.


or just download `modeling_bert.py` from https://github.com/AIRI-Institute/GENA_LM/tree/main/src/gena_lm

In [10]:
! wget https://raw.githubusercontent.com/AIRI-Institute/GENA_LM/main/src/gena_lm/modeling_bert.py

--2023-07-06 07:35:15--  https://raw.githubusercontent.com/AIRI-Institute/GENA_LM/main/src/gena_lm/modeling_bert.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94428 (92K) [text/plain]
Saving to: ‘modeling_bert.py’


2023-07-06 07:35:16 (6.61 MB/s) - ‘modeling_bert.py’ saved [94428/94428]



In [40]:
from modeling_bert import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', num_labels=2)
model = model.cuda()
model.classifier

Some weights of the model checkpoint at AIRI-Institute/gena-lm-bert-base-t2t were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at AIRI-Ins

Linear(in_features=768, out_features=2, bias=True)

## Get sequence classification dataset

In [9]:
from datasets import load_dataset
# load ~11k samples from promoters prediction dataset
dataset = load_dataset("yurakuratov/example_promoters_2k")['train'].train_test_split(test_size=0.1)

Downloading and preparing dataset csv/yurakuratov--example_promoters_2k to /root/.cache/huggingface/datasets/yurakuratov___csv/yurakuratov--example_promoters_2k-9946f83e0515e5af/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/23.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/yurakuratov___csv/yurakuratov--example_promoters_2k-9946f83e0515e5af/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sequence', 'promoter_presence'],
        num_rows: 10656
    })
    test: Dataset({
        features: ['sequence', 'promoter_presence'],
        num_rows: 1184
    })
})

In [11]:
dataset['train'][0]

{'sequence': 'TAGATTCTCACCCCTTGTTGTGTATTGTAATTATCTAGGGAACTTAAAAGGTACTGAGGCATGGGCCCCAATGGAATCATTATCCCTGGGGGGTGGGGTGGGCCCTAGACCTGGTATTTTCAAAAACTTTCAAAAATTATTCTAACACATACGCAGATTAAACTGGTTGCAGACATTCAGGAAGCAAAAATATGACAAAATTACCCTAAAAGTAAAATAAAATAAGATTTTTGGCTCAAAGTGCAGTAATCCTGTTGTCTGGCCACCTTATTTGTGTCCTATTCATCAAATAAGGGGATTTGTTTTGTTTTGTTTTTTGAGAACCGCAGCAAGAAAAATTGGTCATGCCCAGGCAAATCTTCTAGGTGAGTTCTAAAGATAAGTCAAGTGGCCATAAAACACTTCTACAGCTAATATTTGTTGAGAGCATGTTCTGAGCCATGTGCTATGCAGAATACATTTACTATACATTGCCTCACTTAATCCTCTCAACAATTCTGAGGCATTATTCTTCTTCTGGATTTACAAAAGAAGAAAGAGAGGCACAAAGCAATTGCTTAACTCGCTCAGCACCTGACCACTAGTTAGAAGTGAAGCTGGGAGATTTGAATGCAAGCAGGTAGACTGCAGAATCTGCCTGTTAATCATCATGTTATATCTCCTTTACACTGTGACAGTTAAGATAAAATTACAAAAATTTGCGAGAGCAGTAGTTCTCAATCTTGCCGGCATAGCAGAATTGTTGGGAGAGCTTTGAAAAAAAATCCTTTGCTCATGTCATATCATGAAACAATTTAATCCATATCTGATATGGTTTGGCTGTGTGCCCATCGAAATCTCATCTTGAATTGTAGCTGCCATTATTCCCACGTGTTGTGGGAGGGACCCAGTGGGAGACAATTGAATCATGGGGGCAGTTTCCCTCATACTGTTCTGGTGGTAGTGAATAAGTCTCACGAGATCTGATGGTTTTATAAGGGGAAACC

In [12]:
print('# base pairs: ', len(dataset['train'][0]['sequence']))

# base pairs:  2000


In [13]:
print('tokens: ', ' '.join(tokenizer.tokenize(dataset['train'][0]['sequence'])))

tokens:  TAGATTC TCACCCC TTGTTG TGTATTG TAATTATC TAGGGAAC TTAAAAGG TAC TGAGGC ATGGGCCCC AATGGAATC ATTATCCC TGGGGGG TGGGG TGGGCCC TAG ACCTGG TATTTTC AAAAAC TTTC AAAAATT ATTCTAAC ACATAC GC AGATT AAACTGG TTGC AGACATTC AGGAAGC AAAAATATG ACAAAA TTACCC TAAAAG TAAAATAAAA TAAGATT TTTGGC TCAAAGTGC AGTAA TCCTGTTG TCTGGCC ACCTTATT TGTGTCC TATTC ATCAAATAA GGGG ATTTG TTTTGTTTTG TTTTTTG AGAACC GCAGC AAGAAAA ATTGG TCATG CCCAGGC AAATC TTCTAGG TGAG TTCTAA AGATAAG TCAAG TGGCC ATAAAAC ACTTCTAC AGCTAA TATTTGTTG AGAGC ATGTTCTG AGCC ATGTGC TATGC AGAATAC ATTTAC TATAC ATTGCC TCACTTAA TCCTCTC AAC AATTCTG AGGC ATTATTC TTCTTCTGG ATTTAC AAAAG AAGAA AGAGAGGC ACAAAGC AATTGC TTAAC TCGC TCAGCACC TGACCAC TAG TTAGAAG TGAAGC TGGGAGATT TGAATGC AAGC AGGTAG ACTGC AGAA TCTGCC TGTTAA TCATC ATG TTATATC TCCTTTAC ACTGTG ACAGTTAAG ATAAAA TTAC AAAAATT TGCG AGAGC AGTAG TTCTCAATC TTGCC GGC ATAGC AGAATTG TTGGG AGAGC TTTG AAAAAAAA TCCTTTGC TCATG TCATATC ATGAAAC AATTTAA TCCATATC TGATATGG TTTGGCTGTG TGCCC ATCG AAATCTCATCTTG AATTG TAGCT

In [14]:
print('# tokens: ', len(tokenizer.tokenize(dataset['train'][0]['sequence'])))

# tokens:  300


### Dataset preprocessing
following HuggingFace text classification guide: https://huggingface.co/docs/transformers/tasks/sequence_classification

In [15]:
def preprocess_labels(example):
  example['label'] = example['promoter_presence']
  return example

dataset = dataset.map(preprocess_labels)

Map:   0%|          | 0/10656 [00:00<?, ? examples/s]

Map:   0%|          | 0/1184 [00:00<?, ? examples/s]

In [16]:
def preprocess_function(examples):
  # just truncate right, but for some tasks symmetric truncation from left and right is more reasonable
  # set max_length to 128 to make experiments faster
  return tokenizer(examples["sequence"], truncation=True, max_length=128)

In [17]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/10656 [00:00<?, ? examples/s]

Map:   0%|          | 0/1184 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [18]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [19]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sequence', 'promoter_presence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10656
    })
    test: Dataset({
        features: ['sequence', 'promoter_presence', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1184
    })
})

## Training

In [20]:
from transformers import TrainingArguments, Trainer
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': (predictions==labels).sum() / len(labels)}

# change training hyperparameters to archive better quality
training_args = TrainingArguments(
    output_dir="test_run",
    learning_rate=1e-4,
    lr_scheduler_type="constant_with_warmup",
    warmup_ratio=0.1,
    optim='adamw_torch',
    weight_decay=0.0,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6893,0.62084,0.674831
2,0.5864,0.519286,0.745777
3,0.4804,0.535669,0.753378
4,0.4071,0.616851,0.744088
5,0.3599,0.664174,0.710304




TrainOutput(global_step=1665, training_loss=0.5046220704957888, metrics={'train_runtime': 1201.8009, 'train_samples_per_second': 44.333, 'train_steps_per_second': 1.385, 'total_flos': 3504639257395200.0, 'train_loss': 0.5046220704957888, 'epoch': 5.0})

## Get predictions from model on single example

In [21]:
x, y = dataset['test']['sequence'][0], dataset['test']['label'][0]

In [22]:
x_feat = tokenizer(x, return_tensors='pt')
x_feat.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [23]:
# move sample to gpu and feed to model
for k in x_feat:
  x_feat[k] = x_feat[k].cuda()

model = model.eval()
with torch.no_grad():
  out = model(**x_feat)
out



SequenceClassifierOutput(loss=None, logits=tensor([[-1.4048, -0.0459]], device='cuda:0'), hidden_states=None, attentions=None)

In [24]:
# get class probabilities
prob = torch.softmax(out['logits'], dim=-1)
prob

tensor([[0.2044, 0.7956]], device='cuda:0')

In [27]:
# get label
print(f'prediction: {torch.argmax(prob)}, label: {y}')

prediction: 1, label: 1
