<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/pknots_experiments/blob/main/fine-tuning/Model_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Evaluation

This script will evaluate the model on a new dataset.

## 1) Installation & Model loading

In [1]:
!pip install -qq transformers datasets

[K     |████████████████████████████████| 5.3 MB 30.4 MB/s 
[K     |████████████████████████████████| 441 kB 57.4 MB/s 
[K     |████████████████████████████████| 7.6 MB 61.1 MB/s 
[K     |████████████████████████████████| 163 kB 73.8 MB/s 
[K     |████████████████████████████████| 212 kB 71.2 MB/s 
[K     |████████████████████████████████| 115 kB 73.5 MB/s 
[K     |████████████████████████████████| 127 kB 71.5 MB/s 
[K     |████████████████████████████████| 115 kB 71.8 MB/s 
[?25h

In [3]:
import torch

torch.cuda.empty_cache()
torch.cuda.get_device_name(0)

'Tesla T4'

In [19]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

HF_MODEL = "simecek/knotted_proteins_demo_model"

tokenizer = AutoTokenizer.from_pretrained(HF_MODEL, max_length=1024, truncation=True)
model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL, num_labels=2)

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--simecek--knotted_proteins_demo_model/snapshots/4a84074dd85da86f371050aa018989cd8a939824/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--simecek--knotted_proteins_demo_model/snapshots/4a84074dd85da86f371050aa018989cd8a939824/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--simecek--knotted_proteins_demo_model/snapshots/4a84074dd85da86f371050aa018989cd8a939824/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--simecek--knotted_proteins_demo_model/snapshots/4a84074dd85da86f371050aa018989cd8a939824/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--simecek--knotted_proteins_demo_model/snapshots/4a84074dd85da86f371050aa018989cd8a939824/config.json
Model conf

In [5]:
def tokenize_function(s):
  seq_split = " ".join(s['seq'])
  return tokenizer(seq_split)

tokenize_function({'seq': "B"})

{'input_ids': [2, 27, 3], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}

## 2) Dataset - PF04034 family

We will demonstrate the evaluation on [PF04034](https://www.ebi.ac.uk/interpro/entry/pfam/PF04034/) family. Data have been provided as two CSV files by Jakub Kostrzębski.

In [6]:
INPUT_POSITIVE = "PF04034_knotted - PF04034_knotted.csv"
INPUT_NEGATIVE = "PF04034_unknotted_upsampled - unknotted.csv"

In [13]:
import pandas as pd

df_positive = pd.read_csv(INPUT_POSITIVE)
df_negative = pd.read_csv(INPUT_NEGATIVE)
print(len(df_positive), len(df_negative))
# if the dataset is large, you might just want to take a random 10_000 sample

df = pd.concat([df_positive, df_negative], ignore_index=True)
df

1390 1390


Unnamed: 0,seq,label
0,MAKSSKGKGSNKQRGAHSSKPSSGHTSTHRHVESKHGANKSGKNFP...,1
1,MSARGKKNKNKSKLLEKRPRHVLCKEQRYSQEMDRGSDKDDEAESA...,1
2,MGRKKQGALVRPGNKGKDKRHRMKGKTLEAFTEDMHTAFEASLHVE...,1
3,MWELGHCDPRRCTGRKLARLGLVRCLRLGHRFGGLVLSPMGSQYVS...,1
4,MGKGKNKDKEGQKGHSNKTSNGHKNRGNHHGKGRQETKYSSVVRQV...,1
...,...,...
2775,MGRKKSVRGSGPEGSRSGGRRPQGCPRASRSLDSFSEEVEAALLAS...,0
2776,MFYDKLSDGTYEEFGNETVLISHGRLLPFLVATNPINYGKPCQLSC...,0
2777,QLVAQSGVAVIDCSWARLDETPFGKMRGSHLRLLPYLVAANPVNYG...,0
2778,MDSGDRDRAEELLATFTWGETFAELNEEPLSRYADCADSEAVVAVQ...,0


In [15]:
from datasets import Dataset

ds = Dataset.from_pandas(df).shuffle()
tds = ds.map(tokenize_function, remove_columns='seq', num_proc=4)
tds.set_format("pt")

tds

      

#0:   0%|          | 0/695 [00:00<?, ?ex/s]

  

#1:   0%|          | 0/695 [00:00<?, ?ex/s]

#2:   0%|          | 0/695 [00:00<?, ?ex/s]

#3:   0%|          | 0/695 [00:00<?, ?ex/s]

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2780
})

## 3) Evaluation

In [16]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [20]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments('outputs', fp16=True, per_device_eval_batch_size=1, report_to='none')  

trainer = Trainer(
    model,
    training_args,
    train_dataset=tds,
    eval_dataset=tds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

PyTorch: setting up devices
Using cuda_amp half precision backend


In [21]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 2780
  Batch size = 1
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 1.700256586074829,
 'eval_accuracy': 0.6920863309352518,
 'eval_f1': 0.5663627152988855,
 'eval_precision': 0.9571917808219178,
 'eval_recall': 0.40215827338129495,
 'eval_runtime': 67.8274,
 'eval_samples_per_second': 40.986,
 'eval_steps_per_second': 40.986}