# Testing Huggingface
The intention of this notebook is only to familiarize with huggigface api and the process of using it, before writing any package or script code using it.

## Tests

1. Loading a pretrained protein model and making predictions
2. Loading a pretrained protein model repurposing the head for sequence generation
3. Finetuning a sequence generation model on pairs of sequences
4. Pretraining a brand new tokenizer and model on unlabled list of sequences

## 0. What us the vocabulary of the pretrained tokenizer

In [4]:
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)

In [17]:
input_ids = tokenizer("AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR")['input_ids']

In [19]:
input_ids

[2, 1]

In [22]:
input_ids = tokenizer(" ".join("AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR"))['input_ids']

In [23]:
print(len("AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR"), len(input_ids))

110 111


In [21]:
tokenizer.get_vocab()

{'<pad>': 0,
 '</s>': 1,
 '<unk>': 2,
 '▁A': 3,
 '▁L': 4,
 '▁G': 5,
 '▁V': 6,
 '▁S': 7,
 '▁R': 8,
 '▁E': 9,
 '▁D': 10,
 '▁T': 11,
 '▁I': 12,
 '▁P': 13,
 '▁K': 14,
 '▁F': 15,
 '▁Q': 16,
 '▁N': 17,
 '▁Y': 18,
 '▁M': 19,
 '▁H': 20,
 '▁W': 21,
 '▁C': 22,
 '▁X': 23,
 '▁B': 24,
 '▁O': 25,
 '▁U': 26,
 '▁Z': 27,
 '<extra_id_99>': 28,
 '<extra_id_98>': 29,
 '<extra_id_97>': 30,
 '<extra_id_96>': 31,
 '<extra_id_95>': 32,
 '<extra_id_94>': 33,
 '<extra_id_93>': 34,
 '<extra_id_92>': 35,
 '<extra_id_91>': 36,
 '<extra_id_90>': 37,
 '<extra_id_89>': 38,
 '<extra_id_88>': 39,
 '<extra_id_87>': 40,
 '<extra_id_86>': 41,
 '<extra_id_85>': 42,
 '<extra_id_84>': 43,
 '<extra_id_83>': 44,
 '<extra_id_82>': 45,
 '<extra_id_81>': 46,
 '<extra_id_80>': 47,
 '<extra_id_79>': 48,
 '<extra_id_78>': 49,
 '<extra_id_77>': 50,
 '<extra_id_76>': 51,
 '<extra_id_75>': 52,
 '<extra_id_74>': 53,
 '<extra_id_73>': 54,
 '<extra_id_72>': 55,
 '<extra_id_71>': 56,
 '<extra_id_70>': 57,
 '<extra_id_69>': 58,
 '<extra_id_

## 1. Loading a pretrained model and making predictions
Rostlab pretrained a T5 model on Uniref. for context prediction. It should hypothetically also be used for sequence generation.

https://huggingface.co/Rostlab/prot_t5_xl_uniref50

https://www.biorxiv.org/content/10.1101/2020.07.12.199554v2

The below implementation is exactly the example.

In [2]:
from transformers import T5Tokenizer, T5Model
import re
import torch

In [3]:
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50")

Some weights of the model checkpoint at Rostlab/prot_t5_xl_uniref50 were not used when initializing T5Model: ['lm_head.weight']
- This IS expected if you are initializing T5Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
sequences_Example = ["A E T C Z A O","S K T Z P"]
sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]

In [5]:
ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)

input_ids = torch.tensor(ids['input_ids'])
attention_mask = torch.tensor(ids['attention_mask'])

In [15]:
with torch.no_grad():
    embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=input_ids)

# For feature extraction we recommend to use the encoder embedding
encoder_embedding = embedding[2].cpu().numpy()
decoder_embedding = embedding[0].cpu().numpy()

In [18]:
encoder_embedding.shape

(2, 8, 1024)

The tokenizer from the Rostlab model does not token aggregation.

## 2. Loading a pretrained protein model repurposing the head for sequence generation

In [3]:
from transformers import T5ForConditionalGeneration

In [36]:
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
model = T5ForConditionalGeneration.from_pretrained("Rostlab/prot_t5_xl_uniref50")

In [53]:
input_seq = ["M S K T Z P"]

In [59]:
ids = tokenizer(input_seq, add_special_tokens=True, padding=True)

In [62]:
input_ids = torch.tensor(ids['input_ids'])

In [65]:
output = model.generate(input_ids)



In [67]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

M S K T


We recovered most of the sequence - let's try a more fun one.

In [4]:
raw_seq = ["AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR", "RPRTAFSSEQLARLKREFNENRYLTERRRQQLSSELGLNEAQIKIWFQNKRAKI"]
input_seq = []
for s in raw_seq:
    new = ""
    for c in s:
        new += f'{c} '
    input_seq.append(re.sub(r"[UZOB]", "X", new))


tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
model = T5ForConditionalGeneration.from_pretrained("Rostlab/prot_t5_xl_uniref50")
ids = tokenizer(input_seq, add_special_tokens=True, padding=True)
input_ids = torch.tensor(ids['input_ids'])
output = model.generate(input_ids, max_length=100)

In [6]:
print([tokenizer.decode(o, skip_special_tokens=True) for o in output])

['A Q V I N T F D G V A D Y L Q T Y H K L P D N Y I T K S E A Q A L G W V A S K G N L A D V A P G K S I G G D I F S N R E G K L P G K S G R T W R E A D I N Y T S G F R N S D R I L Y S S D W L I Y K T', 'R P R T A F S S E Q L A R L K R E F N E N R Y L T E R R R Q Q L S S E L G L N E A Q I K I W F Q N K R A K I']
