# nanoBERT Example

Here we present nanoBERT, a nanobody-specific transformer. Its primary application is positing infilling, predicting what amino acids could be available at a given position according to the nanobody-specific distribution.  

In [1]:
# Install stadard library
! pip install --upgrade transformers



In [3]:
from transformers import pipeline, RobertaTokenizer, AutoModel

In [4]:
# Initialise the tokenizer
tokenizer = RobertaTokenizer.from_pretrained("NaturalAntibody/nanoBERT", return_tensors="pt")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/633 [00:00<?, ?B/s]



In [5]:
# Initialise model
unmasker = pipeline('fill-mask', model="NaturalAntibody/nanoBERT", tokenizer=tokenizer, top_k=20 )

pytorch_model.bin:   0%|          | 0.00/57.9M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [18]:
import os
import pandas as pd

current_directory = os.getcwd()
print(f"Current Directory: {current_directory}")

df_s = pd.read_csv("/content/sdab_data.csv", header=0, sep=";")
df_s.head()

Current Directory: /content


Unnamed: 0,id,name,seq,tm,doi,source,fr1,cdr1,fr2,cdr2,fr3,cdr3,fr4,target1,target2,target3,target4
0,>sdab1,NRL-N-C2,EVQLQASGGGLVRPGGSLRLSCAASGFTFSSYAMMWVRQAPGKGLE...,675,https://pubs.acs.org/doi/10.1021/acs.analchem....,Llama,EVQLQASGGGLVRPGGSLRLSC,AASGFTFSSYAMM,WVRQAPGKGLEWV,SAINGGGGST,SYADSVKGRFTISRDNAKNTLYLQMNSLKPEDTAVYYC,AKYQAAVHQEKEDY,WGQGTQVTVSS,SARS-CoV-2 nucleocapsid (N),,,
1,>sdab2,NRL-N-C2-hop,EVQLQASGGGLVRPGGSLRLSCAASGFTFSSYAMMWVRQAPGKGLE...,65,https://pubs.acs.org/doi/10.1021/acs.analchem....,Llama,EVQLQASGGGLVRPGGSLRLSC,AASGFTFSSYAMM,WVRQAPGKGLEWV,SAINGGGGST,SYADSVKGRFTISRDNAKNTLYLQMNSLKPEDTAVYYC,AKYQAAVHQEKEDY,WGQGTQVTVSS,SARS-CoV-2 nucleocapsid (N),,,
2,>sdab3,NRL-N-E2,EVQLQASGGGLVQAGGSLRLSCAASGRTDSTQHMAWFRQAPGKERE...,62,https://pubs.acs.org/doi/10.1021/acs.analchem....,Llama,EVQLQASGGGLVQAGGSLRLSC,AASGRTDSTQHMA,WFRQAPGKEREFV,TAIQWRGGGT,SYTDSVKGRFTISRDNAKNTVYLEMNSLKPEDTAVYYC,ATNTRWTYFSPTVPDRYDY,WGQGTQVTVSS,SARS-CoV-2 nucleocapsid (N),,,
3,>sdab4,NRL-N-E2-hop,EVQLQASGGGLVQAGGSLRLSCAASGRTDSTQHMAWFRQAPGKERE...,62,https://pubs.acs.org/doi/10.1021/acs.analchem....,Llama,EVQLQASGGGLVQAGGSLRLSC,AASGRTDSTQHMA,WFRQAPGKEREFV,TAIQWRGGGT,SYTDSVKGRFTISRDNAKNTVYLEMNSLKPEDTAVYYC,ATNTRWTYFSPTVPDRYDY,WGQGTQVTVSS,SARS-CoV-2 nucleocapsid (N),,,
4,>sdab5,NRL-N-E10,DVQLQASGGGLVQAGGSLRLSCAASARTFYTMGWFRQVLGKDREFV...,70,https://pubs.acs.org/doi/10.1021/acs.analchem....,Llama,DVQLQASGGGLVQAGGSLRLSC,AASARTFYTMG,WFRQVLGKDREFV,GAIRWGVYATT,RYADSVKGRFSISRDDATNTVALQMNSLKPEDTAVYYC,AARAGPLGFELSATSSAEYDY,WGQGTQVTVSS,SARS-CoV-2 nucleocapsid (N),,,


In [33]:
# Predict the residue probability at one or more masked positions
# mark position to predict with '<mask>'

mutations = pd.DataFrame(columns=["id", "tm", "crd3","wt_seq","position","wt","mut","mut_seq","prob_mut"])

for i in range(df_s.shape[0]):
  raw=df_s.loc[i,"seq"]
  cdr3=df_s.loc[i,"cdr3"]
  position = raw.find(cdr3)
  print(cdr3)
  for j in range(position, position+len(cdr3)):
    print(j,raw[j])
    seq = raw[:j] + "<mask>" + raw[j + 1:]
    residueProbability = unmasker(seq)
    # Print residue probabilities
    for scores in residueProbability:
      if scores['score']>0.01 and scores['token_str'] != raw[j]:
        print(f"Amino Acid : {scores['token_str']}, probability = {scores['score']}")
        mutseq = raw[:j] + scores['token_str'] + raw[j + 1:]
        mutations.loc[len(mutations)] = [df_s.loc[i,"id"],
                                                     df_s.loc[i,"tm"],
                                                     cdr3,
                                                     raw[j],
                                                     j,
                                                     seq,
                                                     scores['token_str'],
                                                     mutseq,
                                                     scores['score']]



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Amino Acid : D, probability = 0.049461059272289276
Amino Acid : S, probability = 0.03146084398031235
Amino Acid : P, probability = 0.029772788286209106
Amino Acid : R, probability = 0.023041419684886932
Amino Acid : E, probability = 0.02027197740972042
Amino Acid : I, probability = 0.01959805004298687
Amino Acid : W, probability = 0.01290155854076147
Amino Acid : N, probability = 0.010198970325291157
102 P
Amino Acid : G, probability = 0.41073477268218994
Amino Acid : S, probability = 0.15273743867874146
Amino Acid : A, probability = 0.0587761364877224
Amino Acid : T, probability = 0.05691343918442726
Amino Acid : Y, probability = 0.05450468510389328
Amino Acid : D, probability = 0.04911365732550621
Amino Acid : R, probability = 0.031071772798895836
Amino Acid : V, probability = 0.028152791783213615
Amino Acid : L, probability = 0.022674161940813065
Amino Acid : E, probability = 0.020423617213964462
Amino Acid : N, probab

In [35]:
mutations.shape

(129281, 9)

In [39]:
mutations.to_csv("/content/sdab_data_crd3_mutations.csv",index=False, sep="\t")

In [45]:
mutations.to_csv("/content/sdab_data_crd3_mutations_01.csv",index=False, sep="\t")

# New section