<a href="https://colab.research.google.com/github/AngelRuizMoreno/Scripts_Notebooks/blob/master/protnlm/protnlm_use_model_for_inference_uniprot_2022_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```
# Copyright 2022 Google Inc.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

This colab supports the UniProt launch 2022_04, where Google predicted
protein names for 88% of all Uncharacterized proteins (over 1 in 5 proteins in UniProt).

This colab allows you to run a model that's very similar to the one used in the UniProt release. **Put in the amino acid sequence below**, and press "Runtime > Run all" in the _File_ menu above to **get name predictions for your protein**!

This colab takes a few minutes to run initially, and then you get protein sequence predictions in a few seconds!

# Import code

In [1]:
#@markdown Please execute this cell by pressing the _Play_ button
#@markdown on the left to import the dependencies. It can take a few minutes.
!python3 -m pip install -q -U tensorflow==2.15
!python3 -m pip install -q -U tensorflow-text==2.15
import tensorflow as tf
import tensorflow_text
import numpy as np
import re

import IPython.display
from absl import logging

logging.set_verbosity(logging.ERROR)  # Turn down tensorflow warnings

def print_markdown(string):
  IPython.display.display(IPython.display.Markdown(string))

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# 2. Load the model

In [2]:
#@markdown Please execute this cell by pressing the _Play_ button.

def query(seq):
  return f"[protein_name_in_english] <extra_id_0> [sequence] {seq}"

EC_NUMBER_REGEX = r'(\d+).([\d\-n]+).([\d\-n]+).([\d\-n]+)'

def run_inference(seq):
  labeling = infer(tf.constant([query(seq)]))
  names = labeling['output_0'][0].numpy().tolist()
  scores = labeling['output_1'][0].numpy().tolist()
  beam_size = len(names)
  names = [names[beam_size-1-i].decode().replace('<extra_id_0> ', '') for i in range(beam_size)]
  for i, name in enumerate(names):
    if re.match(EC_NUMBER_REGEX, name):
      names[i] = 'EC:' + name
  scores = [np.exp(scores[beam_size-1-i]) for i in range(beam_size)]
  return names, scores

In [3]:
#@markdown Please execute this cell by pressing the _Play_ button
#@markdown on the left to load the model. It can take a few minutes.

! mkdir -p protnlm

! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/saved_model.pb -P protnlm -q
! mkdir -p protnlm/variables
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.index -P protnlm/variables/ -q
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.data-00000-of-00001 -P protnlm/variables/ -q

imported = tf.saved_model.load(export_dir="protnlm")
infer = imported.signatures["serving_default"]

In [14]:
#@title 3. Put your prediction here (hemoglobin is pre-loaded)

#@markdown Press the _Play_ button to get a prediction.
#@markdown The first time can take a few minutes.
#@markdown
#@markdown Subsequent predictions take a few seconds.
sequence = "MEAEKQRAKQLVVGILAHVDSGKTTLSEAMLYRSGTIRKLGRVDHKDAFLDTDALEKARG ITIFSKQALLTAGETAITLLDTPGHVDFSTETERTLQVLDYAVLVISGTDGVQSHTETLW RLLRRYHIPTFVFINKMDLPGPGKEKLLEQLNHRLGEGFVDFGADEDTRNEALAVCDERL MEAVLERGTLTPEELIPAIARRHVFPCWFGAALKLEGVDALLAGLDTYTRPAPALDAFGA KVFKLSQDEQGTRLTWLRVTGGTLKVKDQLTGESDGGPWAEKANQLRLYSGVKYTLAEEV GPGQVCAVTGLTQAHPGEGLGAERDSDLPVLEPVLSYQVLLPEGADIHAALGKLHRLEEE EPQLHVVWNETLGEIHVQLMGEVQLEVLKSLLAERYGLEVEFGPGGILYKETITEAMEGV GHYEPLRHYAEVHLKLEPLPAGSGMQFAADCREEVLDKNWQRLVMTHLEEKQHLGVLIGA PLTDVKITLIAGRAHLKHTEGGDFRQATYRAVRQGLMMANQIGKTQLLEPWYTFRLEVPA ENLGRAMNDIQRMEGSFDPPETSADGQTATLTGKAPAATMRSYPMEVVSYTRGRGRVSLT LEGYRPCHNAREVIEAVGYEPEHDLDNPADSVFCAHGAGFVVPWEQVRSHMHVDSGWGKS KPAETDAVAASARQAGRQRRAAAYRATLEEDAELLKIFEQTYGPIKRDPLAAFRPVQKKE RPDFAAEQWTLAPEYLLVDGYNIIFAWDELNALSKESLDAARKKLADILCNYQGFKKCVV ILVFDAYRVPGSPGSIEQYHNIHIVYTKEAETADMFIEHVTHEIGKDRRVRVATSDGMEQ IIILGHGALRVSARMFHEEVKEVEKEIKRYLQGEV" #@param {type:"string"}
sequence = sequence.replace("\n","").replace(' ', '')

names, scores = run_inference(sequence)

for name, score, i in zip(names, scores, range(len(names))):
  print_markdown(f"Prediction number {i+1}: **{name}** with a score of **{score:.03f}**")

Prediction number 1: **Translation elongation factor G** with a score of **0.458**

Prediction number 2: **GTP-binding protein** with a score of **0.239**

Prediction number 3: **Elongation factor G** with a score of **0.057**

Prediction number 4: **Small GTP-binding protein domain-containing protein** with a score of **0.042**

Prediction number 5: **Small GTP-binding protein domain protein** with a score of **0.025**

Prediction number 6: **Tr-type G domain-containing protein** with a score of **0.023**

Prediction number 7: **Small GTP-binding protein** with a score of **0.021**

Prediction number 8: **Small GTP-binding protein domain** with a score of **0.020**

Prediction number 9: **Tetracycline resistance protein tetM from transposon Tn916** with a score of **0.011**

Prediction number 10: **TetM/TetW/TetO/TetS family tetracycline resistance ribosomal protection protein** with a score of **0.011**