<a href="https://colab.research.google.com/github/Basspoom/Basspoom/blob/main/RbcL_Ib_mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **RbcL Ib mining**
The model in this notebook allows the prediction of the plant Rubisco kinetic paramater, carboxylation turnover rate (Kcat), based on the Rubisco large subunit (rbcl) protein sequence.

Details of the model can be found in the published paper:


Basic instructions for use:
1. Model predictions can be obtained faster by setting the runtime as a 'TPU' via 'runtime --> change runtime type'.  
2. Load each step of the notebook using the 'play' button as instructed below:

In [None]:
#@title Import dependencies. Will take approx 10 minutes to load.
!pip install tensorflow==2.8.0 tensorflow_probability==0.16.0 # need this version so that I can load my saved gaussian process model
#!pip install gpflow
from IPython.display import clear_output
clear_output()
!pip install glob2
clear_output()
import torch
!git clone https://github.com/Iqbalwasim01/esm.git
!pip install esm/
import esm
model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()
clear_output()
from pickle import LIST
from typing import List
from typing_extensions import Final
from numpy.core.defchararray import count
import pandas
import numpy
import os
from numpy.core import numeric
from pathlib import Path
import glob
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import timeit
import tensorflow as tf
import time
time.sleep(1)

In [None]:
#@title Load pretrained models. Will take approx 2 minutes
# Gaussian process regression pretrained model

#attempts = 0 # need this function. When importing gpflow package it fails the first two times due to package comptability conflicts.
#while attempts < 3:
    #try:
        #import gpflow
        #break
    #except:
        #attempts += 1
        #time.sleep(1)  # wait for 1 second before retrying

#from gpflow import kernels, models
#from gpflow.utilities import print_summary, set_trainable

!git clone https://github.com/Iqbalwasim01/RbcL-1b-mining.git
!unzip /content/RbcL-1b-mining/SavedModel.zip
%cd "/content/SavedModel"
loaded_model=tf.saved_model.load("/content/SavedModel")
clear_output()

In [None]:
#@title
# Calculate embeddings


def ESM(sequence):

  global df_results

  data=[("protein",sequence)]

  model.eval()  # disables dropout for deterministic results

  batch_labels, batch_strs, batch_tokens = batch_converter(data)
  # Extract per-residue representations (on CPU)
  with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
  token_representations = results["representations"][33]

  # Generate per-sequence representations via averaging
  # NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
  sequence_representations = []
  for i, (_, seq) in enumerate(data):
    sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))

  for s in sequence_representations:
    samples=s.numpy()
    samples=samples.reshape(1,1280)

  #inputs=tf.convert_to_tensor(samples,dtype=default_type())
  mean, var=loaded_model.predict_f_compiled(samples)
  return mean,var

Paste your sequence in the box which will appear below. The mean and variance will appear after a few seconds.

The first run always takes a few minutes to load.

NOTE: Make sure each protein sequence contains no spaces or line breaks, otherwise an error will occur.

In [None]:
#@title

#from gpflow.config import default_float
#gpflow.config.set_default_float(numpy.float64)

interact(ESM,sequence="AABBBCCCC")

interactive(children=(Text(value='AABBBCCCC', description='sequence'), Output()), _dom_classes=('widget-intera…

Alternatively, upload a .csv file formatted with the exact column names: "Species.name" & "seq"

NOTE: Make sure each protein sequence contains no spaces or line breaks, otherwise an error will occur.

In [None]:
#@title
uploader=widgets.FileUpload(
    accept='.csv',  # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
    multiple=False  # True to accept multiple files upload else False
)
display(uploader)

FileUpload(value={}, accept='.csv', description='Upload')

In [None]:
#@title Run cell to predict activity from uploaded file
from tqdm import tqdm
import io
import pandas as pd
import gc

input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('ISO-8859-1'))
df = pd.read_csv(content)


Species_name=df["Species.name"].to_numpy()
Sequence=df["seq"].to_numpy()

mean_=[]
var_=[]

for i in tqdm(Sequence):
  gc.disable() # reduces loop time by disabling garbage collector
  mean,var=ESM(sequence=i)
  mean=mean.numpy()[0]; var=var.numpy()[0]
  mean_.append(mean); var_.append(var) # faster than append
gc.enable()

bigdata=pd.DataFrame({'mean':mean_,'var':var_},columns=['mean','var'])
bigdata['Species.new'] = Species_name
from google.colab import files
bigdata.to_csv('bigdata.csv')
files.download('bigdata.csv')

IndexError: list index out of range