# **Google Colab Page for feature generation and Model validation**

## Note: In this Colab example we utlilized novel TRP Channels that are not included in our original dataset to validate our model.

### We used the following unseen TRP Channel proteins to validate our model

#### 1.  Q7Z4N2 ========================> Transient receptor potential cation channel subfamily M member 1 (TRPM1)
#### 2.  Q9R283 ========================> Short transient receptor potential channel 2 (TRPC2)
#### 3.  O94759 ========================> Transient receptor potential cation channel subfamily M member 2 (TRPM2)
#### 4.  Q2TV84 ========================> Transient receptor potential cation channel subfamily M member 1 (TRPM1)
#### 5.  Q8R4D5 ========================> Transient receptor potential cation channel subfamily M member 8 (TRPM8)

# Colab Page for Submission of novel proteins to test our method validity

Please Submit a protein sequence in Fasta format as given in the input section of this colab page. Using this colab page, researcher can generate BERT representations and also capable of downloading Distograms of provided protein sequences and transforms Distograms into features. After that, both the Distogram features and BERT features are concatenated to obtain hybrid feature set. Finally, these features are used to test the validity of our proposed method.  

# **1. Import Important Modules**

In [6]:
__author__ = "Software Authors Name"
__copyright__ = "Copyright (C) 2004 Author Name"
__license__ = "Public Domain"
__version__ = "1.0"

In [None]:
#@title 1. Necessary modules
import requests
import shutil
import glob
import cv2
import os
from skimage import io
import pandas as pd
import time
import csv
import json
import numpy as np
import math
from sklearn import svm
import pickle
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
import sys

# **2. Verifiy GPU availability**

In [None]:
#@title 2. Check for GPU availability
# Memory footprint support libraries/code
# If in case, the utilization is greater than 0% try to kill using the code (!kill -9 -1). 
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm() 

Collecting gputil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-py3-none-any.whl size=7411 sha256=d71388c9ecfeb2c165b65698bb0370df85957bc18f408d3447c3e08b0a863978
  Stored in directory: /root/.cache/pip/wheels/6e/f8/83/534c52482d6da64622ddbf72cd93c35d2ef2881b78fd08ff0c
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 12.5 GB  | Proc size: 185.5 MB
GPU RAM Free: 11441MB | Used: 0MB | Util   0% | Total 11441MB


# **3. Get the BERT Package using Google-research GITHUB repository**

In [None]:
#@title 3. Download the BERT Package using Google-research link
!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

Cloning into 'bert_repo'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 328.28 KiB | 4.00 MiB/s, done.
Resolving deltas: 100% (182/182), done.


# **4. Get the particular BERT Model (In our case: BERT LARGE CASED MODEL)**

In [None]:
#@title 4. Download BERT Large Cased Model 
!wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip

--2021-12-19 13:35:47--  https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.197.128, 64.233.191.128, 173.194.74.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.197.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1242178883 (1.2G) [application/zip]
Saving to: ‘cased_L-24_H-1024_A-16.zip’


2021-12-19 13:35:57 (116 MB/s) - ‘cased_L-24_H-1024_A-16.zip’ saved [1242178883/1242178883]



# **5. Extract the BERT Model**

In [None]:
#@title 5. Extract BERT Large Cased Model 
# Extract all files
import zipfile

folder = 'bert_model'
with zipfile.ZipFile("cased_L-24_H-1024_A-16.zip","r") as zip_ref:
    zip_ref.extractall(folder)

# **6. Include BERT specific Modules**

In [None]:
#@title 6. Import required modules for BERT
!pip install tensorflow-gpu==1.15.2
import modeling
import optimization
import run_classifier
import run_classifier_with_tfhub
import tokenization
import tensorflow as tf
# import tfhub 
import tensorflow_hub as hub
import zipfile
import os

Collecting tensorflow-gpu==1.15.2
  Downloading tensorflow_gpu-1.15.2-cp37-cp37m-manylinux2010_x86_64.whl (410.9 MB)
[K     |████████████████████████████████| 410.9 MB 33 kB/s 
Collecting tensorboard<1.16.0,>=1.15.0
  Downloading tensorboard-1.15.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 36.4 MB/s 
Collecting tensorflow-estimator==1.15.1
  Downloading tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503 kB)
[K     |████████████████████████████████| 503 kB 43.2 MB/s 
[?25hCollecting keras-applications>=1.0.8
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.6 MB/s 
Collecting gast==0.2.2
  Downloading gast-0.2.2.tar.gz (10 kB)
Building wheels for collected packages: gast
  Building wheel for gast (setup.py) ... [?25l[?25hdone
  Created wheel for gast: filename=gast-0.2.2-py3-none-any.whl size=7554 sha256=599f805ff85991e3eb937ac460ca4b887c02759b45d2587fc45500b2b17f0d2b
  Stored in d

# **7. Get embeddings for data input data files**


In [None]:
#@title 7. Generate representations from BERT

data_bert_new = pd.DataFrame()
# Extract features for n-gram embeddings
def extractFeatureEmbedingJSONL(input_jsonl_file_path):
  # Temporary store variable
  temp_store_feature = [];
  embedding = []
  # Read JSONL files and append embedding vectors
  with open(input_jsonl_file_path) as f:
      for line in f:
        embedding.append(json.loads(line))
  # Print total rows data in test and train
  #print("Max embedings: "+str(len(embedding)))
    
  # Extract feature for each proteins here
  for row_index, get_prot_embedding in enumerate(embedding):
    # Temp variables
    store_token_amino_acid = [];
    store_token_embedding = [];
  
    # Get features
    features = embedding[row_index]["features"]

    # Extract amino acid tokens and vectors (token embedding)
    for index, feature in enumerate(features):
      token_amino_acid = feature["token"]
      # Order from original paper about layer (["layers"] ["index"] ["values"])
      # Index mens index of layer, ':' means select all layers
      token_embedding_layer0 = feature["layers"][0]["values"] # Sum last 4 layers
      token_embedding_layer1 = feature["layers"][1]["values"] # Sum last 4 layers
      token_embedding_layer2 = feature["layers"][2]["values"] # Sum last 4 layers
      token_embedding_layer3 = feature["layers"][3]["values"] # Sum last 4 layers
      
      # # Make list in list for all four layers
      token_embedding = [token_embedding_layer0, token_embedding_layer1, token_embedding_layer2, token_embedding_layer3];
      # Sum last 4 layers (sum of the last four layers)
      token_embedding = sum(map(np.array, token_embedding));

      # Store
      store_token_amino_acid.append(token_amino_acid);
      store_token_embedding.append(token_embedding);

  #   # Convert to dataframe (look like PSSM)
    data_bert = pd.DataFrame(store_token_embedding)


  #   # Add amino acid in dataframe
    data_bert['residue'] = store_token_amino_acid # Creat a new column represents all amino acids

    # Remove first and last rows containing special tokens
    data_bert = data_bert.drop(data_bert.index[len(data_bert)-1])
    data_bert = data_bert.drop(data_bert.index[0])
    return data_bert;

# **8. Generate Embedding Matrix using BERT**

In [None]:
#@title 8. Generate Embedding Matrix. This step outputs a 20 x 768 for BERT Base or 20 x 1024 for BERT Large by summing up the same amino acid feature vectors.
def GenerateBERTEmbeddingMatrix(df_protein_bert_embeddings):
  # Put all amino acids in order to know missing column 20 amino acid
  default_AA = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']


  # Sum/mean multiple row values of various columns grouped by 'residue'
  df_protein_bert_embeddings = df_protein_bert_embeddings.groupby('residue', as_index=False).sum()
  print(df_protein_bert_embeddings)

  # Transpose first row as header
  df_protein_bert_embeddings = df_protein_bert_embeddings.set_index('residue').T

  # Get recent column names 
  get_column_names = df_protein_bert_embeddings.columns.tolist()
  
  # Check all columns are exist
  get_column_mis = list(set(default_AA).difference(get_column_names))

  # Check list 
  if len(get_column_mis) > 0:
    for get_aa in get_column_mis:
      # Create column with value 0
      df_protein_bert_embeddings[get_aa] = 0;

  # Select only default amino acids
  bert_values_order = df_protein_bert_embeddings[default_AA]
  print(bert_values_order)

  # 6. Transpose again
  bert_values_order = bert_values_order.transpose()
  print(bert_values_order)

  # 7. Change the idea like PSSM (We used all 20 rows × 1024 columns matrix)
  # ##############################################################################
  # Change the idea like PSSM (We used all 20 rows × 1024 columns matrix)
  # It means that we will generate more than 20480 features for each proteins
  # ##############################################################################
  # Pandas flatten a dataframe to a list (use .flatten() on the DataFrame)
  bert_feature_used = bert_values_order.values.flatten();
  return bert_feature_used

# **9. Provides input and other parameters for Bert**

In [None]:
#@title 9. Pass the neceassary parameters and input to the BERT feature extractor module to generate represenations.

# Note: Auto detect for GPU when set use_tpu=False (training will fall on CPU or GPU)
# From the jsonl file you have last 4 layers outputs or -1,-2,-3,-4
# Get embeddings for input data classifiers from Google Colab terminal command
def extractEmbeddingBertFeatures(df_data, bert_model_path): 
    start_time = time.time()
    get_path = bert_model_path;
    print("Bert Path: {0}".format(get_path));

    # Save temp dataframe and run bert embedding extractor
    df_data.to_csv('input.txt', index=False, header=False, quoting=csv.QUOTE_NONE)
    os.system(f"python3 /content/bert_repo/extract_features.py \
               --input_file=input.txt \
               --output_file=output.jsonl \
               --vocab_file='{bert_model_path}/vocab.txt' \
               --bert_config_file='{bert_model_path}/bert_config.json' \
               --init_checkpoint='{bert_model_path}/bert_model.ckpt' \
               --layers='-1,-2,-3,-4' \
               --max_seq_length=512 \
               --do_lower_case=False \
               --batch_size=8 \
               --use_tpu=False")

    #bert_output = pd.read_json("output.jsonl", lines=True)
    #bert_output.head()
    
    # Call function and extract/genereate all bert features from embedding files
    result_features = extractFeatureEmbedingJSONL('output.jsonl');

    # Remove temp files
    os.system("rm input.txt")
    os.system("rm output.jsonl")
    
    #Convert to dataframe
    df_results = pd.DataFrame(result_features)
    
    # Timing
    print("[It takes {0} seconds to extract embedding features]".format((time.time() - start_time)))

    return result_features  

# **10. Provide each subsequence to obtain embeddings from BERT**

In [None]:
#@title 10. Decompose the longer protein sequences into subsequences with a maximum length of 510, and sequentially pass each subseuqnce to obtain representations.

def generate_portionwise_embeddings(df_fasta_format):
  df_final_results = pd.DataFrame()
  bert_store_feature = [];
  bert_prot_id = [];
  for index, row in df_fasta_format.iterrows():
    df_selected = df_fasta_format.iloc[index:index+1 , : ]; # for each row
    str_sequence = df_selected['SEQUENCE'].tolist()[0];
  # Split to max 510 amino acids (with 2 additional special tokens)
    lst_part_seq = cut_string(str_sequence, 510);
    # print(lst_part_seq)
    get_id = df_selected['ID'].tolist()[0];
  # #  Create dataframe for each proteins ID
    df_prot = pd.DataFrame({"SEQUENCE": lst_part_seq, 'ID': get_id})
  # # #   CREATE N-GRAMS DATA
    df_prot['1-grams'] = df_prot.apply(lambda x: ngrams(x['SEQUENCE'], 1), axis=1)
    df_bert_res_new = pd.DataFrame()
    # print(df_prot['1-grams']);
    for subsequence in df_prot['1-grams'].values.tolist():
      df_unigram = pd.DataFrame({subsequence})
      df_unigram = df_unigram.rename(columns = {0: "unigram"})
      # print(df_unigram['unigram'][0]);
      BERT_PRETRAINED_DIR = '/content/bert_model/cased_L-24_H-1024_A-16' 
      print('>>  BERT pretrained directory: '+BERT_PRETRAINED_DIR)
      print("SUBSEQUENCE OF PROTEINS:");print(subsequence);
      df_bert_res_return = extractEmbeddingBertFeatures(df_unigram, BERT_PRETRAINED_DIR);
      print(df_bert_res_return)

      df_bert_res_new = df_bert_res_new.append(df_bert_res_return)
      
    print("All BERT EMBDDING BEFORE CALCULATIONS:");
    # print(df_bert_res_new);  
    
    # Simple method to calcuate bert feature for classifiers
    bert_feature_flattened = GenerateBERTEmbeddingMatrix(df_bert_res_new)
    print(bert_feature_flattened)
    
    # Store for all proteins in list with their ID
    bert_store_feature.append(bert_feature_flattened);
    bert_prot_id.append(get_id);

  df_results = pd.DataFrame(bert_store_feature)
  df_results ['id'] = bert_prot_id;

  return df_results

# **11. Place the protein sequences in FASTA format into DataFrame**

In [None]:
#@title 11. This step converts protein sequence in Fasta format into DataFrame

def read_fasta_input(fastaSequenceInput):
    # Variables
    store_accesion_id = [];
    store_sequence_prot = [];
    store_seq_Length = [];
    
    data = fastaSequenceInput.replace('\n\n', '\n');
    getProtSeq = data.split(">")
    str_list = list(filter(None, getProtSeq)) # fastest
    
    for data_lst in str_list:
        try:
            each_prot = data_lst.split("\n")
            clear_prot = list(filter(None, each_prot)) # fastest
            # Get ID by first index and set to lowercase
            accesion_id = clear_prot[0];
            # Get sequence of protein by joining list from index
            get_sequence = "".join(clear_prot[1:len(clear_prot)]);
            get_sequence = get_sequence.replace('  ', ' ').replace(' ', '').replace('\t', '').replace('\n', '').replace('<br>', '');
            # Get sequence length
            get_seq_len = len(get_sequence);
            # Store
            store_accesion_id.append(accesion_id);
            store_sequence_prot.append(get_sequence);
            store_seq_Length.append(get_seq_len); 
        except:
            print("Found problem and skip proteins: {0}".format(data_lst));
    all_data = {'ID' : store_accesion_id, 
                'SEQUENCE': store_sequence_prot,
                'length':store_seq_Length
               }
    return all_data; 

# **12. Convert protein sequences into uni-gram**

In [None]:
#@title 12. This step transform protein sequence into uni-gram format, the inupt format accetable by BERT.
# CREATE N-GRAMS DATA

'''
 A function to split the data into n-gram feature
'''
def ngrams(input, n):
  # Cut string the same with BERT max input 512
  if len(input) < 510:
    input = input[0:len(input)];
  else:
    input = input[0:510];
  
  # Create a list and dataframe
  output = []

  # loop for each residues (+1 needs max the loop)
  for i in range(0, (len(input)+1)-n): # minus n means stop at final string
      # Cut for each n-gram
      g = input[i:i+n];

      # Score in list
      output.append(g);

  # Convert list to string
  joinstr = ' '.join(output);

  return joinstr;

# **13. Partition the longer protein sequences into subsequences**

In [None]:
#@title 13. This step splits the protein sequence into subsequence of length 510.
# Cut string to list
def cut_string(input_str, x):
    # Cut
    lst_res = [input_str[y-x:y] for y in range(x, len(input_str)+x, x)]
    return lst_res;

# **14. Provide the input in FASTA format**
## Kindly follow the same format while providing input.
### Here we used novel and unseen TRP channels that are not included in our orignial dataset.

In [None]:
#@title 14. Input protein sequences in FASTA format.

fasta_inputs = ">Q7Z4N2 \
\nMKDSNRCCCGQFTNQHIPPLPSATPSKNEEESKQVETQPEKWSVAKHTQSYPTDSYGVLE \
\nFQGGGYSNKAMYIRVSYDTKPDSLLHLMVKDWQLELPKLLISVHGGLQNFEMQPKLKQVF \
\nGKGLIKAAMTTGAWIFTGGVSTGVISHVGDALKDHSSKSRGRVCAIGIAPWGIVENKEDL \
\nVGKDVTRVYQTMSNPLSKLSVLNNSHTHFILADNGTLGKYGAEVKLRRLLEKHISLQKIN \
\nTRLGQGVPLVGLVVEGGPNVVSIVLEYLQEEPPIPVVICDGSGRASDILSFAHKYCEEGG \
\nIINESLREQLLVTIQKTFNYNKAQSHQLFAIIMECMKKKELVTVFRMGSEGQQDIEMAIL \
\nTALLKGTNVSAPDQLSLALAWNRVDIARSQIFVFGPHWPPLGSLAPPTDSKATEKEKKPP \
\nMATTKGGRGKGKGKKKGKVKEEVEEETDPRKIELLNWVNALEQAMLDALVLDRVDFVKLL \
\nIENGVNMQHFLTIPRLEELYNTRLGPPNTLHLLVRDVKKSNLPPDYHISLIDIGLVLEYL \
\nMGGAYRCNYTRKNFRTLYNNLFGPKRPKALKLLGMEDDEPPAKGKKKKKKKKEEEIDIDV \
\nDDPAVSRFQYPFHELMVWAVLMKRQKMAVFLWQRGEESMAKALVACKLYKAMAHESSESD \
\nLVDDISQDLDNNSKDFGQLALELLDQSYKHDEQIAMKLLTYELKNWSNSTCLKLAVAAKH \
\nRDFIAHTCSQMLLTDMWMGRLRMRKNPGLKVIMGILLPPTILFLEFRTYDDFSYQTSKEN \
\nEDGKEKEEENTDANADAGSRKGDEENEHKKQRSIPIGTKICEFYNAPIVKFWFYTISYLG \
\nYLLLFNYVILVRMDGWPSLQEWIVISYIVSLALEKIREILMSEPGKLSQKIKVWLQEYWN \
\nITDLVAISTFMIGAILRLQNQPYMGYGRVIYCVDIIFWYIRVLDIFGVNKYLGPYVMMIG \
\nKMMIDMLYFVVIMLVVLMSFGVARQAILHPEEKPSWKLARNIFYMPYWMIYGEVFADQID \
\nLYAMEINPPCGENLYDEEGKRLPPCIPGAWLTPALMACYLLVANILLVNLLIAVFNNTFF \
\nEVKSISNQVWKFQRYQLIMTFHDRPVLPPPMIILSHIYIIIMRLSGRCRKKREGDQEERD \
\nRGLKLFLSDEELKRLHEFEEQCVQEHFREKEDEQQSSSDERIRVTSERVENMSMRLEEIN \
\nERETFMKTSLQTVDLRLAQLEELSNRMVNALENLAGIDRSDLIQARSRASSECEATYLLR \
\nQSSINSADGYSLYRYHFNGEELLFEDTSLSTSPGTGVRKKTCSFRIKEEKDVKTHLVPEC \
\nQNSLHLSLGTSTSATPDGSHLAVDDLKNAEESKLGPDIGISKEDDERQTDSKKEETISPS \
\nLNKTDVIHGQDKSDVQNTQLTVETTNIEGTISYPLEETKITRYFPDETINACKTMKSRSF \
\nVYSRGRKLVGGVNQDVEYSSITDQQLTTEWQCQVQKITRSHSTDIPYIVSEAAVQAEHKE \
\nQFADMQDEHHVAEAIPRIPRLSLTITDRNGMENLLSVKPDQTLGFPSLRSKSLHGHPRNV \
\nKSIQGKLDRSGHASSVSSLVIVSGMTAEEKKVKKEKASTETEC \
>Q9R283 \
\nMLMSLTDSKEGKNRSGVRMFKDDDFLIPASGESWDRLRLTCSQPFTRHQSFGLAFLRVRS \
\nSLDSLSDPVKDPSSPGSSGLNQNSSDKLESDPSPWLTNPSIRRTFFPDPQTSTKEISALK \
\nGMLKQLQPGPLGRAARMVLSAAHKAPPASVVSPNNSHGEPDSSHPERAEPRAEEPNRKNN \
\nASRGKRRKVQEQRRPLSSSSSQPNRRATGRTKQRQQRPQAKSDGSGVQATGQCPICTGSF \
\nSIEALPRHAATCGESSPPQPASPTSLSSSESVLRCLHVALTPVPLIPKPNWTEIVNKKLK \
\nFPPTLLRAIQEGQLGLVQQLLESGSDPSGAGPGGPLRNVEESEDRSWREALNLAIRLGHE \
\nVITDVLLANVKFDFRQIHEALLVAVDTNQPAVVRRLLARLEREKGRKVDTKSFSLAFFDS \
\nSIDGSRFAPGVTPLTLACQKDLYEIAQLLMDQGHTIARPHPVSCACLECSNARRYDLLKF \
\nSLSRINTYRGIASRAHLSLASEDAMLAAFQLSRELRRLARKEPEFKPQYIALESLCQDYG \
\nFELLGMCRNQSEVTAVLNDLGEDSETEPEAEGLGQAFEEGIPNLARLRLAVNYNQKQFVA \
\nHPICQQVLSSIWCGNLAGWRGSTTIWKLFVAFLIFLTMPFLCIGYWLAPKSRLGRLLKIP \
\nVLKFLLHSASYLWFLIFLLGESLVMETQLSTFKGRSQSVWETSLHMIWVTGFLWFECKEV \
\nWIEGLRSYLLDWWNFLDVVILSLYLASFALRLLLAGLAYMHCRDASDSSTCRYFTTAERS \
\nEWRTEDPQFLAEVLFAVTSMLSFTRLAYILPAHESLGTLQISIGKMIDDMIRFMFILMII \
\nLTAFLCGLNNIYVPYQETEKLGNFNETFQFLFWTMFGMEEHSVVDMPQFLVPEFVGRAMY \
\nGIFTIVMVIVLLNMLIAMITNSFQKIEDDADVEWKFARSKLYLSYFREGLTLPVPFNILP \
\nSPKAAFYLLRRIFRFICCGSSCCKAKKSDYPPIPTFTNPGARAGPGEGEHVSYRLRVIKA \
\nLVQRYIETARREFEETRRKDLGNRLTELTKTVSRLQSEVASVQKTVAAGGALRPPDGASI \
\nLSRYITRVRNSFQNLGPPAPDTPAELTMPGIVETEVSLEDSLDATGEAGTPASGESSSSS \
\nSAHVLVHREQEAEGAGDLPLGEDLETKGES \
>O94759 \
\nMEPSALRKAGSEQEEGFEGLPRRVTDLGMVSNLRRSNSSLFKSWRLQCPFGNNDKQESLS \
\nSWIPENIKKKECVYFVESSKLSDAGKVVCQCGYTHEQHLEEATKPHTFQGTQWDPKKHVQ \
\nEMPTDAFGDIVFTGLSQKVKKYVRVSQDTPSSVIYHLMTQHWGLDVPNLLISVTGGAKNF \
\nNMKPRLKSIFRRGLVKVAQTTGAWIITGGSHTGVMKQVGEAVRDFSLSSSYKEGELITIG \
\nVATWGTVHRREGLIHPTGSFPAEYILDEDGQGNLTCLDSNHSHFILVDDGTHGQYGVEIP \
\nLRTRLEKFISEQTKERGGVAIKIPIVCVVLEGGPGTLHTIDNATTNGTPCVVVEGSGRVA \
\nDVIAQVANLPVSDITISLIQQKLSVFFQEMFETFTESRIVEWTKKIQDIVRRRQLLTVFR \
\nEGKDGQQDVDVAILQALLKASRSQDHFGHENWDHQLKLAVAWNRVDIARSEIFMDEWQWK \
\nPSDLHPTMTAALISNKPEFVKLFLENGVQLKEFVTWDTLLYLYENLDPSCLFHSKLQKVL \
\nVEDPERPACAPAAPRLQMHHVAQVLRELLGDFTQPLYPRPRHNDRLRLLLPVPHVKLNVQ \
\nGVSLRSLYKRSSGHVTFTMDPIRDLLIWAIVQNRRELAGIIWAQSQDCIAAALACSKILK \
\nELSKEEEDTDSSEEMLALAEEYEHRAIGVFTECYRKDEERAQKLLTRVSEAWGKTTCLQL \
\nALEAKDMKFVSHGGIQAFLTKVWWGQLSVDNGLWRVTLCMLAFPLLLTGLISFREKRLQD \
\nVGTPAARARAFFTAPVVVFHLNILSYFAFLCLFAYVLMVDFQPVPSWCECAIYLWLFSLV \
\nCEEMRQLFYDPDECGLMKKAALYFSDFWNKLDVGAILLFVAGLTCRLIPATLYPGRVILS \
\nLDFILFCLRLMHIFTISKTLGPKIIIVKRMMKDVFFFLFLLAVWVVSFGVAKQAILIHNE \
\nRRVDWLFRGAVYHSYLTIFGQIPGYIDGVNFNPEHCSPNGTDPYKPKCPESDATQQRPAF \
\nPEWLTVLLLCLYLLFTNILLLNLLIAMFNYTFQQVQEHTDQIWKFQRHDLIEEYHGRPAA \
\nPPPFILLSHLQLFIKRVVLKTPAKRHKQLKNKLEKNEEAALLSWEIYLKENYLQNRQFQQ \
\nKQRPEQKIEDISNKVDAMVDLLDLDPLKRSGSMEQRLASLEEQVAQTAQALHWIVRTLRA \
\nSGFSSEADVPTLASQKAAEEPDAEPGGRKKTEEPGDSYHVNARHLLYPNCPVTRFPVPNE \
\nKVPWETEFLIYDPPFYTAERKDAAAMDPMGDTLEPLSTIQYNVVDGLRDRRSFHGPYTVQ \
\nAGLPLNPMGRTGLRGRGSLSCFGPNHTLYPMVTRWRRNEDGAICRKSIKKMLEVLVVKLP \
\nLSEHWALPGGSREPGEMLPRKLKRILRQEHWPSFENLLKCGMEVYKGYMDDPRNTDNAWI \
\nETVAVSVHFQDQNDVELNRLNSNLHACDSGASIRWQVVDRRIPLYANHKTLLQKAAAEFG \
\nAHY \
>Q2TV84 \
\nMGSMRKMSSSFKRGSIKSSTSGSQKGQKAWIEKTFCKRECIFVIPSTKDPNRCCCGQLTN \
\nQHIPPLPSGAPSTTGEDTKQADTQSGKWSVSKHTQSYPTDSYGILEFQGGGYSNKAMYIR \
\nVSYDTKPDSLLHLMVKDWQLELPKLLISVHGGLQSFEMQPKLKQVFGKGLIKAAMTTGAW \
\nIFTGGVSTGVVSHVGDALKDHSSKSRGRLCAIGIAPWGMVENKEDLIGKDVTRVYQTMSN \
\nPLSKLSVLNNSHTHFILADNGTLGKYGAEVKLRRQLEKHISLQKINTRLGQGVPVVGLVV \
\nEGGPNVVSIVLEYLKEDPPVPVVVCDGSGRASDILSFAHKYCDEGGVINESLRDQLLVTI \
\nQKTFNYSKSQSYQLFAIIMECMKKKELVTVFRMGSEGQQDVEMAILTALLKGTNASAPDQ \
\nLSLALAWNRVDIARSQIFVFGPHWPPLGSLAPPVDTKATEKEKKPPTATTKGRGKGKGKK \
\nKGKVKEEVEEETDPRKLELLNWVNALEQAMLDALVLDRVDFVKLLIENGVNMQHFLTIPR \
\nLEELYNTRLGPPNTLHLLVRDVKKSNLPPDYHISLIDIGLVLEYLMGGAYRCNYTRKSFR \
\nTLYNNLFGPKRPKALKLLGMEDDEPPAKGKKKKKKKKEEEIDIDVDDPAVSRFQYPFHEL \
\nMVWAVLMKRQKMAVFLWQRGEECMAKALVACKLYKAMAHESSESELVDDISQDLDNNSKD \
\nFGQLAVELLDQSYKHDEQVAMKLLTYELKNWSNSTCLKLAVAAKHRDFIAHTCSQMLLTD \
\nMWMGRLRMRKNPGLKVIMGILIPPTILFLEFRTYDDFSYQTSKENEDGKEKEEENVDANA \
\nDAGSRKGDEENEHKKQRSIPIGTKICEFYNAPIVKFWFYTISYLGYLLLFNYVILVRMDG \
\nWPSPQEWIVISYIVSLALEKIREILMSEPGKLSQKIKVWLQEYWNITDLVAISMFMVGAI \
\nLRLQSQPYMGYGRVIYCVDIILWYIRVLDIFGVNKYLGPYVMMIGKMMIDMLYFVVIMLV \
\nVLMSFGVARQAILHPEEKPSWKLARNIFYMPYWMIYGEVFADQIDLYAMEINPPCGENLY \
\nDEEGKRLPPCIPGAWLTPALMACYLLVANILLVNLLIAVFNNTFFEVKSISNQVWKFQRY \
\nQLIMTFHDRPVLPPPMIILSHIYIIIMRLSGRCRKKREGDQEERDRGLKLFLSDEELKKL \
\nHEFEEQCVQEHFREKEDEQQSSSDERIRVTSERVENMSMRLEEINERENFMKTSLQTVDL \
\nRLSQLEELSGRMVSALENLAGIDRSDLIQARSRASSECEATYLLRQSSINSADGYSLYRY \
\nHFNGEELLFEEPALSTSPGTAFRKKTYSFRVKDEDAKSHLDQPSNLHHTPGPSPPATPGR \
\nSRLALEGPLSTELRPGSDPGISAGEFDPRADFKSTEAAPSLNAAGVTGTQLTVESTDSHP \
\nLRESKLVRYYPGDPNTYKTMKSRSFVYTEGRKLVRGLSNWSAEYSSIMDQAWNATEWRCQ \
\nVQRITRSRSTDIPYIVSEAASQDELEDEHRGSLLDPQISRSALTVSDRPEKENLLSVKPH \
\nQTLGFPCLRSRSLHGRPRSAEPAPSKLDRAGHASSTSNLAVMSVVPEGQNTQQEKRSAET \
\nEC \
>Q8R4D5 \
\nMSFEGARLSMRSRRNGTMGSTRTLYSSVSRSTDVSYSDSDLVNFIQANFKKRECVFFTRD \
\nSKAMENICKCGYAQSQHIEGTQINQNEKWNYKKHTKEFPTDAFGDIQFETLGKKGKYLRL \
\nSCDTDSETLYELLTQHWHLKTPNLVISVTGGAKNFALKPRMRKIFSRLIYIAQSKGAWIL \
\nTGGTHYGLMKYIGEVVRDNTISRNSEENIVAIGIAAWGMVSNRDTLIRSCDDEGHFSAQY \
\nIMDDFTRDPLYILDNNHTHLLLVDNGCHGHPTVEAKLRNQLEKYISERTSQDSNYGGKIP \
\nIVCFAQGGGRETLKAINTSVKSKIPCVVVEGSGQIADVIASLVEVEDVLTSSMVKEKLVR \
\nFLPRTVSRLPEEEIESWIKWLKEILESSHLLTVIKMEEAGDEIVSNAISYALYKAFSTNE \
\nQDKDNWNGQLKLLLEWNQLDLASDEIFTNDRRWESADLQEVMFTALIKDRPKFVRLFLEN \
\nGLNLQKFLTNEVLTELFSTHFSTLVYRNLQIAKNSYNDALLTFVWKLVANFRRSFWKEDR \
\nSSREDLDVELHDASLTTRHPLQALFIWAILQNKKELSKVIWEQTKGCTLAALGASKLLKT \
\nLAKVKNDINAAGESEELANEYETRAVELFTECYSNDEDLAEQLLVYSCEAWGGSNCLELA \
\nVEATDQHFIAQPGVQNFLSKQWYGEISRDTKNWKIILCLFIIPLVGCGLVSFRKKPIDKH \
\nKKLLWYYVAFFTSPFVVFSWNVVFYIAFLLLFAYVLLMDFHSVPHTPELILYALVFVLFC \
\nDEVRQWYMNGVNYFTDLWNVMDTLGLFYFIAGIVFRLHSSNKSSLYSGRVIFCLDYIIFT \
\nLRLIHIFTVSRNLGPKIIMLQRMLIDVFFFLFLFAVWMVAFGVARQGILRQNEQRWRWIF \
\nRSVIYEPYLAMFGQVPSDVDSTTYDFSHCTFSGNESKPLCVELDEHNLPRFPEWITIPLV \
\nCIYMLSTNILLVNLLVAMFGYTVGIVQENNDQVWKFQRYFLVQEYCNRLNIPFPFVVFAY \
\nFYMVVKKCFKCCCKEKNMESNACCFRNEDNETLAWEGVMKENYLVKINTKANDNSEEMRH \
\nRFRQLDSKLNDLKSLLKEIANNIK"


# **15. Place the sequnces into DataFrame**

In [None]:
#@title 15. Store the protein sequences into DataFrame.
# Store in dataframe
get_array_fasta = read_fasta_input(fasta_inputs)
df_fasta_format = pd.DataFrame(get_array_fasta) 
prot_id_test_name  = df_fasta_format['ID'].tolist()

In [None]:
df_fasta_format

Unnamed: 0,ID,SEQUENCE,length
0,Q7Z4N2,MKDSNRCCCGQFTNQHIPPLPSATPSKNEEESKQVETQPEKWSVAK...,1603
1,Q9R283,MLMSLTDSKEGKNRSGVRMFKDDDFLIPASGESWDRLRLTCSQPFT...,1170
2,O94759,MEPSALRKAGSEQEEGFEGLPRRVTDLGMVSNLRRSNSSLFKSWRL...,1503
3,Q2TV84,MGSMRKMSSSFKRGSIKSSTSGSQKGQKAWIEKTFCKRECIFVIPS...,1622
4,Q8R4D5,MSFEGARLSMRSRRNGTMGSTRTLYSSVSRSTDVSYSDSDLVNFIQ...,1104


# **16. Call the Function to generate BERT representations**

In [None]:
#@title 16. Start generating word embeddings for each protein sequence using BERT.

bert_large_cased_features_trp_channels =  generate_portionwise_embeddings(df_fasta_format)

>>  BERT pretrained directory: /content/bert_model/cased_L-24_H-1024_A-16
SUBSEQUENCE OF PROTEINS:
M K D S N R C C C G Q F T N Q H I P P L P S A T P S K N E E E S K Q V E T Q P E K W S V A K H T Q S Y P T D S Y G V L E F Q G G G Y S N K A M Y I R V S Y D T K P D S L L H L M V K D W Q L E L P K L L I S V H G G L Q N F E M Q P K L K Q V F G K G L I K A A M T T G A W I F T G G V S T G V I S H V G D A L K D H S S K S R G R V C A I G I A P W G I V E N K E D L V G K D V T R V Y Q T M S N P L S K L S V L N N S H T H F I L A D N G T L G K Y G A E V K L R R L L E K H I S L Q K I N T R L G Q G V P L V G L V V E G G P N V V S I V L E Y L Q E E P P I P V V I C D G S G R A S D I L S F A H K Y C E E G G I I N E S L R E Q L L V T I Q K T F N Y N K A Q S H Q L F A I I M E C M K K K E L V T V F R M G S E G Q Q D I E M A I L T A L L K G T N V S A P D Q L S L A L A W N R V D I A R S Q I F V F G P H W P P L G S L A P P T D S K A T E K E K K P P M A T T K G G R G K G K G K K K G K V K E E V E E E T D P R K

# **17. Arrange the output BERT embeddings**

In [None]:
#@title 17. Organize the dataframe of BERT embeddings.

id_bert = []
bert_identifier = bert_large_cased_features_trp_channels.pop('id')
for id in bert_identifier:
  id = id.strip()
  id_bert.append(id)
bert_large_cased_features_trp_channels.insert(0, 'identifier', id_bert)
print(bert_large_cased_features_trp_channels)

  identifier           0           1  ...      20477      20478      20479
0     Q7Z4N2  159.036138 -386.799378  ... -15.875570  16.598205 -78.255499
1     Q9R283  186.401666 -431.527951  ...   7.202540  17.881064 -26.087860
2     O94759  193.744620 -490.307770  ... -17.110799  25.564462 -43.457403
3     Q2TV84  186.439118 -432.939179  ... -38.669777  18.675699 -68.119855
4     Q8R4D5  126.609915 -301.131471  ...  -6.117267  24.483573 -51.237505

[5 rows x 20481 columns]


# **18. Fetch the Distograms using protien identifiers**

In [None]:
#@title 18. This step downloads the Distograms for the provided protein sequences using sequence identifier.

def download_Distogram(ID):
  # check avaibility
  with requests.get(f"https://alphafold.ebi.ac.uk/api/prediction/{ID}", stream=True) as r:
    if r.status_code == 404:
      print(f"Protein {ID} not found!")
      return 0
  with requests.get(f"https://alphafold.ebi.ac.uk/files/AF-{ID}-F1-predicted_aligned_error_v1.png", stream=True) as r:
    print(f"Downloading.. {ID}")
    with open(f"{ID}.png", 'wb') as f:
        shutil.copyfileobj(r.raw, f)
    return 1

# **19. Generate respresentations from Distograms**

In [None]:
#@title 19. This step transforms Distograms into Features and save in a DataFrame.

lst_accession_id = []
transformed_features, final_transformed_features = [], []

for identifier in prot_id_test_name:
  identifier = identifier.strip()
  lst_accession_id.append(identifier)
  download_Distogram(identifier)

!mkdir trp_channel
!mv *.png trp_channel/

for file in glob.glob("trp_channel/*.png"):
  transform_to_features = cv2.imread(file, 0)
  transformed_features.append(transform_to_features)
  for each_feature_vector in transformed_features:
    collect_transformed_features = list(each_feature_vector.flatten())
  final_transformed_features.append(collect_transformed_features)

distogram_features_trp_channels = pd.DataFrame(final_transformed_features)
distogram_features_trp_channels['identifier'] = pd.DataFrame(lst_accession_id)
id_distogram = distogram_features_trp_channels.pop('identifier')
distogram_features_trp_channels.insert(0, 'identifier', id_distogram) 
print(distogram_features_trp_channels)

Downloading.. Q7Z4N2
Downloading.. Q9R283
Downloading.. O94759
Downloading.. Q2TV84
Downloading.. Q8R4D5
  identifier   0   1    2    3  ...  102395  102396  102397  102398  102399
0     Q7Z4N2  54  79  140  188  ...     195     176     134      81      55
1     Q9R283  53  73  121  166  ...      71      74      63      58      50
2     O94759  52  68   85   91  ...     162     136      97      67      51
3     Q2TV84  61  97  112  145  ...     201     179     132      80      55
4     Q8R4D5  50  65  101  146  ...      79      72      65      57      50

[5 rows x 102401 columns]


# **20. Obtain a hybrid feature set by combining Distogram and BERT representations**

In [None]:
#@title 20. This step concatenate Distogram features and BERT features to obtain composite feature set.

# Set proper indexes for searching in distogram features
get_feature1 = distogram_features_trp_channels.copy()
get_feature1.set_index(['identifier'], inplace=True); get_feature1.update(get_feature1);

# Set proper indexes for searching in bert features
get_feature2 = bert_large_cased_features_trp_channels.copy()
get_feature2.set_index(['identifier'], inplace=True); get_feature2.update(get_feature2);

final_features = pd.merge(get_feature1, get_feature2, left_index=True, right_index=True)
final_features.reset_index(inplace=True)

# **21. Load the model from Github repository**

In [None]:
#@title 21. load the model from github

!rm -rf bert_repo
import sys
!test -d disto_trp || git clone https://github.com/Muazzam5/Disto-TRP.git disto_trp
if not 'disto_trp' in sys.path:
  sys.path += ['disto_trp']

Cloning into 'disto_trp'...
remote: Enumerating objects: 1003, done.[K
remote: Counting objects: 100% (1003/1003), done.[K
remote: Compressing objects: 100% (996/996), done.[K
remote: Total 1003 (delta 85), reused 830 (delta 3), pack-reused 0[K
Receiving objects: 100% (1003/1003), 246.52 MiB | 34.15 MiB/s, done.
Resolving deltas: 100% (85/85), done.


# **22. Model Validation**

In [None]:
#@title 22. This step loads and validate our module on unseen and noval protein sequences.

store_prob_class1 = []
!rm -rf final_model_validation
!mkdir final_model_validation
!unrar e 'disto_trp/5. Validation_model.rar' 'final_model_validation/'

filenamemodel_1 = 'final_model_validation/validation_model.sav'
loaded_model_class_1 = pickle.load(open(filenamemodel_1, 'rb'))
text_x = final_features.iloc[:,1:]
prob_class1 = loaded_model_class_1.predict_proba(text_x)

for val in prob_class1: 
  store_prob_class1.append(val[1]);

all_results = {'Fasta' : prot_id_test_name, 
                'Probability to be a TRP Channel': store_prob_class1
               }
df_all_results = pd.DataFrame(all_results)
print(df_all_results)


UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from disto_trp/5. Validation_model.rar

Extracting  final_model_validation/validation_model.sav                   18% 36% 54% 72% 91% 99%  OK 
All OK




     Fasta  Probability to be a TRP Channel
0  Q7Z4N2                          0.999996
1  Q9R283                          0.994886
2  O94759                          0.975839
3  Q2TV84                          0.979189
4  Q8R4D5                          0.743498
