<a href="https://colab.research.google.com/github/HWaymentSteele/colab_exercises/blob/main/PLM_Sol_in_colab_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PLM_Sol in Colab

PLM_Sol from Zhang ... Sun (2024) *Briefings in Bioinformatics* [paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC11343611/) [github](https://github.com/Violet969/PLM_Sol/tree/main)

Last updated Jan 25, 2025

This colab implements the workflow in the README of the PLM_Sol repo with one exception: it uses  ProtTrans code to generate `prot_t5_xl_half_uniref50` embeddings rather than using the `bio_embeddings` codebase (which I and others can't get working in Colab).

Tips:
  - make sure you're on a GPU to go fastest (check in upper right corner, change at Runtime > Change Runtime Type)
  - Hit Runtime > Run_all
  - Will prompt you to upload a .fasta file with your sequences
  - Will automatically download a CSV with `protein_ID`, `sequence`, and `predict_result`.

by Hannah Wayment-Steele


In [1]:
%%time
#@title Setup Code and Functions

import os
import subprocess

def run_(command):
    result = subprocess.run(command, shell=True, text=True, capture_output=True)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    else:
        print(result.stdout)

if not os.path.exists("SETUP_READY"):
    print("Cloning PLM_Sol repository...")
    run_("git clone https://github.com/Violet969/PLM_Sol.git")
    run_("pip install biopython pyaml")
    run_('sed -i "s/\.\/model_param\//\\/content\\/PLM_Sol\\/model_param\//g" /content/PLM_Sol/inference.py')

    print("Cloning ProtTrans repository...")
    run_("git clone https://github.com/agemagician/ProtTrans.git")
    open("SETUP_READY", "w").close()
else:
    print("Setup already completed. No action needed.")

import yaml

# write input configs for PLM_sol
config_data = {
    "output_files_name": "PLM_sol_output",
    "log_iterations": 100,
    "n_draws": 200,
    "batch_size": 72,
    "checkpoints_list": [
        "/content/PLM_Sol/model_param/model_param.t7"
    ],
    "embeddings": "out.h5",
    "remapping": "input.fasta",
}

# Save as a YAML file
with open("config.yml", "w") as file:
    yaml.dump(config_data, file, default_flow_style=False)

Cloning PLM_Sol repository...

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting pyaml
  Downloading pyaml-25.1.0-py3-none-any.whl.metadata (12 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 36.5 MB/s eta 0:00:00
Downloading pyaml-25.1.0-py3-none-any.whl (26 kB)
Installing collected packages: pyaml, biopython
Successfully installed biopython-1.85 pyaml-25.1.0


Cloning ProtTrans repository...

CPU times: user 59.6 ms, sys: 6.38 ms, total: 66 ms
Wall time: 11.2 s


In [2]:
#@title Upload sequences in fasta file

from google.colab import files

# Prompt the user to upload a file
uploaded = files.upload()

# Save the uploaded file as 'input.fasta'
for filename in uploaded.keys():
    with open("input.fasta", "wb") as f:
        f.write(uploaded[filename])
    print(f"{filename} has been saved as input.fasta")

Saving RelaxDB_and_CPMG_09dec2024.fasta to RelaxDB_and_CPMG_09dec2024.fasta
RelaxDB_and_CPMG_09dec2024.fasta has been saved as input.fasta


In [3]:
#@title Run


print('Generating ProtTrans embeddings')
run_("python /content/ProtTrans/Embedding/prott5_embedder.py -i input.fasta -o out.h5")
print('Running PLM_Sol')
run_("python /content/PLM_Sol/inference.py --config config.yml")
print('Downloading output')
files.download('protTrans_prediction_result.csv')

Generating ProtTrans embeddings
Using device: cuda:0
Loading: Rostlab/prot_t5_xl_half_uniref50-enc
########################################
Example sequence: 27011
ADKQTHETELTFDQVKEQLTESGKKRGVLTYEEIAERMSSFEIESDQMDEYYEFLGEQGVELISENEETEDLE
########################################
Total number of sequences: 144
Average sequence length: 127.01388888888889
Number of sequences >1000: 0
Embedded protein MK12 with length 367 to emb. of shape: torch.Size([367, 1024])

############# STATS #############
Total number of embeddings: 144
Total time: 4.77[s]; time/prot: 0.0331[s]; avg. len= 127.01

Running PLM_Sol

Downloading output


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>