<a href="https://colab.research.google.com/github/Amelie-Schreiber/sampling_protein_language_models/blob/main/EvoProtGrad_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# <b><font color='#009e74'>EvoProtGrad: Directed EVOlution on a PROTein sequence with GRADient-based discrete Markov chain monte carlo</font></b>

EvoProtGrad is a Python package for directed evolution of proteins in sequence space with gradients.

Compose a custom protein model that maps sequence to function with  pretrained models, including protein language models (PLMs), to guide and constrain search. Natively integrates with 🤗 HuggingFace transformers. The MCMC sampler identifies promising amino acids to mutate via model gradients taken with respect to the input (i.e., sensitivity analysis).

Use this colab to demo the sampler. Input a protein sequence and output a population of variants.

Link to paper: https://doi.org/10.1088/2632-2153/accacd


colab by @pemami4911.

##  <b><font color='#009e74'> Reminders and Important informations:</font></b>
- It is recommended to run in a Colab GPU session (go to page menu: `Runtime`->  `Change runtime type` -> select `GPU` and confirm
- Cells labelled <b><font color='#56b4e9'>PRELIMINARY OPERATIONS </font></b>  must be run <b><font color='#d55c00'>ONE</font></b> at a time and <b><font color='#d55c00'>ONCE</font></b> at the start and skipped for new predictions.
- <b><font color='#d55c00'>ONE</font></b> wildtype protein sequence at a time can be processed by the pipeline.
- A  <b><font color='#d55c00'>new run</font></b> can be performed by re-running the protein sequence upload cell and running the sampling cell again

****

In [1]:
#@title <b><font color='#56b4e9'>PRELIMINARY OPERATIONS</font>: Setup enviroment and dependencies</b>

#@markdown Run this cell to install the required enviroment and dependencies

#@markdown **N.B: This cell should be run only ONCE at the START of the notebook.**
! rm -r sample_data

# install dependencies present in pip
! pip install evo_prot_grad &> /dev/null




In [2]:
#@title <b><font color='#56b4e9'>PRELIMINARY OPERATIONS</font>: Import python library</b>

#@markdown **N.B: This cell only needs to be run ONCE at the START of the notebook.**

import evo_prot_grad

## <b><font color='#009e74'>PIPELINE : SAMPLER </font></b>

Default: [GFP_AEQVI](https://www.uniprot.org/uniprotkb/P42212/entry)

In [3]:
#@title <b><font color='#56b4e9'> Wildtype protein sequence upload</font></b>

wildtype_sequence = 'MNSVTVSHAPYTITYHDDWEPVMSQLVEFYNEVASWLLRDETSPIPDKFFIQLKQPLRNKRVCVCGIDPYPKDGTGVPFESPNFTKKSIKEIASSISRLTGVIDYKGYNLNIIDGVIPWNYYLSCKLGETKSHAIYWDKISKLLLQHITKHVSVLYCLGKTDFSNIRAKLESPVTTIVGYHPAARDRQFEKDRSFEIINVLLELDNKVPINWAQGFIY' #@param {type:"string"}

In [4]:
#@title <b><font color='#56b4e9'> Protein Language Model expert selector </font></b>

protein_language_model_expert = 'esm' # @param ["esm", "bert", "causallm"] {allow-input: true}

expert = evo_prot_grad.get_expert(protein_language_model_expert, temperature = 1.0, device = 'cuda')


Downloading (…)lve/main/config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/31.4M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [10]:
#@title <b><font color='#56b4e9'> Pipeline SAMPLER </font></b>

#@markdown <b><font color='#d55c00'>Execute the cell</font></b> to run the sampler and produce a list of variants and Product of Experts scores.

variants, scores = evo_prot_grad.DirectedEvolution(
                   wt_protein = wildtype_sequence,
                   output = 'best',                # return best, last, all variants
                   experts = [expert],             # list of experts to compose
                   parallel_chains = 5,            # number of parallel chains to run
                   n_steps = 100,                  # number of MCMC steps per chain
                   max_mutations = 3,             # maximum number of mutations per variant
                   preserved_regions = None,       # List of regions (start,end) to preserve
                   verbose = False                 # print debug info to command line
)()

wtseq = ' '.join(wildtype_sequence.strip())

for v,s in zip(variants,scores):
  evo_prot_grad.common.utils.print_variant_in_color(v, wtseq)
  print(s)

[0mM[0m [0mN[0m [0mS[0m [0mV[0m [0mT[0m [0mV[0m [0mS[0m [0mH[0m [0mA[0m [0mP[0m [91mW[0m [0mT[0m [0mI[0m [0mT[0m [0mY[0m [0mH[0m [0mD[0m [0mD[0m [0mW[0m [0mE[0m [0mP[0m [0mV[0m [0mM[0m [0mS[0m [0mQ[0m [0mL[0m [0mV[0m [0mE[0m [0mF[0m [0mY[0m [0mN[0m [0mE[0m [0mV[0m [0mA[0m [0mS[0m [0mW[0m [0mL[0m [0mL[0m [0mR[0m [0mD[0m [0mE[0m [0mT[0m [0mS[0m [0mP[0m [0mI[0m [0mP[0m [0mD[0m [0mK[0m [0mF[0m [0mF[0m [0mI[0m [0mQ[0m [0mL[0m [0mK[0m [0mQ[0m [0mP[0m [0mL[0m [0mR[0m [0mN[0m [0mK[0m [0mR[0m [0mV[0m [0mC[0m [0mV[0m [91mI[0m [0mG[0m [0mI[0m [0mD[0m [0mP[0m [0mY[0m [0mP[0m [0mK[0m [0mD[0m [0mG[0m [0mT[0m [0mG[0m [0mV[0m [0mP[0m [0mF[0m [0mE[0m [0mS[0m [0mP[0m [0mN[0m [0mF[0m [0mT[0m [0mK[0m [0mK[0m [0mS[0m [0mI[0m [0mK[0m [0mE[0m [0mI[0m [0mA[0m [0mS[0m [0mS[0m [0mI[0m [0mS[0m [0mR[0m [0mL[0m [0mT[0

In [12]:
import io
import sys

#@title <b><font color='#56b4e9'> Pipeline SAMPLER </font></b>

#@markdown <b><font color='#d55c00'>Execute the cell</font></b> to run the sampler and produce a list of variants and Product of Experts scores.

variants, scores = evo_prot_grad.DirectedEvolution(
                   wt_protein = wildtype_sequence,
                   output = 'best',                # return best, last, all variants
                   experts = [expert],             # list of experts to compose
                   parallel_chains = 5,            # number of parallel chains to run
                   n_steps = 100,                  # number of MCMC steps per chain
                   max_mutations = 3,             # maximum number of mutations per variant
                   preserved_regions = None,       # List of regions (start,end) to preserve
                   verbose = False                 # print debug info to command line
)()

wtseq = ' '.join(wildtype_sequence.strip())

for v, s in zip(variants, scores):
    # Capture the printed output of the function
    old_stdout = sys.stdout
    new_stdout = io.StringIO()
    sys.stdout = new_stdout

    evo_prot_grad.common.utils.print_variant_in_color(v, wtseq)

    # Reset the standard output
    sys.stdout = old_stdout

    # Get the captured output and remove spaces
    colored_variant = new_stdout.getvalue().replace(" ", "")
    print(colored_variant, end='')  # end='' to avoid extra newline
    print(s)

[0mM[0m[0mN[0m[0mS[0m[0mV[0m[0mT[0m[0mV[0m[0mS[0m[0mH[0m[0mA[0m[0mP[0m[0mY[0m[0mT[0m[0mI[0m[0mT[0m[0mY[0m[0mH[0m[0mD[0m[0mD[0m[0mW[0m[0mE[0m[0mP[0m[0mV[0m[0mM[0m[0mS[0m[0mQ[0m[0mL[0m[0mV[0m[0mE[0m[0mF[0m[0mY[0m[0mN[0m[0mE[0m[0mV[0m[0mA[0m[0mS[0m[0mW[0m[0mL[0m[0mL[0m[0mR[0m[0mD[0m[0mE[0m[0mT[0m[0mS[0m[0mP[0m[0mI[0m[0mP[0m[0mD[0m[0mK[0m[0mF[0m[0mF[0m[0mI[0m[0mQ[0m[0mL[0m[0mK[0m[0mQ[0m[0mP[0m[0mL[0m[0mR[0m[0mN[0m[0mK[0m[0mR[0m[0mV[0m[91mV[0m[0mV[0m[0mC[0m[0mG[0m[0mI[0m[0mD[0m[0mP[0m[0mY[0m[0mP[0m[0mK[0m[0mD[0m[0mG[0m[0mT[0m[0mG[0m[0mV[0m[0mP[0m[0mF[0m[0mE[0m[0mS[0m[0mP[0m[0mN[0m[0mF[0m[0mT[0m[0mK[0m[0mK[0m[0mS[0m[0mI[0m[0mK[0m[0mE[0m[0mI[0m[0mA[0m[0mS[0m[0mS[0m[0mI[0m[0mS[0m[0mR[0m[0mL[0m[0mT[0m[0mG[0m[0mV[0m[0mI[0m[0mD[0m[0mY[0m[0mK[0m[0mG[0m[0mY[0m[0mN[0m[0mL[0m[0mN[0m

**License:**

EvoProtGrad's source code is licensed under the permissive BSD 3-Clause license.

\\

**Bugs:**

For any bugs please report the issue on the project [Github](https://github.com/NREL/EvoProtGrad) or contact Patrick Emami (Patrick.Emami@nrel.gov).

\\

**Citing this work:**

If you use our package please cite:


```
@article{emami2023plug,
  title={Plug \& play directed evolution of proteins with gradient-based discrete MCMC},
  author={Emami, Patrick and Perreault, Aidan and Law, Jeffrey and Biagioni, David and John, Peter St},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025014},
  year={2023},
  publisher={IOP Publishing}
}
```