# Getting started with Fold

This notebook will briefly cover how to run Fold workflows. 

For more information please [read the docs](https://docs.openprotein.ai/).

## Setup

Connect to the OpenProtein backend with your credentials:

In [1]:
import openprotein
import json

with open('secrets.config', 'r') as f:
    config = json.load(f)

session = openprotein.connect(username= config['username'], password= config['password']) 


Specify a demo sequence to fold:

In [2]:
SEQUENCE = "MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGMYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSEP"


We can examine the Fold models available:

In [3]:
session.fold.list_models()

[alphafold2, esmfold]

In [4]:
esmfoldmodel  =session.fold.get_model('esmfold')
esmfoldmodel.fold?

[0;31mSignature:[0m [0mesmfoldmodel[0m[0;34m.[0m[0mfold[0m[0;34m([0m[0msequences[0m[0;34m:[0m [0mList[0m[0;34m[[0m[0mbytes[0m[0;34m][0m[0;34m,[0m [0mnum_recycles[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m1[0m[0;34m)[0m [0;34m->[0m [0mopenprotein[0m[0;34m.[0m[0mapi[0m[0;34m.[0m[0mfold[0m[0;34m.[0m[0mFoldResultFuture[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Fold sequences using this model.

Parameters
----------
sequences : List[bytes]
    sequences to fold
num_recycles : int
    number of times to recycle models
Returns
-------
    FoldResultFuture
[0;31mFile:[0m      ~/work/openprotein-python/openprotein/api/fold.py
[0;31mType:[0m      method

In [5]:
esmfoldmodel.metadata

ModelMetadata(model_id='esmfold', description=ModelDescription(citation_title='', doi='', summary='esmfold_v1 model with 690M parameters, running on top of esm2_t36_3B_UR50D with 3B parameters.'), max_sequence_length=1024, dimension=-1, output_types=['fold'], input_tokens=['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', ':'], output_tokens=None, token_descriptions=[[TokenInfo(id=0, token='A', primary=True, description='Alanine')], [TokenInfo(id=1, token='R', primary=True, description='Arginine')], [TokenInfo(id=2, token='N', primary=True, description='Asparagine')], [TokenInfo(id=3, token='D', primary=True, description='Aspartic acid')], [TokenInfo(id=4, token='C', primary=True, description='Cysteine')], [TokenInfo(id=5, token='Q', primary=True, description='Glutamine')], [TokenInfo(id=6, token='E', primary=True, description='Glutamic acid')], [TokenInfo(id=7, token='G', primary=True, description='Glycine')], [TokenInfo(id=8, token='H

## ESMFold: 

ESMFold can be called on a sequence as below, note that currently only `num_recycles` is supported as a model hyperparameter:

In [6]:
esm = esmfoldmodel.fold([SEQUENCE.encode()], num_recycles=1)

esm

<openprotein.api.fold.FoldResultFuture at 0x7fd20d7e00d0>

In [7]:
esm.wait_until_done(verbose=True, timeout=300)

Waiting: 100%|██████████| 100/100 [00:00<00:00, 11137.88it/s, status=SUCCESS]


True

We then can access the results: a tuple containing the query sequence and the contents of the resulting PDB file:

In [8]:

result = esm.wait() 
print(result[0][0])

b'MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGMYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSEP'


In [9]:
print("\n".join( list(result[0][1].decode().split("\n")[0:5]) ) ) # truncate to prevent printing the whole PDB

PARENT N/A
ATOM      1  N   MET A   1       0.462 -12.684 -24.850  1.00 49.54           N  
ATOM      2  CA  MET A   1       1.650 -11.836 -24.842  1.00 51.45           C  
ATOM      3  C   MET A   1       1.718 -11.007 -23.563  1.00 49.89           C  
ATOM      4  CB  MET A   1       1.663 -10.916 -26.063  1.00 46.31           C  


## AlphaFold2:

Alphafold2 is slightly different, it requires evolutionary context (via an MSA) before it can make structure predictions, we will first therefore need to create an MSA based on the sequence we wish to fold:

In [10]:
afmodel  =session.fold.get_model('alphafold2')
afmodel.fold?

[0;31mSignature:[0m
[0mafmodel[0m[0;34m.[0m[0mfold[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmsa[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mopenprotein[0m[0;34m.[0m[0mapi[0m[0;34m.[0m[0malign[0m[0;34m.[0m[0mMSAFuture[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_recycles[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_models[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_msa[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mstr[0m[0;34m][0m [0;34m=[0m [0;34m'auto'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrelax_max_iterations[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Post sequences to alphafold model.

Parameters
----------
msa : Union[str, MSAFuture]
    msa
num_recycles : int
    n

In [11]:
afmodel.metadata

ModelMetadata(model_id='alphafold2', description=ModelDescription(citation_title='Highly accurate protein structure prediction with AlphaFold.', doi='10.1038/s41586-021-03819-2', summary='alphafold2 model.'), max_sequence_length=2048, dimension=-1, output_types=['fold'], input_tokens=['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', ':'], output_tokens=None, token_descriptions=[[TokenInfo(id=0, token='A', primary=True, description='Alanine')], [TokenInfo(id=1, token='R', primary=True, description='Arginine')], [TokenInfo(id=2, token='N', primary=True, description='Asparagine')], [TokenInfo(id=3, token='D', primary=True, description='Aspartic acid')], [TokenInfo(id=4, token='C', primary=True, description='Cysteine')], [TokenInfo(id=5, token='Q', primary=True, description='Glutamine')], [TokenInfo(id=6, token='E', primary=True, description='Glutamic acid')], [TokenInfo(id=7, token='G', primary=True, description='Glycine')], [TokenInfo(id

In [12]:
msa = session.align.create_msa(SEQUENCE.encode())
print(msa)



status=<JobStatus.SUCCESS: 'SUCCESS'> job_id='3338c1be-ee09-444b-ac8f-26d8238da36a' job_type=<JobType.align_align: '/align/align'> created_date=datetime.datetime(2024, 4, 3, 9, 30, 47, 357953) start_date=None end_date=datetime.datetime(2024, 4, 3, 9, 30, 47, 358297) prerequisite_job_id=None progress_message=None progress_counter=None num_records=None sequence_length=None msa_id='3338c1be-ee09-444b-ac8f-26d8238da36a'


We can wait until the MSA is complete to examine the outputs:

In [13]:
msa.wait_until_done(verbose=True)

print(list(msa.get_msa())[0:3])

Waiting: 100%|██████████| 100/100 [00:00<00:00, 9702.52it/s, status=SUCCESS] 


[['seed', 'MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGMYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSEP'], ['UniRef100_G1RE34', 'MYRMQLLSCIALSLALVTNGAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVQELKGSETTFMCEWITFCQSIISTLT----------------------------------------------------------------------------------------------------'], ['UniRef100_A0A2K5MA48', 'MYRMQLLSCIALSLALVANSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRTKDLISNINVIVLELKGSETTLMCEWITFCQSIISTLT----------------------------------------------------------------------------------------------------']]


In [14]:
afmodel.fold?

[0;31mSignature:[0m
[0mafmodel[0m[0;34m.[0m[0mfold[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmsa[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mopenprotein[0m[0;34m.[0m[0mapi[0m[0;34m.[0m[0malign[0m[0;34m.[0m[0mMSAFuture[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_recycles[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_models[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_msa[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mstr[0m[0;34m][0m [0;34m=[0m [0;34m'auto'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrelax_max_iterations[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Post sequences to alphafold model.

Parameters
----------
msa : Union[str, MSAFuture]
    msa
num_recycles : int
    n

We can then send the msa to the fold endpoint:

In [16]:
fold = afmodel.fold(msa=msa, num_models=1 )

fold

<openprotein.api.fold.FoldResultFuture at 0x7fd1d44debb0>

In [17]:
fold.wait(verbose=True, timeout=600)

Waiting: 100%|██████████| 100/100 [08:09<00:00,  4.89s/it, status=SUCCESS]
Retrieving: 100%|██████████| 1/1 [00:00<00:00, 14.21it/s]


[(b'MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGMYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSEP',
  b'MODEL     1                                                                     \nATOM      1  N   MET A   1     -30.969   8.984  15.750  1.00 29.41           N  \nATOM      2  CA  MET A   1     -30.250  10.070  15.094  1.00 29.41           C  \nATOM      3  C   MET A   1     -28.828   9.648  14.727  1.00 29.41           C  \nATOM      4  CB  MET A   1     -30.219  11.312  15.984  1.00 29.41           C  \nATOM      5  O   MET A   1     -27.984   9.500  15.609  1.00 29.41           O  \nATOM      6  CG  MET A   1     -31.281  12.344  15.648  1.00 29.41           C  \nATOM      7  SD  MET A   1     -31.172  13.844  16.688  1.00 29.41           S  \nATOM      8  CE  MET A   1     -30.312  14.961  15.547  1.00 29.41           C  \nATOM      9 

We then have a PDB file contents returned, we can save this as a file and open it with PyMol for visualization!

In [18]:

result = fold.wait(verbose=True) 
result[0][0]


Waiting: 100%|██████████| 100/100 [00:00<00:00, 8417.06it/s, status=SUCCESS]
Retrieving: 100%|██████████| 1/1 [00:00<00:00, 7710.12it/s]


b'MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGMYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSEP'

In [19]:
print("\n".join( list(result[0][1].decode().split("\n")[0:5]) ) ) # truncate to prevent printing the whole PDB

MODEL     1                                                                     
ATOM      1  N   MET A   1     -30.969   8.984  15.750  1.00 29.41           N  
ATOM      2  CA  MET A   1     -30.250  10.070  15.094  1.00 29.41           C  
ATOM      3  C   MET A   1     -28.828   9.648  14.727  1.00 29.41           C  
ATOM      4  CB  MET A   1     -30.219  11.312  15.984  1.00 29.41           C  
