## Rosetta generated energies for the subset of the dataset

**complex level energies** are stored in rosettaComplexEnergies.csv
- this means one array per allele-peptide pair
- the array contains 20 energy terms: attractive, repulsive, electrostatic etc (19 effectively, the last term is summation of first 19)

**per-peptide-position** energies are in rosettaPPPEnergies.csv
- this means "9 arrays" per allele-peptide pair - one for each peptide position
- the whole thing is 9x20 dimensional (9x19 effectively, the last term is summation of first 19)

Check out the paper draft for reference

I would use per-peptide-position in a similar way as sequence is used for clustering :)


In [1]:
import pandas as pd
import numpy as np

### Load ppp array from csv

In [2]:
## 1 - load the energies

##convert string to array


def pppene_to_array(tmp):
    tmp = tmp.replace("(", "")
    tmp = tmp.replace(")", "")
    tmp = tmp.strip("[]")
    tmp = tmp.replace(" ", "")
    tmp = tmp.replace("\n", ",")
    return np.array(np.fromstring(tmp, dtype=float, sep=", ").reshape(9,20))

ppp_ene = pd.read_csv("/rdf_mount/rosetta_energies/rosettaPPPEnergies.csv")
ppp_ene = ppp_ene[["allele", "peptide", "binder", "ba", "energies", "total_energy"]]
ppp_ene["energies"] = ppp_ene["energies"].apply(pppene_to_array)
ppp_ene

Unnamed: 0,allele,peptide,binder,ba,energies,total_energy
0,A0101,YLEQLHQLY,1,0.574375,"[[-9.37669305, 4.67802037, 8.54111444, 10.3367...",112.623867
1,A0101,HSERHVLLY,1,0.574375,"[[-7.89190954, 0.93707113, 10.06233605, 2.7158...",91.902185
2,A0101,MTDPEMVEV,1,0.574375,"[[-8.25236275, 10.56587939, 7.5793572, 1.14710...",146.451590
3,A0101,LTDFIREEY,1,0.574375,"[[-8.43720197, 10.21830335, 7.113905, 36.51672...",138.735082
4,A0101,LLDQRPAWY,1,0.574375,"[[-8.18944861, 12.37002534, 6.69837214, 8.0238...",142.756344
...,...,...,...,...,...,...
77576,C1601,QQTTTSFQN,0,0.000000,"[[-8.23468271, 4.14167228, 9.40648541, 38.0082...",128.349704
77577,C1601,QQVEQMEIP,0,0.000000,"[[-9.20362039, 22.89414572, 10.40234849, 35.70...",159.872992
77578,C1601,QQWQVFSAE,0,0.000000,"[[-8.46025926, 3.82365938, 10.25920191, 61.414...",97.152888
77579,C1601,QRCVVLRFL,0,0.000000,"[[-7.11160825, 1.35151526, 9.62650699, 20.2230...",116.714004


### Example of the energy array

 the 19 terms per each position have thir names and are ordered in a way (i think the same as listed in supplementary table in the draft)

In [3]:
#now ppp_ene contains all the data 

#one example ppp energy array
tmp = ppp_ene["energies"].iloc[0]

In [4]:
tmp.shape

(9, 20)

In [5]:
tmp

array([[-9.37669305e+00,  4.67802037e+00,  8.54111444e+00,
         1.03367888e+01,  2.16286690e-01,  9.84807020e-01,
        -2.07856539e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00, -3.15956700e-01, -3.10243000e-03,
         0.00000000e+00,  9.19347541e+00,  1.58142217e+01,
         0.00000000e+00,  0.00000000e+00,  5.82230000e-01,
         0.00000000e+00,  1.59220611e+01],
       [-9.13721601e+00,  8.13204776e+00,  2.09544442e+00,
         4.95836834e+01,  7.58083300e-02,  3.12654320e-01,
        -1.34635036e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00, -5.27629900e-02,  1.03926847e+01,
        -1.50306530e-01,  0.00000000e+00,  1.66147000e+00,
         4.43163200e-01,  6.95708430e+00],
       [-8.34647124e+00,  1.72245477e+00,  7.18347239e+00,
         2.48287356e+00,  3.06020080e-01,  1.81860420e-01,
        -1.21559940e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.0

## Load complex energy - that's for the full complex

In [6]:
## 1 - load the energies

#from string in csv to array
def ene_to_array(ene_str):
    ene_str = ene_str.strip("[]")
    ene_str = ene_str.strip("\(\)")
    return np.fromstring(ene_str, dtype=float, count = 20, sep=", ")

complex_ene = pd.read_csv("/rdf_mount/rosetta_energies/rosettaComplexEnergies.csv")
complex_ene = complex_ene[["allele", "peptide", "binder", "ba", "energies", "total_energy"]]
complex_ene["energies"] = complex_ene["energies"].apply(ene_to_array)
complex_ene

Unnamed: 0,allele,peptide,binder,ba,energies,total_energy
0,A0101,YLEQLHQLY,1,0.574375,"[-2306.77360984, 811.85340549, 1461.80507136, ...",44.784793
1,A0101,HSERHVLLY,1,0.574375,"[-2293.38661156, 760.61067588, 1470.48909157, ...",-15.285789
2,A0101,MTDPEMVEV,1,0.574375,"[-2276.11339533, 793.72608921, 1465.61096698, ...",96.468759
3,A0101,LTDFIREEY,1,0.574375,"[-2295.34585307, 757.61420988, 1473.41382942, ...",59.739557
4,A0101,LLDQRPAWY,1,0.574375,"[-2293.03299876, 860.34732958, 1461.86885467, ...",103.222148
...,...,...,...,...,...,...
77576,C1601,QQTTTSFQN,0,0.000000,"[-2289.00765669, 2426.75217972, 1479.38435418,...",1470.688209
77577,C1601,QQVEQMEIP,0,0.000000,"[-2303.64763419, 2393.02916873, 1484.9113067, ...",1481.597525
77578,C1601,QQWQVFSAE,0,0.000000,"[-2296.25090083, 2349.00146681, 1475.89378426,...",1404.053591
77579,C1601,QRCVVLRFL,0,0.000000,"[-2308.57295453, 2368.37391254, 1478.19777993,...",1426.372734


In [7]:
#now complex_ene contains all the data 

#one example complex energy array
tmp = complex_ene["energies"].iloc[0]

In [8]:
tmp.shape

(20,)

In [9]:
tmp

array([-2.30677361e+03,  8.11853405e+02,  1.46180507e+03,  3.50531953e+03,
        9.22575232e+01, -4.44835947e+01, -5.68797597e+02,  1.61927973e+01,
       -8.56658887e+01, -1.49167529e+02, -5.67123772e+01, -3.95078550e+01,
       -2.90500378e+00,  2.46602371e+02,  1.64293265e+03, -8.93118320e+01,
        0.00000000e+00,  4.58854100e+01,  4.48507127e+01,  4.47847925e+01])

## How the terms are calculated?

Download pyRosetta - follow instructions here: https://els2.comotion.uw.edu/product/pyrosetta

Download wheel file and then do pip install

extract_fullc_energies was used to calculate energy for the whole complex

extract_ppp_energies was used to calculate per-peptide-position terms

if you want, later you can run this for the full dataset

In [10]:
import pyrosetta

ModuleNotFoundError: No module named 'pyrosetta'

In [11]:
# util functions for extrating the data
def init_rosetta():
    pyrosetta.init()

def extract_fullc_energies(file_name):
    scorefxn=pyrosetta.get_fa_scorefxn()
    #load rosetta
    pose = pyrosetta.pose_from_pdb(file_name)
    scorefxn(pose)
    #get energies
    res_ene = pose.energies().total_energies_array()
    return res_ene

def extract_ppp_energies(file_name, pep_len):
    scorefxn=pyrosetta.get_fa_scorefxn()
    #load rosetta
    pose = pyrosetta.pose_from_pdb(file_name)
    scorefxn(pose)
    #get energies
    res_ene = pose.energies().residue_total_energies_array()
    peptide_ene = res_ene[-pep_len:]
    return peptide_ene


In [16]:
#give a path to a pHLA pdb file
fname = "/rdf_mount/singleconf/all_data/A0101-YLEQLHQLY.pdb"


In [13]:
init_rosetta()
ppp_energies = extract_ppp_energies(fname, 9)
ppp_energies

NameError: name 'pyrosetta' is not defined

In [14]:
pd.DataFrame(ppp_energies)

NameError: name 'ppp_energies' is not defined

## Energy array type

Each energy array of length 20 (both for total energy and for ppp energy)
has these terms and in this order.
You can use this here to cast the energies and then pick energy of interest
for example only attractive (fa_atr) to cluster

In [15]:
refenetype=[('fa_atr', '<f8'), 
            ('fa_rep', '<f8'), 
            ('fa_sol', '<f8'), 
            ('fa_intra_rep', '<f8'), 
            ('fa_intra_sol_xover4', '<f8'), 
            ('lk_ball_wtd', '<f8'), 
            ('fa_elec', '<f8'), 
            ('pro_close', '<f8'), 
            ('hbond_sr_bb', '<f8'), 
            ('hbond_lr_bb', '<f8'),
            ('hbond_bb_sc', '<f8'), 
            ('hbond_sc', '<f8'), 
            ('dslf_fa13', '<f8'), 
            ('omega', '<f8'), 
            ('fa_dun', '<f8'), 
            ('p_aa_pp', '<f8'), 
            ('yhh_planarity', '<f8'), 
            ('ref', '<f8'), 
            ('rama_prepro', '<f8'), 
            ('total_score', '<f8')]