**The spin systems (aka frames) in this experiment are defined by w1 and w2, which correspond to an amide (bb or sc) N[DE12]?-H[DE12]? pair.**

---

> Remove any protons but the HA from w3 when computing the per-frame intensity ranks and relative intensities and plot the distribution of 
* Intensity rank of N(i)-H(i)-HA(i) wrt any N(i)-H(i)-HA([0-9]+)
* Relative intensity of N(i)-H(i)-HA(i) wrt any N(i)-H(i)-HA([0-9]+)
* Intensity rank of N(i)-H(i)-HA(i-1) wrt any N(i)-H(i)-HA([0-9]+)
* Relative intensity of N(i)-H(i)-HA(i-1) wrt any N(i)-H(i)-HA([0-9]+)

> Remove any protons but the amide H[DE12]? protons from w3 when computing the per-frame intensity ranks and relative intensities and plot the distribution of 
* Intensity rank of N(i)-H(i)-H(i-1) wrt any N(i)-H(i)-H([0-9]+)
* Relative intensity of N(i)-H(i)-H(i-1) wrt any N(i)-H(i)-H([0-9]+)

 ---

> **For every amino acid type individually**, remove all HN, HD, HE (all amide protons, bb and sc), all HA and all aromatic protons from w3, and count how many times the most intense peak in the frame belongs to residue i and how many it doesn't.

In [1]:
import pandas as pd

from glob import glob

import warnings
warnings.simplefilter('ignore')

from functions import *

In [2]:
heteronucleus = '15N'
pdb_ids = [p.split('/')[-1].split('_')[0] for p in glob(f'data/*_{heteronucleus}.list')]
print(pdb_ids)

['2K52', '2JRM', '6SVC', '2L9R', '2LEA', '2K57', '2KD0', '2MA6', '2KRS', '2JVD', '2LX7', '2JT1', '2LTM', '1YEZ', '6SOW', '2LF2']


In [3]:
df = concat_peak_lists(pdb_ids=pdb_ids, heteronucleus=heteronucleus)
print(f'Data aggregated from {df.pdb_id.unique().shape[0]} proteins.\n')
df

Data aggregated from 16 proteins.



Unnamed: 0,pdb_id,res,noe,X,H,Hnoe,height,noe_res,inter,resnum,noe_resnum,res_diff,atom_type,atom_type_pos
0,2K52,D2,M1HA,125.763,8.938,4.091,32776,M1,True,2,1,1,HA,HA_i-1
1,2K52,D2,M1HB2,125.763,8.938,2.189,6228,M1,True,2,1,1,HB,HB_i-1
2,2K52,D2,M1HB3,125.763,8.938,2.076,5174,M1,True,2,1,1,HB,HB_i-1
3,2K52,D2,M1HG2,125.763,8.938,2.524,9565,M1,True,2,1,1,HG,HG_i-1
4,2K52,D2,M1HG3,125.763,8.938,2.426,4391,M1,True,2,1,1,HG,HG_i-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3721,2LF2,H170,E169HG3,119.800,8.260,2.200,283,E169,True,170,169,1,HG,HG_i-1
3722,2LF2,H170,H,119.800,8.260,8.260,904,H170,False,170,170,0,H,H_i
3723,2LF2,H170,H171H,119.800,8.260,7.950,-159,H170,False,170,170,0,H,H_i
3724,2LF2,H171,H170H,125.600,7.950,8.260,-104,H171,False,171,171,0,H,H_i


Simplifying the NOE contact categories: everything that's more than 1 residue away is now "far"

In [4]:
df.loc[df.res_diff.abs() > 1, "atom_type_pos"] = df.loc[df.res_diff.abs() > 1, "atom_type"] + "_far"

Removing the sign of the crosspeaks: here the phase does not matter.

In [5]:
df.loc[:, 'height'] = df.height.abs()

Make a checkpoint and export the data

In [6]:
df.to_csv(f'data/out/all_peak_lists_{heteronucleus}.csv')

### Filtering

In [7]:
df[df.pdb_id == np.random.choice(pdb_ids)].sample(5)

Unnamed: 0,pdb_id,res,noe,X,H,Hnoe,height,noe_res,inter,resnum,noe_resnum,res_diff,atom_type,atom_type_pos
34,2MA6,S6,E7HA,118.2,8.47,4.25,2037,E7,True,6,7,-1,HA,HA_i+1
666,2MA6,I35,HG2,118.1,7.39,1.1,24002,I35,False,35,35,0,HG,HG_i
806,2MA6,L39,H38H,116.1,8.33,8.02,4862,L39,False,39,39,0,H,H_i
771,2MA6,H38,I35HG2,121.5,8.02,1.1,338,I35,True,38,35,3,HG,HG_far
457,2MA6,G27,V22HG1,111.6,8.73,0.96,665,V22,True,27,22,5,HG,HG_far


2. A) Removing the non-amide protons from the NOE contacts
   
$H^N_{bb, sc}$ -> $N_{bb, sc}$ -> $H^A_{bb, sc}$


In [8]:
# Leave only amides: either in sidechain (according to the mask) or in the backbone (named H)
df_ha = df[df.atom_type == 'HA']
df_ha.tail(7)

Unnamed: 0,pdb_id,res,noe,X,H,Hnoe,height,noe_res,inter,resnum,noe_resnum,res_diff,atom_type,atom_type_pos
3676,2LF2,K167,K166HA,121.6,8.27,4.28,10708,K166,True,167,166,1,HA,HA_i-1
3682,2LF2,K167,HA,121.6,8.27,4.28,10708,K167,False,167,167,0,HA,HA_i
3689,2LF2,L168,K167HA,123.0,8.24,4.28,18332,K167,True,168,167,1,HA,HA_i-1
3695,2LF2,L168,HA,123.0,8.24,4.31,12256,L168,False,168,168,0,HA,HA_i
3703,2LF2,E169,L168HA,120.9,8.33,4.31,8139,L168,True,169,168,1,HA,HA_i-1
3710,2LF2,E169,HA,120.9,8.33,4.2,3393,E169,False,169,169,0,HA,HA_i
3717,2LF2,H170,E169HA,119.8,8.26,4.2,382,E169,True,170,169,1,HA,HA_i-1


2. B) Removing the non-amide protons from the NOE contacts
   
$H^N_{bb, sc}$ -> $N_{bb, sc}$ -> $H^N_{bb, sc}$


In [9]:
aa_sidechain_amide_protons = {
    'R': 'HH',
    'N': 'HD',
    'Q': 'HE',
    'H': 'HD',
    'K': 'HZ',
    'W': 'HE',
}

sc_amide_mask = df.apply(lambda row: is_sc_amide(row, aa_sidechain_amide_protons), axis=1)
# Leave only amides: either in sidechain (according to the mask) or in the backbone (named H)
df_hn = df[sc_amide_mask | (df.atom_type == 'H')]
df_hn.tail(7)

Unnamed: 0,pdb_id,res,noe,X,H,Hnoe,height,noe_res,inter,resnum,noe_resnum,res_diff,atom_type,atom_type_pos
3709,2LF2,E169,H,120.9,8.33,8.33,49667,E169,False,169,169,0,H,H_i
3715,2LF2,E169,H170H,120.9,8.33,8.26,12297,E169,False,169,169,0,H,H_i
3716,2LF2,H170,E169H,119.8,8.26,8.33,420,E169,True,170,169,1,H,H_i-1
3722,2LF2,H170,H,119.8,8.26,8.26,904,H170,False,170,170,0,H,H_i
3723,2LF2,H170,H171H,119.8,8.26,7.95,159,H170,False,170,170,0,H,H_i
3724,2LF2,H171,H170H,125.6,7.95,8.26,104,H171,False,171,171,0,H,H_i
3725,2LF2,H171,H,125.6,7.95,7.95,15838,H171,False,171,171,0,H,H_i


In [10]:
df_hn.atom_type.unique()

array(['H', 'HD', 'HE', 'HH'], dtype=object)

### Processing the intensities

In [11]:
print(f'Removing {df_ha.duplicated().sum()} duplicated rows from amide sidechains')
df_ha = df_ha.drop_duplicates(['resnum', 'noe_resnum', 'atom_type_pos'])

Removing 1 duplicated rows from amide sidechains


In [12]:
print(f'Removing {df_hn.duplicated().sum()} duplicated rows from HA noes')
df_hn = df_hn.drop_duplicates(['resnum', 'noe_resnum', 'atom_type_pos'])

Removing 3 duplicated rows from HA noes


Duplicated data points occur when there is entry for both GlyHA2 and HA3 with the same chemical shift (which is the most comon sceario, because those protons are rarily distingushable physically)

Calculating the intensities relative to the maximum of the frame

In [13]:
df_hn.insert(7, 'rel_height', df_hn['height'].to_frame() / df_hn[['pdb_id', 'res', 'height']].groupby(['pdb_id', 'res']).transform('max'))
df_ha.insert(7, 'rel_height', df_ha['height'].to_frame() / df_ha[['pdb_id', 'res', 'height']].groupby(['pdb_id', 'res']).transform('max'))

Calculting atom ranks

In [14]:
df_ha['rank'] = df_ha[['pdb_id', 'res', 'rel_height']].groupby(['pdb_id', 'res'], as_index=False)["rel_height"]\
                        .rank(method='min', ascending=False)\
                        .astype('category')

df_hn['rank'] = df_hn[['pdb_id', 'res', 'rel_height']].groupby(['pdb_id', 'res'], as_index=False)["rel_height"]\
                        .rank(method='min', ascending=False)\
                        .astype('category')

In [15]:
# df_hn.tail(7)
df_ha.tail(7)

Unnamed: 0,pdb_id,res,noe,X,H,Hnoe,height,rel_height,noe_res,inter,resnum,noe_resnum,res_diff,atom_type,atom_type_pos,rank
3676,2LF2,K167,K166HA,121.6,8.27,4.28,10708,1.0,K166,True,167,166,1,HA,HA_i-1,1.0
3682,2LF2,K167,HA,121.6,8.27,4.28,10708,1.0,K167,False,167,167,0,HA,HA_i,1.0
3689,2LF2,L168,K167HA,123.0,8.24,4.28,18332,1.0,K167,True,168,167,1,HA,HA_i-1,1.0
3695,2LF2,L168,HA,123.0,8.24,4.31,12256,0.668558,L168,False,168,168,0,HA,HA_i,2.0
3703,2LF2,E169,L168HA,120.9,8.33,4.31,8139,1.0,L168,True,169,168,1,HA,HA_i-1,1.0
3710,2LF2,E169,HA,120.9,8.33,4.2,3393,0.416882,E169,False,169,169,0,HA,HA_i,2.0
3717,2LF2,H170,E169HA,119.8,8.26,4.2,382,1.0,E169,True,170,169,1,HA,HA_i-1,1.0


Making another checkpoint

In [16]:
df_ha.to_csv(f'data/out/noe_to_HA_rel_int_{heteronucleus}.csv')
df_hn.to_csv(f'data/out/noe_to_HN_rel_int_{heteronucleus}.csv')

### Leaving only Gly as residues $i$ (i.e. in w1 and w2 dimensions)

A. $H^N_{Gly} -> N_{Gly} - > H^{A}$

B. $H^N_{Gly} -> N_{Gly} - > H^{N}_{bb, sc}$


In [17]:
df_gly_ha = df_ha[df_ha.res.str.match(r'^G')]
df_gly_ha.duplicated().sum()

0

In [18]:
df_gly_hn = df_hn[df_hn.res.str.match(r'^G')]
df_gly_hn.duplicated().sum()

0

In [19]:
print(f'''We are dealing with the total of Glycine 
{(df_gly_ha.pdb_id + df_gly_ha.res).unique().shape[0]} frames ''')

We are dealing with the total of Glycine 
40 frames 


**How many NOEs does each Gly frame has?**

In [20]:
df_gly_ha[["pdb_id", "res"]].value_counts() #.sort_index()

pdb_id  res 
2JRM    G42     7
        G43     7
2LF2    G146    6
2JRM    G16     6
2LF2    G150    6
        G165    6
2K52    G20     5
2LF2    G122    4
2LEA    G81     4
        G76     4
2LF2    G97     4
2LTM    G100    4
2KRS    G3      4
2K52    G31     4
        G11     4
2JRM    G54     4
        G23     4
2K52    G6      3
2LX7    G42     3
2K52    G49     3
2LF2    G94     3
2KRS    G25     3
        G52     3
2JRM    G20     3
2LX7    G50     3
        G26     2
2LF2    G71     2
2MA6    G27     2
2LEA    G92     2
1YEZ    G47     2
2KD0    G75     2
2LF2    G103    1
2LEA    G40     1
2KRS    G22     1
        G16     1
2LTM    G67     1
2KD0    G73     1
        G20     1
2K57    G28     1
1YEZ    G14     1
Name: count, dtype: int64

In [21]:
#df_gly_ha[df_gly_ha.pdb_id == np.random.choice(pdb_ids)]
df_gly_hn[df_gly_hn.pdb_id == "2JRM"]
#df_gly_ha[df_gly_ha.pdb_id == "2JRM"]

Unnamed: 0,pdb_id,res,noe,X,H,Hnoe,height,rel_height,noe_res,inter,resnum,noe_resnum,res_diff,atom_type,atom_type_pos,rank
392,2JRM,G16,A12H,113.363,8.439,7.661,4196,0.051209,A12,True,16,12,4,H,H_far,5.0
395,2JRM,G16,Q13H,113.363,8.439,8.205,9650,0.117771,Q13,True,16,13,3,H,H_far,2.0
398,2JRM,G16,S14H,113.363,8.439,8.245,4335,0.052905,S14,True,16,14,2,H,H_far,4.0
415,2JRM,G16,K18H,113.363,8.439,8.003,1517,0.018514,K18,True,16,18,-2,H,H_far,7.0
417,2JRM,G16,A19H,113.363,8.439,8.067,1379,0.01683,A19,True,16,19,-3,H,H_far,8.0
422,2JRM,G16,W40HE1,113.363,8.439,10.958,1580,0.019283,W40,True,16,40,-24,HE,HE_far,6.0
423,2JRM,G16,G43H,113.363,8.439,8.181,9650,0.117771,G43,True,16,43,-27,H,H_far,2.0
425,2JRM,G16,W44H,113.363,8.439,8.46,81939,1.0,W44,True,16,44,-28,H,H_far,1.0
543,2JRM,G20,N22H,107.003,8.527,7.628,1765,0.658091,N22,True,20,22,-2,H,H_far,2.0
544,2JRM,G20,G23H,107.003,8.527,7.586,1143,0.426174,G23,True,20,23,-3,H,H_far,4.0


In [22]:
df_gly_hn.to_csv(f'data/out/gly_noe_to_HN_{heteronucleus}.csv')
df_gly_ha.to_csv(f'data/out/gly_noe_to_HA_{heteronucleus}.csv')