# Basic concepts IFP processing and aggregation 

This notebook explains how to edit a data frame of interaction fingerprints (IFP) from Molecular Dynamics Simulation on the example of processing with Prolif. The dataframe is restructured for further processing with IFPAggVis to aggregate the IFPs.
The notebook is based on an example data set provided in the publication, as well as open-source in Zenodo.

In [44]:
import pandas as pd
import os

Define ligand number for processing and paths...

In [45]:
ligand = 1
file_path = "../../data/csv_files/"

outpath = "../../data/aggregated_files/"

# Check if outpath exists, otherwise create new directory
if not os.path.exists(outpath):
    os.makedirs(outpath)
    print(outpath + " was created!")

### Read IFP data
Read data frames with IFPs processed by ProLIF. Due to different selections in the pre-processing, two IFP data sets will be merged to one df containing all interactions relevant.

IFP between ligand (MC-LR) and protein (PPP1, HOH, Mn<sup>2+</sup> ions) without VdWContact as interaction

In [46]:
df = pd.read_csv(file_path+'Interactions_lig' + str(ligand) + '_allreps_wovdw.csv', 
                 sep=',',header=[0, 1,2], index_col = [0])


In [47]:
df.head(3)

ligand,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11,LG11
protein,HOH8,HIS66,ASP95,ARG96,ARG96,ARG96,ASN124,ASN124,HIS125,HIS125,...,CYS273,CYS273,GLY274,GLU275,GLU275,GLU275,GLU275,PHE276,PHE276,HOH402
interaction,HBAcceptor,Hydrophobic,Hydrophobic,Hydrophobic,HBAcceptor,Anionic,Hydrophobic,HBAcceptor,Hydrophobic,PiStacking,...,Hydrophobic,HBAcceptor,HBAcceptor,Hydrophobic,HBDonor,HBAcceptor,Cationic,Hydrophobic,HBAcceptor,HBAcceptor
Frame,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
0,True,False,False,False,True,True,False,False,True,False,...,True,False,True,True,False,True,False,False,False,False
1,True,False,False,False,True,True,False,False,True,False,...,False,False,False,True,False,True,False,True,False,False
2,True,False,False,False,True,True,False,False,True,False,...,False,False,True,True,False,True,False,True,False,True


In [48]:
len(df)

84003

IFP between ligand (MC-LR) and Mn<sup>2+</sup> ions with VdWContact as interaction. HOH will be dropped, as it was only used to have hydrogen bonds in processing steps as selection without hydrogen bonds in Prolif results in an error with RDKit.

In [49]:
df_mn = pd.read_csv(file_path + 'Interactions_lig' + str(ligand) + '_allreps_Mn.csv', 
                    sep=',', header=[0, 1,2], index_col = [0])

In [50]:
df_mn.head(3)

ligand,LG11,LG11,LG11,LG11
protein,HOH8,MN400,MN401,HOH402
interaction,VdWContact,VdWContact,VdWContact,VdWContact
Frame,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
0,True,False,False,True
1,True,False,False,True
2,True,False,False,True


### Process and Restructure IFP dataframes
Dataframe has to be restructured for further processing to remove multi-index and to translate interactions from Bool to int

In [51]:
from IFPAggVis.ifpaggvis import helpers

IFP between ligand (MC-LR) and protein (PPP1, HOH, Mn<sup>2+</sup> ions) without VdWContact as interaction

In [52]:
df = df.droplevel("ligand", axis=1)
df_new = helpers.get_res_names_in_col_index(df)
df_new.replace({False: 0, True: 1}, inplace=True)    
df_new.to_csv(outpath + "ligand_" + str(ligand) + "_res_based_in_columns.csv")

IFP between ligand (MC-LR) and Mn<sup>2+</sup> ions with VdWContact as interaction. 

In [53]:
df_mn = df_mn.droplevel("ligand", axis=1)
df_new_mn = helpers.get_res_names_in_col_index(df_mn)
df_new_mn.replace({False: 0, True: 1}, inplace=True)    
df_new_mn.to_csv(outpath + "ligand_" + str(ligand) + "_res_based_in_columns.csv")

##### Merge IFP dataframes
merge both data frames to generate one set of IFPs

In [54]:
import re

columns of small IFP set (IFP between ligand (MC-LR) and Mn<sup>2+</sup> ions with vdWContact) to list to search them..

In [55]:
columns = df_new_mn.columns.values.tolist()

Select columns with Mn<sup>2+</sup> ions

In [56]:
selected_columns = [match for match in columns if "MN" in match] 

Merge IFP sets by adding vdWContact columns to larger IFP set (IFP between ligand (MC-LR) and protein (PPP1, HOH, Mn<sup>2+</sup> ions) without VdWContact)

In [57]:
for col in selected_columns:
    if len(df_new) == len(df_new_mn):
        df_new[col] = df_new_mn[col].values
    else:
        print("Mismatch in number of rows! Double check your data.")


Sort columns according to residue number for further processing

In [58]:
sorted_cols = sorted(df_new.columns.values.tolist(), key=lambda s: int(re.search(r'\d+', s).group()))


In [59]:
sorted_cols

['HOH8_HBAcceptor',
 'HIS66_Hydrophobic',
 'ASP95_Hydrophobic',
 'ARG96_Hydrophobic',
 'ARG96_HBAcceptor',
 'ARG96_Anionic',
 'ASN124_Hydrophobic',
 'ASN124_HBAcceptor',
 'HIS125_Hydrophobic',
 'HIS125_PiStacking',
 'HIS125_EdgeToFace',
 'GLU126_Hydrophobic',
 'CYS127_Hydrophobic',
 'CYS127_HBAcceptor',
 'SER129_HBAcceptor',
 'ILE130_Hydrophobic',
 'ASN131_Hydrophobic',
 'ASN131_HBAcceptor',
 'ARG132_Hydrophobic',
 'ILE133_Hydrophobic',
 'TYR134_Hydrophobic',
 'TYR134_HBDonor',
 'TYR134_HBAcceptor',
 'TYR134_PiStacking',
 'TYR134_EdgeToFace',
 'VAL195_Hydrophobic',
 'PRO196_Hydrophobic',
 'ASP197_Hydrophobic',
 'LEU201_Hydrophobic',
 'CYS202_Hydrophobic',
 'CYS202_HBAcceptor',
 'LEU205_Hydrophobic',
 'TRP206_Hydrophobic',
 'TRP206_HBDonor',
 'TRP206_PiStacking',
 'TRP206_FaceToFace',
 'TRP206_EdgeToFace',
 'ASP208_Hydrophobic',
 'ASP208_HBDonor',
 'ASP208_Cationic',
 'ASP210_Cationic',
 'ASN219_Hydrophobic',
 'ASN219_HBDonor',
 'ASN219_HBAcceptor',
 'ASP220_Hydrophobic',
 'ASP220_HBDon

Generate new df with ordered columns

In [60]:
df_order = df_new[sorted_cols]

save ordered df to file for further processing

In [61]:
df_order.to_csv(outpath + "ligand_" + str(ligand) + "_res_based_in_columns_merged.csv")

In [62]:
df_order.head(3)

Unnamed: 0_level_0,HOH8_HBAcceptor,HIS66_Hydrophobic,ASP95_Hydrophobic,ARG96_Hydrophobic,ARG96_HBAcceptor,ARG96_Anionic,ASN124_Hydrophobic,ASN124_HBAcceptor,HIS125_Hydrophobic,HIS125_PiStacking,...,GLY274_HBAcceptor,GLU275_Hydrophobic,GLU275_HBDonor,GLU275_HBAcceptor,GLU275_Cationic,PHE276_Hydrophobic,PHE276_HBAcceptor,MN400_VdWContact,MN401_VdWContact,HOH402_HBAcceptor
Frame,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,0,0,0,0,0
1,1,0,0,0,1,1,0,0,1,0,...,0,1,0,1,0,1,0,0,0,0
2,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1


Correct indices of computed df to only look at well equilibrated part of the simulation. Therefore, the first 30 ns of each simulation are cut off.

In [63]:
df_order = pd.read_csv(outpath + "ligand_" + str(ligand) + "_res_based_in_columns_merged.csv", index_col = 0)

df_order.drop(index = df_order.index[:3000], axis = 0, inplace = True)
df_order = df_order.reset_index(drop = True)
df_order.drop(index = df_order.index[25000:28000], axis = 0, inplace = True)
df_order = df_order.reset_index(drop = True)
df_order.drop(index = df_order.index[50000:53000], axis = 0, inplace = True)
df_order = df_order.reset_index(drop = True)
print(len(df_order))

df_order.to_csv(outpath + "ligand_" + str(ligand) + "_res_based_in_columns_merged_corrected.csv", sep = ',')


75003


### Calculate aggregated IFP
calculate occurence of interaction over original data frame and save to file for later visualisation purposes

In [64]:
occ = df_order.mean()

Save frequent interactions (occuring more than 30 % of the time to file)

In [65]:
occ.loc[occ >= 0.3] = int(1)
occ.loc[occ < 0.3] = int(0)
occ.to_csv(outpath+"ligand_"+str(ligand)+"_aggregated_IFPocc30.csv")


## Aggregation of IFP set

### Process sliding window

Define step size as percentage value (x1) used to calculate size of sliding window 

In [66]:
step_size = 1

In [67]:
df_order = pd.read_csv(outpath + "ligand_" + str(ligand) + "_res_based_in_columns_merged_corrected.csv", sep = ',', index_col = 0)

Reset index to ensure correct step calculation later

In [68]:
df_order = df_order.reset_index(drop=True)

In [69]:
df_order.head(3)

Unnamed: 0,HOH8_HBAcceptor,HIS66_Hydrophobic,ASP95_Hydrophobic,ARG96_Hydrophobic,ARG96_HBAcceptor,ARG96_Anionic,ASN124_Hydrophobic,ASN124_HBAcceptor,HIS125_Hydrophobic,HIS125_PiStacking,...,GLY274_HBAcceptor,GLU275_Hydrophobic,GLU275_HBDonor,GLU275_HBAcceptor,GLU275_Cationic,PHE276_Hydrophobic,PHE276_HBAcceptor,MN400_VdWContact,MN401_VdWContact,HOH402_HBAcceptor
0,0,0,0,0,1,1,0,0,1,0,...,1,1,0,0,0,1,0,0,0,1
1,0,0,0,0,1,1,0,0,1,0,...,0,1,0,1,0,1,0,0,0,1
2,0,0,0,0,1,1,0,0,0,0,...,1,1,0,1,0,1,0,0,0,1


In [70]:
len(df_order)

75003

Define name of outfile to save processed file

In [71]:
outfile = outpath + "x1_sliding_filter/ligand_" + str(ligand) + "_x1_filter_" + str(step_size) + ".csv"

Calculate step size based on length of data set

In [72]:
original_length = len(df_order)
step_val = int((original_length/100)*step_size)


In [73]:
step_val

750

#### Process centered sliding window by calculating mean and save to file
It is possible to adjust the method of aggregation and instead of calculating the mean calculate other properties such as the median. Please see the pandas documentation for further details. 
Also, centering the sliding window is optional but recommended to take into account IFPs occuring shortly after and previous the current IFP processed.

In [74]:
df_result = df_order.rolling(window = step_val, center=True).mean()

In [75]:
df_result.head(10)

Unnamed: 0,HOH8_HBAcceptor,HIS66_Hydrophobic,ASP95_Hydrophobic,ARG96_Hydrophobic,ARG96_HBAcceptor,ARG96_Anionic,ASN124_Hydrophobic,ASN124_HBAcceptor,HIS125_Hydrophobic,HIS125_PiStacking,...,GLY274_HBAcceptor,GLU275_Hydrophobic,GLU275_HBDonor,GLU275_HBAcceptor,GLU275_Cationic,PHE276_Hydrophobic,PHE276_HBAcceptor,MN400_VdWContact,MN401_VdWContact,HOH402_HBAcceptor
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


In [76]:
path = "../../data/aggregated_files/x1_sliding_filter/"

# Check if outpath exists, otherwise create new directory
if not os.path.exists(path):
    os.makedirs(path)
    print(path + " was created!")
    

In [77]:
df_result.to_csv(outfile)

### Filter interactions

Filter interactions based on interaction filter (x2) to represent only interactions that occur more often than x2 % of the time.

Remove columns with NaN value, as those occur because of centered sliding window. Then, index will be reset.

In [78]:
df_x2 = df_result.dropna().reset_index(drop=True)

In [79]:
df_x2.head(10)

Unnamed: 0,HOH8_HBAcceptor,HIS66_Hydrophobic,ASP95_Hydrophobic,ARG96_Hydrophobic,ARG96_HBAcceptor,ARG96_Anionic,ASN124_Hydrophobic,ASN124_HBAcceptor,HIS125_Hydrophobic,HIS125_PiStacking,...,GLY274_HBAcceptor,GLU275_Hydrophobic,GLU275_HBDonor,GLU275_HBAcceptor,GLU275_Cationic,PHE276_Hydrophobic,PHE276_HBAcceptor,MN400_VdWContact,MN401_VdWContact,HOH402_HBAcceptor
0,0.250667,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.428,0.0,...,0.453333,0.792,0.0,0.406667,0.001333,0.676,0.0,0.021333,0.0,0.265333
1,0.250667,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.426667,0.0,...,0.452,0.792,0.0,0.406667,0.001333,0.674667,0.0,0.021333,0.0,0.264
2,0.252,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.425333,0.0,...,0.452,0.792,0.0,0.405333,0.001333,0.674667,0.0,0.021333,0.0,0.262667
3,0.252,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.425333,0.0,...,0.450667,0.790667,0.0,0.404,0.001333,0.673333,0.0,0.021333,0.0,0.261333
4,0.252,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.425333,0.0,...,0.449333,0.789333,0.0,0.402667,0.001333,0.673333,0.0,0.021333,0.0,0.26
5,0.253333,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.424,0.0,...,0.449333,0.789333,0.0,0.402667,0.001333,0.672,0.0,0.021333,0.0,0.258667
6,0.253333,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.422667,0.0,...,0.448,0.789333,0.0,0.401333,0.001333,0.672,0.0,0.021333,0.0,0.257333
7,0.253333,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.421333,0.0,...,0.446667,0.789333,0.0,0.401333,0.001333,0.672,0.0,0.021333,0.0,0.256
8,0.253333,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.42,0.0,...,0.445333,0.788,0.0,0.401333,0.001333,0.673333,0.0,0.021333,0.0,0.254667
9,0.253333,0.0,0.0,0.116,1.0,1.0,0.0,0.0,0.42,0.0,...,0.444,0.788,0.0,0.4,0.001333,0.672,0.0,0.021333,0.0,0.253333


Filter interactions based on occurence (x2 filter). <br>
Interactions have to occur > than x2 to be detected as present. Value is between 0.0 and 1.0. <br>
Suggested values are: 0.01, 0.02, 0.025, 0.05, 0.1, 0.15, 0.2 (Standard), 0.25, 0.3, 0.35, 0.4

In [80]:
filter_val_x2 = 0.2

Filter df based on x2 filter value

In [81]:
df_x2_out = df_x2.apply(lambda x: [0 if y <= filter_val_x2 else 1 for y in x])

In [82]:
df_x2_out.head(10)

Unnamed: 0,HOH8_HBAcceptor,HIS66_Hydrophobic,ASP95_Hydrophobic,ARG96_Hydrophobic,ARG96_HBAcceptor,ARG96_Anionic,ASN124_Hydrophobic,ASN124_HBAcceptor,HIS125_Hydrophobic,HIS125_PiStacking,...,GLY274_HBAcceptor,GLU275_Hydrophobic,GLU275_HBDonor,GLU275_HBAcceptor,GLU275_Cationic,PHE276_Hydrophobic,PHE276_HBAcceptor,MN400_VdWContact,MN401_VdWContact,HOH402_HBAcceptor
0,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
1,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
2,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
3,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
4,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
5,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
6,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
7,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
8,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1
9,1,0,0,0,1,1,0,0,1,0,...,1,1,0,1,0,1,0,0,0,1


Remove columns with interactions only 0, since those are not relevant and can be dropped.

In [83]:
no_int_cols = (df_x2_out == 0) & (df_x2_out.applymap(type) == int)
cols_name = no_int_cols.all()[no_int_cols.all()].index.to_list()
df_x2_result = df_x2_out[df_x2_out.columns.difference(cols_name)]           

Evaluate how many columns can be dropped

In [84]:
 len(df_x2_out.columns), len(df_x2_result.columns)

(86, 60)

save binarised and filtered interactions to new file

In [85]:
outname = outpath + "x2_sliding_filter/ligand_" + str(ligand) + "_x1_filter_" + str(step_size) + "_x2_filter_" + str(filter_val_x2) + ".csv"

In [86]:
path = "../../data/aggregated_files/x2_sliding_filter/"

# Check if outpath exists, otherwise create new directory
if not os.path.exists(path):
    os.makedirs(path)
    print(path + " was created!")
    

In [87]:
df_x2_result.to_csv(outname)

### Aggregation of IFPs

After processing IFP data frame with a sliding window (x1 filter) and filtering the occurence of interactions (x2 filter), the resulting IFP data frame can be aggregated by interactions (i.e. any interaction pattern independent of the time point is aggregated to one IFP) or time (i.e. any interaction pattern that are identical and occur right after each other are aggregated)


In [88]:
from IFPAggVis.ifpaggvis import aggregate
from collections import Counter

#### Aggregation by interactions
i.e. any interaction pattern independent of the time point is aggregated to one IFP

In [89]:
df_int_agg = df_x2_result.groupby(df_x2_result.columns.tolist(), as_index=False).size().sort_values("size", ascending=False)

In [90]:
df_int_agg.head(3)

Unnamed: 0,ARG221_Anionic,ARG221_HBAcceptor,ARG221_HBDonor,ARG221_Hydrophobic,ARG96_Anionic,ARG96_HBAcceptor,ARG96_Hydrophobic,ASN124_Hydrophobic,ASP197_Hydrophobic,ASP208_Cationic,...,TYR134_PiStacking,TYR272_HBAcceptor,TYR272_Hydrophobic,VAL195_Hydrophobic,VAL223_HBAcceptor,VAL223_HBDonor,VAL223_Hydrophobic,VAL250_HBAcceptor,VAL250_Hydrophobic,size
244,1,1,0,0,0,0,0,0,1,0,...,0,0,1,1,0,0,1,0,1,3345
61,0,0,0,0,0,0,0,1,1,1,...,0,0,0,1,0,1,1,1,1,2203
199,0,0,0,1,1,1,1,0,0,0,...,0,1,1,0,0,0,1,0,1,1864


Calculate interactions that change and number of differences to previous IFP

In [91]:
diffs_indices,diff_to_prev = aggregate.calculate_differences_rows(df_int_agg.iloc[::1, :-1])
df_int_agg["diff_to_prev"] = diff_to_prev

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 321/321 [00:00<00:00, 2611.99it/s]


In [92]:
df_int_agg

Unnamed: 0,ARG221_Anionic,ARG221_HBAcceptor,ARG221_HBDonor,ARG221_Hydrophobic,ARG96_Anionic,ARG96_HBAcceptor,ARG96_Hydrophobic,ASN124_Hydrophobic,ASP197_Hydrophobic,ASP208_Cationic,...,TYR272_HBAcceptor,TYR272_Hydrophobic,VAL195_Hydrophobic,VAL223_HBAcceptor,VAL223_HBDonor,VAL223_Hydrophobic,VAL250_HBAcceptor,VAL250_Hydrophobic,size,diff_to_prev
244,1,1,0,0,0,0,0,0,1,0,...,0,1,1,0,0,1,0,1,3345,[]
61,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,2203,18
199,0,0,0,1,1,1,1,0,0,0,...,1,1,0,0,0,1,0,1,1864,28
63,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,0,1,1,1,1775,24
57,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,1614,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,0,0,0,0,1,1,0,1,0,1,...,1,1,1,0,0,1,1,1,1,24
36,0,0,0,0,0,0,0,1,0,1,...,0,0,1,0,0,1,1,1,1,8
227,0,1,0,1,1,0,0,0,1,0,...,0,0,1,0,0,1,0,0,1,21
228,0,1,0,1,1,0,0,0,1,0,...,0,0,1,0,0,1,0,1,1,1


Get an overview of how many IFPs where aggregated to one row

In [93]:
structure_size = Counter(df_int_agg['size'].values)

In [94]:
structure_size

Counter({3345: 1,
         2203: 1,
         1864: 1,
         1775: 1,
         1614: 1,
         1526: 1,
         1391: 1,
         1150: 1,
         1127: 1,
         1053: 1,
         1039: 1,
         1006: 1,
         988: 1,
         969: 1,
         927: 1,
         908: 1,
         896: 1,
         842: 1,
         822: 1,
         817: 1,
         814: 1,
         786: 1,
         783: 1,
         757: 1,
         745: 1,
         725: 1,
         652: 1,
         647: 1,
         643: 1,
         638: 1,
         636: 1,
         625: 1,
         622: 1,
         621: 2,
         616: 1,
         611: 1,
         608: 1,
         596: 1,
         587: 1,
         558: 1,
         556: 1,
         550: 1,
         536: 1,
         500: 1,
         492: 1,
         471: 1,
         467: 1,
         440: 1,
         437: 1,
         433: 2,
         416: 1,
         415: 1,
         410: 1,
         408: 1,
         401: 1,
         392: 1,
         391: 1,
         388: 1,
  

Check how much we reduced the number of individual IFPs

In [95]:
print("Previous number of IFP: ", len(df_x2_result), " \nnew number of IFP: ", len(df_int_agg))

Previous number of IFP:  74254  
new number of IFP:  322


Save aggregation based on interaction to file and define outfile name

In [96]:
outname = outpath + "aggregation_interaction/ligand_" + str(ligand) + "_x1_filter_" + str(step_size) + "_x2_filter_" + str(filter_val_x2)

In [97]:
path = "../../data/aggregated_files/aggregation_interaction/"

# Check if outpath exists, otherwise create new directory
if not os.path.exists(path):
    os.makedirs(path)
    print(path + " was created!")
    

In [98]:
df_int_agg.to_csv(outname + "_interaction_based_aggregation.csv")

#### Aggregation by time
i.e. any interaction pattern that are identical and occur right after each other are aggregated

Calculate interactions that change and number of differences to previous IFP

In [99]:
diffs_indices,diffs_in_rows = aggregate.calculate_differences_rows(df_x2_result)
df_x2_result["diff_to_prev"] = diffs_in_rows

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74253/74253 [00:10<00:00, 7113.21it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_x2_result["diff_to_prev"] = diffs_in_rows


In [100]:
df_x2_result.head(3)

Unnamed: 0,ARG221_Anionic,ARG221_HBAcceptor,ARG221_HBDonor,ARG221_Hydrophobic,ARG96_Anionic,ARG96_HBAcceptor,ARG96_Hydrophobic,ASN124_Hydrophobic,ASP197_Hydrophobic,ASP208_Cationic,...,TYR134_PiStacking,TYR272_HBAcceptor,TYR272_Hydrophobic,VAL195_Hydrophobic,VAL223_HBAcceptor,VAL223_HBDonor,VAL223_Hydrophobic,VAL250_HBAcceptor,VAL250_Hydrophobic,diff_to_prev
0,0,1,0,1,1,1,0,0,0,0,...,0,1,1,0,0,0,1,0,1,[]
1,0,1,0,1,1,1,0,0,0,0,...,0,1,1,0,0,0,1,0,1,0
2,0,1,0,1,1,1,0,0,0,0,...,0,1,1,0,0,0,1,0,1,0


Summarise all IFPs that are identical an occur right after each other


In [101]:
df_temp_agg = aggregate.summarise_df(df_x2_result, "diff_to_prev", "occurence")

Detected different frames aggregated by time:  (520,)


Drop old differences between rows

In [102]:
df_temp_agg = df_temp_agg.drop(['diff_to_prev'], axis=1)
df_x2_result = df_x2_result.drop(['diff_to_prev'], axis=1)

Calculate new differences between IFP (rows)

In [103]:
diffs_indices,diffs_in_rows = aggregate.calculate_differences_rows(df_temp_agg)
df_temp_agg["diff_to_prev"] = diffs_in_rows

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 519/519 [00:00<00:00, 6325.04it/s]


In [104]:
df_temp_agg.head(3)

Unnamed: 0,ARG221_Anionic,ARG221_HBAcceptor,ARG221_HBDonor,ARG221_Hydrophobic,ARG96_Anionic,ARG96_HBAcceptor,ARG96_Hydrophobic,ASN124_Hydrophobic,ASP197_Hydrophobic,ASP208_Cationic,...,TYR272_HBAcceptor,TYR272_Hydrophobic,VAL195_Hydrophobic,VAL223_HBAcceptor,VAL223_HBDonor,VAL223_Hydrophobic,VAL250_HBAcceptor,VAL250_Hydrophobic,occurence,diff_to_prev
0,0,1,0,1,1,1,0,0,0,0,...,1,1,0,0,0,1,0,1,50,[]
50,0,1,0,1,1,1,0,0,0,0,...,1,1,0,0,0,1,0,1,2,2
52,0,0,0,1,1,1,0,0,0,0,...,1,1,0,0,0,1,0,1,68,2


Reset index of df after aggregation by time

In [105]:
df_temp_agg = df_temp_agg.reset_index(drop=True)

In [106]:
df_temp_agg.tail(3)

Unnamed: 0,ARG221_Anionic,ARG221_HBAcceptor,ARG221_HBDonor,ARG221_Hydrophobic,ARG96_Anionic,ARG96_HBAcceptor,ARG96_Hydrophobic,ASN124_Hydrophobic,ASP197_Hydrophobic,ASP208_Cationic,...,TYR272_HBAcceptor,TYR272_Hydrophobic,VAL195_Hydrophobic,VAL223_HBAcceptor,VAL223_HBDonor,VAL223_Hydrophobic,VAL250_HBAcceptor,VAL250_Hydrophobic,occurence,diff_to_prev
517,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,52,2
518,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,75,2
519,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,276,2


Check how many IFP where summarised
(how often: number IFP)

In [107]:
time_size = Counter(df_temp_agg["occurence"].values)

In [108]:
time_size

Counter({50: 2,
         2: 16,
         68: 1,
         117: 6,
         387: 2,
         58: 2,
         57: 2,
         124: 4,
         13: 4,
         636: 1,
         73: 3,
         104: 3,
         12: 7,
         307: 1,
         11: 9,
         101: 2,
         114: 1,
         33: 1,
         71: 3,
         3: 16,
         9: 11,
         1: 30,
         74: 1,
         459: 1,
         184: 1,
         5: 10,
         46: 1,
         6: 12,
         1864: 1,
         471: 1,
         25: 4,
         83: 4,
         65: 1,
         21: 3,
         61: 4,
         52: 4,
         16: 4,
         43: 2,
         7: 7,
         299: 1,
         87: 3,
         817: 1,
         55: 4,
         77: 1,
         389: 1,
         440: 1,
         232: 1,
         220: 1,
         37: 3,
         622: 1,
         18: 4,
         93: 2,
         85: 3,
         45: 4,
         370: 1,
         97: 4,
         467: 1,
         88: 2,
         416: 1,
         54: 2,
         385: 1,
 

Save temporal aggregation of IFP to file

In [109]:
outname = outpath + "aggregation_time/ligand_" + str(ligand) + "_x1_filter_" + str(step_size) + "_x2_filter_" + str(filter_val_x2)

In [110]:
path = "../../data/aggregated_files/aggregation_time/"

# Check if outpath exists, otherwise create new directory
if not os.path.exists(path):
    os.makedirs(path)
    print(path + " was created!")
    

In [111]:
df_temp_agg.to_csv(outname + "_time_based_aggregation.csv")

In [112]:
df_temp_agg.sort_values("occurence",ascending=False).head(3)

Unnamed: 0,ARG221_Anionic,ARG221_HBAcceptor,ARG221_HBDonor,ARG221_Hydrophobic,ARG96_Anionic,ARG96_HBAcceptor,ARG96_Hydrophobic,ASN124_Hydrophobic,ASP197_Hydrophobic,ASP208_Cationic,...,TYR272_HBAcceptor,TYR272_Hydrophobic,VAL195_Hydrophobic,VAL223_HBAcceptor,VAL223_HBDonor,VAL223_Hydrophobic,VAL250_HBAcceptor,VAL250_Hydrophobic,occurence,diff_to_prev
36,0,0,0,1,1,1,1,0,0,0,...,1,1,0,0,0,1,0,1,1864,2
507,0,0,0,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,1425,2
497,0,0,1,0,0,0,0,1,1,1,...,0,0,1,0,1,1,1,1,1042,2
