# DeepRank-Core basic Protein-Protein Interface data processing

## Setup

### Data

The example data used in this tutorial are available on Zenodo at [this record address](https://zenodo.org/record/7997586). To download the data, please visit the link and click the “Download" button. Unzip the downloaded file, and save the contents as a folder named `data/` in the same directory as this notebook. The name data and the folder location are optional but recommended, as this are the name and the location we will use to refer to the folder throughout the tutorial.

The dataset contains only 100 data points, which are obviously not enough to develop an impactful predictive model, and the scope of its use is indeed only demonstrative and informative for the users.

### Software

1. Follow the [updated instructions](https://github.com/DeepRank/deeprank-core#installation) in the README.md on the main branch for successfully installing `deeprankcore` package.
2. To test the environment in which` deeprankcore` has been (successfully) installed, first clone [deeprank-core repository](https://github.com/DeepRank/deeprank-core). Navigate into it, and after having activated the environment and installed [pytest](https://anaconda.org/anaconda/pytest), run `pytest tests`. All tests should pass. We recommend installing `deeprankcore` and all its dependencies into a [conda](https://docs.conda.io/en/latest/) environment.

## Introduction

<img style="margin-left: 1.5rem" align="right" src="images/data_generation_ppi.png" width="400">

This tutorial will demonstrate the use of DeepRank-Core for generating Protein-Protein Interfaces (PPIs) and saving them into [HDF5 files](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) files, starting from [PBD files](https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format)) representing protein-protein complexes.

In this data processing phase, for each protein-protein complex an interface is selected according to a distance threshold that the user can customize, and it is mapped to a graph, in which nodes represent either residues or atoms, and edges the interactions between them. Nodes and edges can have several different features (i.e. node and/or edge features), which are generated and added during the processing phase as well. Optionally, the graphs can be mapped to volumetric grids. The mapped data are finally saved into HDF5 files, and can be used for later models' training (for details go to [training_ppi.ipynb](https://github.com/DeepRank/deeprank-core/blob/main/tutorials/training_ppi.ipynb) tutorial). In particular, graphs can be used for the training of Graph Neural Networks (GNNs), and grids can be used for the training of Convolutional Neural Networks (CNNs). 

### Use case

<img style="margin-right: 1.5rem" align="left" src="images/pmhc_pdb_example.png" width="200"/>

The example dataset that we provide contains PDB files representing [Major Histocompatibility Complex (MHC) protein](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) + peptide (pMHC) complexes, which play a key role in T-cell immunity. We are interested in predicting the Binding Affinity (BA) of the complexes, which can be used to determine the most suitable mutated tumor peptides as vaccine candidates.

PDB models used in this tutorial have been generated with [PANDORA](https://github.com/X-lab-3D/PANDORA), an anchor restrained modeling pipeline for generating peptide-MHC structures. While target data, so the BA values for such pMHC complexes, have been retrieved from [MHCFlurry 2.0](https://data.mendeley.com/datasets/zx3kjzc3yx).

On the left an example of a pMHC structure is shown, rendered using [ProteinViewer](https://marketplace.visualstudio.com/items?itemName=ArianJamasb.protein-viewer). The MHC protein is displayed in green, while the peptide is in orange.

## Utilities

### Libraries

Let's import the libraries needed for this tutorial:

In [1]:
import os
import pandas as pd
import glob
import h5py
from deeprankcore.query import QueryCollection
from deeprankcore.query import ProteinProteinInterfaceResidueQuery, ProteinProteinInterfaceAtomicQuery
from deeprankcore.utils.grid import GridSettings, MapMethod
from deeprankcore.dataset import GraphDataset

### Raw files and paths

Let's define the paths for reading raw data and saving the processed ones:

In [17]:
def clean_folder(folder_path):
    files = glob.glob(folder_path)
    for f in files:
        os.remove(f)

data_path = os.path.join("data", "ppi")
processed_data_path = os.path.join(data_path, "processed")

# Let's clear the processed/ folders which contain the files that we
# will regenerate step by step during this tutorial
clean_folder(os.path.join(processed_data_path, "residue/*"))
clean_folder(os.path.join(processed_data_path, "atomic/*"))

- Raw data are PDB files in `data/ppi/pdb/`, which contains atomic coordinates of the protein-protein complexes of interest, so in our case of pMHC complexes.
- Target data, so in our case the BA values for the pMHC complex, are in `data/ppi/BA_values.csv`.
- The final PPI processed data will be saved in `data/ppi/processed/` folder, which in turns contains a folder for residue-level data and another one for atomic-level data. More details about such different levels will come a few cells below.

`get_pdb_files_and_target_data` is an helper function used to retrieve the raw pdb files names in a list and the binding affinity target values from a CSV containing the IDs of the PDB models as well:

In [3]:
def get_pdb_files_and_target_data(data_path):
	csv_data = pd.read_csv(os.path.join(data_path, "BA_values.csv"))
	pdb_files = glob.glob(os.path.join(data_path, "pdb", '*.pdb'))
	pdb_files.sort()
	pdb_ids_csv = [pdb_file.split('/')[-1].split('.')[0] for pdb_file in pdb_files]
	csv_data_indexed = csv_data.set_index('ID')
	csv_data_indexed = csv_data_indexed.loc[pdb_ids_csv]
	bas = csv_data_indexed.measurement_value.values.tolist()
	return pdb_files, bas

pdb_files, bas = get_pdb_files_and_target_data(data_path)

## `QueryCollection` and `Query` objects

For each protein-protein complex, so for each data point, a query can be created and added to the `QueryCollection` object, to be processed later on. Different types of queries exist, based on the molecular resolution needed:

- In a `ProteinProteinInterfaceResidueQuery` each node represents one amino acid residue.
- In a `ProteinProteinInterfaceAtomicQuery` each node represents one atom within the amino acid residues.

A query takes as inputs:

- A `.pdb` file, representing the protein-protein structural complex.
- The ids of the two chains composing the complex. In our use case, "M" indicates the protein's chain and "P" the peptide's chain.
- The correspondent [Position-Specific Scoring Matrices (PSSMs)](https://en.wikipedia.org/wiki/Position_weight_matrix), in the form of .pssm files. The PSSM is optional and we are not going to use it in this basic tutorial.
- The distance cutoff, which represents the maximum distance in Ångström between two interacting residues of the two proteins.
- The target values associated with the query. For each query/data point, in the use case demonstrated in this tutorial will add two targets: "BA" and "binary". The first represents the continuous Binding Affinity value of the complex, while the second represents its binary representation, being 0 (BA > 500 nM) a not-binding complex and 1 (BA <= 500 nM) binding one.

## Residue-level PPI: `ProteinProteinInterfaceResidueQuery`

In [4]:
queries = QueryCollection()

interface_distance_cutoff = 15  # max distance in Å between two interacting residues/atoms of two proteins

print(f'Adding {len(pdb_files)} queries to the query collection ...')
count = 0
for i in range(len(pdb_files)):
	queries.add(
		ProteinProteinInterfaceResidueQuery(
			pdb_path = pdb_files[i], 
			chain_id1 = "M",
			chain_id2 = "P",
			distance_cutoff = interface_distance_cutoff,
			targets = {
				'binary': int(float(bas[i]) <= 500), # binary target value
				'BA': bas[i], # continuous target value
				}))
	count +=1
	if count % 20 == 0:
		print(f'{count} queries added to the collection.')

print(f'Queries ready to be processed.\n')

Adding 100 queries to the query collection ...
20 queries added to the collection.
40 queries added to the collection.
60 queries added to the collection.
80 queries added to the collection.
100 queries added to the collection.
Queries ready to be processed.



### Notes on `process()` method

When queries are ready to be processed, we can process them. But first, be sure to set `process()` method parameters according to your needs.

- `feature_modules` parameter can be used for indicating features modules to be used to generate the features. By default, only the basic features contained in `deeprankcore.features.components` and `deeprankcore.features.contact` are generated. Users can add custom features by creating a new module and placing it in `deeprankcore.feature` subpackage. A complete and detailed list of the pre-implemented features per module and more information about how to add custom features can be found [here](https://deeprankcore.readthedocs.io/en/latest/features.html). 
- If you want to include grids in the HDF5 files, which represent the mapping of the graphs to a volumetric box, you need to define `grid_settings` and `grid_map_method` parameters and to pass them to the `process()` method, as shown in the example below. If they are `None`, only graphs are saved.
- `cpu_count` parameter can be used to specify how many processes to be run simultaneously, and will coincide with the number of HDF5 files generated. By default it takes all available CPU cores.


In [5]:
grid_settings = GridSettings( # None if you don't want grids
	# the number of points on the x, y, z edges of the cube
	points_counts = [35, 30, 30],
	# x, y, z sizes of the box in Å
	sizes = [1.0, 1.0, 1.0])
grid_map_method = MapMethod.GAUSSIAN # None if you don't want grids

queries.process(
	os.path.join(processed_data_path, "residue", "proc"),
	combine_output = False,
	grid_settings = grid_settings,
	grid_map_method = grid_map_method)

print(f'The queries processing is done. The generated hdf5 files are in {processed_data_path}.')

The queries processing is done. The generated hdf5 files are in data/ppi/processed.


### Exploring data

As representative example, the following is the HDF5 structure generated by the previous code for `BA-100600.pdb`, so for one single graph, which represents one PPI, for the graph + grid case:

```bash
└── residue-ppi:M-P:BA-100600
    |
    ├── edge_features
    │   ├── _index
    │   ├── _name
    │   ├── covalent
    │   ├── distance
    │   ├── electrostatic
    │   ├── same_chain
    │   └── vanderwaals
    |
    ├── node_features
    │   ├── _chain_id
    │   ├── _name
    │   ├── _position
    │   ├── hb_acceptors
    │   ├── hb_donors
    │   ├── polarity
    │   ├── res_charge
    │   ├── res_mass
    |   ├── res_pI
    |   ├── res_size
    |   └── res_type
    |
    ├── grid_points
    │   ├── center
    │   ├── x
    │   ├── y
    │   └── z
    |
    ├── mapped_features
    │   ├── _position_000
    │   ├── _position_001
    │   ├── _position_002
    │   ├── covalent
    │   ├── distance
    │   ├── electrostatic
    │   ├── polarity_000
    │   ├── polarity_001
    │   ├── polarity_002
    │   ├── polarity_003
    |   ├── ...
    |   └── vanderwaals
    |
    └── target_values
    │   ├── BA
        └── binary
```

`edge_features`, `node_features`, `mapped_features` are [HDF5 Groups](https://docs.h5py.org/en/stable/high/group.html) which contain [HDF5 Datasets](https://docs.h5py.org/en/stable/high/dataset.html) (e.g., `_index`, `electrostatic`, etc.), which in turn contains features values in the form of arrays. `edge_features` and `node_features` refer specificly to the graph representation, while `grid_points` and `mapped_features` refer to the grid mapped from the graph. Each data point generated by deeprankcore has the above structure, with the features and the target changing according to the user's settings. Features starting with `_` are used during the graphs and grids computations, but they are not supposed to be used for training models. 

It is always a good practice to first explore the data, and then make decision about splitting them in training, test and validation sets. There are different possible ways for doing it.



#### Pandas dataframe

The edge and node features just generated can be explored by instantiating the `GraphDataset` object, and then using `hdf5_to_pandas` method which converts node and edge features into a [Pandas](https://pandas.pydata.org/) dataframe. Each row represents a ppi in the form of a graph.  

In [6]:
processed_data = glob.glob(os.path.join(processed_data_path, "residue", "*.hdf5"))
dataset = GraphDataset(processed_data)
df = dataset.hdf5_to_pandas()
df.head()

   ['data/ppi/processed/residue/proc-68616.hdf5', 'data/ppi/processed/residue/proc-28835.hdf5', 'data/ppi/processed/residue/proc-28842.hdf5', 'data/ppi/processed/residue/proc-28839.hdf5', 'data/ppi/processed/residue/proc-28838.hdf5', 'data/ppi/processed/residue/proc-28843.hdf5', 'data/ppi/processed/residue/proc-28834.hdf5', 'data/ppi/processed/residue/proc-68617.hdf5', 'data/ppi/processed/residue/proc-68610.hdf5', 'data/ppi/processed/residue/proc-68611.hdf5', 'data/ppi/processed/residue/proc-68608.hdf5', 'data/ppi/processed/residue/proc-68612.hdf5', 'data/ppi/processed/residue/proc-68613.hdf5', 'data/ppi/processed/residue/proc-68609.hdf5', 'data/ppi/processed/residue/proc-28837.hdf5', 'data/ppi/processed/residue/proc-68614.hdf5', 'data/ppi/processed/residue/proc-28840.hdf5', 'data/ppi/processed/residue/proc-28841.hdf5', 'data/ppi/processed/residue/proc-68615.hdf5', 'data/ppi/processed/residue/proc-28836.hdf5'] dataset                 : 100%|██████████| 20/20 [00:00<00:00, 998.06it/s, e

Unnamed: 0,id,hb_acceptors,hb_donors,polarity_0,polarity_1,polarity_2,polarity_3,res_charge,res_mass,res_pI,...,res_type_15,res_type_16,res_type_17,res_type_18,res_type_19,covalent,distance,electrostatic,same_chain,vanderwaals
0,residue-ppi:M-P:BA-109118,"[0, 0, 0, 0, 2, 4, 0, 1, 4, 2, 0, 4, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 5, 0, 1, 1, 0, ...","[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.0, -0.0, -1....","[131.2, 57.1, 99.1, 131.2, 87.1, 129.1, 71.1, ...","[5.74, 5.97, 5.96, 5.74, 5.68, 3.22, 6.0, 5.66...",...,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[8.939218869677594, 3.7155567550503115, 10.479...","[4.1472389946921036, 4.25357409156152, 2.10131...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[-0.01573804461765138, -1.389584272764123, -0...."
1,residue-ppi:M-P:BA-113208,"[0, 0, 0, 0, 2, 2, 4, 0, 1, 4, 2, 0, 4, 0, 0, ...","[0, 0, 0, 0, 2, 1, 0, 0, 1, 0, 1, 5, 0, 0, 1, ...","[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.0, -0.0...","[131.2, 57.1, 99.1, 131.2, 128.1, 87.1, 129.1,...","[5.74, 5.97, 5.96, 5.74, 5.65, 5.68, 3.22, 6.0...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[9.090487885696785, 2.944761450440425, 10.1010...","[4.196517899454653, 3.9653592891139944, 2.0586...","[1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[-0.01590706581019848, -1.3102129067733628, -0..."
2,residue-ppi:M-P:BA-113341,"[0, 0, 0, 4, 0, 0, 2, 4, 0, 0, 1, 4, 2, 0, 4, ...","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 5, 0, ...","[1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, -1.0, 0.0...","[131.2, 57.1, 99.1, 129.1, 131.2, 103.2, 87.1,...","[5.74, 5.97, 5.96, 3.22, 5.74, 5.07, 5.68, 3.2...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[8.933349987546665, 2.9466881409473924, 14.682...","[5.1333330283700995, 4.447385828620606, 7.7546...","[1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, ...","[-0.018753036452292617, -1.345115191443993, -0..."
3,residue-ppi:M-P:BA-114468,"[0, 0, 0, 0, 2, 4, 0, 1, 4, 2, 0, 4, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 5, 0, 0, 1, 0, ...","[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.0, -0.0, -1....","[131.2, 57.1, 99.1, 131.2, 87.1, 129.1, 71.1, ...","[5.74, 5.97, 5.96, 5.74, 5.68, 3.22, 6.0, 5.66...",...,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[8.97565674477361, 3.8860051466769834, 10.6309...","[4.126656242489209, 4.147784045093941, 2.04490...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[-0.016786224481205807, -1.2213048631336378, -..."
4,residue-ppi:M-P:BA-115138,"[0, 0, 0, 0, 2, 4, 0, 0, 1, 4, 2, 0, 4, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 5, 0, 0, 1, ...","[1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, -0.0...","[131.2, 57.1, 99.1, 131.2, 87.1, 129.1, 71.1, ...","[5.74, 5.97, 5.96, 5.74, 5.68, 3.22, 6.0, 5.98...",...,"[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[8.691351333365832, 4.116847094561565, 10.2411...","[4.545344746293386, 5.636495580385518, 1.96594...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[-0.017313129931719448, -1.1463602750420618, -..."


We can also generate histograms for looking at the features distributions:

In [7]:
dataset.save_hist(
    features = ["res_mass", "distance", "electrostatic"],
    fname = os.path.join(processed_data_path, "residue", "_".join(["res_mass", "distance", "electrostatic"])))

#### Other tools

- [HDFView](https://www.hdfgroup.org/downloads/hdfview/), a visual tool written in Java for browsing and editing HDF5 files.
  As representative example, the following is the structure for `BA-100600.pdb` seen from HDF5View:
  
  <img style="margin-bottom: 1.5rem" align="centrum" src="images/hdfview.png" width="200">

  Using this tool you can inspect the values of the features visually, for each data point. 

- Python packages such as [h5py](https://docs.h5py.org/en/stable/index.html). Examples:

In [8]:
with h5py.File(processed_data[0], "r") as hdf5:
    # List of all graphs in hdf5, each graph representing a ppi
    ids = list(hdf5.keys())
    print(f'IDs of PPIs in {processed_data[0]}: {ids}')
    node_features = list(hdf5[ids[0]]["node_features"]) 
    print(f'Node features: {node_features}')
    edge_features = list(hdf5[ids[0]]["edge_features"])
    print(f'Edge features: {edge_features}')
    targets = list(hdf5[ids[0]]["target_values"])
    print(f'Targets features: {targets}')
    # Polarity feature for ids[0], numpy.ndarray
    node_feat_polarity = hdf5[ids[0]]["node_features"]["polarity"][:]
    print(f'Polarity feature shape: {node_feat_polarity.shape}')
    # Electrostatic feature for ids[0], numpy.ndarray
    edge_feat_electrostatic = hdf5[ids[0]]["edge_features"]["electrostatic"][:]
    print(f'Electrostatic feature shape: {edge_feat_electrostatic.shape}')

IDs of PPIs in data/ppi/processed/residue/proc-68616.hdf5: ['residue-ppi:M-P:BA-109118', 'residue-ppi:M-P:BA-113208', 'residue-ppi:M-P:BA-113341', 'residue-ppi:M-P:BA-114468', 'residue-ppi:M-P:BA-115138', 'residue-ppi:M-P:BA-115586', 'residue-ppi:M-P:BA-128164', 'residue-ppi:M-P:BA-128373', 'residue-ppi:M-P:BA-129219', 'residue-ppi:M-P:BA-153205', 'residue-ppi:M-P:BA-153504', 'residue-ppi:M-P:BA-153607']
Node features: ['_chain_id', '_name', '_position', 'hb_acceptors', 'hb_donors', 'polarity', 'res_charge', 'res_mass', 'res_pI', 'res_size', 'res_type']
Edge features: ['_index', '_name', 'covalent', 'distance', 'electrostatic', 'same_chain', 'vanderwaals']
Targets features: ['BA', 'binary']
Polarity feature shape: (168, 4)
Electrostatic feature shape: (5730,)


## Atomic-level PPI: `ProteinProteinInterfaceAtomicQuery`

Now we will generate data at atomic resolution, very similarly to what we have just done for residue-level. 

In [9]:
queries = QueryCollection()

interface_distance_cutoff = 5.5  # max distance in Å between two interacting residues/atoms of two proteins

print(f'Adding {len(pdb_files)} queries to the query collection ...')
count = 0
for i in range(len(pdb_files)):
	queries.add(
		ProteinProteinInterfaceAtomicQuery(
			pdb_path = pdb_files[i], 
			chain_id1 = "M",
			chain_id2 = "P",
			distance_cutoff = interface_distance_cutoff,
			targets = {
				'binary': int(float(bas[i]) <= 500), # binary target value
				'BA': bas[i], # continuous target value
				}))
	count +=1
	if count % 20 == 0:
		print(f'{count} queries added to the collection.')

print(f'Queries ready to be processed.\n')

Adding 100 queries to the query collection ...
20 queries added to the collection.
40 queries added to the collection.
60 queries added to the collection.
80 queries added to the collection.
100 queries added to the collection.
Queries ready to be processed.



In [10]:
grid_settings = GridSettings( # None if you don't want grids
	# the number of points on the x, y, z edges of the cube
	points_counts = [35, 30, 30],
	# x, y, z sizes of the box in Å
	sizes = [1.0, 1.0, 1.0])
grid_map_method = MapMethod.GAUSSIAN # None if you don't want grids

queries.process(
	os.path.join(processed_data_path, "atomic", "proc"),
	combine_output = False,
	grid_settings = grid_settings,
	grid_map_method = grid_map_method)

print(f'The queries processing is done. The generated hdf5 files are in {processed_data_path}.')

The queries processing is done. The generated hdf5 files are in data/ppi/processed.


Again, we can give a look at the data using `hdf5_to_pandas` function.

In [11]:
processed_data = glob.glob(os.path.join(processed_data_path, "atomic", "*.hdf5"))
dataset = GraphDataset(processed_data)
df = dataset.hdf5_to_pandas()
df.head()

   ['data/ppi/processed/atomic/proc-68753.hdf5', 'data/ppi/processed/atomic/proc-68752.hdf5', 'data/ppi/processed/atomic/proc-68759.hdf5', 'data/ppi/processed/atomic/proc-68755.hdf5', 'data/ppi/processed/atomic/proc-29166.hdf5', 'data/ppi/processed/atomic/proc-29170.hdf5', 'data/ppi/processed/atomic/proc-29171.hdf5', 'data/ppi/processed/atomic/proc-29167.hdf5', 'data/ppi/processed/atomic/proc-68754.hdf5', 'data/ppi/processed/atomic/proc-68758.hdf5', 'data/ppi/processed/atomic/proc-68761.hdf5', 'data/ppi/processed/atomic/proc-29168.hdf5', 'data/ppi/processed/atomic/proc-29172.hdf5', 'data/ppi/processed/atomic/proc-29164.hdf5', 'data/ppi/processed/atomic/proc-68757.hdf5', 'data/ppi/processed/atomic/proc-68756.hdf5', 'data/ppi/processed/atomic/proc-29165.hdf5', 'data/ppi/processed/atomic/proc-29173.hdf5', 'data/ppi/processed/atomic/proc-29169.hdf5', 'data/ppi/processed/atomic/proc-68760.hdf5'] dataset                 : 100%|██████████| 20/20 [00:00<00:00, 582.85it/s, entry_name=proc-68760

Unnamed: 0,id,atom_charge,atom_type_0,atom_type_1,atom_type_2,atom_type_3,atom_type_4,atom_type_5,hb_acceptors,hb_donors,...,res_type_16,res_type_17,res_type_18,res_type_19,covalent,distance,electrostatic,same_chain,same_res,vanderwaals
0,atom-ppi:M-P:BA-109118,"[0.235, -0.47, 0.235, 0.0, 0.0, 0.265, -0.7, 0...","[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.8190019241331223, 2.6828613456531825, 5.071...","[0.0, 0.0, 0.0, 0.0, 3.9606787743986938, -10.2...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, -0.06864532688412299, -0.1009805909..."
1,atom-ppi:M-P:BA-113208,"[0.235, -0.47, 0.235, 0.0, 0.0, 0.265, -0.7, 0...","[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, ...","[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.8156370782730775, 2.6840560724396196, 4.991...","[0.0, 0.0, 0.0, 0.0, 3.9466956852615307, -10.0...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, -0.07378705936144203, -0.1008218203..."
2,atom-ppi:M-P:BA-113341,"[0.235, -0.47, 0.235, 0.0, 0.0, 0.265, -0.7, 0...","[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.8194686037412133, 2.693912396496961, 4.9167...","[0.0, 0.0, 0.0, 0.0, 4.148045568939799, -10.74...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, -0.07884993144287317, -0.1117571290..."
3,atom-ppi:M-P:BA-118613,"[0.235, -0.47, 0.235, 0.0, 0.0, 0.265, -0.7, 0...","[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 1, 0, 0, ...","[0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.813579333803735, 2.6865596959680604, 4.9486...","[0.0, 0.0, 0.0, 0.0, 4.068652232932618, -10.47...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, -0.0766742185370699, -0.10859529409..."
4,atom-ppi:M-P:BA-118616,"[0.235, -0.47, 0.235, 0.0, 0.0, 0.265, -0.7, 0...","[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 1, ...","[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.8175202887450796, 2.7127686963690802, 4.926...","[0.0, 0.0, 0.0, 0.0, 4.374546146594057, -11.41...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, -0.07819541371361853, -0.1128761569..."


In [12]:
dataset.save_hist(
    features = "atom_charge",
    fname = os.path.join(processed_data_path, "atomic", "atom_charge"))

Note that some of the features are different from the ones generated with the residue-level queries. There are indeed features in `deeprankcore.features.components` module which are generated only in atomic graphs, i.e. `atom_type`, `atom_charge`, and `pdb_occupancy`, because they don't make sense only in the atomic graphs' representation.