## Converting proteins to featurized graphs

The [graphein](https://graphein.ai/) library provides functionality for producing a number of types of graph-based representations of proteins. We'll use it to construct graphs from protein structures, extract interface residues, and to featurise the nodes and edges of the graph

In [1]:
import graphein
graphein.verbose(enabled=False)
from graphein.protein.config import ProteinGraphConfig
from graphein.protein.graphs import construct_graph
from graphein.protein.features.nodes import amino_acid as graphein_nodes
from graphein.protein import edges as graphein_edges
from graphein.protein.subgraphs import extract_subgraph
from graphein.protein.visualisation import plotly_protein_structure_graph
from functools import partial

Here we use two of the node features implemented in `graphein.protein.features.nodes.amino_acid`, but there's many more kinds of node features available in the library (see the full [API](https://graphein.ai/modules/graphein.protein.html#features))

In [6]:
graph_config = ProteinGraphConfig(
    node_metadata_functions = [graphein_nodes.amino_acid_one_hot, graphein_nodes.meiler_embedding],
    edge_construction_functions = [graphein_edges.add_peptide_bonds, partial(graphein_edges.add_distance_threshold, threshold=8, long_interaction_threshold=2)]
)

In [7]:
graph = construct_graph(pdb_code='1A0G', config=graph_config)

Output()

Now we have a graph object consisting of nodes and edges, each associated with the attributes we've specified

In [13]:
i = 0
for (node, node_data) in graph.nodes(data=True):
  print("Node:", node)
  print("Node attributes:", node_data)
  if i > 5:
    break
  i += 1


Node: A:GLY:1
Node attributes: {'chain_id': 'A', 'residue_name': 'GLY', 'residue_number': 1, 'atom_type': 'CA', 'element_symbol': 'C', 'coords': array([62.54 , 19.333, 31.113]), 'b_factor': 26.2, 'amino_acid_one_hot': array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'meiler': dim_1    0.00
dim_2    0.00
dim_3    0.00
dim_4    0.00
dim_5    6.07
dim_6    0.13
dim_7    0.15
Name: GLY, dtype: float64}
Node: A:TYR:2
Node attributes: {'chain_id': 'A', 'residue_name': 'TYR', 'residue_number': 2, 'atom_type': 'CA', 'element_symbol': 'C', 'coords': array([60.403, 22.375, 31.908]), 'b_factor': 25.8, 'amino_acid_one_hot': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]), 'meiler': dim_1    2.94
dim_2    0.30
dim_3    6.47
dim_4    0.96
dim_5    5.66
dim_6    0.25
dim_7    0.41
Name: TYR, dtype: float64}
Node: A:THR:3
Node attributes: {'chain_id': 'A', 'residue_name': 'THR', 'residue_number': 3, 'atom_type': 'CA', 'element_symbol': 'C', 'coords': array([56.6

In [14]:
i = 0
for (start_node, end_node, edge_data) in graph.edges(data=True):
  print(f"Edge between {start_node} and {end_node}")
  print("Edge attributes:", edge_data)
  if i > 5:
    break
  i += 1


Edge between A:GLY:1 and A:TYR:2
Edge attributes: {'kind': {'peptide_bond'}, 'distance': 3.801652009324369}
Edge between A:TYR:2 and A:THR:3
Edge attributes: {'kind': {'peptide_bond'}, 'distance': 3.787657059449812}
Edge between A:TYR:2 and A:LYS:11
Edge attributes: {'kind': {'distance_threshold'}, 'distance': 4.355063949013834}
Edge between A:THR:3 and A:LEU:4
Edge attributes: {'kind': {'peptide_bond'}, 'distance': 3.7871142047738653}
Edge between A:THR:3 and A:TYR:114
Edge attributes: {'kind': {'distance_threshold'}, 'distance': 4.84375938708768}
Edge between A:LEU:4 and A:TRP:5
Edge attributes: {'kind': {'peptide_bond'}, 'distance': 3.796580303378287}
Edge between A:LEU:4 and A:ILE:9
Edge attributes: {'kind': {'distance_threshold'}, 'distance': 4.564822669063936}


Graphein also has functions to visualize the graph using different libraries. Here's an example with `plotly`:

In [58]:
p = plotly_protein_structure_graph(
    graph,
    colour_edges_by="kind",
    colour_nodes_by='chain_id',
    label_node_ids=False,
    plot_title="Peptide backbone graph with distance connections. Nodes coloured by chain.",
    node_size_multiplier=1
    )
p.show()

We can extract interface residues from this graph by checking for edges between chains:

In [15]:
interface_residues = set()
for source, target, kind in graph.edges(data=True):
    if 'distance_threshold' in kind['kind'] and source.split(":")[0] == "A" and target.split(":")[0] != "A":
        interface_residues.add(source)
interface_residues

{'A:ARG:138',
 'A:ASN:148',
 'A:GLN:101',
 'A:GLN:157',
 'A:GLY:27',
 'A:HIS:100',
 'A:ILE:17',
 'A:LEU:149',
 'A:PHE:26',
 'A:TRP:139',
 'A:TYR:24',
 'A:VAL:153'}

We only need the graph for one chain, so we'll extract the subgraph for the chain of interest.

In [17]:
chain_subgraph = extract_subgraph(graph, chains="A")
p = plotly_protein_structure_graph(
    chain_subgraph,
    colour_edges_by="kind",
    colour_nodes_by="degree",
    label_node_ids=False,
    plot_title="Peptide backbone graph. Nodes coloured by degree.",
    node_size_multiplier=1
    )
p.show()

Let's wrap this into a function that we'll be reusing in the later notebooks. Feel free to add more node and edge features to the graph in this function.

In [7]:
def load_graph(pdb_id, chain):
    graph_config = ProteinGraphConfig(
        node_metadata_functions = [graphein_nodes.amino_acid_one_hot, graphein_nodes.meiler_embedding],
        edge_construction_functions = [graphein_edges.add_peptide_bonds, 
                                       partial(graphein_edges.add_distance_threshold, threshold=8, long_interaction_threshold=2)],
    )
    graph = construct_graph(pdb_code=pdb_id, config=graph_config, verbose=False)
    interface_residues = set()
    for source, target, kind in graph.edges(data=True):
        c1, c2 = source.split(":")[0], target.split(":")[0]
        if 'distance_threshold' in kind['kind']:
            if c1 == chain and c2 != chain:
                interface_residues.add(source)
            elif c2 == chain and c1 != chain:
                interface_residues.add(target)
    return extract_subgraph(graph, chains=chain), interface_residues

Put this function into the `src/dataloader.py` file so that we can reuse it in the later notebooks.

## Bonus

- Surface graphs
- Other ways of extracting subgraphs
- Small molecule graphs

In [60]:
from graphein.protein.features.sequence import compute_esm_embedding
from graphein.protein.features.sequence.utils import (
    subset_by_node_feature_value,
)
from graphein.protein.features.nodes.amino_acid import (
    amino_acid_one_hot,
    meiler_embedding,
    expasy_protein_scale,
    hydrogen_bond_acceptor,
    hydrogen_bond_donor,
)
from graphein.protein.edges.distance import (
    add_distance_threshold,
    add_peptide_bonds,
    add_hydrophobic_interactions,
    add_hydrogen_bond_interactions,
    add_disulfide_interactions,
    add_ionic_interactions,
    add_aromatic_interactions,
    add_aromatic_sulphur_interactions,
    add_cation_pi_interactions,
)
import graphein.molecule as gm



config = gm.MoleculeGraphConfig(
    node_metadata_functions=[
        gm.atom_type_one_hot,
        gm.atomic_mass,
        gm.degree,
        gm.total_degree,
        gm.total_valence,
        gm.explicit_valence,
        gm.implicit_valence,
        gm.num_explicit_h,
        gm.num_implicit_h,
        gm.total_num_h,
        gm.num_radical_electrons,
        gm.formal_charge,
        gm.hybridization,
        gm.is_aromatic,
        gm.is_isotope,
        gm.is_ring,
        gm.chiral_tag,
        partial(gm.is_ring_size, ring_size=5),
        partial(gm.is_ring_size, ring_size=7)
    ]
)
graph = gm.construct_graph(mol2_path=MOL2_PATH, config=config)

NameError: name 'gm' is not defined