### 4.3 Distances between molecular structures

Learning goal: The concept of maximum common subgraph (MCS) and related distance measures.

Figure 2 shows an example of four molecular graph structures: Niacin (vitamin B1), Nicotine (active ingredient in tobacco), psilocin (active ingredient in “magic mushrooms”), and proline (amino acid). The node labels correspond to atoms (carbon, oxygen or nitrogen). For simplicity hydrogen atoms and double bonds between atoms are not presented

Figure 2: Four graphs corresponding to molecular structures.

![image.png](molecular.png)

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

![image.png](molecular_index.png)

In [3]:
G1_Niacin = {
    '1': {'label': 'C', 'adj': ['2', '6']},
    '2': {'label': 'C', 'adj': ['1', '3']},
    '3': {'label': 'C', 'adj': ['2', '4']},
    '4': {'label': 'N', 'adj': ['3', '5']},
    '5': {'label': 'C', 'adj': ['4', '6']},
    '6': {'label': 'C', 'adj': ['1', '5', '7']},
    '7': {'label': 'C', 'adj': ['6', '8', '9']},
    '8': {'label': 'O', 'adj': ['7']},
    '9': {'label': 'O', 'adj': ['7']},
}

G2_Nicotine = {
    '1': {'label': 'C', 'adj': ['2', '6']},
    '2': {'label': 'C', 'adj': ['1', '3']},
    '3': {'label': 'C', 'adj': ['2', '4']},
    '4': {'label': 'N', 'adj': ['3', '5']},
    '5': {'label': 'C', 'adj': ['4', '6']},
    '6': {'label': 'C', 'adj': ['1', '5', '7']},
    '7': {'label': 'C', 'adj': ['6', '8', '11']},
    '8': {'label': 'C', 'adj': ['7', '9']},
    '9': {'label': 'C', 'adj': ['8', '10']},
    '10': {'label': 'C', 'adj': ['9', '11']},
    '11': {'label': 'N', 'adj': ['7', '10', '12']},
    '12': {'label': 'C', 'adj': ['11']},
}

G3_Psilocin = {
    '1': {'label': 'O', 'adj': ['2']},
    '2': {'label': 'C', 'adj': ['3', '7']},
    '3': {'label': 'C', 'adj': ['2', '4']},
    '4': {'label': 'C', 'adj': ['3', '5']},
    '5': {'label': 'C', 'adj': ['4', '6']},
    '6': {'label': 'C', 'adj': ['5', '7', '8']},
    '7': {'label': 'C', 'adj': ['2', '6', '10']},
    '8': {'label': 'N', 'adj': ['6', '9']},
    '9': {'label': 'C', 'adj': ['8', '10']},
    '10': {'label': 'C', 'adj': ['7', '9', '11']},
    '11': {'label': 'C', 'adj': ['10', '12']},
    '12': {'label': 'C', 'adj': ['11', '13']},
    '13': {'label': 'N', 'adj': ['12', '13', '14']},
    '14': {'label': 'C', 'adj': ['13']},
    '15': {'label': 'C', 'adj': ['13']},
}

G4_Proline = {
    '1': {'label': 'C', 'adj': ['2', '5']},
    '2': {'label': 'C', 'adj': ['1', '3']},
    '3': {'label': 'C', 'adj': ['2', '4']},
    '4': {'label': 'N', 'adj': ['3', '5']},
    '5': {'label': 'C', 'adj': ['1', '4', '6']},
    '6': {'label': 'C', 'adj': ['5', '7', '8']},
    '7': {'label': 'O', 'adj': ['6']},
    '8': {'label': 'O', 'adj': ['6']},
}

In [14]:
def mcs(G1, G2, M, Mbest):
    """
    Maximum Common Subgraph (MCS) Algorithm
    :param G1: First graph represented as an adjacency list with labels
    :param G2: Second graph represented as an adjacency list with labels
    :param M: Current partially matched pairs
    :param Mbest: Current best match
    :return: Updated best match
    """
    C = get_candidate_pairs(G1, G2, M)
    for i1, i2 in C:
        if is_valid_matching(G1, G2, M, (i1, i2)):
            Mbest, len_Mbest = mcs(G1, G2, M + [(i1, i2)], Mbest)
    
    if len(M) > len(Mbest):
        return M, len(M)
    else:
        return Mbest, len(Mbest)

def get_candidate_pairs(G1, G2, M):
    """
    Get all label matching node pairs from (G1, G2) not in M
    :param G1: First graph
    :param G2: Second graph
    :param M: Current partially matched pairs
    :return: Set of candidate pairs
    """
    C = []
    matched_nodes_G1 = {i1 for i1, i2 in M}
    matched_nodes_G2 = {i2 for i1, i2 in M}
    
    for i1 in G1:
        if i1 not in matched_nodes_G1:
            for i2 in G2:
                if i2 not in matched_nodes_G2 and G1[i1]['label'] == G2[i2]['label']:
                    C.append((i1, i2))
    return C

def is_valid_matching(G1, G2, M, pair):
    """
    Check if adding the pair to M results in a valid matching
    :param G1: First graph
    :param G2: Second graph
    :param M: Current partially matched pairs
    :param pair: Candidate pair
    :return: True if valid, False otherwise
    """
    i1, i2 = pair
    
    if M:  # If M is not empty, check for connectivity
        if not any(j1 in G1[i1]['adj'] and j2 in G2[i2]['adj'] for j1, j2 in M):
            return False
    
    for j1, j2 in M:
        # Check if there is a mismatch in adjacency
        if (j1 in G1[i1]['adj'] and j2 not in G2[i2]['adj']) or (j1 not in G1[i1]['adj'] and j2 in G2[i2]['adj']):
            return False
            
    return True


MCS_G1_G2, MCS_G1_G2_len = mcs(G1_Niacin, G2_Nicotine, [], [])
print(f"| MCS_G1_G2 | = {MCS_G1_G2_len}; Subgraph label matching {MCS_G1_G2}")

MCS_G1_G3, MCS_G1_G3_len = mcs(G1_Niacin, G3_Psilocin, [], [])
print(f"| MCS_G1_G3 | = {MCS_G1_G3_len}; Subgraph label matching {MCS_G1_G3}")

MCS_G1_G4, MCS_G1_G4_len = mcs(G1_Niacin, G4_Proline, [], [])
print(f"| MCS_G1_G4 | = {MCS_G1_G4_len}; Subgraph label matching {MCS_G1_G4}")

MCS_G2_G3, MCS_G2_G3_len = mcs(G2_Nicotine, G3_Psilocin, [], [])
print(f"| MCS_G2_G3 | = {MCS_G2_G3_len}; Subgraph label matching {MCS_G2_G3}")

MCS_G2_G4, MCS_G2_G4_len = mcs(G2_Nicotine, G4_Proline, [], [])
print(f"| MCS_G2_G4 | = {MCS_G2_G4_len}; Subgraph label matching {MCS_G2_G4}")

MCS_G3_G4, MCS_G3_G4_len = mcs(G3_Psilocin, G4_Proline, [], [])
print(f"| MCS_G3_G4 | = {MCS_G3_G4_len}; Subgraph label matching {MCS_G3_G4}")


| MCS_G1_G2 | = 7; Subgraph label matching [('1', '1'), ('2', '2'), ('3', '3'), ('4', '4'), ('5', '5'), ('6', '6'), ('7', '7')]
| MCS_G1_G3 | = 7; Subgraph label matching [('1', '4'), ('2', '5'), ('3', '6'), ('4', '8'), ('6', '3'), ('7', '2'), ('8', '1')]
| MCS_G1_G4 | = 7; Subgraph label matching [('1', '1'), ('2', '2'), ('3', '3'), ('6', '5'), ('7', '6'), ('8', '7'), ('9', '8')]
| MCS_G2_G3 | = 9; Subgraph label matching [('1', '10'), ('2', '7'), ('3', '6'), ('4', '8'), ('6', '11'), ('7', '12'), ('11', '13'), ('10', '14'), ('12', '15')]
| MCS_G2_G4 | = 6; Subgraph label matching [('6', '6'), ('7', '5'), ('8', '1'), ('9', '2'), ('10', '3'), ('11', '4')]
| MCS_G3_G4 | = 6; Subgraph label matching [('2', '6'), ('1', '7'), ('3', '5'), ('4', '1'), ('5', '2'), ('6', '3')]


(a) Determine the nearest neighbour for each molecule using Union-normalized MCS distance (Udist in slides, see Aggarwal Eq. 17.2).

Union-normalized distance: The distance measure lies in the range of (0, 1), and is
also shown to be a metric. The union-normalized measure UDist(G1, G2) is defined
as follows:

$$
UDIST(G_1,G_2) = 1 - \frac{\mid MCS(G_1,G_2) \mid}{ \mid G_1 \mid +\mid G_2 \mid - \mid MCS(G_1,G_2) \mid}
$$

This measure is referred to as the union-normalized distance because the denominator
contains the number of nodes in the union of the two graphs. A different way of
understanding this measure is that it normalizes the number of non-matching nodes
U(G1, G2) between the two graphs (unnormalized measure) with the number of nodes
in the union of the two graphs.

$$
UDist(G_1, G_2) = \frac{\text{Non-matching nodes between $G_1$ and $G_2$}}{\text{Union size of $G_1$ and $G_2$}}
$$

One advantage of this measure is that it is intuitively easier to interpret. Two perfectly
matching graphs will have a distance of 0 from one another, and two perfectly nonmatching graphs will have a distance of 1.

In [18]:
MCS_G1_G2, MCS_G1_G2_len = mcs(G1_Niacin, G2_Nicotine, [], [])
print(f"| MCS_G1_G2 | = {MCS_G1_G2_len}; Subgraph label matching {MCS_G1_G2}")

MCS_G1_G3, MCS_G1_G3_len = mcs(G1_Niacin, G3_Psilocin, [], [])
print(f"| MCS_G1_G3 | = {MCS_G1_G3_len}; Subgraph label matching {MCS_G1_G3}")

MCS_G1_G4, MCS_G1_G4_len = mcs(G1_Niacin, G4_Proline, [], [])
print(f"| MCS_G1_G4 | = {MCS_G1_G4_len}; Subgraph label matching {MCS_G1_G4}")

MCS_G2_G3, MCS_G2_G3_len = mcs(G2_Nicotine, G3_Psilocin, [], [])
print(f"| MCS_G2_G3 | = {MCS_G2_G3_len}; Subgraph label matching {MCS_G2_G3}")

MCS_G2_G4, MCS_G2_G4_len = mcs(G2_Nicotine, G4_Proline, [], [])
print(f"| MCS_G2_G4 | = {MCS_G2_G4_len}; Subgraph label matching {MCS_G2_G4}")

MCS_G3_G4, MCS_G3_G4_len = mcs(G3_Psilocin, G4_Proline, [], [])
print(f"| MCS_G3_G4 | = {MCS_G3_G4_len}; Subgraph label matching {MCS_G3_G4}")

def UDIST(G1, G2, Mbest):
    """
    Calculate the Union-normalized MCS distance between two graphs
    :param G1: First graph
    :param G2: Second graph
    :param Mbest: Maximum Common Subgraph
    :return: Union-normalized MCS distance
    """
    return 1 - len(Mbest) / (len(G1) + len(G2) - len(Mbest))

print("\nUnion-normalized MCS distances:")
# Calculate Union-normalized MCS distances between Niacin and Nicotine
UDIST_G1_G2 = UDIST(G1_Niacin, G2_Nicotine, MCS_G1_G2)
print(f"UDIST(G1, G2) = {UDIST_G1_G2}")

# Calculate Union-normalized MCS distances between Niacin and Psilocin
UDIST_G1_G3 = UDIST(G1_Niacin, G3_Psilocin, MCS_G1_G3)
print(f"UDIST(G1, G3) = {UDIST_G1_G3}")

# Calculate Union-normalized MCS distances between Niacin and Proline
UDIST_G1_G4 = UDIST(G1_Niacin, G4_Proline, MCS_G1_G4)
print(f"UDIST(G1, G4) = {UDIST_G1_G4}")

# Calculate Union-normalized MCS distances between Nicotine and Psilocin
UDIST_G2_G3 = UDIST(G2_Nicotine, G3_Psilocin, MCS_G2_G3)
print(f"UDIST(G2, G3) = {UDIST_G2_G3}")

# Calculate Union-normalized MCS distances between Nicotine and Proline
UDIST_G2_G4 = UDIST(G2_Nicotine, G4_Proline, MCS_G2_G4)
print(f"UDIST(G2, G4) = {UDIST_G2_G4}")

# Calculate Union-normalized MCS distances between Psilocin and Proline
UDIST_G3_G4 = UDIST(G3_Psilocin, G4_Proline, MCS_G3_G4)
print(f"UDIST(G3, G4) = {UDIST_G3_G4}")


| MCS_G1_G2 | = 7; Subgraph label matching [('1', '1'), ('2', '2'), ('3', '3'), ('4', '4'), ('5', '5'), ('6', '6'), ('7', '7')]
| MCS_G1_G3 | = 7; Subgraph label matching [('1', '4'), ('2', '5'), ('3', '6'), ('4', '8'), ('6', '3'), ('7', '2'), ('8', '1')]
| MCS_G1_G4 | = 7; Subgraph label matching [('1', '1'), ('2', '2'), ('3', '3'), ('6', '5'), ('7', '6'), ('8', '7'), ('9', '8')]
| MCS_G2_G3 | = 9; Subgraph label matching [('1', '10'), ('2', '7'), ('3', '6'), ('4', '8'), ('6', '11'), ('7', '12'), ('11', '13'), ('10', '14'), ('12', '15')]
| MCS_G2_G4 | = 6; Subgraph label matching [('6', '6'), ('7', '5'), ('8', '1'), ('9', '2'), ('10', '3'), ('11', '4')]
| MCS_G3_G4 | = 6; Subgraph label matching [('2', '6'), ('1', '7'), ('3', '5'), ('4', '1'), ('5', '2'), ('6', '3')]

Union-normalized MCS distances:
UDIST(G1, G2) = 0.5
UDIST(G1, G3) = 0.5882352941176471
UDIST(G1, G4) = 0.30000000000000004
UDIST(G2, G3) = 0.5
UDIST(G2, G4) = 0.5714285714285714
UDIST(G3, G4) = 0.6470588235294117


Nearest neighbor of G1_Niacin is G4_proline

Nearest neighbor of G2_Nicotine is G1_Niacin and G3_Psilocin

Nearest neighbor of G3_Psilocin is G2_Nicotine

Nearest neighbor of G4_proline is G1_Niacin

(b) Determine the nearest neighbour for each molecule using Max-normalized MCS distance (Mdist in slides, see Aggarwal Eq. 17.3)

Max-normalized distance: This distance measure also lies in the range (0, 1). The max-normalized distance $MDist(G_1, G_2)$ between two graphs $G_1$ and $G_2$ is defined as follows:

$$ MDist(G_1, G_2) = 1 − \frac{|MCS(G_1, G_2)|}{max\{|G1|, |G2| \}}$$

The main difference from the union-normalized distance is that the denominator is normalized by the maximum size of the two graphs. This distance measure is a metric because it satisfies the triangle inequality. The measure is also relatively easy to interpret. Two perfectly matching graphs will have a distance of 0 from one another, and two perfectly non-matching graphs will have a distance of 1.

In [19]:
MCS_G1_G2, MCS_G1_G2_len = mcs(G1_Niacin, G2_Nicotine, [], [])
print(f"| MCS_G1_G2 | = {MCS_G1_G2_len}; Subgraph label matching {MCS_G1_G2}")

MCS_G1_G3, MCS_G1_G3_len = mcs(G1_Niacin, G3_Psilocin, [], [])
print(f"| MCS_G1_G3 | = {MCS_G1_G3_len}; Subgraph label matching {MCS_G1_G3}")

MCS_G1_G4, MCS_G1_G4_len = mcs(G1_Niacin, G4_Proline, [], [])
print(f"| MCS_G1_G4 | = {MCS_G1_G4_len}; Subgraph label matching {MCS_G1_G4}")

MCS_G2_G3, MCS_G2_G3_len = mcs(G2_Nicotine, G3_Psilocin, [], [])
print(f"| MCS_G2_G3 | = {MCS_G2_G3_len}; Subgraph label matching {MCS_G2_G3}")

MCS_G2_G4, MCS_G2_G4_len = mcs(G2_Nicotine, G4_Proline, [], [])
print(f"| MCS_G2_G4 | = {MCS_G2_G4_len}; Subgraph label matching {MCS_G2_G4}")

MCS_G3_G4, MCS_G3_G4_len = mcs(G3_Psilocin, G4_Proline, [], [])
print(f"| MCS_G3_G4 | = {MCS_G3_G4_len}; Subgraph label matching {MCS_G3_G4}")

def MDIST(G1, G2, Mbest):
    """
    Calculate the Max-normalized MCS distance between two graphs
    :param G1: First graph
    :param G2: Second graph
    :param Mbest: Maximum Common Subgraph
    :return: Max-normalized MCS distance
    """
    return 1 - len(Mbest) / max(len(G1), len(G2))

print("\nMax-normalized MCS distances:")

# Calculate Max-normalized MCS distances between Niacin and Nicotine
MDIST_G1_G2 = MDIST(G1_Niacin, G2_Nicotine, MCS_G1_G2)
print(f"MDIST(G1, G2) = {MDIST_G1_G2}")

# Calculate Max-normalized MCS distances between Niacin and Psilocin
MDIST_G1_G3 = MDIST(G1_Niacin, G3_Psilocin, MCS_G1_G3)
print(f"MDIST(G1, G3) = {MDIST_G1_G3}")

# Calculate Max-normalized MCS distances between Niacin and Proline
MDIST_G1_G4 = MDIST(G1_Niacin, G4_Proline, MCS_G1_G4)
print(f"MDIST(G1, G4) = {MDIST_G1_G4}")

# Calculate Max-normalized MCS distances between Nicotine and Psilocin
MDIST_G2_G3 = MDIST(G2_Nicotine, G3_Psilocin, MCS_G2_G3)
print(f"MDIST(G2, G3) = {MDIST_G2_G3}")

# Calculate Max-normalized MCS distances between Nicotine and Proline
MDIST_G2_G4 = MDIST(G2_Nicotine, G4_Proline, MCS_G2_G4)
print(f"MDIST(G2, G4) = {MDIST_G2_G4}")

# Calculate Max-normalized MCS distances between Psilocin and Proline
MDIST_G3_G4 = MDIST(G3_Psilocin, G4_Proline, MCS_G3_G4)
print(f"MDIST(G3, G4) = {MDIST_G3_G4}")

| MCS_G1_G2 | = 7; Subgraph label matching [('1', '1'), ('2', '2'), ('3', '3'), ('4', '4'), ('5', '5'), ('6', '6'), ('7', '7')]
| MCS_G1_G3 | = 7; Subgraph label matching [('1', '4'), ('2', '5'), ('3', '6'), ('4', '8'), ('6', '3'), ('7', '2'), ('8', '1')]
| MCS_G1_G4 | = 7; Subgraph label matching [('1', '1'), ('2', '2'), ('3', '3'), ('6', '5'), ('7', '6'), ('8', '7'), ('9', '8')]
| MCS_G2_G3 | = 9; Subgraph label matching [('1', '10'), ('2', '7'), ('3', '6'), ('4', '8'), ('6', '11'), ('7', '12'), ('11', '13'), ('10', '14'), ('12', '15')]
| MCS_G2_G4 | = 6; Subgraph label matching [('6', '6'), ('7', '5'), ('8', '1'), ('9', '2'), ('10', '3'), ('11', '4')]
| MCS_G3_G4 | = 6; Subgraph label matching [('2', '6'), ('1', '7'), ('3', '5'), ('4', '1'), ('5', '2'), ('6', '3')]

Max-normalized MCS distances:
MDIST(G1, G2) = 0.41666666666666663
MDIST(G1, G3) = 0.5333333333333333
MDIST(G1, G4) = 0.2222222222222222
MDIST(G2, G3) = 0.4
MDIST(G2, G4) = 0.5
MDIST(G3, G4) = 0.6


Nearest neighbor of G1_Niacin is G4_proline

Nearest neighbor of G2_Nicotine is G3_Psilocin

Nearest neighbor of G3_Psilocin is G2_Nicotine

Nearest neighbor of G4_proline is G1_Niacin

(c) Under which conditions are Udist and Mdist equivalent? I.e., give conditions related to some graphs G1, G2 and MCS(G1, G2) such that Udist(G1, G2) = Mdist(G1, G2)

UDIST and MDIST are equal if and only if:

$ \mid G_1 \mid +\mid G_2 \mid - \mid MCS(G_1,G_2) \mid = max\{|G_1|, |G_2| \} $

We can see that this only happens when Udist and Mdist are equivalent when one graph is a subgraph of the other, or when the two graphs are isomorphic. In these cases, the MCS is the smaller graph (or either graph if they are of equal size).
