In [1]:
from wp2_functions import drf_features_from_rsmi, drf_wl_features_from_rsmi, plot_wl_drf_iterations_from_rsmi, wl_feature_sets_per_iter_graph, drf_wl_features_per_iter_from_rsmi, plot_wl_drf_feature_growth_from_rsmi
import pandas as pd
from vis_utils import plot_drf_from_counters_rsmi
import plotly.io as pio
from synkit.IO import rsmi_to_graph
pio.renderers.default = "vscode"

TODOs:

1. Implement transformations Φ for Vertex-, Edge- and Shortest-Path Labels.
2. Implement the DRF transformation function.
3. Implement Weisfeiler-Lehman with applications of Φ such that hashed feature sets are
returned at every iteration.
4. Pre-compute feature hash sets for all datasets

In [2]:
# one rsmi for testing

path = "schneider50k_clean.tsv"
data = pd.read_csv(path, sep="\t")
rsmi = data["clean_rxn"].iloc[20]
print(rsmi)


[NH2:1][CH2:2][c:3]1[cH:33][cH:32][cH:31][c:5]([CH2:6][N:7]([CH2:20][c:21]2[cH:22][cH:23][c:24]([C:27]([F:28])([F:29])[F:30])[cH:25][cH:26]2)[S:8](=[O:9])(=[O:10])[c:11]2[cH:16][c:15]([Cl:17])[cH:14][c:13]([Cl:18])[c:12]2[OH:19])[cH:4]1.[O:44]=[C:43]=[N:42][c:36]1[cH:37][cH:38][c:39]([F:41])[cH:40][c:35]1[F:34]>>[O:44]=[C:43]([NH:1][CH2:2][c:3]1[cH:33][cH:32][cH:31][c:5]([CH2:6][N:7]([CH2:20][c:21]2[cH:26][cH:25][c:24]([C:27]([F:30])([F:28])[F:29])[cH:23][cH:22]2)[S:8](=[O:10])(=[O:9])[c:11]2[cH:16][c:15]([Cl:17])[cH:14][c:13]([Cl:18])[c:12]2[OH:19])[cH:4]1)[NH:42][c:36]1[cH:37][cH:38][c:39]([F:41])[cH:40][c:35]1[F:34]


## 1. + 2. Implement transformations Φ for Vertex-, Edge- and Shortest-Path Labels + Implement the DRF transformation function.

In WP2, reactions are represented using **Differential Reaction Fingerprints (DRF)**.
For a reaction \( R = (E_R, P_R) \), feature representations are first computed
separately for the reactant graph \( \Phi(E_R) \) and the product graph \( \Phi(P_R) \).

The reaction representation is then defined as the **symmetric difference**
between these two feature multisets:
\[
\Phi_{\text{reaction}}(R) = \Phi(E_R) \,\Delta\, \Phi(P_R)
\]

This operation removes features that appear in both reactants and products
and keeps only those features that change during the reaction.
In this implementation, features are treated as **multisets** (using feature counts),
so the symmetric difference is computed as the absolute difference of feature frequencies.

Depending on the chosen mode, DRF can be constructed from
vertex labels, edge labels, or shortest-path features.
This representation focuses explicitly on structural changes
and is used as input for kernel-based similarity computations in later steps.



→ DRF filters out everything that does not belong to the reaction.  
→ Only the change itself remains.

### Comparing DRF Feature Types

This cell computes Differential Reaction Fingerprints (DRF) using different base
feature mappings (vertex labels, edge labels, and shortest paths).  
The resulting feature counts illustrate how the choice of Φ affects the richness
and granularity of the reaction representation.

In [3]:
drf_E = drf_features_from_rsmi(rsmi, mode="edge")
drf_V = drf_features_from_rsmi(rsmi, mode="vertex")
drf_SP = drf_features_from_rsmi(rsmi, mode="sp", include_edge_labels_in_sp=True)

print("DRF edge feature count:", sum(drf_E.values()))
print("DRF vertex feature count:", sum(drf_V.values()))
print("DRF shortest-path feature count:", sum(drf_SP.values()))

# Optional: ein paar Einträge ansehen
print("Sample DRF edge items:", list(drf_E.items())[:10])

DRF edge feature count: 3
DRF vertex feature count: 0
DRF shortest-path feature count: 399
Sample DRF edge items: [('e79537351a6bc51d5c4fbea90fec0b26', 2), ('095751d849efbdb232d632cd4a7fc89e', 1)]


### Visualizing DRF Features for Different Φ Mappings

This cell visualizes the Differential Reaction Fingerprints derived from different
base feature mappings (edge, vertex, and shortest-path features).  
Highlighted graph elements indicate structural differences between reactants and
products that contribute to the DRF representation.

In [4]:
fig_e = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_E,
    mode="edge",
    hash_labels=True,              
    digest_size=16,                 
    show_edge_labels=True,
)
fig_e.show(renderer="vscode")

fig_v = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_V,
    mode="vertex",
    hash_labels=True,
    digest_size=16,
)
fig_v.show(renderer="vscode")

fig_sp = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_SP,
    mode="sp",
    include_edge_labels_in_sp=True, 
    hash_labels=True,
    digest_size=16,
)
fig_sp.show(renderer="vscode")

## 3. Implement Weisfeiler-Lehman with applications of Φ such that hashed feature sets are returned at every iteration.

The Weisfeiler–Lehman (WL) algorithm iteratively refines node labels by incorporating
the labels of neighboring nodes. Starting from initial node labels (e.g. atom types),
each WL iteration constructs a new label for every node based on its current label
and the multiset of its neighbors’ labels, followed by hashing.

This process enriches node labels with increasingly larger local neighborhood
information. After each iteration, feature maps Φ are applied to the relabeled graph
and the resulting hashed feature sets are stored. These WL-enhanced features capture
local substructures of the graph up to a given neighborhood depth and are later used
for kernel-based similarity computations.

→ WL iteratively enriches node labels with neighborhood information, capturing increasingly complex local graph structures.
→ WL = wiederholtes Relabeling von Knoten anhand ihrer Nachbarschaft (und Hashing dieser neuen Labels).

### WL-Enhanced DRF Feature Counts

This cell computes Differential Reaction Fingerprints combined with Weisfeiler–Lehman
(WL) relabeling for different base feature mappings (edge, vertex, and shortest paths).
The printed counts summarize the total amount of structural change captured when
neighborhood information up to WL depth h=3 is incorporated.

In [5]:
drf_wl_E = drf_wl_features_from_rsmi(rsmi, h=3, mode="edge")
drf_wl_V = drf_wl_features_from_rsmi(rsmi, h=3, mode="vertex")
drf_wl_SP = drf_wl_features_from_rsmi(rsmi, h=3, mode="sp", include_edge_labels_in_sp=True)
print(sum(drf_wl_E.values()), sum(drf_wl_V.values()), sum(drf_wl_SP.values()))

40 28 2090


### WL Feature Sets per Iteration

This cell computes feature sets for a single graph after successive
Weisfeiler–Lehman (WL) iterations.  
The printed values show how the number of extracted features grows with increasing
WL depth, illustrating how neighborhood information is progressively incorporated.

In [6]:

ed, pr = rsmi_to_graph(rsmi)

sets_per_iter = wl_feature_sets_per_iter_graph(ed, h=3, mode="edge")  # Liste von Sets
print([len(s) for s in sets_per_iter])

[11, 23, 34, 38]


### WL–DRF Features per Iteration and Visualization

This cell computes WL-enhanced DRF features separately for each WL iteration and
prints their total counts.  
The accompanying visualization highlights, for each iteration, which graph elements
contribute to the DRF representation, illustrating how increasingly larger
neighborhood information influences the detected reaction changes.

Red graph elements indicate nodes or edges that contribute to the DRF representation, i.e. structural changes between reactants and products.
(Rot markiert = das ist die Reaktion.
Grau = chemisch irrelevant Hintergrund.)

In [7]:

per_iter, total = drf_wl_features_per_iter_from_rsmi(rsmi, h=3, mode="edge")
print([sum(c.values()) for c in per_iter], "sum:", sum(total.values()))


fig = plot_wl_drf_iterations_from_rsmi(rsmi, h=3, mode="edge", show_edge_labels=True)
fig.show()

[3, 7, 11, 19] sum: 40


### WL–DRF Feature Growth Across Iterations
This plot shows how the number and magnitude of DRF features evolve across WL iterations.
It illustrates how successive WL relabeling steps introduce additional structural
information until the representation saturates.

The WL-DRF feature growth shows that successive WL iterations introduce additional structural information. No saturation is observed up to h=3, indicating that deeper neighborhood information contributes meaningfully to the reaction representation.
Das Diagramm zeigt, dass WL die DRF-Repräsentation schrittweise anreichert
und bis Iteration 3 noch keine Sättigung erreicht ist.

In [8]:
fig = plot_wl_drf_feature_growth_from_rsmi(rsmi, h=3, mode="edge", show_cumulative=True)
fig.show()

## Pre-computing Feature Hash Sets

In this step, feature representations are pre-computed for all reactions in the dataset.
For each reaction, hashed feature sets are generated using the previously defined
DRF and WL–DRF transformations and stored for later use.

This pre-computation avoids repeated and expensive feature extraction during kernel
construction and enables efficient and reproducible experiments in subsequent work packages.

### Pre-computing WL–DRF Feature Representations

In this step, WL–DRF feature representations are pre-computed for all reactions in the
selected datasets and stored on disk.  
This avoids repeated and expensive feature extraction and enables efficient kernel
construction in subsequent work packages.

In [9]:
from wp2_precompute_features import precompute_all_subsets_in_dir

precompute_all_subsets_in_dir(
    subsets_dir="subsets_small",
    out_dir="precomputed",
    pattern="subset*.tsv",
    drf_mode="edge",
    wl_h=3,
    wl_mode="edge",
)

[+] Precomputing subset_001.tsv -> subset_001.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_002.tsv -> subset_002.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_003.tsv -> subset_003.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_004.tsv -> subset_004.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_005.tsv -> subset_005.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_006.tsv -> subset_006.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_007.tsv -> subset_007.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_008.tsv -> subset_008.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_009.tsv -> subset_009.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_010.tsv -> subset_010.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_011.tsv -> subset_011.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_012.tsv -> subset_012.reaction_features_drf_wl_h3.pkl
[+] Precomputing subset_013.tsv -> subset_013.reaction_features_