In [1]:
from pathlib import Path
import sys

ROOT = next((p for p in [Path.cwd(), *Path.cwd().parents] if (p / "scripts").is_dir() or (p / "data").is_dir()), None)
if ROOT is None:
    raise RuntimeError("Repo-Root not found (expected folder 'scripts' or 'data').")
sys.path.insert(0, str(ROOT))
DATA_DIR = ROOT / "data"

# Eingabe- und Ausgabe-Verzeichnisse unter DATA_DIR
subsets_dir_big = DATA_DIR / "subsets_big"
subsets_dir_small = DATA_DIR / "subsets_small"
assert subsets_dir_big.exists(), f"Input dir missing: {subsets_dir_big}"
assert subsets_dir_small.exists(), f"Input dir missing: {subsets_dir_small}"

targets_big_drf = [(DATA_DIR / "drf_big" / f"precomputed_drf_{m}", m) for m in ("edge", "vertex", "sp")]
targets_small_drf = [(DATA_DIR / "drf_small" / f"precomputed_drf_{m}", m) for m in ("edge", "vertex", "sp")]
targets_big_its = [(DATA_DIR / "its_big" / f"precomputed_its_{m}", m) for m in ("edge", "vertex", "sp")]
targets_small_its = [(DATA_DIR / "its_small" / f"precomputed_its_{m}", m) for m in ("edge", "vertex", "sp")]

In [None]:
import pandas as pd
import plotly.io as pio
pio.renderers.default = "vscode"
from synkit.IO import rsmi_to_graph
from collections import Counter

#local imports

from scripts.wp2.wp2_functions import (
    drf, drf_wl, wl_feature_sets_per_iter,
    its_wl_feature_sets
)

from scripts.wp2.wp2_precompute_features import (
    precompute_all_subsets_in_dir_drf_wl,
    precompute_all_subsets_in_dir_its_wl
)

from scripts.wp2.wp2_plots import (
    plot_drf_from_counters_rsmi,
    plot_wl_drf_iterations_from_rsmi,
    plot_wl_drf_feature_growth_from_rsmi,
    visualize_its_wl_iterations,
    plot_its_wl_feature_growth_from_rsmi,
    plot_its_wl_feature_growth_subset,
    plot_feature_growth_subset_its_vs_drf
)


In [3]:
# one rsmi for testing

path = DATA_DIR / "schneider50k_clean.tsv"
data = pd.read_csv(path, sep="\t")
rsmi = data["clean_rxn"].iloc[20]
print(rsmi)

[NH2:1][CH2:2][c:3]1[cH:33][cH:32][cH:31][c:5]([CH2:6][N:7]([CH2:20][c:21]2[cH:22][cH:23][c:24]([C:27]([F:28])([F:29])[F:30])[cH:25][cH:26]2)[S:8](=[O:9])(=[O:10])[c:11]2[cH:16][c:15]([Cl:17])[cH:14][c:13]([Cl:18])[c:12]2[OH:19])[cH:4]1.[O:44]=[C:43]=[N:42][c:36]1[cH:37][cH:38][c:39]([F:41])[cH:40][c:35]1[F:34]>>[O:44]=[C:43]([NH:1][CH2:2][c:3]1[cH:33][cH:32][cH:31][c:5]([CH2:6][N:7]([CH2:20][c:21]2[cH:26][cH:25][c:24]([C:27]([F:30])([F:28])[F:29])[cH:23][cH:22]2)[S:8](=[O:10])(=[O:9])[c:11]2[cH:16][c:15]([Cl:17])[cH:14][c:13]([Cl:18])[c:12]2[OH:19])[cH:4]1)[NH:42][c:36]1[cH:37][cH:38][c:39]([F:41])[cH:40][c:35]1[F:34]


## 1. + 2. Implement transformations Φ for Vertex-, Edge- and Shortest-Path Labels + Implement the DRF transformation function.

In WP2, reactions are represented using **Differential Reaction Fingerprints (DRF)**.
For a reaction \( R = (E_R, P_R) \), feature representations are first computed
separately for the reactant graph \( \Phi(E_R) \) and the product graph \( \Phi(P_R) \).

The reaction representation is then defined as the **symmetric difference**
between these two feature multisets:
\[
\Phi_{\text{reaction}}(R) = \Phi(E_R) \,\Delta\, \Phi(P_R)
\]

This operation removes features that appear in both reactants and products
and keeps only those features that change during the reaction.
In this implementation, features are treated as **multisets** (using feature counts),
so the symmetric difference is computed as the absolute difference of feature frequencies.

Depending on the chosen mode, DRF can be constructed from
vertex labels, edge labels, or shortest-path features.
This representation focuses explicitly on structural changes
and is used as input for kernel-based similarity computations in later steps.


→ DRF filters out everything that does not belong to the reaction.  
→ Only the change itself remains.

### Comparing DRF Feature Types

This cell computes Differential Reaction Fingerprints (DRF) using different base
feature mappings (vertex labels, edge labels, and shortest paths).  
The resulting feature counts illustrate how the choice of Φ affects the richness
and granularity of the reaction representation.

In [4]:
drf_E = drf(rsmi = rsmi, mode="edge")
drf_V = drf(rsmi = rsmi, mode="vertex")
drf_SP = drf(rsmi = rsmi, mode="sp", include_edge_labels_in_sp=True)

print("DRF edge feature count:", sum(drf_E.values()))
print("DRF vertex feature count:", sum(drf_V.values()))
print("DRF shortest-path feature count:", sum(drf_SP.values()))

# Optional: ein paar Einträge ansehen
print("Sample DRF edge items:", list(drf_E.items())[:10])

DRF edge feature count: 7
DRF vertex feature count: 4
DRF shortest-path feature count: 479
Sample DRF edge items: [('c39115617198efcb3b826afe5a16b5ca', 2), ('dfece4a9dcd08795392ad70cc5917916', 1), ('f7efcd62786774a9fdf164fdb955b257', 1), ('15f9e08649775a73900f08bfd014711b', 1), ('de31134b08b96644d27409cf6bbb8a27', 1), ('c3597c2da4c3548b56a2e0db79dcb01f', 1)]


### Visualizing DRF Features for Different Φ Mappings

This cell visualizes the Differential Reaction Fingerprints derived from different
base feature mappings (edge, vertex, and shortest-path features).  
Highlighted graph elements indicate structural differences between reactants and
products that contribute to the DRF representation.

In [5]:
fig_e = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_E,
    mode="edge",
    hash_labels=True,              
    digest_size=16,                 
    show_edge_labels=True,
)
fig_e.show(renderer="vscode")

fig_v = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_V,
    mode="vertex",
    hash_labels=True,
    digest_size=16,
)
fig_v.show(renderer="vscode")

fig_sp = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_SP,
    mode="sp",
    include_edge_labels_in_sp=True, 
    hash_labels=True,
    digest_size=16,
)
fig_sp.show(renderer="vscode")

## 3. Implement Weisfeiler-Lehman with applications of Φ such that hashed feature sets are returned at every iteration.

The Weisfeiler–Lehman (WL) algorithm iteratively refines node labels by incorporating
the labels of neighboring nodes. Starting from initial node labels (e.g. atom types),
each WL iteration constructs a new label for every node based on its current label
and the multiset of its neighbors’ labels, followed by hashing.

This process enriches node labels with increasingly larger local neighborhood
information. After each iteration, feature maps Φ are applied to the relabeled graph
and the resulting hashed feature sets are stored. These WL-enhanced features capture
local substructures of the graph up to a given neighborhood depth and are later used
for kernel-based similarity computations.

→ WL iteratively enriches node labels with neighborhood information, capturing increasingly complex local graph structures.
→ WL = wiederholtes Relabeling von Knoten anhand ihrer Nachbarschaft (und Hashing dieser neuen Labels).

### WL-Enhanced DRF Feature Counts

This cell computes Differential Reaction Fingerprints combined with Weisfeiler–Lehman
(WL) relabeling for different base feature mappings (edge, vertex, and shortest paths).
The printed counts summarize the total amount of structural change captured when
neighborhood information up to WL depth h=3 is incorporated.

In [6]:
drf_wl_E = drf_wl(rsmi=rsmi, h=3, mode="edge")
drf_wl_V = drf_wl(rsmi=rsmi, h=3, mode="vertex")
drf_wl_SP = drf_wl(rsmi=rsmi, h=3, mode="sp", include_edge_labels_in_sp=True)
print(sum(drf_wl_E.values()), sum(drf_wl_V.values()), sum(drf_wl_SP.values()))

80 60 2520


### WL Feature Sets per Iteration

This cell computes feature sets for a single graph after successive
Weisfeiler–Lehman (WL) iterations.  
The printed values show how the number of extracted features grows with increasing
WL depth, illustrating how neighborhood information is progressively incorporated.

In [7]:

ed, pr = rsmi_to_graph(rsmi, drop_non_aam=False, use_index_as_atom_map=False)

sets_per_iter = wl_feature_sets_per_iter(ed, h=3, mode="edge")  # Liste von Sets
print([len(s) for s in sets_per_iter])

[17, 32, 38, 40]


### WL–DRF Features per Iteration and Visualization

This cell computes WL-enhanced DRF features separately for each WL iteration and
prints their total counts.  
The accompanying visualization highlights, for each iteration, which graph elements
contribute to the DRF representation, illustrating how increasingly larger
neighborhood information influences the detected reaction changes.

Red graph elements indicate nodes or edges that contribute to the DRF representation, i.e. structural changes between reactants and products.
(Rot markiert = das ist die Reaktion.
Grau = chemisch irrelevant Hintergrund.)

In [8]:

per_iter, total, *_ = drf_wl(rsmi=rsmi, h=3, mode="edge")


per_iter_counts = [Counter(c.split("|")) for c in per_iter]  
total_counts = Counter(total.split("|"))    
print([sum(c.values()) for c in per_iter_counts], "sum:", sum(total_counts.values()))


fig = plot_wl_drf_iterations_from_rsmi(rsmi=rsmi, h=3, mode="edge", show_edge_labels=True)
fig.show()

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] sum: 1


### WL–DRF Feature Growth Across Iterations
This plot shows how the number and magnitude of DRF features evolve across WL iterations.
It illustrates how successive WL relabeling steps introduce additional structural
information until the representation saturates.

The WL-DRF feature growth shows that successive WL iterations introduce additional structural information. No saturation is observed up to h=3, indicating that deeper neighborhood information contributes meaningfully to the reaction representation.
Das Diagramm zeigt, dass WL die DRF-Repräsentation schrittweise anreichert
und bis Iteration 3 noch keine Sättigung erreicht ist.

In [9]:
fig = plot_wl_drf_feature_growth_from_rsmi(rsmi, h=3, mode="edge", show_cumulative=True)
fig.show()

## Pre-computing Feature Hash Sets

In this step, feature representations are pre-computed for all reactions in the dataset.
For each reaction, hashed feature sets are generated using the previously defined
DRF and WL–DRF transformations and stored for later use.

This pre-computation avoids repeated and expensive feature extraction during kernel
construction and enables efficient and reproducible experiments in subsequent work packages.

### Pre-computing WL–DRF Feature Representations

In this step, WL–DRF feature representations are pre-computed for all reactions in the
selected datasets and stored on disk.  
This avoids repeated and expensive feature extraction and enables efficient kernel
construction in subsequent work packages.

precompute_all_subsets_in_dir speichert nicht die Reaktion,
sondern die Änderung, die die Reaktion ausmacht.

In [10]:
# for out_dir, mode in targets_big_drf:
#     out_dir.mkdir(parents=True, exist_ok=True)  
#     print(f"write to {out_dir} (mode={mode})")
#     precompute_all_subsets_in_dir_drf_wl(
#         subsets_dir=str(subsets_dir_big), 
#         out_dir=str(out_dir),
#         h=3,
#         mode=mode,  # "edge" | "vertex" | "sp"
#     )


In [11]:
# for out_dir, mode in targets_small_drf:
#     out_dir.mkdir(parents=True, exist_ok=True)  
#     print(f"write to {out_dir} (mode={mode})")
#     precompute_all_subsets_in_dir_drf_wl(
#         subsets_dir=str(subsets_dir_small), 
#         out_dir=str(out_dir),
#         h=3,
#         mode=mode,  # "edge" | "vertex" | "sp"
#     )

In [12]:
per_iter, total = its_wl_feature_sets(rsmi=rsmi, h=3, mode="edge")
print("Features per iteration:", [len(c) for c in per_iter])
print("Total unique over all iterations:", len(total))
set(total.keys())
S_G = set(total.keys())

print("Final hashset S_G size:", len(S_G))

Features per iteration: [41, 63, 65, 65]
Total unique over all iterations: 234
Final hashset S_G size: 234


### Vizualization of WL on ITS
This visualization shows the effect of the Weisfeiler–Lehman algorithm on the ITS graph.
The graph structure remains fixed, while node colors represent WL-refined labels at
successive iterations, illustrating how increasing neighborhood context is encoded.

In [13]:
fig = visualize_its_wl_iterations(rsmi, h=3)
fig.show()

Node colors represent WL labels.  
At iteration 0, nodes are colored by atom type only.  
At iteration 1, colors encode the immediate neighborhood of each atom.  
At iteration 2, colors further distinguish atoms based on the neighborhood of their neighbors,
illustrating how increasing structural context is incorporated.

### ITS–WL Feature Growth Across Iterations
This plot shows how the number of ITS–WL features evolves across WL iterations.
It illustrates how increasing neighborhood depth enriches the ITS representation
until the feature set saturates.

In [14]:
fig = plot_its_wl_feature_growth_from_rsmi(rsmi, h=3, mode="edge")
fig.show()

While the previous plot illustrates feature growth for a single reaction,
the following analysis aggregates ITS–WL feature counts across an entire subset.
Averaging over multiple reactions provides a more robust view of how the WL
representation typically evolves with increasing iteration depth.

### ITS–WL Feature Growth Across Iterations (Subset Average)

This plot shows the average growth of ITS–WL feature counts across WL iterations,
aggregated over all reactions in a subset. Averaging across multiple reactions
provides a robust view of how structural context is progressively captured by
the WL procedure and whether feature growth saturates with increasing depth.

In [15]:
subset_df = data.iloc[:20]
fig = plot_its_wl_feature_growth_subset(
    subset_df,   # z.B. ein DataFrame aus subsets_small
    h=3,
    mode="edge",
)
fig.show()

## Pre-computing ITS–WL Feature Representations

In this step, WL-based feature representations are pre-computed for the Imaginary
Transition State (ITS) graphs of all reactions in the selected subsets.
For each reaction, features are extracted across all Weisfeiler–Lehman iterations
and aggregated into a single representation, which is stored for later kernel-based
classification.

In [16]:
# for out_dir, mode in targets_small_its:
#     out_dir.mkdir(parents=True, exist_ok=True)  
#     print(f"write to {out_dir} (mode={mode})")
#     precompute_all_subsets_in_dir_its_wl(
#         subsets_dir=str(subsets_dir_small), 
#         out_dir=str(out_dir),
#         h=3,
#         mode=mode,  # "edge" | "vertex" | "sp"
#     )


In [17]:
# for out_dir, mode in targets_big_its:
#     out_dir.mkdir(parents=True, exist_ok=True)  
#     print(f"write to {out_dir} (mode={mode})")
#     precompute_all_subsets_in_dir_its_wl(
#         subsets_dir=str(subsets_dir_big), 
#         out_dir=str(out_dir),
#         h=3,
#         mode=mode,  # "edge" | "vertex" | "sp"
#     )


## Comparison of ITS–WL and DRF–WL Representations

The following visualizations compare two WL-based reaction representations.
The ITS–WL view applies Weisfeiler–Lehman relabeling to the Imaginary Transition State
graph, illustrating how structural context is refined across iterations.
In contrast, the DRF–WL visualization highlights graph elements that contribute to the
reaction difference between reactants and products, thereby localizing the reaction
center and its surrounding context.

ITS–WL encodes structural context,
DRF–WL localizes reaction-specific changes.

### ITS–WL vs DRF–WL Feature Growth Across Iterations

This plot compares how feature sets evolve across WL iterations for two reaction
representations: ITS–WL (single reaction graph) and DRF–WL (difference between
reactants and products). Aggregating over a subset provides a robust view of how
quickly structural context is captured and whether feature growth saturates.

In [18]:
subset_df = pd.read_csv(DATA_DIR / "subsets_small/subset_001.tsv", sep="\t")

fig = plot_feature_growth_subset_its_vs_drf(subset_df, h=3, mode="edge", show_errorbars=True)
fig.show()

The plot shows the average growth of WL-based feature sets across iterations for two
reaction representations: ITS–WL and DRF–WL.  
For ITS–WL, features are extracted from a single reaction graph, and their number
increases steadily as larger structural neighborhoods are incorporated.
In contrast, DRF–WL focuses on differences between reactants and products, resulting
in a more compact feature set that grows more slowly and saturates earlier.
This comparison illustrates that ITS–WL captures global structural context, whereas
DRF–WL emphasizes reaction-specific changes and their local environment.

ITS–WL wächst, weil es Struktur beschreibt.
DRF–WL wächst langsamer, weil es nur Änderung beschreibt.