In [1]:
from pathlib import Path
import sys

ROOT = next((p for p in [Path.cwd(), *Path.cwd().parents] if (p / "scripts").is_dir() or (p / "data").is_dir()), None)
if ROOT is None:
    raise RuntimeError("Repo-Root not found (expected folder 'scripts' or 'data').")
sys.path.insert(0, str(ROOT))

DATA_DIR = ROOT / "data"
DRF_DIRS_BIG = [(DATA_DIR / "drf_big" / f"precomputed_drf_{m}", m) for m in ("edge", "vertex", "sp")]
DRF_DIRS_SMALL = [(DATA_DIR / "drf_small" / f"precomputed_drf_{m}", m) for m in ("edge", "vertex", "sp")]
ITS_DIRS_BIG = [(DATA_DIR / "its_big" / f"precomputed_its_{m}", m) for m in ("edge", "vertex", "sp")]
ITS_DIRS_SMALL = [(DATA_DIR / "its_small" / f"precomputed_its_{m}", m) for m in ("edge", "vertex", "sp")]

In [2]:
import plotly.express as px

from scripts.wp3.wp3_loader import (
    load_precomputed_features
)

from scripts.wp3.wp3_kernel import (
    kernel_multiset_intersection,
    build_kernel_matrix_from_loaded,
    kernel_matrix_stats
)

# WP3 — Kernel-based Classification (SVM)

This notebook implements kernel inner products on precomputed hashed feature sets and runs
SVM classification for DRF–WL and ITS–WL across different feature types (vertex/edge/shortest-path),
dataset sizes, numbers of classes, and train/test splits.

## 1) Paths to precomputed feature directories

We load precomputed feature representations (stored as `.pkl`) for:
- DRF–WL: reactant/product difference features
- ITS–WL: features from the ITS reaction graph

Each representation is available for three feature modes: vertex, edge, shortest-path.

### Load DRF–WL Features
Load precomputed DRF–WL feature sets and reaction class labels for kernel-based classification.

In [3]:

X_drf, y_drf = {}, {}
for path, mode in DRF_DIRS_BIG:  
    assert path.exists(), f"Pfad nicht gefunden: {path}"
    X, y = load_precomputed_features(path, feature_key="drf_wl")
    X_drf[mode] = X
    y_drf[mode] = y
    print(f"\nLoaded DRF features ({mode}) from {path}")
    print("Number of reactions:", len(X))
    print("Number of classes:", len(set(y)))


Loaded DRF features (edge) from c:\Users\OPokrovskaja\Documents\praktikum_graphen\reaction-kernels\data\drf_big\precomputed_drf_edge
Number of reactions: 50000
Number of classes: 50

Loaded DRF features (vertex) from c:\Users\OPokrovskaja\Documents\praktikum_graphen\reaction-kernels\data\drf_big\precomputed_drf_vertex
Number of reactions: 50000
Number of classes: 50

Loaded DRF features (sp) from c:\Users\OPokrovskaja\Documents\praktikum_graphen\reaction-kernels\data\drf_big\precomputed_drf_sp
Number of reactions: 50000
Number of classes: 50


### Load ITS–WL Features
Load precomputed ITS–WL feature sets and reaction class labels derived from the ITS graph.

In [4]:
X_its = {}
y_its = {}
for path, mode in ITS_DIRS_BIG:  
    assert path.exists(), f"Pfad nicht gefunden: {path}"
    X, y = load_precomputed_features(path, feature_key="its_wl")
    X_its[mode] = X
    y_its[mode] = y
    print(f"\nLoaded ITS features ({mode}) from {path}")
    print("Number of reactions:", len(X))
    print("Number of classes:", len(set(y)))



Loaded ITS features (edge) from c:\Users\OPokrovskaja\Documents\praktikum_graphen\reaction-kernels\data\its_big\precomputed_its_edge
Number of reactions: 50000
Number of classes: 50

Loaded ITS features (vertex) from c:\Users\OPokrovskaja\Documents\praktikum_graphen\reaction-kernels\data\its_big\precomputed_its_vertex
Number of reactions: 50000
Number of classes: 50

Loaded ITS features (sp) from c:\Users\OPokrovskaja\Documents\praktikum_graphen\reaction-kernels\data\its_big\precomputed_its_sp
Number of reactions: 50000
Number of classes: 50


The output confirms that all precomputed DRF–WL feature representations
(edge, vertex, and shortest-path) were loaded successfully. Each representation
contains the full dataset of 50,000 reactions across 50 reaction classes,
providing a consistent basis for kernel computation and classification.

## 2) Kernel inner product on hash sets

The lab definition reduces all kernels to counting common elements of two hashed feature sets.
Given two reactions with feature hash sets \(S_G, S_H\), the kernel is:
\[
k(G,H) = |S_G \cap S_H|
\]

Our precomputed features are stored as Counters. For the required hashset kernel, we use the Counter keys.

A kernel is a function that measures how similar two reactions are.

### Kernel sanity check (DRF–WL)

We verify that the multiset kernel produces meaningful similarities on the precomputed DRF–WL feature multisets.  
Self-similarity \(k(x,x)\) is clearly positive, and different reactions can still share a non-zero overlap, indicating common reaction-change patterns captured by DRF–WL.

In [5]:
mode = "edge"   # "edge" | "vertex" | "sp"
X = X_its[mode]  # or X_drf[mode]

# finde erstes Paar mit k>0
for i in range(len(X)):
    if len(X[i]) == 0:
        continue
    for j in range(i + 1, len(X)):
        if len(X[j]) == 0:
            continue
        k = kernel_multiset_intersection(X[i], X[j])
        if k > 0:
            print("Found non-zero kernel at:", i, j, "value:", k)
            break
    else:
        continue
    break


Found non-zero kernel at: 0 1 value: 10


In [6]:
# Finde ein nicht-leeres Paar
for i in range(len(X)):
    if len(X[i]) == 0:
        continue
    for j in range(i+1, len(X)):
        if len(X[j]) == 0:
            continue
        k = kernel_multiset_intersection(X[i], X[j])
        if k > 0:
            print("Found non-zero kernel at:", i, j, "value:", k)
            break
    else:
        continue
    break

Found non-zero kernel at: 0 1 value: 10


### Kernel Matrix Construction

To apply kernel-based classification, the pairwise similarities between all reactions are computed and stored in a kernel matrix. Each entry \(K_{ij}\) represents the multiset kernel value between reactions \(i\) and \(j\). This matrix serves as the direct input for training a Support Vector Machine with a precomputed kernel.

### DRF–WL Kernel Matrix (edge features)

This heatmap visualizes the kernel matrix computed using the DRF–WL edge kernel for a subset of reactions.
Each entry \(K_{ij}\) represents the multiset intersection between the DRF–WL feature representations of reaction \(i\) and reaction \(j\).

The bright diagonal indicates high self-similarity, as each reaction shares all its features with itself.
Most off-diagonal entries are close to zero, which reflects the sparsity of the DRF representation:  
DRF removes all static molecular structure and retains only features corresponding to reaction-specific changes.

Non-zero off-diagonal values highlight pairs of reactions that share similar bond-change patterns.
This confirms that the DRF–WL kernel captures meaningful similarities between reactions while remaining highly selective.

In [7]:
mode = "edge"   # "edge" | "vertex" | "sp"
n = 200

K_drf, y_big = build_kernel_matrix_from_loaded(
    X_drf, y_drf,
    mode=mode,
    n=n,
)

stats = kernel_matrix_stats(K_drf)
print("Kernel matrix stats:", stats)

fig = px.imshow(
    K_drf,
    title=f"Kernel Matrix Heatmap (DRF–WL {mode}, n={n})",
    aspect="auto",
)
fig.show()

Kernel matrix stats: {'n': 200.0, 'sym_max_abs': 0.0, 'diag_min': 0.0, 'diag_max': 110.0, 'nonzero_share': 0.24645, 'median': 0.0, 'mean': 1.0637999773025513, 'max': 110.0}


**Figure (DRF–WL):** Kernel matrix heatmap computed using the DRF–WL edge kernel.
Each entry \(K_{ij}\) represents the multiset intersection between the DRF–WL feature representations of reactions \(i\) and \(j\).
The diagonal indicates self-similarity, while off-diagonal values are mostly close to zero.
This sparsity reflects the DRF representation, which removes static molecular structure and retains only reaction-specific changes.
Non-zero off-diagonal entries therefore highlight reactions with similar bond-change patterns.

### ITS–WL Kernel Matrix (edge features)

This heatmap shows the kernel matrix computed using the ITS–WL edge kernel.
Here, reactions are represented by Weisfeiler–Lehman features extracted from the Imaginary Transition State (ITS) graph.

Compared to DRF–WL, the ITS–WL kernel produces a denser similarity structure.
This is expected, as the ITS graph encodes the full combined structure of reactants and products, including unchanged molecular context.

The diagonal again represents self-similarity, while the richer off-diagonal structure indicates that many reactions share common substructures.
As a result, ITS–WL captures broader structural similarity between reactions, not only the explicit reaction center.

In [None]:
mode = "edge"
n = 200

K_its, y_its_small = build_kernel_matrix_from_loaded(
    X_its, y_its,
    mode=mode,
    n=n,
)

print("ITS kernel matrix stats:", kernel_matrix_stats(K_its))

fig = px.imshow(
    K_its,
    title=f"Kernel Matrix Heatmap (ITS–WL {mode}, n={n})",
    aspect="auto",
)
fig.show()

**Figure (ITS–WL):** Kernel matrix heatmap computed using the ITS–WL edge kernel.
Each entry \(K_{ij}\) corresponds to the multiset intersection of Weisfeiler–Lehman features extracted from the Imaginary Transition State graphs.
Compared to DRF–WL, the ITS–WL kernel exhibits a denser similarity structure, as the ITS graph encodes the full molecular context of reactants and products.
Off-diagonal similarities reflect shared structural motifs beyond the reaction center.