# WP3 — Kernel-based Classification (SVM)

This notebook implements kernel inner products on precomputed hashed feature sets and runs
SVM classification for DRF–WL and ITS–WL across different feature types (vertex/edge/shortest-path),
dataset sizes, numbers of classes, and train/test splits.

In [1]:
from wp2_functions import drf_wl_features_from_rsmi
from loader import load_precomputed_features

from wp3_kernel import (
    build_kernel_matrix_from_loaded, 
    kernel_matrix_stats,
    kernel_multiset_intersection,
    build_kernel_matrix_from_loaded, 
    kernel_matrix_stats
)
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "vscode"

## 1) Paths to precomputed feature directories

We load precomputed feature representations (stored as `.pkl`) for:
- DRF–WL: reactant/product difference features
- ITS–WL: features from the ITS reaction graph

Each representation is available for three feature modes: vertex, edge, shortest-path.

### Load DRF–WL Features
Load precomputed DRF–WL feature sets and reaction class labels for kernel-based classification.

In [2]:
from loader import load_precomputed_features

DRF_DIRS = {
    "edge": "drf/precomputed_drf_edge",
    "vertex": "drf/precomputed_drf_vertex",
    "sp": "drf/precomputed_drf_sp",
}

# Alle Feature-Sets laden
X_drf = {}
y_drf = {}

for mode, path in DRF_DIRS.items():
    X, y = load_precomputed_features(path, feature_key="drf_wl")
    X_drf[mode] = X
    y_drf[mode] = y

    print(f"\nLoaded DRF features ({mode})")
    print("Number of reactions:", len(X))
    print("Number of classes:", len(set(y)))


Loaded DRF features (edge)
Number of reactions: 50000
Number of classes: 50

Loaded DRF features (vertex)
Number of reactions: 50000
Number of classes: 50

Loaded DRF features (sp)
Number of reactions: 46140
Number of classes: 50


### Load ITS–WL Features
Load precomputed ITS–WL feature sets and reaction class labels derived from the ITS graph.

In [3]:
from loader import load_precomputed_features

ITS_DIRS = {
    "edge": "its/precomputed_its_edge",
    "vertex": "its/precomputed_its_vertex",
    "sp": "its/precomputed_its_sp",
}

# Alle Feature-Sets laden
X_its = {}
y_its = {}

for mode, path in ITS_DIRS.items():
    X, y = load_precomputed_features(path, feature_key="its_wl")
    X_its[mode] = X
    y_its[mode] = y

    print(f"\nLoaded ITS features ({mode})")
    print("Number of reactions:", len(X))
    print("Number of classes:", len(set(y)))


Loaded ITS features (edge)
Number of reactions: 50000
Number of classes: 50

Loaded ITS features (vertex)
Number of reactions: 50000
Number of classes: 50

Loaded ITS features (sp)
Number of reactions: 50000
Number of classes: 50


The output confirms that all precomputed DRF–WL feature representations
(edge, vertex, and shortest-path) were loaded successfully. Each representation
contains the full dataset of 50,000 reactions across 50 reaction classes,
providing a consistent basis for kernel computation and classification.

## 2) Kernel inner product on hash sets

The lab definition reduces all kernels to counting common elements of two hashed feature sets.
Given two reactions with feature hash sets \(S_G, S_H\), the kernel is:
\[
k(G,H) = |S_G \cap S_H|
\]

Our precomputed features are stored as Counters. For the required hashset kernel, we use the Counter keys.

Ein Kernel ist eine Funktion, die sagt, wie ähnlich zwei Reaktionen sind.

### Kernel sanity check (DRF–WL)

We verify that the multiset kernel produces meaningful similarities on the precomputed DRF–WL feature multisets.  
Self-similarity \(k(x,x)\) is clearly positive, and different reactions can still share a non-zero overlap, indicating common reaction-change patterns captured by DRF–WL.

In [4]:
mode = "edge"   # "edge" | "vertex" | "sp"
X = X_its[mode]  # oder X_drf[mode]

# finde erstes Paar mit k>0
for i in range(len(X)):
    if len(X[i]) == 0:
        continue
    for j in range(i + 1, len(X)):
        if len(X[j]) == 0:
            continue
        k = kernel_multiset_intersection(X[i], X[j])
        if k > 0:
            print("Found non-zero kernel at:", i, j, "value:", k)
            break
    else:
        continue
    break


Found non-zero kernel at: 0 1 value: 8


In [5]:
# Finde ein nicht-leeres Paar
for i in range(len(X)):
    if len(X[i]) == 0:
        continue
    for j in range(i+1, len(X)):
        if len(X[j]) == 0:
            continue
        k = kernel_multiset_intersection(X[i], X[j])
        if k > 0:
            print("Found non-zero kernel at:", i, j, "value:", k)
            break
    else:
        continue
    break

Found non-zero kernel at: 0 1 value: 8


### Kernel Matrix Construction

To apply kernel-based classification, the pairwise similarities between all reactions are computed and stored in a kernel matrix. Each entry \(K_{ij}\) represents the multiset kernel value between reactions \(i\) and \(j\). This matrix serves as the direct input for training a Support Vector Machine with a precomputed kernel.

### DRF–WL Kernel Matrix (edge features)

This heatmap visualizes the kernel matrix computed using the DRF–WL edge kernel for a subset of reactions.
Each entry \(K_{ij}\) represents the multiset intersection between the DRF–WL feature representations of reaction \(i\) and reaction \(j\).

The bright diagonal indicates high self-similarity, as each reaction shares all its features with itself.
Most off-diagonal entries are close to zero, which reflects the sparsity of the DRF representation:  
DRF removes all static molecular structure and retains only features corresponding to reaction-specific changes.

Non-zero off-diagonal values highlight pairs of reactions that share similar bond-change patterns.
This confirms that the DRF–WL kernel captures meaningful similarities between reactions while remaining highly selective.

In [6]:
mode = "edge"   # "edge" | "vertex" | "sp"
n = 200

K_drf, y_small = build_kernel_matrix_from_loaded(
    X_drf, y_drf,
    mode=mode,
    n=n,
)

stats = kernel_matrix_stats(K_drf)
print("Kernel matrix stats:", stats)

fig = px.imshow(
    K_drf,
    title=f"Kernel Matrix Heatmap (DRF–WL {mode}, n={n})",
    aspect="auto",
)
fig.show()

Kernel matrix stats: {'n': 200.0, 'sym_max_abs': 0.0, 'diag_min': 24.0, 'diag_max': 94.0, 'nonzero_share': 0.1821, 'median': 0.0, 'mean': 0.826200008392334, 'max': 94.0}


**Figure (DRF–WL):** Kernel matrix heatmap computed using the DRF–WL edge kernel.
Each entry \(K_{ij}\) represents the multiset intersection between the DRF–WL feature representations of reactions \(i\) and \(j\).
The diagonal indicates self-similarity, while off-diagonal values are mostly close to zero.
This sparsity reflects the DRF representation, which removes static molecular structure and retains only reaction-specific changes.
Non-zero off-diagonal entries therefore highlight reactions with similar bond-change patterns.

#### Error Handling

In [7]:
import pickle
from pathlib import Path

pkl = next(Path("drf/precomputed_drf_edge").glob("*.pkl"))
obj = pickle.load(open(pkl, "rb"))

print("Keys:", obj.keys())
print("n_errors:", obj["meta"]["n_errors"])
print("First error:", obj["errors"][:1])
print("First feature:", type(obj["drf_wl"][0]), obj["drf_wl"][0])


Keys: dict_keys(['meta', 'rsmi', 'classes', 'drf_wl', 'errors'])
n_errors: 0
First error: []
First feature: <class 'collections.Counter'> Counter({'a82838b20364425c67fcb5c7e9afe41e': 2, '5450379d4b597bf1c7af1a3c9f693e38': 2, 'c6764b9ca50efd6d3e9fb6f852bc2f0e': 2, '6d313c7f7232721ae18a5bca00bc11ef': 2, '144e500ccedd25f71f204f17362141d5': 2, 'e18af7080f8f530277220e4e452e4eda': 2, '547f58cf21f27c8b82bd711df1b44914': 1, '602bd8e20c9c046a4919fa6bd48fa7d4': 1, '3c96fdc9d330460f21aeb28e07575879': 1, 'b8fb27b68fdd36b9df573dae55ce06d1': 1, '2454d79cc5ad08b5839b2412010649de': 1, '6aef668f83375a3f29b8d61aaa609776': 1, '5f0938fcffb698773cb194e4f3638bfb': 1, '7f4568e0d5321cd5e4f18b42c3851107': 1, 'b63332285bf4357676d3672defc787c5': 1, '38e9567c8f82ea78b720f222ad4bf422': 1, '01eb2445818b1fa0a2dbaa8579c50538': 1, 'de29cd00dc3c165e4e4fe0b3a05bb6a7': 1, 'd32b7afca00a4807e01bb9945ccf1495': 1, '10d1f4a56deacec06e10f22777ebabf7': 1, 'd5600fbfaca1c551987c9e9b650d6e22': 1, '26115cacbcfcb4da8df1b4ab23cf4b57'

In [8]:
import pickle
from pathlib import Path

DIR = Path("drf/precomputed_drf_edge")  # <- GENAU der Ordner, den du lädst
pkl = sorted(DIR.glob("*.pkl"))[0]
print("Inspecting:", pkl)

with open(pkl, "rb") as f:
    obj = pickle.load(f)

print("Keys:", obj.keys())
print("Meta n_rows:", obj["meta"]["n_rows"])
print("Meta n_errors:", obj["meta"]["n_errors"])
print("First error (if any):", obj["errors"][:1])

# Jetzt das wichtigste:
X = obj["drf_wl"]
empty = sum(1 for c in X if len(c) == 0)
print("Empty counters:", empty, "/", len(X))

# Beispiel suchen
for i, c in enumerate(X):
    if len(c) > 0:
        print("First non-empty at idx:", i, "items:", len(c), "total:", sum(c.values()))
        print("Sample:", list(c.items())[:5])
        break
else:
    print("ALL COUNTERS ARE EMPTY in this PKL.")


Inspecting: drf/precomputed_drf_edge/subset_001.reaction_features_drf_wl_h3_edge.pkl
Keys: dict_keys(['meta', 'rsmi', 'classes', 'drf_wl', 'errors'])
Meta n_rows: 60
Meta n_errors: 0
First error (if any): []
Empty counters: 0 / 60
First non-empty at idx: 0 items: 52 total: 54
Sample: [('86673f02a9bba3113b35f611fee08fab', 1), ('e11f3902c40931c8135357648e383a14', 1), ('34be2e7bb02c994c02c30ecd3c73a525', 1), ('ba69450099be1228e55119b644917475', 2), ('119fcc403eca81c58aee0a625c395fe8', 1)]


In [9]:
import pickle
from pathlib import Path

pkl = sorted(Path("drf/precomputed_drf_edge").glob("*.pkl"))[0]
obj = pickle.load(open(pkl, "rb"))

print("n_errors:", obj["meta"]["n_errors"])
print("empty:", sum(1 for c in obj["drf_wl"] if len(c)==0), "/", len(obj["drf_wl"]))
print("example total count:", sum(obj["drf_wl"][0].values()))

n_errors: 0
empty: 0 / 60
example total count: 54


### ITS–WL Kernel Matrix (edge features)

This heatmap shows the kernel matrix computed using the ITS–WL edge kernel.
Here, reactions are represented by Weisfeiler–Lehman features extracted from the Imaginary Transition State (ITS) graph.

Compared to DRF–WL, the ITS–WL kernel produces a denser similarity structure.
This is expected, as the ITS graph encodes the full combined structure of reactants and products, including unchanged molecular context.

The diagonal again represents self-similarity, while the richer off-diagonal structure indicates that many reactions share common substructures.
As a result, ITS–WL captures broader structural similarity between reactions, not only the explicit reaction center.

In [10]:
mode = "edge"
n = 200

K_its, y_its_small = build_kernel_matrix_from_loaded(
    X_its, y_its,
    mode=mode,
    n=n,
)

print("ITS kernel matrix stats:", kernel_matrix_stats(K_its))

fig = px.imshow(
    K_its,
    title=f"Kernel Matrix Heatmap (ITS–WL {mode}, n={n})",
    aspect="auto",
)
fig.show()

ITS kernel matrix stats: {'n': 200.0, 'sym_max_abs': 0.0, 'diag_min': 32.0, 'diag_max': 212.0, 'nonzero_share': 0.9411, 'median': 8.0, 'mean': 9.979949951171875, 'max': 212.0}


**Figure (ITS–WL):** Kernel matrix heatmap computed using the ITS–WL edge kernel.
Each entry \(K_{ij}\) corresponds to the multiset intersection of Weisfeiler–Lehman features extracted from the Imaginary Transition State graphs.
Compared to DRF–WL, the ITS–WL kernel exhibits a denser similarity structure, as the ITS graph encodes the full molecular context of reactants and products.
Off-diagonal similarities reflect shared structural motifs beyond the reaction center.

### Comparison of DRF–WL and ITS–WL Kernel Matrices

The DRF–WL and ITS–WL kernel matrices reveal complementary notions of reaction similarity.
DRF–WL focuses exclusively on reaction-specific changes by computing the symmetric difference between reactant and product features.
As a result, the corresponding kernel matrix is sparse, with non-zero similarities only for reactions that share similar bond-change patterns.

In contrast, ITS–WL operates on the Imaginary Transition State graph, which encodes the full structural context of both reactants and products.
This leads to a denser kernel matrix, as reactions may share common substructures even if their reaction centers differ.

Consequently, DRF–WL provides a highly selective notion of similarity tailored to reaction mechanisms,
whereas ITS–WL captures broader structural resemblance between reactions.
Both representations are therefore suitable for different aspects of reaction classification.

**Figure:** Kernel matrix heatmaps for DRF–WL (bottom) and ITS–WL (top) using edge-based Weisfeiler–Lehman features.
Each entry \(K_{ij}\) corresponds to the multiset intersection between the feature representations of reactions \(i\) and \(j\).
The diagonal indicates self-similarity, while off-diagonal values reflect shared structural or reaction-specific features.
DRF–WL produces a sparse kernel emphasizing reaction changes, whereas ITS–WL yields a denser kernel capturing overall structural similarity.

In [11]:
import numpy as np
import plotly.express as px

def upper_triangle_values(K):
    n = K.shape[0]
    return K[np.triu_indices(n, k=1)]

vals_drf = upper_triangle_values(K_drf)  # DRF Kernel-Matrix
vals_its = upper_triangle_values(K_its)  # ITS Kernel-Matrix

fig = px.histogram(
    x=[vals_drf, vals_its],
    labels={"value": "Kernel value", "variable": "Kernel"},
    nbins=50,
    opacity=0.6,
    title="Distribution of Kernel Values: DRF–WL vs ITS–WL",
)

fig.data[0].name = "DRF–WL"
fig.data[1].name = "ITS–WL"
fig.show()

**Figure:** Distribution of off-diagonal kernel values for DRF–WL and ITS–WL.
DRF–WL produces a highly sparse similarity distribution with many zero entries, reflecting its focus on reaction-specific changes.
In contrast, ITS–WL yields a broader distribution, capturing shared structural context between reactions.