# WP3 — Kernel-based Classification (SVM)

This notebook implements kernel inner products on precomputed hashed feature sets and runs
SVM classification for DRF–WL and ITS–WL across different feature types (vertex/edge/shortest-path),
dataset sizes, numbers of classes, and train/test splits.

In [1]:
from loader import (
    load_precomputed_features
)

from wp3_kernel import kernel_multiset_intersection

import pandas as pd
from vis_utils import plot_drf_from_counters_rsmi
import plotly.io as pio
from synkit.IO import rsmi_to_graph
pio.renderers.default = "vscode"

## 1) Paths to precomputed feature directories

We load precomputed feature representations (stored as `.pkl`) for:
- DRF–WL: reactant/product difference features
- ITS–WL: features from the ITS reaction graph

Each representation is available for three feature modes: vertex, edge, shortest-path.

### Load DRF–WL Features
Load precomputed DRF–WL feature sets and reaction class labels for kernel-based classification.

In [9]:
from loader import load_precomputed_features

DRF_DIRS = {
    "edge": "drf/precomputed_drf_edge",
    "vertex": "drf/precomputed_drf_vertex",
    "sp": "drf/precomputed_drf_sp",
}

# Alle Feature-Sets laden
X_drf = {}
y_drf = {}

for mode, path in DRF_DIRS.items():
    X, y = load_precomputed_features(path, feature_key="drf_wl")
    X_drf[mode] = X
    y_drf[mode] = y

    print(f"\nLoaded DRF features ({mode})")
    print("Number of reactions:", len(X))
    print("Number of classes:", len(set(y)))


Loaded DRF features (edge)
Number of reactions: 50000
Number of classes: 50

Loaded DRF features (vertex)
Number of reactions: 50000
Number of classes: 50

Loaded DRF features (sp)
Number of reactions: 50000
Number of classes: 50


### Load ITS–WL Features
Load precomputed ITS–WL feature sets and reaction class labels derived from the ITS graph.

In [10]:
from loader import load_precomputed_features

DRF_DIRS = {
    "edge": "its/precomputed_its_edge",
    "vertex": "its/precomputed_its_vertex",
    "sp": "its/precomputed_its_sp",
}

# Alle Feature-Sets laden
X_drf = {}
y_drf = {}

for mode, path in DRF_DIRS.items():
    X, y = load_precomputed_features(path, feature_key="its_wl")
    X_drf[mode] = X
    y_drf[mode] = y

    print(f"\nLoaded ITS features ({mode})")
    print("Number of reactions:", len(X))
    print("Number of classes:", len(set(y)))


Loaded ITS features (edge)
Number of reactions: 50000
Number of classes: 50

Loaded ITS features (vertex)
Number of reactions: 50000
Number of classes: 50

Loaded ITS features (sp)
Number of reactions: 50000
Number of classes: 50


The output confirms that all precomputed DRF–WL feature representations
(edge, vertex, and shortest-path) were loaded successfully. Each representation
contains the full dataset of 50,000 reactions across 50 reaction classes,
providing a consistent basis for kernel computation and classification.

## 2) Kernel inner product on hash sets

The lab definition reduces all kernels to counting common elements of two hashed feature sets.
Given two reactions with feature hash sets \(S_G, S_H\), the kernel is:
\[
k(G,H) = |S_G \cap S_H|
\]

Our precomputed features are stored as Counters. For the required hashset kernel, we use the Counter keys.

Ein Kernel ist eine Funktion, die sagt, wie ähnlich zwei Reaktionen sind.

### Kernel sanity check (DRF–WL)

We verify that the multiset kernel produces meaningful similarities on the precomputed DRF–WL feature multisets.  
Self-similarity \(k(x,x)\) is clearly positive, and different reactions can still share a non-zero overlap, indicating common reaction-change patterns captured by DRF–WL.

In [11]:
from wp3_kernel import kernel_multiset_intersection

# Wähle einen Modus zum Testen
mode = "edge"
X = X_drf[mode]

# Sanity check
k00 = kernel_multiset_intersection(X[0], X[0])
k01 = kernel_multiset_intersection(X[0], X[1])

print(f"\nKernel sanity check (DRF–WL, mode={mode})")
print("k(0,0) =", k00)
print("k(0,1) =", k01)


Kernel sanity check (DRF–WL, mode=edge)
k(0,0) = 104
k(0,1) = 8


In [13]:
# Finde ein nicht-leeres Paar
for i in range(len(X)):
    if len(X[i]) == 0:
        continue
    for j in range(i+1, len(X)):
        if len(X[j]) == 0:
            continue
        k = kernel_multiset_intersection(X[i], X[j])
        if k > 0:
            print("Found non-zero kernel at:", i, j, "value:", k)
            break
    else:
        continue
    break

Found non-zero kernel at: 0 1 value: 8
