# PRECONFIGURATION

In [1]:
from pathlib import Path
import sys
import pandas as pd

ROOT = next((p for p in [Path.cwd(), *Path.cwd().parents] if (p / "scripts").is_dir() or (p / "data").is_dir()), None)
if ROOT is None:
    raise RuntimeError("Repo-Root not found (expected folder 'scripts' or 'data').")
sys.path.insert(0, str(ROOT))
DATA_DIR = ROOT / "data"

# Eingabe- und Ausgabe-Verzeichnisse unter DATA_DIR
subsets_dir_big = DATA_DIR / "subsets_big"
subsets_dir_small = DATA_DIR / "subsets_small"
assert subsets_dir_big.exists(), f"Input dir missing: {subsets_dir_big}"
assert subsets_dir_small.exists(), f"Input dir missing: {subsets_dir_small}"


In [2]:
from scripts.wp2.wp2_functions import (
    drf, its_wl_feature_sets
)
from scripts.wp2.wp2_plots import (
    plot_drf_from_counters_rsmi,
    plot_wl_drf_iterations_from_rsmi,
    visualize_its_wl_iterations,
    plot_feature_growth_subset_its_vs_drf
)

In [3]:
# one rsmi for testing

path = DATA_DIR / "schneider50k_clean.tsv"
data = pd.read_csv(path, sep="\t")
rsmi = data["clean_rxn"].iloc[1]

drf_E = drf(rsmi = rsmi, mode="edge")
drf_V = drf(rsmi = rsmi, mode="vertex")
drf_SP = drf(rsmi = rsmi, mode="sp", include_edge_labels_in_sp=True)

fig_drf_edge = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_E,
    mode="edge",
    hash_labels=True,              
    digest_size=16,                 
    show_edge_labels=True,
)


fig_drf_vertex = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_V,
    mode="vertex",
    hash_labels=True,
    digest_size=16,
)


fig_drf_sp = plot_drf_from_counters_rsmi(
    rsmi,
    drf_counter=drf_SP,
    mode="sp",
    include_edge_labels_in_sp=True, 
    hash_labels=True,
    digest_size=16,
)



fig_edge_wl = plot_wl_drf_iterations_from_rsmi(rsmi=rsmi, h=3, mode="edge", show_edge_labels=True)
fig_vertex_wl = plot_wl_drf_iterations_from_rsmi(rsmi=rsmi, h=3, mode="vertex", show_edge_labels=True)
fig_sp_wl = plot_wl_drf_iterations_from_rsmi(rsmi=rsmi, h=3, mode="sp", include_edge_labels_in_sp=True)


fig_its_wl= visualize_its_wl_iterations(rsmi, h=3)


subset_df = pd.read_csv(DATA_DIR / "subsets_small/subset_001.tsv", sep="\t")

fig_feature_growth_edge = plot_feature_growth_subset_its_vs_drf(subset_df, h=3, mode="edge", show_errorbars=True)
fig_feature_growth_vertex = plot_feature_growth_subset_its_vs_drf(subset_df, h=3, mode="vertex", show_errorbars=True)
fig_feature_growth_sp = plot_feature_growth_subset_its_vs_drf(subset_df, h=3, mode="sp", show_errorbars=True)


# PRESENTATION START

# 1. Dataset Overview

- 50 classes 
- 50.000 reactions
- 1000 reactions per class

# 1.1. Subset Split

## 2 variations:

- ### Subset Small, split all 50k reactions into subsets containing each:
    - maximal 3 classes with each 20 reactions
    - 833 subsets with 3 classes, 1 subset with 1 class
    - e.g.:
    
    Subset 1: Classes=['1.7.4', '3.4.1', '7.9.2'],\
    Counts={'7.9.2': 20, '1.7.4': 20, '3.4.1': 20}

- ### Subset Big, split all 50k reactions into subsets containing each:
    - maximal 5 classes with each 200 reactions
    - 49 subsets with 5 classes, 1 subset with 2 classes, 3 subsets with 1 class 
    - e.g.: 
    
    Subset 1: Classes=['1.2.1', '1.3.6', '10.4.2', '6.2.1', '6.3.7'],\
    Counts={'6.2.1': 200, '1.3.6': 200, '10.4.2': 200, '6.3.7': 200, '1.2.1': 200}

# 2. DRF Transformation Function


### Φ feature transformations

- **Vertex**:
    - one feature per atom: atom label (or hashed label)
    - Example (raw): ["C", "O", "N"] → hashed: ["a1f3...", "b2c4...", ...]

    - *What it does:*
        - *Look at each atom in the molecule*
        - *Get the atom label (for example: "C", "O", "N")*
        

- **Edge**:
    - one feature per bond: canonical triplet "nodeLabelA | bondLabel | nodeLabelB"
    - canonicalization sorts node labels so C–O and O–C map to same token
    - Example (raw): ["C|-|O", "C|=|O", "N|-|H"]

    - *What it does:*
        - *For each bond between two atoms u and v*:
            - *Read the label of u and of v (e.g. "C" and "O") and the bond type (e.g. "-" or "=")*
            - *Make one string that combines them: "C|-|O"*
            - *If canonicalize=True, sort the two atom labels so order doesn’t matter (both "C|-|O" and "O|-|C" become "C|-|O")*

        - *example:*
            - *Bonds: C—O (single), C=O (double), C—H*
            - *phi_edge_list (raw, canonical) → ["C|-|O", "C|=|O", "C|-|H"]*
            - *Hashing → ["e1","e2","e3"]*
        *Why canonicalize? So the same chemical bond gives the same feature regardless of node ordering.*


- **Shortest Path**:
    - one feature per unordered node pair: sequence of node/edge labels along shortest path; direction canonicalized (min of forward/reverse)
    - captures neighborhood structure beyond immediate bonds
    - Example (raw): ["C| - |O| - |H", "N| - |C| = |O"]

    - *What it does:*

        - *Path Labeling:*
            - *start with the label of the first node*
            - *for each step along the path, optionally append the edge label, then the next node label*
            - *join everything with "|" to make one string*
            
            - *Example path nodes [A, B, C] with labels A="C", edge(A,B)="-", B="O", edge(B,C)="-", C="H":*
                - *forward label = "C|-|O|-|H"*
                - *reverse label = "H|-|O|-|C"*
                - *choose the smaller (lexicographically) of forward/reverse to be canonical (so direction doesn't matter)*

        - *Shortest Path Computation*:
            - *compute shortest paths between all pairs of nodes*
            - *for each pair, build the canonical path label (as above)*

        - *Result: one feature for each unordered node pair, describing the path between them*




### DRF (Differential Reaction Fingerprint)

- compute feature multisets separately for educts Φ(E) and products Φ(P).
- Reaction fingerprint = multiset symmetric difference:
    - Φ_reaction = Φ(E) Δ Φ(P) (for each feature f: count = |count_E(f) − count_P(f)|)
    - removes unchanged features; keeps only created/destroyed/changed ones (reaction center)

- Example:

    - Edges (educt): ["C|-|O", "C|-|O", "C|-|H"]
    - Edges (product): ["C|=|O", "C|-|H"]

    - Counters:
        - E: {"C|-|O":2, "C|-|H":1}
        - P: {"C|=|O":1, "C|-|H":1}
    
    - DRF (symmetric diff): {"C|-|O":2, "C|=|O":1}
        -  two C–O single bonds removed, one C=O formed.

### 2.1. DRF Visualisation Base Feature Mappings 

(edge, vertex, and shortest-path features)\
Highlighted graph elements indicate structural differences between reactants and
products that contribute to the DRF representation.

In [4]:
fig_drf_edge.show(renderer="vscode")

In [5]:
fig_drf_vertex.show(renderer="vscode")

In [6]:
fig_drf_sp.show(renderer="vscode")

### 2.2 DRF WL

- for a reaction (educt graph E, product graph P) it runs WL label refinement for h iterations on both graphs
- at each iteration i it builds features Φ_i(E) and Φ_i(P) from the WL node labels (vertex / edge / shortest-path features)
- computes the symmetric multiset difference per iteration: Δ_i = Φ_i(E) Δ Φ_i(P).
    - returns
        - per_iter = [Δ_0, Δ_1, ..., Δ_h] and total = sum_i Δ_i 
        - and total Counter 

- drf_wl = sum of WL-level differences → a multi-scale DRF that captures changes at increasing neighborhood radii

### WL-Iterations- Visualisation for one reaction (edge features)

In [7]:
fig_edge_wl.show(renderer="vscode")


### WL-Iterations- Visualisation for one reaction (vertex features)

In [8]:
fig_vertex_wl.show(renderer="vscode")

### 2.3. ITS WL

ITS–WL (the ITS feature set with WL labels) describes a reaction with a single ITS graph representation (one graph per reaction) and extracts features from that graph across WL iterations — it encodes neighborhood/context information around the reaction center as patterns inside one graph rather than an explicit before/after difference.


- For each iteration i = 0..h:

    - Compute WL node labels L_i for the ITS graph (wl_label_sequence).
        - L0 = initial node labels (e.g., atom types or hashed atom labels).
        - Each next label L_{i+1}[u] encodes the previous label of u plus the multiset of neighbor labels (hashed).
    - from L_i build features accordingly for :
        - **vertex**: one feature per node = L_i[n] (prefixed/tagged).
        - **edge**: one feature per bond = canonical triplet using L_i[u], edgeLabel, L_i[v] (then hash).
        - **sp**: one feature per unordered node pair = canonical shortest‑path label using L_i for nodes (then hash).
    - convert the feature list into a Counter (counts per feature) → c_i.
    - append c_i to per_iter and add it to total.



Node colors represent WL labels.  
At iteration 0, nodes are colored by atom type only.  
At iteration 1, colors encode the immediate neighborhood of each atom.  
At iteration 2, colors further distinguish atoms based on the neighborhood of their neighbors,
illustrating how increasing structural context is incorporated.

In [9]:
fig_its_wl.show(renderer="vscode")

### 2.4. ITS–WL vs DRF–WL Feature Growth Across Iterations

This plot compares how feature sets evolve across WL iterations for two reaction
representations: ITS–WL (single reaction graph) and DRF–WL (difference between
reactants and products). Aggregating over a subset provides a robust view of how
quickly structural context is captured and whether feature growth saturates.

In [10]:
fig_feature_growth_edge.show(renderer="vscode")
fig_feature_growth_vertex.show(renderer="vscode")
fig_feature_growth_sp.show(renderer="vscode")

- ITS produces many more features at every iteration (richer context)
- DRF produces far fewer features because it keeps only the changes between reactants and products
- high heterogeneity in the dataset — some reactions produce many features, others few

- **Iteration effect**:
    - ITS tends to grow quickly and then saturate
    - DRF grows more slowly because only changed patterns accumulate

- **Mode differences (edge / vertex / sp)**
    - edge/vertex: moderate feature counts and variance.
    - sp (shortest-path): feature counts explode (and variance is huge) — path features scale roughly with node‑pair counts.
