# Hybrid GNN-QUBO Alignment Pipeline

## Section 1: Load Project Dependencies
Set up the Python path for the project and import every module that the pipeline relies on.

In [1]:
from pathlib import Path
import sys
from types import SimpleNamespace
import webbrowser

import pandas as pd
from IPython.display import HTML, IFrame, display

repo_root = Path().resolve().parent
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

from src.config import *
from src.evaluation.solvers import (
    solve_alignment_with_annealer,
    solve_alignment_with_nearest_neighbor,
)
from src.kg_construction.fetch_data import fetch_wiki_data, fetch_arxiv_data
from src.kg_construction.build_kg import build_unpruned_kgs, prune_kgs
from src.embedding.generate_embeddings import (
    generate_relation_embeddings,
    generate_entity_embeddings,
)
from src.utils.graph_visualizer import visualize_ttl

# Maintain backward-compatible variable name for existing cells
ALIGNED_ENTITIES_CSV = ALIGNED_ENTITIES_ANNEALER_CSV

pd.set_option("display.max_colwidth", None)

# Create directories if they don't exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
KG_DIR.mkdir(parents=True, exist_ok=True)
EMBEDDINGS_DIR.mkdir(parents=True, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


## Section 2: Data Preparation
Fetch the Wikipedia and arXiv raw data.  

In [2]:
RUN_FETCH = False
RUN_GRAPH_BUILD = True

wiki_titles = []
arxiv_ids = []

if RUN_FETCH:
    print("Fetching source corpora...")
    wiki_summaries, wiki_titles = fetch_wiki_data()
    arxiv_abstracts, arxiv_ids = fetch_arxiv_data()
    print(f"Wikipedia titles: {wiki_titles}")
    print(f"arXiv IDs: {arxiv_ids}")
else:
    print("Skipping fetch step; using existing articles.")

if RUN_GRAPH_BUILD:
    print("\nBuilding and pruning knowledge graphs...")
    build_unpruned_kgs()
    prune_kgs()
else:
    print("Skipping graph build/prune; using existing TTL files.")

Skipping fetch step; using existing articles.

Building and pruning knowledge graphs...

--- STEP 1: FETCH THE DATA ---
-> all requested Wikipedia articles already cached; reusing local files
    -> returning 10 Wikipedia summaries.
-> all requested arXiv abstracts already cached; reusing local files
    -> returning 10 arXiv abstracts.

--- STEP 2: RUN THE NLP PIPELINE ---
-> NLTK 'punkt' tokenizer model found.
-> NLTK 'punkt_tab' tokenizer model found.
-> SciBERT NER model loaded.

--- STEP 3: BUILD THE UNPRUNED KGs ---
-> Wiki triples extracted: 11208
-> arXiv triples extracted: 2784

-> Saving 200 wiki triples to /home/nuno/Documents/QUBO-KGA/output/KGs/kg_wiki_unpruned.ttl
-> Saving 78 arXiv triples to /home/nuno/Documents/QUBO-KGA/output/KGs/kg_arxiv_unpruned.ttl

--- STEP 4: PRUNING KGs ---

-> pruning Wiki KG...
    -> loaded 200 raw triples, pruning with 7 entities.
    -> Found and added 2 matching entities.
    -> Saving 2 clean triples to /home/nuno/Documents/QUBO-KGA/outpu

## Section 3: Generate Embeddings
Generate the following embeddings:
- Entity embeddings (using a GAE that fine-tunes the SciBERT embeddings)
- Relation embeddings (using the SciBERT embeddings)

In [3]:
RUN_EMBEDDINGS = True

if RUN_EMBEDDINGS:
    print("Generating relation embeddings...")
    generate_relation_embeddings()
    print("\nGenerating entity embeddings...")
    generate_entity_embeddings()
else:
    print("Skipping embedding generation; using cached tensors.")



Generating relation embeddings...

--- Part 1: Generating Relation Embeddings (for H_structure) ---
Loading SciBERT model: allenai/scibert_scivocab_cased...
Generating embeddings for relations:
  - developedBy
  - usesConcept
  - implements
Successfully saved relation embeddings to /home/nuno/Documents/QUBO-KGA/output/embeddings/relation_embeddings.npz

Generating entity embeddings...

--- Part 2: Generating Entity Embeddings (for H_node) ---

Processing Wiki KG:
  Loading graph from /home/nuno/Documents/QUBO-KGA/output/KGs/kg_wiki_unpruned.ttl...
  Generating SciBERT features for nodes...
  Graph loaded: 36 nodes, 236 edges.
  Training GAE for 200 epochs...
    Epoch 1/200, Loss: 0.7252, mean pos prob: 0.628, mean neg prob: 0.627
    Epoch 20/200, Loss: 0.6327, mean pos prob: 0.715, mean neg prob: 0.594
    Epoch 40/200, Loss: 0.6288, mean pos prob: 0.731, mean neg prob: 0.594
    Epoch 60/200, Loss: 0.6427, mean pos prob: 0.731, mean neg prob: 0.604
    Epoch 80/200, Loss: 0.6613, me

## Section 4: Formulate the problem as a QUBO and solve it
Perform the QUBO formulation:
$$
H_{total} = \underbrace{\sum_{i,a}{-S(i, a) \cdot x_{i,a}}}_{H_{\text{node}}} + \underbrace{\sum_{i,j,a,b}{-w_{ij,ab} \cdot x_{i,a} \cdot x_{j,b}}}_{H_{\text{structure}}} + \underbrace{\sum_{i} P_{1} \sum_{a=1}^M \sum_{b=a+1}^M x_{i,a} x_{i,b}}_{\text{Constraint}\ 1} + \underbrace{\sum_{a} P_{2} \sum_{i=1}^N \sum_{j=i+1}^N x_{i,a} x_{j,a}}_{\text{Constraint}\ 2}
$$

- Where:
  - $x_{i,a}$: A binary variable (1 or 0) that is 1 if we align entity $i$ from KG1 with entity $a$ from KG2.
  - $S(i,a)$: The similarity score between entity $i$ and $a$, derived from the GAE embeddings.
  - $w_{ij,ab}$: The structural similarity weight, derived from the SciBERT relation embeddings.
  - $P_1, P_2$: Large positive penalty constants to enforce the constraints.
  - Constraint 1: Enforces that each entity $i$ in KG1 maps to at most one entity in KG2. If $i$ matches zero entities, the penalty is 0. If it matches one, the penalty is 0. If it matches two or more, the penalty is high.
  - Constraint 2: Enforces that each entity $a$ in KG2 is mapped to by at most one entity from KG1. This allows entities to remain unaligned, making the formulation more robust to realistic KGs that do not have perfect 1-to-1 overlap.

And solve it using quantum annealing (more details in the README.md file).

In [4]:
SOLVE_QUBO = True

result = None
if SOLVE_QUBO:
    print("\nSolving the alignment QUBO...")
    result = solve_alignment_with_annealer(
        similarity_threshold=0.0,
        max_structural_pairs=2000,
        visualize=True,
    )
else:
    print("Skipping QUBO solve; falling back to existing artefacts.")

if result is None:
    result = SimpleNamespace(
        alignments=[],
        energy=float("nan"),
        sampleset=None,
        aligned_graph_path=KG_ALIGNED_PATH,
        aligned_graph_html=(KG_DIR / "kg_aligned.html"),
        alignment_report_path=ALIGNED_ENTITIES_ANNEALER_CSV,
    )


Solving the alignment QUBO...
[QUBO] candidate variables: 6, structural pairs: 0
[QUBO] running simulated annealer with num_reads=100, beta_range=None, seed=None
[QUBO] best sample energy=-0.0582 produced 2 alignments
loading graph from: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_aligned.ttl

saved graph visualization to: /home/nuno/Documents/QUBO-KGA/output/KGs/aligned_kg.html
[QUBO] alignment report saved to /home/nuno/Documents/QUBO-KGA/output/alignments/alignment_annealer.csv with 2 matches and 4 unaligned entries


## Section 5: Display HTML Visualizations
Render the pruned graphs, aligned knowledge graph, and alignment report directly within the notebook.

In [5]:
# Assuming KG_DIR, result, and ALIGNED_ENTITIES_ANNEALER_CSV are defined


def _open_in_browser(path, title):
    """Open a local HTML file in the default browser."""
    path = Path(path)
    webbrowser.open(f"file://{path.resolve()}")
    display(HTML(f"<p><i>Opening '{title}' in the browser...</i></p>"))


wiki_html = KG_DIR / "pruned_wiki_kg.html"
arxiv_html = KG_DIR / "pruned_arxiv_kg.html"
aligned_html = KG_DIR / "kg_aligned.html"
# create the visualizations
visualize_ttl(KG_WIKI_FINAL_PATH, wiki_html)
visualize_ttl(KG_ARXIV_FINAL_PATH, arxiv_html)
visualize_ttl(KG_ALIGNED_PATH, aligned_html)
visualize_ttl(KG_WIKI_UNPRUNED_PATH, KG_DIR / "unpruned_wiki_kg.html")
visualize_ttl(KG_ARXIV_UNPRUNED_PATH, KG_DIR / "unpruned_arxiv_kg.html")


# aligned_html = getattr(result, "aligned_graph_html", None)
# if aligned_html is None:
#     aligned_html = KG_DIR / "aligned_kg.html"

# Open HTML files in browser
_open_in_browser(wiki_html, "Pruned Wiki Knowledge Graph")
_open_in_browser(arxiv_html, "Pruned arXiv Knowledge Graph")
_open_in_browser(aligned_html, "Aligned Knowledge Graph")
_open_in_browser(KG_DIR / "unpruned_wiki_kg.html", "Unpruned Wiki Knowledge Graph")
_open_in_browser(KG_DIR / "unpruned_arxiv_kg.html", "Unpruned arXiv Knowledge Graph")

# # Display only the DataFrame in the notebook
#display(HTML("<h3>Annealer Aligned Entities Report</h3>"))
display(pd.read_csv(ALIGNED_ENTITIES_ANNEALER_CSV))

loading graph from: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_wiki_final.ttl

saved graph visualization to: /home/nuno/Documents/QUBO-KGA/output/KGs/pruned_wiki_kg.html
loading graph from: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_arxiv_final.ttl

saved graph visualization to: /home/nuno/Documents/QUBO-KGA/output/KGs/pruned_arxiv_kg.html
loading graph from: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_aligned.ttl

saved graph visualization to: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_aligned.html
loading graph from: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_wiki_unpruned.ttl

saved graph visualization to: /home/nuno/Documents/QUBO-KGA/output/KGs/unpruned_wiki_kg.html
loading graph from: /home/nuno/Documents/QUBO-KGA/output/KGs/kg_arxiv_unpruned.ttl

saved graph visualization to: /home/nuno/Documents/QUBO-KGA/output/KGs/unpruned_arxiv_kg.html


Unnamed: 0,wiki_entity,arxiv_entity,not_aligned
0,features,computations,
1,possibilities,nodes,
2,,,arxiv: classical
3,,,arxiv: distributed
4,,,arxiv: Internet
5,,,arxiv: quantum


In [6]:
print("\nRunning nearest-neighbor baseline...")
nearest_neighbor_result = solve_alignment_with_nearest_neighbor(
    similarity_threshold=0.0,
    visualize=False,
)

print(f"Nearest-neighbor alignments found: {len(nearest_neighbor_result.alignments)}")

display(HTML("<h3>Nearest-Neighbor Aligned Entities Report</h3>"))
display(pd.read_csv(ALIGNED_ENTITIES_NN_CSV))


Running nearest-neighbor baseline...
[NN] produced 2 alignments from 2Ã—6 similarity scores
[QUBO] alignment report saved to /home/nuno/Documents/QUBO-KGA/output/alignments/alignment_nn.csv with 2 matches and 4 unaligned entries
Nearest-neighbor alignments found: 2


Unnamed: 0,wiki_entity,arxiv_entity,not_aligned
0,possibilities,computations,
1,features,nodes,
2,,,arxiv: classical
3,,,arxiv: distributed
4,,,arxiv: Internet
5,,,arxiv: quantum


Error: Failed to open Wayland display, fallback to X11. WAYLAND_DISPLAY='wayland-1' DISPLAY=':1'
Error: Failed to open Wayland display, fallback to X11. WAYLAND_DISPLAY='wayland-1' DISPLAY=':1'
Error: Failed to open Wayland display, fallback to X11. WAYLAND_DISPLAY='wayland-1' DISPLAY=':1'
Error: Failed to open Wayland display, fallback to X11. WAYLAND_DISPLAY='wayland-1' DISPLAY=':1'
Error: Failed to open Wayland display, fallback to X11. WAYLAND_DISPLAY='wayland-1' DISPLAY=':1'
