# Hypothetical Hybrid GNN-QUBO Alignment Pipeline

Keep in mind that this Jupyter notebook's purpose is not to show an advantage of the QUBO method over the NN method. 

This is just a proposed pipeline for QUBO re-ranking. 

It implements all of the steps, from data fetching to the QUBO solver.

We start by building two regular knowledge graphs (unpruned) and then prune them to simulate how a smaller knowledge graph would look like (in theory, that smaller knowledge graph would contain the ambiguous entities which didn't get a high alignment confidence score).

## Section 1: Load Project Dependencies
- Set up the Python path for the project and import every module that the pipeline relies on.

In [None]:
from pathlib import Path
import sys
from types import SimpleNamespace
import webbrowser

import pandas as pd
from IPython.display import HTML, IFrame, display

repo_root = Path().resolve().parent
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

from src.config import *
from src.evaluation.solvers import (
    solve_alignment_with_annealer,
    solve_alignment_with_nearest_neighbor,
 )
from src.kg_construction.fetch_data import fetch_wiki_data, fetch_arxiv_data
from src.kg_construction.build_kg import build_unpruned_kgs, prune_kgs
from src.embedding.generate_embeddings import (
    generate_relation_embeddings,
    generate_entity_embeddings,
 )
from src.utils.graph_visualizer import visualize_ttl

# [FIX THIS LATER] maintain backward-compatible variable name for existing cells
ALIGNED_ENTITIES_CSV = ALIGNED_ENTITIES_ANNEALER_CSV

pd.set_option("display.max_colwidth", None)

# create directories if they don't exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
KG_DIR.mkdir(parents=True, exist_ok=True)
EMBEDDINGS_DIR.mkdir(parents=True, exist_ok=True)
ENTITIES_DIR.mkdir(parents=True, exist_ok=True)
WIKI_ENTITIES_DIR.mkdir(parents=True, exist_ok=True)
ARXIV_ENTITIES_DIR.mkdir(parents=True, exist_ok=True)

## Section 2: Data Preparation
- Fetch the Wikipedia and arXiv raw data.  

In [None]:
wiki_titles = []
arxiv_ids = []


print("Fetching source corpora...")
wiki_summaries, wiki_titles = fetch_wiki_data()
arxiv_abstracts, arxiv_ids = fetch_arxiv_data()
print(f"Wikipedia titles: {wiki_titles}")
print(f"arXiv IDs: {arxiv_ids}")

# Section 3: Build the KGs.

- Run the NLP pipeline to perform Named Entity Recognition (NER) and Relationship Extraction (RE) on the raw data.

- Use the entities and the relations between them to build two large unpruned graphs.

- Take the unpruned graphs and reduce them to just a couple entities. In practice, those would be tha "ambiguous" entities whose alignment confidence score is low.

In [None]:
print("\nBuilding and pruning knowledge graphs...")
build_unpruned_kgs(
    wiki_data=wiki_summaries, 
    arxiv_data=arxiv_abstracts
)
prune_kgs()

## Section 4: Generate Embeddings
Generate the following embeddings:
- Entity embeddings (using a GAE that fine-tunes the SciBERT embeddings)
- Relation embeddings (using the SciBERT embeddings)

In [None]:
RUN_EMBEDDINGS = True

if RUN_EMBEDDINGS:
    print("Generating relation embeddings...")
    generate_relation_embeddings()
    print("\nGenerating entity embeddings...")
    generate_entity_embeddings()
else:
    print("Skipping embedding generation; using cached tensors.")



## Section 5: Formulate the problem as a QUBO and solve it
Perform the QUBO formulation:
$$
H_{total} = \underbrace{\sum_{i,a}{-S(i, a) \cdot x_{i,a}}}_{H_{\text{node}}} + \underbrace{\sum_{i,j,a,b}{-w_{ij,ab} \cdot x_{i,a} \cdot x_{j,b}}}_{H_{\text{structure}}} + \underbrace{\sum_{i} P_{1} \sum_{a=1}^M \sum_{b=a+1}^M x_{i,a} x_{i,b}}_{\text{Constraint}\ 1} + \underbrace{\sum_{a} P_{2} \sum_{i=1}^N \sum_{j=i+1}^N x_{i,a} x_{j,a}}_{\text{Constraint}\ 2}
$$

- Where:
  - $x_{i,a}$: A binary variable (1 or 0) that is 1 if we align entity $i$ from KG1 with entity $a$ from KG2.
  - $S(i,a)$: The similarity score between entity $i$ and $a$, derived from the GAE embeddings.
  - $w_{ij,ab}$: The structural similarity weight, derived from the SciBERT relation embeddings.
  - $P_1, P_2$: Large positive penalty constants to enforce the constraints.
  - Constraint 1: Enforces that each entity $i$ in KG1 maps to at most one entity in KG2. If $i$ matches zero entities, the penalty is 0. If it matches one, the penalty is 0. If it matches two or more, the penalty is high.
  - Constraint 2: Enforces that each entity $a$ in KG2 is mapped to by at most one entity from KG1. This allows entities to remain unaligned, making the formulation more robust to realistic KGs that do not have perfect 1-to-1 overlap.

And solve it using quantum annealing (more details in the README.md file).

In [None]:
SOLVE_QUBO = True

result = None
if SOLVE_QUBO:
    print("\nSolving the alignment QUBO...")
    result = solve_alignment_with_annealer(
        similarity_threshold=0.0,
        max_structural_pairs=2000,
        visualize=True
    )
else:
    print("Skipping QUBO solve; falling back to existing artefacts.")

if result is None:
    result = SimpleNamespace(
        alignments=[],
        energy=float("nan"),
        sampleset=None,
        aligned_graph_path=KG_ALIGNED_PATH,
        aligned_graph_html=(KG_DIR / "kg_aligned.html"),
        alignment_report_path=ALIGNED_ENTITIES_ANNEALER_CSV,
    )

## Section 6: Display HTML Visualizations
Render the pruned graphs, aligned knowledge graph, and alignment report directly within the notebook.

In [None]:
# Assuming KG_DIR, result, and ALIGNED_ENTITIES_ANNEALER_CSV are defined


def _open_in_browser(path, title):
    """Open a local HTML file in the default browser."""
    path = Path(path)
    webbrowser.open(f"file://{path.resolve()}")
    display(HTML(f"<p><i>Opening '{title}' in the browser...</i></p>"))


wiki_html = KG_DIR / "pruned_wiki_kg.html"
arxiv_html = KG_DIR / "pruned_arxiv_kg.html"
aligned_html = KG_DIR / "kg_aligned.html"
# create the visualizations
visualize_ttl(KG_WIKI_FINAL_PATH, wiki_html)
visualize_ttl(KG_ARXIV_FINAL_PATH, arxiv_html)
visualize_ttl(KG_ALIGNED_PATH, aligned_html)
visualize_ttl(KG_WIKI_UNPRUNED_PATH, KG_DIR / "unpruned_wiki_kg.html")
visualize_ttl(KG_ARXIV_UNPRUNED_PATH, KG_DIR / "unpruned_arxiv_kg.html")


# aligned_html = getattr(result, "aligned_graph_html", None)
# if aligned_html is None:
#     aligned_html = KG_DIR / "aligned_kg.html"

# Open HTML files in browser
_open_in_browser(wiki_html, "Pruned Wiki Knowledge Graph")
_open_in_browser(arxiv_html, "Pruned arXiv Knowledge Graph")
_open_in_browser(aligned_html, "Aligned Knowledge Graph")
_open_in_browser(KG_DIR / "unpruned_wiki_kg.html", "Unpruned Wiki Knowledge Graph")
_open_in_browser(KG_DIR / "unpruned_arxiv_kg.html", "Unpruned arXiv Knowledge Graph")

# # Display only the DataFrame in the notebook
#display(HTML("<h3>Annealer Aligned Entities Report</h3>"))
display(pd.read_csv(ALIGNED_ENTITIES_ANNEALER_CSV))