# Notebook 7: H, O, T Nodes Insertion

This notebook integrates all generated **H**, **O**, and **T** nodes into the G2 graph, using the previously detected structural communities.

---

### 1. K-means Clustering of S and A Embeddings  
We apply the **K-means** algorithm to the embeddings of all **S** and **A** nodes to form  
**k = √(number of S and A nodes)** clusters.

For each **H** node, we then identify the **top 5 closest semantic clusters** based on centroid similarity.

---

### 2. Linking S/A Nodes to H Nodes  
Within each community:

If an **S** or **A** node belongs to one of the **top 5 semantic clusters** associated with an **H** node, that node is linked to the corresponding **H** node.

---

### 3. O Nodes Insertion  
We insert all **O** nodes (High-level Overview nodes) and link each **O** node to the **H** node from which its summary was produced.

The updated graph is saved as **G3 — community augmented graph**.

---

### 4. T Nodes Insertion  
We insert all **T** nodes (original text chunks) and connect each one to its child **S** nodes using the `source_id` attribute.  
This preserves the original textual meaning that may have been abstracted away during LLM-based semantic decomposition.

The final graph is saved as **G4 — text inserted graph**.


In [1]:
import os
dir_path = os.getcwd()
print("The directory of this script is:", dir_path)
root_path = os.path.dirname(dir_path)
print("The root directory is:", root_path)

The directory of this script is: c:\Users\HP\Desktop\Projects\NodeRAG\graphs
The root directory is: c:\Users\HP\Desktop\Projects\NodeRAG


In [2]:
import sys
sys.path.append(root_path)
from graphs.Node import Node

In [3]:
import pandas as pd
#H nodes
medical_summary = pd.read_parquet("data/nodes/community/medical_communities_summary.parquet")
medical_summary

Unnamed: 0,node_id,community_summary
0,medical-H-22,Here are the distinct categories of high-level...
1,medical-H-24,Here are distinct categories of high-level inf...
2,medical-H-21,Here are distinct categories of high-level inf...
3,medical-H-15,Here are the distinct categories of high-level...
4,medical-H-2,Here's a breakdown of distinct high-level info...
5,medical-H-14,Here are the distinct categories of high-level...
6,medical-H-10,Here are distinct categories of high-level inf...
7,medical-H-5,Here are the distinct categories of high-level...
8,medical-H-11,Here's a breakdown of the distinct categories ...
9,medical-H-8,Here's a breakdown of distinct categories of h...


In [4]:
#O nodes
medical_overview = pd.read_parquet("data/nodes/community/medical_communities_overview.parquet")
medical_overview

Unnamed: 0,node_id,community_overview
0,medical-H-22-O-0,HPV Cancer Etiology Diagnosis Screening Vaccin...
1,medical-H-24-O-0,"Skin Cancer: Origins, Risk Factors, Appearance..."
2,medical-H-21-O-0,"Cancer Patient Empowerment, Guidance, Genetics..."
3,medical-H-15-O-0,Cancer genetics molecular diagnostics biomarke...
4,medical-H-2-O-0,"Cancer Diagnosis, Staging, Treatment, and Prog..."
5,medical-H-14-O-0,"Cancer Treatment, Fertility, Side Effects, Dia..."
6,medical-H-10-O-0,Cancer Diagnosis Treatment Planning Testing Me...
7,medical-H-5-O-0,Cancer Patient Journey Management
8,medical-H-11-O-0,"Cancer Diagnosis, Staging, Grading, Biomarkers..."
9,medical-H-8-O-0,"Endocrine System, Hormone Function, Adrenal Gl..."


In [5]:
import pickle
#communities
with open(f"{root_path}/graphs/data//nodes/community/medical_communities.pkl", "rb") as f:
    communities_medical = pickle.load(f)
communities_medical

defaultdict(list,
            {22: ['medical-0-S-0',
              'medical-0-S-0-R-1',
              'medical-15-S-3-R-10',
              'medical-15-S-5',
              'medical-15-S-5-R-0',
              'medical-15-S-5-R-1',
              'medical-15-S-5-R-2',
              'medical-16-S-1',
              'medical-16-S-1-R-0',
              'medical-16-S-1-R-1',
              'medical-16-S-1-R-2',
              'medical-16-S-1-R-3',
              'medical-16-S-1-R-4',
              'medical-16-S-1-R-5',
              'medical-16-S-1-R-7',
              'medical-16-S-1-R-9',
              'medical-16-S-1-R-11',
              'medical-20-S-4-R-5',
              'medical-32-S-0-R-8',
              'medical-33-S-1-R-3',
              'medical-35-S-1',
              'medical-35-S-1-R-0',
              'medical-35-S-1-R-2',
              'medical-35-S-1-R-3',
              'medical-64-S-0-R-0',
              'medical-118-S-2-R-13',
              'medical-118-S-2-R-14',
              'med

In [6]:
#T nodes
medical_chunks = pd.read_parquet(f"{root_path}/chunking/medical_chunks.parquet")
medical_chunks['node_id'] = "medical-" + medical_chunks.index.astype(str)
medical_chunks[["node_id","chunk"]]

Unnamed: 0,node_id,chunk
0,medical-0,About basal cell skin cancer What is basal cel...
1,medical-1,Hypodermis – The deepest tissue layer is made ...
2,medical-2,Causes and risk factors Basal cell skin cancer...
3,medical-3,Recurrence is when basal cell skin cancer retu...
4,medical-4,2 Testing for basal cell skin cancer 10 Genera...
...,...,...
549,medical-549,Is there a portal where I can get copies of my...
550,medical-550,"Is home care after treatment needed? If yes, w..."
551,medical-551,What will you target? What if I’ve already had...
552,medical-552,What are the chances the cancer will return or...


In [7]:
#S,N,R,A nodes
#G2 - graph
with open(f"{root_path}/graphs/data/graphs/G2_medical_attribute_enriched_graph.pkl", "rb") as f:
    medical_g2 = pickle.load(f)
medical_g2

{'medical-0-S-0': <graphs.Node.Node at 0x20d34912230>,
 'medical-0-S-0-R-0': <graphs.Node.Node at 0x20d4e42b760>,
 'medical-0-S-0-R-1': <graphs.Node.Node at 0x20d4e42aa10>,
 'medical-0-S-0-R-2': <graphs.Node.Node at 0x20d4e42b940>,
 'medical-0-S-1': <graphs.Node.Node at 0x20d4e42abc0>,
 'medical-0-S-1-R-0': <graphs.Node.Node at 0x20d4e42bd30>,
 'medical-0-S-1-R-1': <graphs.Node.Node at 0x20d4e42bd60>,
 'medical-0-S-1-R-2': <graphs.Node.Node at 0x20d4e42ad10>,
 'medical-0-S-1-R-3': <graphs.Node.Node at 0x20d4e42af80>,
 'medical-0-S-1-R-4': <graphs.Node.Node at 0x20d4e42b4f0>,
 'medical-0-S-1-R-5': <graphs.Node.Node at 0x20d4e42b250>,
 'medical-0-S-1-R-6': <graphs.Node.Node at 0x20d4e42bb80>,
 'medical-0-S-1-R-7': <graphs.Node.Node at 0x20d4e42bca0>,
 'medical-0-S-2': <graphs.Node.Node at 0x20d4e42aa40>,
 'medical-0-S-2-R-0': <graphs.Node.Node at 0x20d4e42aad0>,
 'medical-0-S-2-R-1': <graphs.Node.Node at 0x20d4e42b3d0>,
 'medical-0-S-2-R-2': <graphs.Node.Node at 0x20d4e42b0a0>,
 'medical

In [8]:
import json
import faiss
import numpy as np
#embeddings of S,A,H,T nodes

#load faiss index file
index = faiss.read_index("data/embedding/medical_index.faiss")
with open("data/embedding/medical_ids.json", "r") as f:
    medical_embedding_ids = json.load(f)
index.ntotal, len(medical_embedding_ids)

#reconstruct vectors from faiss index
num_vectors = index.ntotal
dimension = index.d
embeddings = np.zeros((num_vectors, dimension), dtype='float32')
for i in range(num_vectors):
    embeddings[i] = index.reconstruct(i)

print("Embeddings shape:", embeddings.shape)
embeddings

Embeddings shape: (4016, 768)


array([[-0.07040366, -0.0460059 , -0.03309077, ..., -0.04978694,
        -0.03295664, -0.00608065],
       [-0.03963271, -0.0726626 , -0.06257651, ..., -0.0454145 ,
        -0.03274148, -0.00346303],
       [ 0.00135392,  0.0302959 ,  0.02910572, ...,  0.01070352,
        -0.02824946, -0.05189522],
       ...,
       [ 0.04048249, -0.0287718 , -0.03389297, ..., -0.00295302,
         0.0417009 ,  0.00099732],
       [ 0.00630151, -0.01392328, -0.01267361, ..., -0.09236691,
         0.01060563, -0.01778928],
       [ 0.01245843, -0.01854508, -0.01106029, ..., -0.08220403,
         0.01130273,  0.05532239]], dtype=float32)

In [9]:
import re
#separate T nodes, H nodes, S-A nodes
#id pattern
medical_T_id_pattern = re.compile(r"^medical-\d+$")
medical_H_id_pattern = re.compile(r"^medical-H-\d+$")
#id lists
medical_embedding_T_ids = []
medical_embedding_H_ids = []
medical_embedding_SA_ids = []
#index
medical_embedding_H_index = []
medical_embedding_SA_index = []


for i in range(num_vectors):
    current_id = medical_embedding_ids[i]
    if medical_T_id_pattern.match(current_id):
        medical_embedding_T_ids.append(current_id)
    elif medical_H_id_pattern.match(current_id):
        medical_embedding_H_ids.append(current_id)
        medical_embedding_H_index.append(i)
    else:
        medical_embedding_SA_ids.append(current_id)
        medical_embedding_SA_index.append(i)

H_embeddings = embeddings[medical_embedding_H_index]
SA_embeddings = embeddings[medical_embedding_SA_index]
SA_embeddings.shape, H_embeddings.shape

((3425, 768), (37, 768))

In [10]:
import math
#k means on embeddings
#hyperparameters
k = math.floor(math.sqrt(SA_embeddings.shape[0]))
niter = 1000

#init model and train
kmeans = faiss.Kmeans(d=index.d,k=k,niter=niter,verbose=True,spherical=True,gpu=False)
kmeans.cp.metric_type = faiss.METRIC_INNER_PRODUCT
kmeans.train(SA_embeddings)

#get centroids and assign
centroids = kmeans.centroids
_, SA_assignments = kmeans.index.search(SA_embeddings, 1)
SA_assignments = SA_assignments.reshape(-1)

#cluster->SA nodes mapping
clusters = {i: [] for i in range(k)}
for idx, cid in enumerate(SA_assignments):
    clusters[cid].append(medical_embedding_SA_ids[idx])

print("Number of clusters:",len(clusters))
clusters

Number of clusters: 58


{0: ['medical-65-S-1',
  'medical-66-S-6',
  'medical-67-S-0',
  'medical-67-S-1',
  'medical-69-S-2',
  'medical-70-S-2',
  'medical-70-S-4',
  'medical-70-S-6',
  'medical-71-S-0',
  'medical-71-S-1',
  'medical-71-S-2',
  'medical-71-S-3',
  'medical-71-S-4',
  'medical-72-S-0',
  'medical-72-S-2',
  'medical-72-S-4',
  'medical-72-S-6',
  'medical-73-S-0',
  'medical-316-S-0',
  'medical-316-S-1',
  'medical-316-S-2',
  'medical-317-S-0',
  'medical-317-S-3',
  'medical-317-S-4',
  'medical-318-S-0',
  'medical-318-S-1',
  'medical-318-S-2',
  'medical-319-S-0',
  'medical-319-S-1',
  'medical-319-S-2',
  'medical-320-S-0',
  'medical-320-S-1',
  'medical-320-S-2',
  'medical-320-S-3',
  'medical-320-S-4',
  'medical-321-S-0',
  'medical-321-S-1',
  'medical-321-S-2',
  'medical-321-S-3',
  'medical-323-S-1',
  'medical-323-S-5',
  'medical-324-S-1',
  'medical-324-S-4',
  'medical-325-S-0',
  'medical-325-S-1',
  'medical-325-S-2',
  'medical-326-S-0',
  'medical-326-S-2',
  'medi

In [11]:
#H node->cluster mapping
#using top k to avoid disconnected H nodes
_, H_assignments = kmeans.index.search(H_embeddings, 5)
H_assignments, H_assignments.shape
H_node_clusters = {}
for i in range(H_assignments.shape[0]):
    H_node_clusters[medical_embedding_H_ids[i]] = H_assignments[i].tolist()

H_node_clusters

{'medical-H-22': [28, 56, 52, 18, 54],
 'medical-H-24': [26, 41, 28, 42, 56],
 'medical-H-21': [20, 18, 23, 28, 24],
 'medical-H-15': [28, 18, 21, 55, 35],
 'medical-H-2': [28, 18, 13, 20, 32],
 'medical-H-14': [18, 28, 20, 23, 5],
 'medical-H-10': [28, 18, 20, 53, 51],
 'medical-H-5': [18, 28, 20, 23, 24],
 'medical-H-11': [28, 18, 20, 13, 16],
 'medical-H-8': [15, 28, 43, 42, 18],
 'medical-H-26': [29, 28, 42, 18, 13],
 'medical-H-20': [50, 54, 28, 55, 42],
 'medical-H-23': [28, 18, 42, 56, 20],
 'medical-H-13': [45, 40, 28, 18, 54],
 'medical-H-17': [4, 28, 17, 18, 36],
 'medical-H-1': [28, 18, 51, 23, 20],
 'medical-H-12': [28, 48, 55, 18, 24],
 'medical-H-7': [0, 28, 44, 18, 42],
 'medical-H-9': [56, 28, 18, 20, 42],
 'medical-H-6': [28, 42, 29, 18, 19],
 'medical-H-3': [28, 18, 26, 5, 20],
 'medical-H-0': [3, 47, 28, 23, 18],
 'medical-H-19': [14, 18, 46, 28, 5],
 'medical-H-16': [28, 13, 11, 32, 29],
 'medical-H-4': [5, 18, 46, 28, 23],
 'medical-H-18': [28, 42, 29, 18, 6],
 'me

In [12]:
#create and link H and O nodes
for H_id in medical_summary["node_id"].tolist():
    #ids
    O_id = f"{H_id}-O-0"
    #data
    H_content = medical_summary[medical_summary["node_id"] == H_id]["community_summary"].iloc[0]
    O_content = medical_overview[medical_overview["node_id"] == O_id]["community_overview"].iloc[0]
    #node creation
    H_node = Node(
        id = H_id,
        node_type = "H",
        source = "",
        content = H_content
    )

    O_node = Node(
        id = O_id,
        node_type = "O",
        source = "",
        content = O_content
    )

    #link H and O node
    H_node.link(O_node)
    O_node.link(H_node)

    #link H node with S,A nodes in the same community and cluster
    current_cluster = set()
    for cluster_id in H_node_clusters[H_id]:
        current_cluster |= set(clusters[cluster_id])
    current_community = set(communities_medical[int(re.search(r"medical-H-(\d+)", H_id).group(1))])
    relevant_nodes = current_cluster&current_community
    for node_id in relevant_nodes:
        if node_id == H_id:
            continue
        relevant_node = medical_g2[node_id]
        relevant_node.link(H_node)
        H_node.link(relevant_node)
    
    #add to node list
    medical_g2[H_id] = H_node
    medical_g2[O_id] = O_node



In [13]:
for node_id in list(medical_g2.keys()):
    node = medical_g2[node_id]
    if node.node_type == "H" and len(node.edges) <= 1:
        print(node_id)
        medical_g2.pop(node_id, None)


In [14]:
with open(f"{root_path}/graphs/data/graphs/G3_medical_community_augmented_graph.pkl", "wb") as f:
    pickle.dump(medical_g2, f)
with open(f"{root_path}/graphs/data/graphs/G3_medical_community_augmented_graph.pkl", "rb") as f:
    medical_g3 = pickle.load(f)

In [15]:
#create and link T nodes
for node_id in list(medical_g3.keys()):
    node = medical_g3[node_id]
    if node.node_type != "S":
        continue
    source_id = node.source
    if source_id not in medical_g3:
        T_node = Node(
            id = source_id,
            node_type = "T",
            source = "",
            content = medical_chunks[medical_chunks["node_id"] == source_id]["chunk"].iloc[0]
        )
        medical_g3[source_id] = T_node
    T_node = medical_g3[source_id]
    T_node.link(node)
    node.link(T_node)

In [16]:
with open(f"{root_path}/graphs/data/graphs/G4_text_inserted_graph.pkl", "wb") as f:
    pickle.dump(medical_g3, f)
with open(f"{root_path}/graphs/data/graphs/G4_text_inserted_graph.pkl", "rb") as f:
    medical_g4 = pickle.load(f)
medical_g4

{'medical-0-S-0': <graphs.Node.Node at 0x20d50b83e80>,
 'medical-0-S-0-R-0': <graphs.Node.Node at 0x20d50b82fe0>,
 'medical-0-S-0-R-1': <graphs.Node.Node at 0x20d85e58ca0>,
 'medical-0-S-0-R-2': <graphs.Node.Node at 0x20d85e58cd0>,
 'medical-0-S-1': <graphs.Node.Node at 0x20d85e58d00>,
 'medical-0-S-1-R-0': <graphs.Node.Node at 0x20d85e58d30>,
 'medical-0-S-1-R-1': <graphs.Node.Node at 0x20d85e58d60>,
 'medical-0-S-1-R-2': <graphs.Node.Node at 0x20d85e58d90>,
 'medical-0-S-1-R-3': <graphs.Node.Node at 0x20d85e58dc0>,
 'medical-0-S-1-R-4': <graphs.Node.Node at 0x20d85e58df0>,
 'medical-0-S-1-R-5': <graphs.Node.Node at 0x20d85e58e20>,
 'medical-0-S-1-R-6': <graphs.Node.Node at 0x20d85e58e50>,
 'medical-0-S-1-R-7': <graphs.Node.Node at 0x20d85e58e80>,
 'medical-0-S-2': <graphs.Node.Node at 0x20d85e58eb0>,
 'medical-0-S-2-R-0': <graphs.Node.Node at 0x20d85e58ee0>,
 'medical-0-S-2-R-1': <graphs.Node.Node at 0x20d85e58f10>,
 'medical-0-S-2-R-2': <graphs.Node.Node at 0x20d85e58f40>,
 'medical