# Example Notebook for Graph Building

This notebook demonstrates how to use the `process_graph` function from `build_graph.py` to process JSON files and build/prune a graph.

In [1]:
%cd ..

/data/home/eak/learning/nganga_ai/graphrag-tagger


In [2]:
import os
from graphrag_tagger.build_graph import process_graph

# Define sample input and output folders
input_folder = (
    "notebook/example/results"  # update this path to your folder containing JSON files
)
output_folder = "notebook/example/results/graph_outputs"  # update this path to where you want the results saved

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Process graph with a specified threshold percentile (e.g., 97.5)
graph = process_graph(
    input_folder,
    output_folder,
    threshold_percentile=97.5,
    content_type_filter="paragraph", # Use only paragraph for chunk graph builder
)

# The processed graph is stored in 'graph' and the connected components map is saved to the output folder.
print("Graph processing completed.")

Processing graph...
Found 81 files in notebook/example/results.
Filtering by content type: paragraph


Loading raw files: 100%|██████████| 81/81 [00:00<00:00, 15172.32it/s]


Loaded 58 raw documents.
Computing scores...
Scores computed.
Building graph...


Building nodes & edges: 100%|██████████| 58/58 [00:00<00:00, 11417.89it/s]

Graph built. Nodes: 58 Edges: 990
Starting graph pruning...
Min weight: 2.2142291753826795
Max weight: 11.656063389192791
Mean weight: 4.024045879185529
Median weight: 3.547562508716013
Pruning threshold (97.5th percentile): 7.651970479746955
Removing 963 edges out of 990...
Graph pruned. Nodes: 58 Edges: 27
Computing connected components...
Number of connected components: 43
Component sizes (min, max, mean): 1 8 1.3488372093023255
Connected components map saved to notebook/example/results/graph_outputs/connected_components.json
Graph processing complete.
Graph processing completed.





In [3]:
import json
import os

raw: dict = json.load(open(os.path.join(output_folder, "connected_components.json")))

len(raw)

58

In [4]:
raw["0"]

0

In [5]:
len(set(raw.values())) # unique tag

43

In [6]:
connected_chunks = {}
for k, v in raw.items():
    if v in connected_chunks:
        connected_chunks[v].append(int(k) + 1)
    else:
        connected_chunks[v] = [int(k) + 1]
        
len(connected_chunks)

43

In [7]:
examples = []

for k, v in connected_chunks.items():
    if len(v) > 1:
        print(k, v)
        if len(v) > len(examples):
            examples = v

0 [1, 3, 13, 21, 56, 26, 27, 30]
8 [41, 10, 42, 46, 47]
10 [12, 29]
12 [15, 23]
27 [36, 37]
31 [43, 52]


In [8]:
example1 = json.load(open(os.path.join(input_folder, f"chunk_{examples[0]}.json")))
example2 = json.load(open(os.path.join(input_folder, f"chunk_{examples[1]}.json")))
example3 = json.load(open(os.path.join(input_folder, f"chunk_{examples[2]}.json")))

In [9]:
print(example1["chunk"])

focused summarization (QFS) task, rather than an explicit retrieval task. Prior
QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose GraphRAG, a graph-based approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text. Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that GraphRAG leads to substantial improvements over a conventional
RAG baseline for both the comprehensiveness a

In [10]:
print(example2["chunk"])

generates a response based on both the query and the retrieved records (Baumel et al., 2018; Dang, 2006; Laskar et al., 2020; Yao et al., 2017). This conventional approach, which we collectively call vector RAG, works well for queries that can be answered with information localized within a small set of records. However, vector RAG approaches do not support sensemaking queries, meaning queries that require global understanding of the entire dataset, such as ”What are the key trends in how scientific discoveries are influenced by interdisciplinary research over the past decade?”
Sensemaking tasks require reasoning over “connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively” (Klein et al., 2006). LLMs such as GPT (Achiam et al., 2023; Brown et al., 2020), Llama (Touvron et al., 2023), and Gemini (Anil et al., 2023) excel at sensemaking in complex domains like scientific discovery (Microsoft, 2023)


In [11]:
print(example3["chunk"])

3.1.1
Source Documents →Text Chunks
To start, the documents in the corpus are split into text chunks. The LLM extracts information from each chunk for downstream processing. Selecting the size of the chunk is a fundamental design decision; longer text chunks require fewer LLM calls for such extraction (which reduces cost) but suffer from degraded recall of information that appears early in the chunk (Kuratov et al., 2024; Liu et al., 2023). See Section A.1 for prompts and examples of the recall-precision trade-offs.
3.1.2
Text Chunks →Entities & Relationships
In this step, the LLM is prompted to extract instances of important entities and the relationships between the entities from a given chunk. Additionally, the LLM generates short descriptions for the entities and relationships. To illustrate, suppose a chunk contained the following text: 4


In [12]:
example1["classification"], example2["classification"], example3["classification"]

({'content_type': 'paragraph',
  'is_sufficient': True,
  'topics': ['Topic 1', 'Topic 4', 'Topic 2']},
 {'content_type': 'paragraph',
  'is_sufficient': True,
  'topics': ['Topic 4', 'Topic 8', 'Topic 6']},
 {'content_type': 'paragraph',
  'is_sufficient': True,
  'topics': ['Topic 4', 'Topic 5']})

In [15]:
texts = "\n\n".join([
    json.load(open(os.path.join(input_folder, f"chunk_{example}.json")))["chunk"] for example in sorted(examples)
])

In [16]:
print(texts)

focused summarization (QFS) task, rather than an explicit retrieval task. Prior
QFS methods, meanwhile, do not scale to the quantities of text indexed by typical RAG systems. To combine the strengths of these contrasting methods, we propose GraphRAG, a graph-based approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text. Our approach uses an LLM to build a graph index in two stages: first, to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, we show that GraphRAG leads to substantial improvements over a conventional
RAG baseline for both the comprehensiveness a