# Example Notebook for Graph Building

This notebook demonstrates how to use the `process_graph` function from `build_graph.py` to process JSON files and build/prune a graph.

In [1]:
%cd ..

/data/home/eak/learning/nganga_ai/graphrag-tagger


In [2]:
import os
from graphrag_tagger.build_graph import process_graph

# Define sample input and output folders
input_folder = (
    "notebook/example/results"  # update this path to your folder containing JSON files
)
output_folder = "notebook/example/results/graph_outputs"  # update this path to where you want the results saved

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Process graph with a specified threshold percentile (e.g., 97.5)
graph = process_graph(
    input_folder,
    output_folder,
    threshold_percentile=97.5,
    content_type_filter="paragraph",
)

# The processed graph is stored in 'graph' and the connected components map is saved to the output folder.
print("Graph processing completed.")

Processing graph...
Found 114 files in notebook/example/results.
Filtering by content type: paragraph


Loading raw files: 100%|██████████| 114/114 [00:00<00:00, 16350.94it/s]


Loaded 90 raw documents.
Computing scores...
Scores computed.
Building graph...


Building nodes & edges: 100%|██████████| 90/90 [00:00<00:00, 6990.64it/s]

Graph built. Nodes: 90 Edges: 2723
Starting graph pruning...
Min weight: 2.187051524018611
Max weight: 8.949035849482943
Mean weight: 3.9093317790378714
Median weight: 3.353718190685278
Pruning threshold (97.5th percentile): 6.928650992130997
Removing 2652 edges out of 2723...
Graph pruned. Nodes: 90 Edges: 71
Computing connected components...
Number of connected components: 68
Component sizes (min, max, mean): 1 9 1.3235294117647058
Connected components map saved to notebook/example/results/graph_outputs/connected_components.json
Graph processing complete.
Graph processing completed.





In [3]:
import json
import os

raw: dict = json.load(open(os.path.join(output_folder, "connected_components.json")))

len(raw)

90

In [4]:
raw["0"]

0

In [5]:
len(set(raw.values())) # unique tag

68

In [6]:
connected_chunks = {}
for k, v in raw.items():
    if v in connected_chunks:
        connected_chunks[v].append(int(k) + 1)
    else:
        connected_chunks[v] = [int(k) + 1]
        
len(connected_chunks)

68

In [7]:
for k, v in connected_chunks.items():
    if len(v) > 1:
        print(k, v)

1 [65, 2, 43, 47, 51, 62, 63]
14 [41, 15]
16 [34, 35, 39, 17, 18, 19, 26, 27, 31]
31 [50, 40]
32 [57, 42, 59, 58]
59 [83, 84, 85, 79]


In [None]:
example1 = json.load(open(os.path.join(input_folder, "chunk_43.json")))
example2 = json.load(open(os.path.join(input_folder, "chunk_51.json")))
example3 = json.load(open(os.path.join(input_folder, "chunk_63.json")))

In [15]:
print(example1["chunk"])

to source text summarization: for low-level community summaries (C3), GraphRAG required 26-
33% fewer context tokens, while for root-level community summaries (C0), it required over 97%
fewer tokens. For a modest drop in performance compared with other global methods, root-level
GraphRAG offers a highly efficient method for the iterative question answering that characterizes
sensemaking activity, while retaining advantages in comprehensiveness (72% win rate) and diversity
(62% win rate) over vector RAG.
10


In [16]:
print(example2["chunk"])

Acknowledgements
We would also like to thank the following people who contributed to the work: Alonso Guevara
Fern´andez, Amber Hoak, Andr´es Morales Esquivel, Ben Cutler, Billie Rinaldi, Chris Sanchez,
Chris Trevino, Christine Caggiano, David Tittsworth, Dayenne de Souza, Douglas Orbaker, Ed
Clark, Gabriel Nieves-Ponce, Gaudy Blanco Meneses, Kate Lytvynets, Katy Smith, M´onica Carva-
jal, Nathan Evans, Richard Ortega, Rodrigo Racanicci, Sarah Smith, and Shane Solomon.
References
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Al-
tenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint
arXiv:2303.08774.
12


In [17]:
print(example3["chunk"])

retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553.
Martin, S., Brown, W. M., Klavans, R., and Boyack, K. (2011). Openord: An open-source toolbox
for large graph layout. SPIE Conference on Visualization and Data Analysis (VDA).
Melnyk, I., Dognin, P., and Das, P. (2022). Knowledge graph generation from text.
14


In [19]:
example1 = json.load(open(os.path.join(input_folder, "chunk_83.json")))
example2 = json.load(open(os.path.join(input_folder, "chunk_85.json")))
example3 = json.load(open(os.path.join(input_folder, "chunk_79.json")))

In [20]:
print(example1["chunk"])

and structured overview of public figures across various sectors of the entertainment industry,
including film, television, music, sports, and digital media. It lists multiple individuals, providing
specific examples of their contributions and the context in which they are mentioned in entertainment
articles, along with references to data reports for each claim. This approach helps the reader
understand the breadth of the topic and make informed judgments without being misled. In contrast,
Answer 2 focuses on a smaller group of public figures and primarily discusses their personal lives and
relationships, which may not provide as broad an understanding of the topic. While Answer 2 also
cites sources, it does not match the depth and variety of Answer 1.
Directness: Winner=2 (Na¨ıve RAG). Answer 2 is better because it directly lists specific public
figures who are repeatedly mentioned across various entertainment articles, such as Taylor Swift,
Travis Kelce, Britney Spears, and Justin Ti

In [21]:
print(example2["chunk"])

E
System Prompts
E.1
Element Instance Generation
---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify
all entities of those types from the text and all relationships among the identified entities.
---Steps---
1.
Identify all entities.
For each identified entity, extract the following information:
- entity name:
Name of the entity, capitalized
- entity type:
One of the following types:
[{entity types}]
- entity description:
Comprehensive description of the entity’s attributes and activities
Format each entity as ("entity"{tuple delimiter}<entity name>{tuple delimiter}<entity type>{tuple
delimiter}<entity description>
2.
From the entities identified in step 1, identify all pairs of (source entity, target entity) that
are *clearly related* to each other
For each pair of related entities, extract the following information:
- source entity:
name of the source entity, as identified in step 1
- target entity:
name of the target en

In [22]:
print(example3["chunk"])

plore the effects of varying the context window size for our combinations of datasets, questions, and
metrics. In particular, our goal was to determine the optimum context size for our baseline condition
(SS) and then use this uniformly for all query-time LLM use. To that end, we tested four context
window sizes: 8k, 16k, 32k and 64k. Surprisingly, the smallest context window size tested (8k)
was universally better for all comparisons on comprehensiveness (average win rate of 58.1%), while
performing comparably with larger context sizes on diversity (average win rate = 52.4%), and em-
powerment (average win rate = 51.3%). Given our preference for more comprehensive and diverse
answers, we therefore used a fixed context window size of 8k tokens for the final evaluation.
19
