# Example Notebook for Graph Building

This notebook demonstrates how to use the `process_graph` function from `build_graph.py` to process JSON files and build/prune a graph.

In [1]:
%cd ..

/data/home/eak/learning/nganga_ai/graphrag-tagger


In [2]:
import os
from graphrag_tagger.build_graph import process_graph

# Define sample input and output folders
input_folder = 'notebook/example/results'  # update this path to your folder containing JSON files
output_folder = 'notebook/example/results/graph_outputs'  # update this path to where you want the results saved

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Process graph with a specified threshold percentile (e.g., 97.5)
graph = process_graph(input_folder, output_folder, threshold_percentile=97.5)

# The processed graph is stored in 'graph' and the connected components map is saved to the output folder.
print('Graph processing completed.')

Processing graph...
Found 61 files in notebook/example/results.


Loading raw files: 100%|██████████| 61/61 [00:00<00:00, 13430.58it/s]


Loaded 61 raw documents.
Computing scores...
Scores computed.
Building graph...


Building nodes & edges: 100%|██████████| 61/61 [00:00<00:00, 8905.10it/s]

Graph built. Nodes: 61 Edges: 1431
Starting graph pruning...
Min weight: 1.9233900363781316
Max weight: 9.273103541315505
Mean weight: 4.240697321820651
Median weight: 3.621366483299374
Pruning threshold (97.5th percentile): 7.829650766355667
Removing 1386 edges out of 1431...
Graph pruned. Nodes: 61 Edges: 45
Computing connected components...
Number of connected components: 43
Component sizes (min, max, mean): 1 6 1.4186046511627908
Connected components map saved to notebook/example/results/graph_outputs/connected_components.json
Graph processing complete.
Graph processing completed.





In [3]:
import json
import os

raw: dict = json.load(open(os.path.join(output_folder, "connected_components.json")))

len(raw)

61

In [4]:
raw["0"]

0

In [5]:
len(set(raw.values())) # unique tag

43

In [6]:
connected_chunks = {}
for k, v in raw.items():
    if v in connected_chunks:
        connected_chunks[v].append(int(k) + 1)
    else:
        connected_chunks[v] = [int(k) + 1]
        
len(connected_chunks)

43

In [7]:
for k, v in connected_chunks.items():
    if len(v) > 1:
        print(k, v)

0 [1, 36, 39, 40, 23, 61]
4 [5, 15, 16]
9 [10, 43]
10 [11, 59]
19 [22, 25, 28, 31, 32]
20 [35, 56, 24, 26, 27, 29]


In [21]:
example1 = json.load(open(os.path.join(input_folder, "chunk_22.json")))
example2 = json.load(open(os.path.join(input_folder, "chunk_25.json")))

In [22]:
print(example1["chunk"])

48
45
41
50
SS
TS
C0
C1
C2
C3
SS TS C0 C1 C2 C3
Comprehensiveness
50
33
38
35
29
31
67
50
53
45
44
40
62
47
50
40
41
41
65
55
60
50
50
50
71
56
59
50
50
51
69
60
59
50
49
50
SS
TS
C0
C1
C2
C3
SS TS C0 C1 C2 C3
Diversity
50
47
57
49
50
50
53
50
58
50
50
48
43
42
50
42
45
44
51
50
58
50
52
51
50
50
55
48
50
50
50
52
56
49
50
50
SS
TS
C0
C1
C2
C3
SS TS C0 C1 C2 C3
Empowerment
50
54
59
55
55
54
46
50
55
53
52
52
41
45
50
48
48
47
45
47
52
50
49
49
45
48
52
51
50
49
46
48
53
51
51
50
SS
TS
C0
C1
C2
C3
SS TS C0 C1 C2 C3
Directness
Figure 2: Head-to-head win rate percentages of (row condition) over (column condition) across two
datasets, four metrics, and 125 questions per comparison (each repeated five times and averaged).
The overall winner per dataset and metric is shown in bold. Self-win rates were not computed but
are shown as the expected 50% for reference. All Graph RAG conditions outperformed na¨ıve RAG
on comprehensiveness and diversity. Conditions C1-C3 also showed slight improvemen

In [20]:
print(example2["chunk"])

Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M.,
Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805.
Baek, J., Aji, A. F., and Saffari, A. (2023). Knowledge-augmented language model prompting for
zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136.
Ban, T., Chen, L., Wang, X., and Chen, H. (2023). From query tools to causal architects: Harnessing
large language models for advanced causal discovery from data.
Barlaug, N. and Gulla, J. A. (2021). Neural networks for entity matching: A survey. ACM Transac-
tions on Knowledge Discovery from Data (TKDD), 15(3):1–37.
Baumel, T., Eyal, M., and Elhadad, M. (2018). Query focused abstractive summarization: Incorpo-
rating query relevance, multi-document coverage, and summary length constraints into seq2seq
models. arXiv preprint arXiv:1801.07704.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2

Remember to update the folder paths before running the notebook.