# Example Notebook for Graphrag-Tagger

This notebook demonstrates how to use the Graphrag-Tagger tool. It shows how to import the module and run the tagging pipeline.

In [1]:
%cd ..

/data/home/eak/learning/nganga_ai/graphrag-tagger


In [None]:
import os
from graphrag_tagger import tagger

# Define sample parameters
params = {
    'pdf_folder': 'notebook/example',  # update path to your PDF folder
    'chunk_size': 512,
    'chunk_overlap': 25,
    'n_components': None,
    'n_features': 512,
    'min_df': 2,
    'max_df': 0.95,
    'llm_model': 'ollama:phi4',
    'output_folder': 'notebook/example/results',  # update path to your output folder
    'model_choice': 'kt' # kt for ktrain or sk for scikit-learn
}

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
# Create output folder if it doesn't exist
os.makedirs(params['output_folder'], exist_ok=True)

# Run the tagging pipeline
tagger.main(params)

# The results will be saved in the specified output folder.

Total chunk texts: 61
n_topics automatically set to 5
lang: en
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
done.
Topics extracted:
community summaries communities graph element context diversity graphrag comprehensiveness text
answer llm rag graph global community questions approach question march
arxiv preprint language models knowledge large graph systems information generation
data community response context summary report ids entities global dataset
entity delimiter tuple entities relationship answer source types information target
Topics cleaned:
Community Summaries and Diversity
LLM-based Question Answering Approach
Large Language Models for Information Generation
Entity Relationship and Knowledge Management
Graph-Based Data Contextualization


Generating Tags: 100%|██████████| 61/61 [04:10<00:00,  4.11s/it]

Saved 61 chunk files to notebook/example/results





In [4]:
from glob import glob
import os

files = glob(os.path.join(params["output_folder"], "*.json"))
len(files)

61

In [5]:
import json

raws: list[dict] = [json.load(open(file)) for file in files]

raws[0].keys()

dict_keys(['chunk', 'source_file', 'classification'])

In [6]:
print(raws[19]["chunk"])

"Tribune Spotlight is reporting on the Unity
March taking place in Verdant Oasis Plaza.
This suggests that the event has attracted media attention,
which could amplify its impact on the community.
The role of Tribune Spotlight could be significant in
shaping public perception of the event and the entities involved.
[Data:
Relationships (40)]"
}}
]
}}
---Real Data---
Use the following text for your answer.
Do not make anything up in your answer.
Input:
{input text}
...Report Structure and Grounding Rules Repeated...
Output:
E.3
Community Answer Generation
---Role---
You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from
multiple analysts.
---Goal---
Generate a response of the target length and format that responds to the user’s question, summarize
all the reports from multiple analysts who focused on different parts of the dataset, and incorporate any
relevant general knowledge.
Note that the analysts’ reports provided below are ranked in t

In [7]:
print(raws[19]["classification"])

['Entity Relationship and Knowledge Management', 'Large Language Models for Information Generation']


In [8]:
print(raws[19]["source_file"])

notebook/example/GraphRagPaper.pdf


In [9]:
import pandas as pd

chunk_classification = pd.DataFrame(raws)

chunk_classification

Unnamed: 0,chunk,source_file,classification
0,Table 2: Number of context units (community su...,notebook/example/GraphRagPaper.pdf,"[Community Summaries and Diversity, LLM-based ..."
1,"Reports (2,\n7, 64, 46, 34, +more)].\nHe is al...",notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
2,these missing entities. This approach allows u...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Entity..."
3,"Jacomy, M., Venturini, T., Heymann, S., and Ba...",notebook/example/GraphRagPaper.pdf,"[Graph-Based Data Contextualization, Entity Re..."
4,"Kosinski, M. (2024). Evaluating large language...",notebook/example/GraphRagPaper.pdf,[Large Language Models for Information Generat...
...,...,...,...
56,"ids and add ""+more"" to indicate that there are...",notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
57,augmented text generation with self-memory. Ad...,notebook/example/GraphRagPaper.pdf,[Large Language Models for Information Generat...
58,not be explicitly stated in the text. The enti...,notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
59,a good nlg evaluator? a preliminary study. arX...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Large ..."


In [10]:
chunk_classification["classification"].apply(len).describe()

count    61.000000
mean      2.327869
std       0.746577
min       0.000000
25%       2.000000
50%       2.000000
75%       3.000000
max       3.000000
Name: classification, dtype: float64

In [11]:
chunk_classification[chunk_classification["classification"].apply(len) == 0]

Unnamed: 0,chunk,source_file,classification
10,-3.76\n0.002\n50\n50\n-0.12\n1\nC2\nTS\n47.68\...,notebook/example/GraphRagPaper.pdf,[]
36,-0.22\n1\nDirectness\nC0\nTS\n44.96\n55.04\n-4...,notebook/example/GraphRagPaper.pdf,[]
60,C0\nSS\n76.56\n23.44\n-7.12\n<0.001\n62.08\n37...,notebook/example/GraphRagPaper.pdf,[]


In [12]:
print(raws[36]["chunk"])

-0.22
1
Directness
C0
TS
44.96
55.04
-4.09
<0.001
45.2
54.8
-3.68
0.003
C1
TS
47.92
52.08
-2.41
0.126
46.64
53.36
-2.91
0.04
C2
TS
48.8
51.2
-2.23
0.179
48.32
51.68
-2.12
0.179
C3
TS
48.08
51.92
-2.23
0.179
48.32
51.68
-2.56
0.074
C0
SS
35.12
64.88
-6.17
<0.001
41.44
58.56
-4.82
<0.001
C1
SS
40.32
59.68
-4.83
<0.001
45.2
54.8
-3.19
0.017
C2
SS
40.4
59.6
-4.67
<0.001
44.88
55.12
-3.65
0.003
C3
SS
40.48
59.52
-4.69
<0.001
45.6
54.4
-2.86
0.043
TS
SS
43.6
56.4
-3.96
<0.001
46
54
-2.68
0.066
C0
C1
46.96
53.04
-2.87
0.037
47.6
52.4
-2.17
0.179
C0
C2
48.4
51.6
-2.06
0.197
48.48
51.52
-1.61
0.321
C1
C2
49.84
50.16
-1
0.952
49.28
50.72
-1.6
0.321
C0
C3
48.4
51.6
-1.8
0.29
47.2
52.8


In [13]:
chunk_classification[chunk_classification["classification"].apply(len) == 1]

Unnamed: 0,chunk,source_file,classification
40,20.8\n-8.34\n<0.001\nC3\nSS\n78.96\n21.04\n-8....,notebook/example/GraphRagPaper.pdf,[Community Summaries and Diversity]


In [14]:
print(raws[40]["chunk"])

20.8
-8.34
<0.001
C3
SS
78.96
21.04
-8.12
<0.001
79.44
20.56
-8.44
<0.001
TS
SS
83.12
16.88
-8.85
<0.001
79.6
20.4
-8.27
<0.001
C0
C1
53.2
46.8
-1.96
0.389
51.92
48.08
-0.45
0.777
C0
C2
50.24
49.76
-0.23
1
53.68
46.32
-1.54
0.371
C1
C2
51.52
48.48
-1.62
0.633
57.76
42.24
-4.01
<0.001
C0
C3
49.12
50.88
-0.56
1
52.16
47.84
-0.86
0.777
C1
C3
50.32
49.68
-0.66
1
55.12
44.88
-2.94
0.016
C2
C3
52.24
47.76
-1.97
0.389
58.64
41.36
-3.68
0.002
Diversity
C0
TS
50.24
49.76
-0.11
1
46.88
53.12
-1.38
0.676
C1
TS
50.48
49.52
-0.12
1
54.64
45.36
-1.88
0.298
C2
TS
57.12
42.88
-2.84
0.036
55.76
44.24
-2.16
0.184
C3
TS
54.32
45.68
-2.39
0.1
60.16
39.84
-4.07
<0.001
C0
SS
76.56
23.44
-7.12
<0.001


In [15]:
chunk_classification[chunk_classification["classification"].apply(len) == 2]

Unnamed: 0,chunk,source_file,classification
1,"Reports (2,\n7, 64, 46, 34, +more)].\nHe is al...",notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
2,these missing entities. This approach allows u...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Entity..."
6,3.1.1\nSource Documents →Text Chunks\nTo start...,notebook/example/GraphRagPaper.pdf,[Large Language Models for Information Generat...
11,(a) Root communities at level 0\n(b) Sub-commu...,notebook/example/GraphRagPaper.pdf,"[Graph-Based Data Contextualization, LLM-based..."
13,"Furthermore, we use a “control criterion” call...",notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Graph-..."
15,tically similar to the query and the generated...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Graph-..."
17,which is setting interest rates on Tuesday and...,notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
18,"Table 3: Average number of extracted claims, r...",notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Large ..."
19,"""Tribune Spotlight is reporting on the Unity\n...",notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
20,2. Diversity: Measured by clustering the claim...,notebook/example/GraphRagPaper.pdf,"[Community Summaries and Diversity, Graph-Base..."


In [16]:
chunk_classification[chunk_classification["classification"].apply(len) == 3]

Unnamed: 0,chunk,source_file,classification
0,Table 2: Number of context units (community su...,notebook/example/GraphRagPaper.pdf,"[Community Summaries and Diversity, LLM-based ..."
3,"Jacomy, M., Venturini, T., Heymann, S., and Ba...",notebook/example/GraphRagPaper.pdf,"[Graph-Based Data Contextualization, Entity Re..."
4,"Kosinski, M. (2024). Evaluating large language...",notebook/example/GraphRagPaper.pdf,[Large Language Models for Information Generat...
5,48\n45\n41\n50\nSS\nTS\nC0\nC1\nC2\nC3\nSS TS ...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Graph-..."
7,"Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B....",notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Large ..."
8,4.1.3\nConfiguration\nWe used a fixed context ...,notebook/example/GraphRagPaper.pdf,[Large Language Models for Information Generat...
9,"Metropolitansky, D. and Larson, J. (2025). Tow...",notebook/example/GraphRagPaper.pdf,[Large Language Models for Information Generat...
12,Source Documents\nText Chunks\ntext extraction...,notebook/example/GraphRagPaper.pdf,"[Entity Relationship and Knowledge Management,..."
14,generates a response based on both the query a...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Entity..."
16,than a broad spectrum of their professional in...,notebook/example/GraphRagPaper.pdf,"[LLM-based Question Answering Approach, Entity..."


Update the paths in the code cell above before running the notebook.