# Example Notebook for Graphrag-Tagger

This notebook demonstrates how to use the Graphrag-Tagger tool. It shows how to import the module and run the tagging pipeline.

In [1]:
%cd ..

/data/home/eak/learning/nganga_ai/graphrag-tagger


In [2]:
import os
from graphrag_tagger import tagger

# Define sample parameters
params = {
    'pdf_folder': 'notebook/example',  # update path to your PDF folder
    'chunk_size': 256,
    'chunk_overlap': 25,
    'n_components': None,
    'n_features': 512,
    'min_df': 2,
    'max_df': 0.95,
    'llm_model': 'ollama:qwen2.5',
    'output_folder': 'notebook/example/results',  # update path to your output folder
    'model_choice': 'kt' # kt for ktrain or sk for scikit-learn
}

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
# Create output folder if it doesn't exist
os.makedirs(params['output_folder'], exist_ok=True)

# Run the tagging pipeline
tagger.main(params)

# The results will be saved in the specified output folder.

Total chunk texts: 114
n_topics automatically set to 7
lang: en
preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
done.
Topics extracted:
community arxiv information preprint graph answering summarization entities knowledge report
answer question figures public entertainment better provide example rag various
entities delimiter entity relationships tuple community llm claims chunk text
arxiv language models preprint knowledge large graph march plaza verdant
entity llm rag global answer summaries sensemaking text approach questions
community summaries graph condition comprehensiveness global dataset communities news average
response data questions user reports format length relevant corpus global
Saving topics at: notebook/example/results/topics.json
Topics cleaned:
Community Knowledge
LLM Answers
Graph Summaries
Preprints Research
Entities Relationship

Generating Tags: 100%|██████████| 114/114 [02:53<00:00,  1.52s/it]

Saved 114 chunk files to notebook/example/results





In [17]:
from glob import glob
import os

files = glob(os.path.join(params["output_folder"], "chunk_*.json"))
len(files)

114

In [18]:
import json

raws: list[dict] = [json.load(open(file)) for file in files]

for i, raw in enumerate(raws):
    if "chunk" not in raw:
        print(i, raw)

In [25]:
raw

{'chunk': 'Jacomy, M., Venturini, T., Heymann, S., and Bastian, M. (2014). Forceatlas2, a continuous graph\nlayout algorithm for handy network visualization designed for the gephi software. PLoS ONE\n9(6): e98679. https://doi.org/10.1371/journal.pone.0098679.\nJin, D., Yu, Z., Jiao, P., Pan, S., He, D., Wu, J., Philip, S. Y., and Zhang, W. (2021). A survey of\ncommunity detection approaches: From statistical modeling to deep learning. IEEE Transactions\non Knowledge and Data Engineering, 35(2):1149–1170.\nKang, M., Kwak, J. M., Baek, J., and Hwang, S. J. (2023). Knowledge graph-augmented language\nmodels for knowledge-grounded dialogue generation. arXiv preprint arXiv:2305.18846.',
 'source_file': 'notebook/example/GraphRagPaper.pdf',
 'classification': {'content_type': 'paragraph',
  'is_sufficient': True,
  'topics': ['Topic 3', 'Topic 4']}}

In [51]:
import json

raws: list[dict] = [json.load(open(file)) for file in files]
raws = [{
    "chunk": raw["chunk"],
    "source_file": raw["source_file"],
    **raw["classification"],
} for raw in raws]

raws[0].keys()

dict_keys(['chunk', 'source_file', 'content_type', 'is_sufficient', 'topics'])

In [52]:
print(raws[19]["chunk"])

Retrieval augmented generation (RAG) (Lewis et al., 2020) is an established approach to using
LLMs to answer queries based on data that is too large to contain in a language model’s context
window, meaning the maximum number of tokens (units of text) that can be processed by the LLM
at once (Kuratov et al., 2024; Liu et al., 2023). In the canonical RAG setup, the system has access to
a large external corpus of text records and retrieves a subset of records that are individually relevant
to the query and collectively small enough to fit into the context window of the LLM. The LLM then
Preprint. Under review.
arXiv:2404.16130v2  [cs.CL]  19 Feb 2025


In [53]:
print(raws[19])

{'chunk': 'Retrieval augmented generation (RAG) (Lewis et al., 2020) is an established approach to using\nLLMs to answer queries based on data that is too large to contain in a language model’s context\nwindow, meaning the maximum number of tokens (units of text) that can be processed by the LLM\nat once (Kuratov et al., 2024; Liu et al., 2023). In the canonical RAG setup, the system has access to\na large external corpus of text records and retrieves a subset of records that are individually relevant\nto the query and collectively small enough to fit into the context window of the LLM. The LLM then\nPreprint. Under review.\narXiv:2404.16130v2  [cs.CL]  19 Feb 2025', 'source_file': 'notebook/example/GraphRagPaper.pdf', 'content_type': 'paragraph', 'is_sufficient': True, 'topics': ['Topic 2', 'Topic 4']}


In [54]:
print(raws[19]["source_file"])

notebook/example/GraphRagPaper.pdf


In [55]:
import pandas as pd

chunk_classification = pd.DataFrame(raws)

chunk_classification

Unnamed: 0,chunk,source_file,content_type,is_sufficient,topics
0,"}},\n{{\n""summary"":\n""Harmony Assembly’s role ...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 1, Topic 2, Topic 5]"
1,level communities are used to generate summari...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 1, Topic 3]"
2,"tions on Knowledge Discovery from Data (TKDD),...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 3, Topic 2]"
3,55.8\n73.5\n100\n2.3\n20.7\n57.4\n66.8\n100\np...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 3, Topic 1]"
4,• SS. An implementation of vector RAG in which...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 5]"
...,...,...,...,...,...
109,the topic without being misled or making falla...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 2, Topic 1]"
110,Many benchmark datasets for open-domain questi...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 2, Topic 4]"
111,52\n51\n43\n41\n50\n49\n47\n48\n48\n45\n51\n50...,notebook/example/GraphRagPaper.pdf,table,True,"[LLM Answers, Entities Relationships]"
112,-2.16\n0.184\nC3\nTS\n54.32\n45.68\n-2.39\n0.1...,notebook/example/GraphRagPaper.pdf,table,True,"[Topic 4, Topic 5]"


In [56]:
chunk_classification["topics"].apply(lambda x: len(x) if x else 0).describe()

count    114.000000
mean       1.982456
std        0.531860
min        0.000000
25%        2.000000
50%        2.000000
75%        2.000000
max        3.000000
Name: topics, dtype: float64

In [57]:
chunk_classification[chunk_classification["topics"].apply(len) == 0]

Unnamed: 0,chunk,source_file,content_type,is_sufficient,topics
35,Acknowledgements\nWe would also like to thank ...,notebook/example/GraphRagPaper.pdf,header,False,[]


In [58]:
print(raws[36]["chunk"])

ai workflows for generating personas. In Proceedings of the 2024 ACM Designing Interactive
Systems Conference, pages 757–781.
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (2024). Reflexion: Language
agents with verbal reinforcement learning. Advances in Neural Information Processing Systems,
36.
15


In [59]:
chunk_classification[chunk_classification["topics"].apply(len) == 1]

Unnamed: 0,chunk,source_file,content_type,is_sufficient,topics
10,G\nStatistical Analysis\nTable 6: Pairwise com...,notebook/example/GraphRagPaper.pdf,table,True,[Topic 4]
13,"ids and add ""+more"" to indicate that there are...",notebook/example/GraphRagPaper.pdf,paragraph,True,[Topic 5]
20,C0\nC1\n46.96\n53.04\n-2.87\n0.037\n47.6\n52.4...,notebook/example/GraphRagPaper.pdf,table,True,[Topic 3]
26,49.76\n-0.06\n1\n55.52\n44.48\n-2.03\n0.17\nC1...,notebook/example/GraphRagPaper.pdf,table,True,[LLM Answers]
38,"""summary"":\n""Role of Tribune Spotlight"", ""expl...",notebook/example/GraphRagPaper.pdf,paragraph,True,[Topic 5]
41,58\n55\n79\n62\n46\n42\n50\n59\n79\n64\n48\n45...,notebook/example/GraphRagPaper.pdf,table,True,[Topic 4]
68,in which we first ask the LLM to assess whethe...,notebook/example/GraphRagPaper.pdf,figure_caption,True,[Topic 5]
70,50.96\n49.04\n-0.39\n1\nEmpowerment\nC0\nTS\n4...,notebook/example/GraphRagPaper.pdf,table,True,[LLM Answers]
86,TS\n48.08\n51.92\n-2.23\n0.179\n48.32\n51.68\n...,notebook/example/GraphRagPaper.pdf,table,True,[Topic 4]
87,Table 4: Average number of clusters across dif...,notebook/example/GraphRagPaper.pdf,table,True,[Topic 4]


In [60]:
print(raws[42]["chunk"])

38,VERDANT OASIS PLAZA,HARMONY ASSEMBLY,Harmony Assembly is holding a march at Verdant Oasis Plaza
39,VERDANT OASIS PLAZA,UNITY MARCH,The Unity March is taking place at Verdant Oasis Plaza
40,VERDANT OASIS PLAZA,TRIBUNE SPOTLIGHT,Tribune Spotlight is reporting on the Unity march taking place at
Verdant Oasis Plaza
41,VERDANT OASIS PLAZA,BAILEY ASADI,Bailey Asadi is speaking at Verdant Oasis Plaza about the march
43,HARMONY ASSEMBLY,UNITY MARCH,Harmony Assembly is organizing the Unity March
Output:
22


In [61]:
chunk_classification[chunk_classification["topics"].apply(len) == 2]

Unnamed: 0,chunk,source_file,content_type,is_sufficient,topics
1,level communities are used to generate summari...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 1, Topic 3]"
3,55.8\n73.5\n100\n2.3\n20.7\n57.4\n66.8\n100\np...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 3, Topic 1]"
4,• SS. An implementation of vector RAG in which...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 5]"
5,4.1.3\nConfiguration\nWe used a fixed context ...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 2, Topic 3]"
6,Each level of this hierarchy provides a commun...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 3, Topic 1]"
...,...,...,...,...,...
109,the topic without being misled or making falla...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 2, Topic 1]"
110,Many benchmark datasets for open-domain questi...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 2, Topic 4]"
111,52\n51\n43\n41\n50\n49\n47\n48\n48\n45\n51\n50...,notebook/example/GraphRagPaper.pdf,table,True,"[LLM Answers, Entities Relationships]"
112,-2.16\n0.184\nC3\nTS\n54.32\n45.68\n-2.39\n0.1...,notebook/example/GraphRagPaper.pdf,table,True,"[Topic 4, Topic 5]"


In [62]:
chunk_classification[chunk_classification["topics"].apply(len) == 3]

Unnamed: 0,chunk,source_file,content_type,is_sufficient,topics
0,"}},\n{{\n""summary"":\n""Harmony Assembly’s role ...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 1, Topic 2, Topic 5]"
2,"tions on Knowledge Discovery from Data (TKDD),...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 3, Topic 2]"
7,2.2\nUsing Knowledge Graphs with LLMs and RAG\...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 2, Topic 3, Topic 4]"
12,models for knowledge-grounded dialogue generat...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 2, Topic 5]"
22,"arXiv:2203.11171.\nWang, Y., Lipka, N., Rossi,...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 2, Topic 3]"
24,"et al., 2023) excel at sensemaking in complex ...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 1, Topic 2, Topic 3]"
25,"Bhargava, P., Bhosale, S., et al. (2023). Llam...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 2, Topic 3]"
30,"Zhang, Y., Zhang, Y., Gan, Y., Yao, L., and Wa...",notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 4, Topic 2, Topic 3]"
43,are risks to downstream sensemaking and decisi...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 3, Topic 2, Topic 5]"
54,community summaries provide global description...,notebook/example/GraphRagPaper.pdf,paragraph,True,"[Topic 1, Topic 2, Topic 3]"


In [64]:
chunk_classification["content_type"].value_counts(normalize=False)

content_type
paragraph         90
table             17
figure_caption     4
header             1
other              1
list               1
Name: count, dtype: int64

In [63]:
chunk_classification["content_type"].value_counts(normalize=True)

content_type
paragraph         0.789474
table             0.149123
figure_caption    0.035088
header            0.008772
other             0.008772
list              0.008772
Name: proportion, dtype: float64

Update the paths in the code cell above before running the notebook.