# Example Notebook for Graphrag-Tagger

This notebook demonstrates how to use the Graphrag-Tagger tool. It shows how to import the module and run the tagging pipeline.

In [1]:
%cd ..

/data/home/eak/learning/nganga_ai/graphrag-tagger


In [None]:
import os
from graphrag_tagger import tagger

# Define sample parameters
params = {
    'pdf_folder': 'notebook/example',  # update path to your PDF folder
    'chunk_size': 256,
    'chunk_overlap': 25,
    'n_components': 15, # set to None for model suggestion
    'n_features': 512,
    'min_df': 2,
    'max_df': 0.95,
    'llm_model': 'ollama:qwen2.5', # 8B Qwen2.5 model
    'output_folder': 'notebook/example/results',  # update path to your output folder
    'model_choice': 'kt' # kt for ktrain or sk for scikit-learn
}

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
# Create output folder if it doesn't exist
os.makedirs(params['output_folder'], exist_ok=True)

# Run the tagging pipeline
tagger.main(params)

# The results will be saved in the specified output folder.

['From Local to Global: A GraphRAG Approach to\nQuery-Focused Summarization\nDarren Edge1†\nHa Trinh1†\nNewman Cheng2\nJoshua Bradley2\nAlex Chao3\nApurva Mody3\nSteven Truitt2\nDasha Metropolitansky1\nRobert Osazuwa Ness1\nJonathan Larson1\n1Microsoft Research\n2Microsoft Strategic Missions and Technologies\n3Microsoft Office of the CTO\n{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,\nsteventruitt,dasham,robertness,jolarso}@microsoft.com\n†These authors contributed equally to this work\nAbstract\nThe use of retrieval-augmented generation (RAG) to retrieve relevant informa-\ntion from an external knowledge source enables large language models (LLMs)\nto answer questions over private and/or previously unseen document collections.\nHowever, RAG fails on global questions directed at an entire text corpus, such\nas “What are the main themes in the dataset?”, since this is inherently a query-\nfocused summarization (QFS) task, rather than an explicit retrieval task. Prior', 'focuse

Generating Tags: 100%|██████████| 81/81 [02:10<00:00,  1.61s/it]

Saved 81 chunk files to notebook/example/results





In [4]:
from glob import glob
import os

files = glob(os.path.join(params["output_folder"], "chunk_*.json"))
len(files)

81

In [5]:
import json

raws: list[dict] = [json.load(open(file)) for file in files]

for i, raw in enumerate(raws):
    if "chunk" not in raw:
        print(i, raw)

In [6]:
raw

{'chunk': 'in which we first ask the LLM to assess whether all entities were extracted, using a logit bias of 100 to force a yes/no decision. If the LLM responds that entities were missed, then a continuation indicating that “MANY entities were missed in the last extraction” encourages the LLM to detect these missing entities. This approach allows us to use larger chunk sizes without a drop in quality (Figure 3) or the forced introduction of noise. We interate self-reflection steps up to a specified maximum number of times.\n0 1 2 3 0 10000 20000 30000\nNumber of self-reflection iterations performed\nEntity references detected 600 chunk size 1200 chunk size 2400 chunk size\nFigure 3: How the entity references detected in the HotPotQA dataset (Yang et al., 2018) varies with chunk size and self-reflection iterations for our generic entity extraction prompt with gpt-4-turbo.\nB\nExample Community Detection 18',
 'source_file': 'notebook/example/GraphRagPaper.pdf',
 'classification': {'con

In [7]:
import json

raws: list[dict] = [json.load(open(file)) for file in files]
raws = [{
    "chunk": raw["chunk"],
    "source_file": raw["source_file"],
    "chunk_file": file,
    **raw["classification"],
} for file, raw in zip(files, raws)]

raws[0].keys()

dict_keys(['chunk', 'source_file', 'chunk_file', 'content_type', 'is_sufficient', 'topics'])

In [8]:
print(raws[19]["chunk"])

- target entity: name of the target entity, as identified in step 1 - relationship description: explanation as to why you think the source entity and the target entity are related to each other - relationship strength: a numeric score indicating strength of the relationship between the source entity and target entity
Format each relationship as ("relationship"{tuple delimiter}<source entity>{tuple delimiter}<target entity>{tuple delimiter}<relationship description>{tuple delimiter}<relationship strength>) 3.
Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.
Use **{record delimiter}** as the list delimiter.
4.
When finished, output {completion delimiter} ---Examples--Entity types:
ORGANIZATION,PERSON
Input:
The Fed is scheduled to meet on Tuesday and Wednesday, with the central bank planning to release its latest policy decision on Wednesday at 2:00 p.m.
ET, followed by a press conference where Fed Chair
Jerome Powell will take 

In [9]:
print(raws[19])

{'chunk': '- target entity: name of the target entity, as identified in step 1 - relationship description: explanation as to why you think the source entity and the target entity are related to each other - relationship strength: a numeric score indicating strength of the relationship between the source entity and target entity\nFormat each relationship as ("relationship"{tuple delimiter}<source entity>{tuple delimiter}<target entity>{tuple delimiter}<relationship description>{tuple delimiter}<relationship strength>) 3.\nReturn output in English as a single list of all the entities and relationships identified in steps 1 and 2.\nUse **{record delimiter}** as the list delimiter.\n4.\nWhen finished, output {completion delimiter} ---Examples--Entity types:\nORGANIZATION,PERSON\nInput:\nThe Fed is scheduled to meet on Tuesday and Wednesday, with the central bank planning to release its latest policy decision on Wednesday at 2:00 p.m.\nET, followed by a press conference where Fed Chair\nJer

In [10]:
print(raws[19]["source_file"])

notebook/example/GraphRagPaper.pdf


In [11]:
import pandas as pd

chunk_classification = pd.DataFrame(raws)

chunk_classification

Unnamed: 0,chunk,source_file,chunk_file,content_type,is_sufficient,topics
0,Conditions\nWe compared six conditions includi...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_23.json,paragraph,True,"[Topic 4, Topic 2]"
1,"arXiv:2212.10509.\nWang, J., Liang, Y., Meng, ...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_53.json,citation,True,"[Topic 6, Topic 12]"
2,"Klein, G., Moon, B., and Hoffman, R. R. (2006)...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_42.json,citation,True,"[Topic 6, Topic 4]"
3,are risks to downstream sensemaking and decisi...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_32.json,paragraph,True,"[Topic 1, Topic 4, Topic 9]"
4,Acknowledgements\nWe would also like to thank ...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_33.json,acknowledgements,True,[Topic 5]
...,...,...,...,...,...,...
76,and personas from large language models: Inves...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_49.json,citation,True,"[Topic 6, Topic 12]"
77,the LLM aligned with the winner based on the c...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_30.json,paragraph,True,"[Topic 6, Topic 9]"
78,"to ensure relevance, diversity, and alignment ...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_12.json,paragraph,True,"[Topic 4, Topic 6]"
79,"100–110, New York, NY, USA. Association for Co...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_38.json,citation,True,"[Topic 4, Topic 12]"


In [12]:
chunk_classification["topics"].apply(lambda x: len(x) if x else 0).describe()

count    81.000000
mean      2.308642
std       0.539490
min       1.000000
25%       2.000000
50%       2.000000
75%       3.000000
max       3.000000
Name: topics, dtype: float64

In [13]:
chunk_classification[chunk_classification["topics"].apply(len) == 0]

Unnamed: 0,chunk,source_file,chunk_file,content_type,is_sufficient,topics


In [14]:
print(raws[36]["chunk"])

Zhang, Y., Zhang, Y., Gan, Y., Yao, L., and Wang, C. (2024a). Causal graph discovery with retrievalaugmented generation based large language models. arXiv preprint arXiv:2402.15301.
Zhang, Z., Chen, J., and Yang, D. (2024b). Darg: Dynamic evaluation of large language models via adaptive reasoning graph. arXiv preprint arXiv:2406.17271.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing,
E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural
Information Processing Systems, 36.
Zhu, Y., Wang, X., Chen, J., Qiao, S., Ou, Y., Yao, Y., Deng, S., Chen, H., and Zhang, N. (2024).
Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities.
17


In [15]:
chunk_classification[chunk_classification["topics"].apply(len) == 1]

Unnamed: 0,chunk,source_file,chunk_file,content_type,is_sufficient,topics
4,Acknowledgements\nWe would also like to thank ...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_33.json,acknowledgements,True,[Topic 5]
34,the topic without being misled or making falla...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_81.json,paragraph,True,[Topic 6]
69,"Jacomy, M., Venturini, T., Heymann, S., and Ba...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_40.json,citation,True,[Topic 12]


In [16]:
print(raws[42]["chunk"])

---Goal--Write a comprehensive report of a community, given a list of entities that belong to the community as well as their relationships and optional associated claims.
The report will be used to inform decision-makers about information associated with the community and their potential impact.
The content of this report includes an overview of the community’s key entities, their legal compliance, technical capabilities, reputation, and noteworthy claims.
---Report Structure--The report should include the following sections: - TITLE: community’s name that represents its key entities - title should be short but specific.
When possible, include representative named entities in the title.
- SUMMARY: An executive summary of the community’s overall structure, how its entities are related to each other, and significant information associated with its entities.
- IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community.


In [17]:
chunk_classification[chunk_classification["topics"].apply(len) == 2]

Unnamed: 0,chunk,source_file,chunk_file,content_type,is_sufficient,topics
0,Conditions\nWe compared six conditions includi...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_23.json,paragraph,True,"[Topic 4, Topic 2]"
1,"arXiv:2212.10509.\nWang, J., Liang, Y., Meng, ...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_53.json,citation,True,"[Topic 6, Topic 12]"
2,"Klein, G., Moon, B., and Hoffman, R. R. (2006)...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_42.json,citation,True,"[Topic 6, Topic 4]"
7,Table 4 contains the results for the average n...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_29.json,paragraph,True,"[Topic 8, Topic 9]"
8,level communities are used to generate summari...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_19.json,paragraph,True,"[Topic 2, Topic 4]"
9,"tions on Knowledge Discovery from Data (TKDD),...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_35.json,citation,True,"[Topic 12, Topic 3]"
10,(a) Root communities at level 0 (b) Sub-commun...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_59.json,figure_caption,True,"[Topic 4, Topic 8]"
12,2.2\nUsing Knowledge Graphs with LLMs and RAG\...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_9.json,paragraph,True,"[Topic 4, Topic 5]"
13,since duplicates are typically clustered toget...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_17.json,paragraph,True,"[Topic 5, Topic 11]"
14,{answer2}\nAssess which answer is better accor...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_79.json,paragraph,True,"[Topic 6, Topic 4]"


In [18]:
chunk_classification[chunk_classification["topics"].apply(len) == 3]

Unnamed: 0,chunk,source_file,chunk_file,content_type,is_sufficient,topics
3,are risks to downstream sensemaking and decisi...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_32.json,paragraph,True,"[Topic 1, Topic 4, Topic 9]"
5,shows example questions for each of the two ev...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_22.json,paragraph,True,"[Topic 6, Topic 8, Topic 4]"
6,or as factual grounding for generated outputs ...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_10.json,paragraph,True,"[Topic 4, Topic 11, Topic 2]"
11,Artificial Intelligence: 33rd Canadian Confere...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_43.json,citation,True,"[Topic 12, Topic 4, Topic 11]"
15,generates a response based on both the query a...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_3.json,paragraph,True,"[Topic 4, Topic 8, Topic 6]"
17,community summaries provide global description...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_5.json,paragraph,True,"[Topic 1, Topic 2, Topic 4]"
20,retrieval for open-domain question answering. ...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_45.json,citation,True,"[Topic 1, Topic 4, Topic 11]"
24,4.1.3\nConfiguration\nWe used a fixed context ...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_25.json,paragraph,True,"[Topic 2, Topic 4, Topic 6]"
25,"Su, D., Xu, Y., Yu, T., Siddique, F. B., Barez...",notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_51.json,citation,True,"[Topic 1, Topic 4, Topic 12]"
27,means that questions can be answered using the...,notebook/example/GraphRagPaper.pdf,notebook/example/results/chunk_20.json,paragraph,True,"[Topic 1, Topic 2, Topic 4]"


In [19]:
chunk_classification["content_type"].value_counts(normalize=False)

content_type
paragraph           58
citation            21
acknowledgements     1
figure_caption       1
Name: count, dtype: int64

In [20]:
chunk_classification["content_type"].value_counts(normalize=True)

content_type
paragraph           0.716049
citation            0.259259
acknowledgements    0.012346
figure_caption      0.012346
Name: proportion, dtype: float64

In [21]:
chunk_classification["is_sufficient"].mean()

np.float64(1.0)

Update the paths in the code cell above before running the notebook.