# **graphrag_tagger Demo**

*A lightweight toolkit for extracting topics from PDFs and visualizing their connections using graphs.*

## **1. Installation & Setup**
Ensure you have Python installed, then install the package locally:

```bash
pip install graprag-tagger
```

In [2]:
#pip install graphrag-tagger

Install all the **Core Dependecies**
- **PyMuPDF** – Extracts text from PDF files.
- **scikit-learn & ktrain** – Performs topic modeling.
- **LLM Client** – Enhances and refines extracted topics.
- **networkx** – Constructs and analyzes graphs.

```bash
pip install pymupdf scikit-learn ktrain llm networkx pytest
```

In [3]:
#pip install pymupdf scikit-learn ktrain llm networkx pytest

Load all the dependencies

## **2. Basic Usage**

Load a sample PDF document

In [1]:
#import the tagger module from the graphrag_tagger package
from graphrag_tagger import tagger

# A sample PDF document Graph Retrieval-Augmented Generation: A Survey availabe at https://arxiv.org/pdf/2408.08921v2
# Define sample parameters
params = {
    'pdf_folder': 'Sample PDf',  # update path to your PDF folder
    'chunk_size': 256,
    'chunk_overlap': 25,
    'n_components': None,
    'n_features': 512,
    'min_df': 2,
    'max_df': 0.95,
    'llm_model': 'ollama:qwen2.5',
    'output_folder': 'results',  # update path to your output folder
    'model_choice': 'sk' # kt for ktrain or sk for scikit-learn
}


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


For Fix regarding ```ModuleNotFoundError``` for ```distutils``` for python versions above ```python 3.12``` see [Fix ```distutils``` ModuleNotFoundError](distutilsModuleErrorFix.md)

In [25]:
# Create output folder if it doesn't exist
import os
os.makedirs(params['output_folder'], exist_ok=True)

# Run the tagging pipeline
tagger.main(params)

['Graph Retrieval-Augmented Generation: A Survey\nBOCI PENG∗, School of Intelligence Science and Technology, Peking University, China\nYUN ZHU∗, College of Computer Science and Technology, Zhejiang University, China\nYONGCHAO LIU, Ant Group, China\nXIAOHE BO, Gaoling School of Artificial Intelligence, Renmin University of China, China\nHAIZHOU SHI, Rutgers University, US\nCHUNTAO HONG, Ant Group, China\nYAN ZHANG†, School of Intelligence Science and Technology, Peking University, China\nSILIANG TANG, College of Computer Science and Technology, Zhejiang University, China\nRecently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges\nof Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge\nbase, RAG refines LLM outputs, effectively mitigating issues such as “hallucination”, lack of domain-specific\nknowledge, and outdated information. However, the complex structure of relationships among differe

Generating Tags: 100%|██████████| 191/191 [26:56<00:00,  8.46s/it]

Saved 191 chunk files to results





You can use any LLM available online for this demo we are using Ollama qwen2.5

for ```Ollama``` ```ConnectionError```  
Install and Setup Ollama from [Ollama Repo](https://github.com/ollama/ollama) and download the model **qwen2.5** for this demo code using 
```bash
ollama pull qwen2.5
```

## **3. Topic Extraction & Refinement**

In [26]:
from glob import glob

files = glob(os.path.join(params["output_folder"], "chunk_*.json"))
len(files)

191

In [27]:
import json

raws: list[dict] = [json.load(open(file,encoding='utf-8')) for file in files]

for i, raw in enumerate(raws):
    if "chunk" not in raw:
        print(i, raw)
raw

{'chunk': 'scenarios such as intelligence report generation [139], patent phrase similarity detection [133] and software understanding [1]. Ranade and Joshi [139] first construct an Event Plot Graph (EPG) and retrieve the critical aspects of the events to aid the generation of intelligence reports. Peng and Yang [133] create a patent-phrase graph and retrieve the ego network of the given patent phrase to assist the judgment of phrase similarity. Alhanahnah et al. [1] propose a Chatbot to understand properties about dependencies in a given software package, which first automatically constructs the dependency graph and then the user can ask questions about the dependencies in the dependency graph.\n9.3\nBenchmarks and Metrics 9.3.1\nBenchmarks. The benchmarks used to evaluate the performance of the GraphRAG system can be divided into two categories. The first category is the corresponding datasets of downstream tasks. We summarize the benchmarks and papers tested with them according to t

In [28]:
import json

raws: list[dict] = [json.load(open(file,encoding='utf-8')) for file in files]
raws = [{
    "chunk": raw["chunk"],
    "source_file": raw["source_file"],
    "chunk_file": file,
    **raw["classification"],
} for file, raw in zip(files, raws)]

print(raws[0].keys(),"\n")
print(raws[0]["chunk"])

dict_keys(['chunk', 'source_file', 'chunk_file', 'content_type', 'is_sufficient', 'topics']) 

Graph Retrieval-Augmented Generation: A Survey
BOCI PENG∗, School of Intelligence Science and Technology, Peking University, China
YUN ZHU∗, College of Computer Science and Technology, Zhejiang University, China
YONGCHAO LIU, Ant Group, China
XIAOHE BO, Gaoling School of Artificial Intelligence, Renmin University of China, China
HAIZHOU SHI, Rutgers University, US
CHUNTAO HONG, Ant Group, China
YAN ZHANG†, School of Intelligence Science and Technology, Peking University, China
SILIANG TANG, College of Computer Science and Technology, Zhejiang University, China
Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as “hallucination”, lack of domain-specific knowledge, and ou

In [31]:
import pandas as pd

chunk_classification = pd.DataFrame(raws)

chunk_classification

Unnamed: 0,chunk,source_file,chunk_file,content_type,is_sufficient,topics
0,Graph Retrieval-Augmented Generation: A Survey...,Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_1.json,paragraph,True,"[Natural Language Processing, Retrieval Method..."
1,"begin by introducing the GraphRAG workflow, al...",Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_10.json,paragraph,True,"[Natural Language Processing, Retrieval Method..."
2,marks specifically designed for the GraphRAG s...,Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_100.json,paragraph,True,"[Natural Language Processing, Knowledge Graphs..."
3,"111:28\nPeng et al.\nLLMs with graphs, which c...",Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_101.json,paragraph,True,"[Natural Language Processing, Knowledge Graphs..."
4,"such as QA systems, metrics like BLEU, ROUGE-L...",Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_102.json,paragraph,True,"[Natural Language Processing, Retrieval Method..."
...,...,...,...,...,...,...
186,application scenarios for its outstanding abil...,Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_95.json,paragraph,True,"[Natural Language Processing, Knowledge Graphs..."
187,"9.2.2\nBiomedical. Recently, GraphRAG techniqu...",Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_96.json,paragraph,True,"[Natural Language Processing, Graph Neural Net..."
188,Graph Retrieval-Augmented Generation: A Survey...,Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_97.json,paragraph,True,"[Knowledge Graphs, Retrieval Methods]"
189,"academic exploration, including predicting pot...",Sample PDf\Graph Retrieval-Augmented Generatio...,results\chunk_98.json,paragraph,True,"[Knowledge Graphs, Natural Language Processing]"


In [30]:
from graphrag_tagger.chat.llm import LLM, LLMService
import numpy as np

# Initialize the LLM you want to use
llm_service = LLMService(model="ollama:qwen2.5")
llm = LLM(llm_service=llm_service) 
extracted_topics: dict = json.load(open("results/topics.json",encoding='utf-8'))
extracted_topics["topics"]

# Example usage with extracted topics
np.unique(chunk_classification["topics"].explode().tolist())

array(['Biomedical', 'Conference Papers 2019', 'Conference Papers 2021',
       'Conference Papers 2024', 'Graph Attention Networks',
       'Graph Languages', 'Graph Neural Networks', 'Graph-Based Indexing',
       'Graph-Guided Retrieval', 'GraphRAG', 'Information Systems',
       'Knowledge Graphs', 'Large Language Models',
       'Natural Language Processing', 'Query Retrieval',
       'Retrieval Methods', 'Semantic Web', 'Topic 1', 'Topic 2',
       'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 9',
       'Training Techniques'], dtype='<U27')

In [32]:
llm.classify(document_chunk=raws[0]["chunk"],topics=raws[0]["topics"])


{'content_type': 'paragraph',
 'is_sufficient': True,
 'topics': ['Natural Language Processing', 'Retrieval Methods']}

## **4. Graph Construction & Visualization**


In [None]:
import os
import pytest
from graphrag_tagger.build_graph import process_graph



# Define sample input and output folders
input_folder = (
    "graphrag-tagger/examples/results"  # update this path to your folder containing JSON files
)
output_folder = (
    "graphrag-tagger/examples/graph_outputs"    # update this path where you want your result to be stored
)

# # Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Process graph with a specified threshold percentile (e.g., 97.5)
graph = process_graph(
    input_folder,
    output_folder,
    threshold_percentile=97.5,
    content_type_filter="paragraph", # Use only paragraph for chunk graph builder
)


# # The processed graph is stored in 'graph' and the connected components map is saved to the output folder.
print("Graph processing completed.")


<function process_graph at 0x000002BD8F8FF600>
def process_graph(
    input_folder: str,
    output_folder: str,
    threshold_percentile: float = 97.5,
    content_type_filter="",
):
    print("Processing graph...")
    pattern = "chunk_*.json"
    if threshold_percentile < 1:
        threshold_percentile *= 100
    os.makedirs(output_folder, exist_ok=True)

    graph_manager = GraphManager()
    raws = graph_manager.load_raw_files(input_folder, pattern, content_type_filter)
    scores_map = graph_manager.compute_scores(raws)
    G = graph_manager.build_graph(raws, scores_map)
    G_pruned = graph_manager.prune_graph(G, threshold_percentile)
    component_map = graph_manager.update_graph_components(G_pruned)

    output_file = os.path.join(output_folder, "connected_components.json")
    with open(output_file, "w") as f:
        json.dump(component_map, f)
    print(f"Connected components map saved to {output_file}")
    print("Graph processing complete.")
    return G_pruned

