# Academic QA System with GraphRAG

- Author: [Yongdam Kim](https://github.com/dancing-with-coffee)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

이 튜토리얼은 GraphRAG를 이용하여 논문 내용 잘 참고하여 더 잘 대답하는 QA 시스템에 대한 구현을 다룹니다. 

GraphRAG는 Microsoft에서 만든 Graph를 이용하여 text의 local & global 정보를 잘 추출하여 답변하는 새로운 시스템입니다.

하지만, Microsoft에서 발표한 GraphRAG 구현체는 Langchain과 결합되어 있지 않아 사용에 어려움이 있습니다.

이러한 점을 개선한 langchain-graphrag 라이브러리를 사용하여 Langchain에서 GraphRAG를 구현할 수 있습니다.

이번 튜토리얼에서는 langchain-graphrag를 사용하여 최신 AI 관련 논문들에 대해 상세하게 답변하는 QA system을 만드는 것에 대해 배웁니다.

![GraphRAG](./assets/08-academicqasystem-graphrag-pipeline-.png)

[논문: From Local to Global-A Graph RAG Approach to Query-Focused Summarization](https://arxiv.org/abs/2404.16130) 

### Table of Contents


### References

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [17]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain-graphrag",
        "langchain_chroma",
        "jq",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "08-AcademicQASystem",  # title 과 동일하게 설정해 주세요
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [None]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

## Load arXiv PDFs

이번 튜토리얼에서는 arXiv 데이터셋을 사용합니다. arXiv는 최신 논문이 올라와 있는 web archive이며, 모든 논문들이 PDF 형태로 올라와 있습니다. 공식적으로 제공하는 모든 pdf 파일이 올라와있는 공식 github repo가 있으나, 전체 데이터가 1TB 정도가 되고 AWS에서만 전체 PDF를 다운로드 받을 수 있게 되어있어 이 튜토리얼에서는 직접 선정한 몇몇 PDF에 대해서만 진행합니다.

- 전체 데이터셋 링크 : https://github.com/mattbierbaum/arxiv-public-datasets

In [4]:
# GraphRAG paper 불러오기
from langchain.document_loaders import PyPDFLoader

# PyPDFLoader는 한 페이지 단위씩 불러옵니다.
loader = PyPDFLoader("./data/2404.16130v1.pdf")
docs = loader.load()
print(f"Loaded {len(docs)} documents.")
print(docs[0].page_content)

Loaded 15 documents.
From Local to Global: A Graph RAG Approach to
Query-Focused Summarization
Darren Edge1†Ha Trinh1†Newman Cheng2Joshua Bradley2Alex Chao3
Apurva Mody3Steven Truitt2
Jonathan Larson1
1Microsoft Research
2Microsoft Strategic Missions and Technologies
3Microsoft Office of the CTO
{daedge,trinhha,newmancheng,joshbradley,achao,moapurva,steventruitt,jolarso }
@microsoft.com
†These authors contributed equally to this work
Abstract
The use of retrieval-augmented generation (RAG) to retrieve relevant informa-
tion from an external knowledge source enables large language models (LLMs)
to answer questions over private and/or previously unseen document collections.
However, RAG fails on global questions directed at an entire text corpus, such
as “What are the main themes in the dataset?”, since this is inherently a query-
focused summarization (QFS) task, rather than an explicit retrieval task. Prior
QFS methods, meanwhile, fail to scale to the quantities of text indexed by typi

## Text Chunking and Text Extracting

이번 단계에서는 **쿼리 라우팅** 과 **문서 평가** 를 수행합니다. 이 과정은 **Adaptive RAG** 의 중요한 부분으로, 효율적인 정보 검색과 생성에 기여합니다.

- **쿼리 라우팅** : 사용자의 쿼리를 분석하여 적절한 정보 소스로 라우팅합니다. 이를 통해 쿼리의 목적에 맞는 최적의 검색 경로를 설정할 수 있습니다.
- **문서 평가** : 검색된 문서의 품질과 관련성을 평가하여 최종 결과의 정확성을 높입니다. 

이 단계는 **Adaptive RAG** 의 핵심 기능을 지원하며, 정확하고 신뢰할 수 있는 정보 제공을 목표로 합니다.

In [5]:
from langchain_core.documents import Document
from langchain_graphrag.indexing import TextUnitExtractor
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
text_unit_extractor = TextUnitExtractor(text_splitter=splitter)

df_text_units = text_unit_extractor.run(docs)
df_text_units

Extracting text units ...: 100%|██████████| 6/6 [00:00<00:00, 9616.29it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 18137.53it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 46218.23it/s]
Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 48099.82it/s]
Extracting text units ...: 100%|██████████| 7/7 [00:00<00:00, 21016.56it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 80854.05it/s]
Extracting text units ...: 100%|██████████| 9/9 [00:00<00:00, 24656.26it/s]
Extracting text units ...: 100%|██████████| 11/11 [00:00<00:00, 17403.75it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 28777.39it/s]
Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 32870.72it/s]
Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 25055.58it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 17015.43it/s]
Extracting text units ...: 100%|██████████| 9/9 [00:00<00:00, 28575.88it/s]
Extra

Unnamed: 0,document_id,id,text_unit
0,4baf61e9-025e-40cd-ab25-3e750911c944,ad4d9e3d-01eb-4371-ab56-9dbe90016343,From Local to Global: A Graph RAG Approach to\...
1,4baf61e9-025e-40cd-ab25-3e750911c944,1e5f1c3e-fea6-4455-a8ce-2b0fe6e37c85,tion from an external knowledge source enables...
2,4baf61e9-025e-40cd-ab25-3e750911c944,6044058e-5e5a-496c-8ae0-ff2a4084b992,RAG systems. To combine the strengths of these...
3,4baf61e9-025e-40cd-ab25-3e750911c944,a754d92e-32ee-4c39-a67a-769d19ee8bcd,"question, each community summary is used to ge..."
4,4baf61e9-025e-40cd-ab25-3e750911c944,55fcccb4-a688-4221-a91b-c568cc36184e,approaches is forthcoming at https://aka .ms/g...
...,...,...,...
117,09b5004a-9dfe-4a6c-a135-5043bb690dea,62a9834f-b922-4095-bafe-c40f22075213,with chain-of-thought reasoning for knowledge-...
118,09b5004a-9dfe-4a6c-a135-5043bb690dea,8135ef7c-e4b7-45dd-8be4-336ee288e05c,"Wang, Y ., Lipka, N., Rossi, R. A., Siu, A., Z..."
119,09b5004a-9dfe-4a6c-a135-5043bb690dea,483df695-ef32-476f-a086-62181071958a,Empirical Methods in Natural Language Processi...
120,a2a3ed19-29b3-4409-b897-38a5d5792482,ce6f2893-8e3f-4266-a600-f2b6abb07067,"Yao, L., Peng, J., Mao, C., and Luo, Y . (2023..."


### Entity Relationship Extraction

GraphRAG는 chunking으로 나눈 텍스트에서 Entity, Relationship을 추출하여 Knowledge Graph를 자동으로 만들어줍니다.

Knowledge Graph를 만들 때는 LLM을 사용하며, 이번 튜토리얼에서는 속도 및 비용 이슈를 고려하여 gpt-4o-mini를 사용합니다. LLM은 사전에 정의된 prompt를 이용하여 entity와 relationship을 추출하는데 사용됩니다.

In [6]:
# 약 16분 정도 소요
from langchain_graphrag.indexing.graph_generation import EntityRelationshipExtractor
from langchain_openai import ChatOpenAI

er_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# There is a static method provide to build the default extractor
extractor = EntityRelationshipExtractor.build_default(llm=er_llm)
text_unit_graphs = extractor.invoke(df_text_units)

Extracting entities and relationships ...: 100%|██████████| 122/122 [16:57<00:00,  8.34s/it]


In [7]:
for index, g in enumerate(text_unit_graphs):
    print("---------------------------------")
    print(f"Graph: {index}")
    print(f"Number of nodes - {len(g.nodes)}")
    print(f"Number of edges - {len(g.edges)}")
    print(g.nodes())
    print(g.edges())
    print("---------------------------------")

---------------------------------
Graph: 0
Number of nodes - 11
Number of edges - 12
['DARREN EDGE', 'HA TRINH', 'NEWMAN CHENG', 'JOSHUA BRADLEY', 'ALEX CHAO', 'APURVA MODY', 'STEVEN TRUITT', 'JONATHAN LARSON', 'MICROSOFT RESEARCH', 'MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES', 'MICROSOFT OFFICE OF THE CTO']
[('DARREN EDGE', 'MICROSOFT RESEARCH'), ('DARREN EDGE', 'HA TRINH'), ('HA TRINH', 'MICROSOFT RESEARCH'), ('NEWMAN CHENG', 'MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES'), ('NEWMAN CHENG', 'JOSHUA BRADLEY'), ('JOSHUA BRADLEY', 'MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES'), ('ALEX CHAO', 'MICROSOFT OFFICE OF THE CTO'), ('ALEX CHAO', 'APURVA MODY'), ('APURVA MODY', 'MICROSOFT OFFICE OF THE CTO'), ('STEVEN TRUITT', 'MICROSOFT STRATEGIC MISSIONS AND TECHNOLOGIES'), ('STEVEN TRUITT', 'JONATHAN LARSON'), ('JONATHAN LARSON', 'MICROSOFT RESEARCH')]
---------------------------------
---------------------------------
Graph: 1
Number of nodes - 3
Number of edges - 2
['LARGE LANGUAGE MOD

In [8]:
# 추출한 모든 entity(node)중에 하나를 검색
text_unit_graphs[0].nodes["MICROSOFT RESEARCH"]

{'type': 'ORGANIZATION',
 'description': ['Microsoft Research is a division of Microsoft focused on advanced research and development in various fields, including AI and machine learning.'],
 'text_unit_ids': ['ad4d9e3d-01eb-4371-ab56-9dbe90016343']}

In [9]:
# 추출한 entity간의 edge(relationship)을 확인
text_unit_graphs[0].edges[("MICROSOFT RESEARCH", "DARREN EDGE")]

{'weight': 1.0,
 'description': ['Darren Edge is affiliated with Microsoft Research as a researcher'],
 'text_unit_ids': ['ad4d9e3d-01eb-4371-ab56-9dbe90016343']}

### Graph Generation

GraphRAG에서는 추출한 모든 entity와 relationship을 사용하지 않고, 합쳐질 수 있는 많은 정보들을 합쳐서 사용합니다. 이 과정을 Summarization이라고 부릅니다.

GraphRAG는 element summarization을 통해서 search의 기능을 향상시킬 수 있으며, 전체 맥락을 잘 이해하는 Global Search의 성능을 높이는데 큰 기여를 합니다.

In [10]:
# 약 20분 소요
from langchain_graphrag.indexing.graph_generation import (
    GraphsMerger,
    EntityRelationshipDescriptionSummarizer,
    GraphGenerator,
)

graphs_merger = GraphsMerger()

es_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

summarizer = EntityRelationshipDescriptionSummarizer.build_default(llm=es_llm)

graph_generator = GraphGenerator(
    er_extractor=extractor,
    graphs_merger=GraphsMerger(),
    er_description_summarizer=summarizer,
)

graph = graph_generator.run(df_text_units)

Extracting entities and relationships ...: 100%|██████████| 122/122 [17:38<00:00,  8.68s/it]
Summarizing entities descriptions: 100%|██████████| 482/482 [02:48<00:00,  2.87it/s]
Summarizing relationship descriptions: 100%|██████████| 749/749 [00:42<00:00, 17.81it/s]


In [11]:
print(f"Number of nodes - {len(graph[0].nodes)}")
print(f"Number of edges - {len(graph[0].edges)}")

Number of nodes - 482
Number of edges - 749


## Local Search through Knowledge Graph

GraphRAG를 통해 만들어진 Knowledge Graph를 이용해서 Local Search와 Global Search를 수행합니다. Local Search는 특정 단락이나 내용에서 내용을 찾을 때 유용하고, Global Search는 전체적인 맥락에서 답변을 얻기 좋은 방법입니다.

In [18]:
from langchain_chroma.vectorstores import Chroma as ChromaVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_graphrag.indexing import SimpleIndexer
from langchain_graphrag.indexing.artifacts import IndexerArtifacts
from langchain_graphrag.indexing.artifacts_generation import (
    CommunitiesReportsArtifactsGenerator,
    EntitiesArtifactsGenerator,
    RelationshipsArtifactsGenerator,
    TextUnitsArtifactsGenerator,
)
from langchain_graphrag.indexing.graph_clustering.leiden_community_detector import (
    HierarchicalLeidenCommunityDetector,
)

from langchain_graphrag.indexing.report_generation import (
    CommunityReportGenerator,
    CommunityReportWriter,
)

import pickle


# Define save_artifacts function
def save_artifacts(artifacts: IndexerArtifacts, path: str):
    artifacts.entities.to_parquet(f"{path}/entities.parquet")
    artifacts.relationships.to_parquet(f"{path}/relationships.parquet")
    artifacts.text_units.to_parquet(f"{path}/text_units.parquet")
    artifacts.communities_reports.to_parquet(f"{path}/communities_reports.parquet")

    if artifacts.merged_graph is not None:
        with path.joinpath("merged-graph.pickle").open("wb") as fp:
            pickle.dump(artifacts.merged_graph, fp)

    if artifacts.summarized_graph is not None:
        with path.joinpath("summarized-graph.pickle").open("wb") as fp:
            pickle.dump(artifacts.summarized_graph, fp)

    if artifacts.communities is not None:
        with path.joinpath("community_info.pickle").open("wb") as fp:
            pickle.dump(artifacts.communities, fp)


# Community Detector
community_detector = HierarchicalLeidenCommunityDetector()

# Define LLM, Embedding model
ls_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


# Entities artifacts Generator
# We need the vector Store (mandatory) for entities

# let's create a collection name based on
# the embedding model name
entities_collection_name = f"entity-openai-embeddings"
entities_vector_store = ChromaVectorStore(
    collection_name=entities_collection_name,
    persist_directory="./",
    embedding_function=embeddings,
)

entities_artifacts_generator = EntitiesArtifactsGenerator(
    entities_vector_store=entities_vector_store
)

relationships_artifacts_generator = RelationshipsArtifactsGenerator()

# Community Report Generator
# report_gen_llm = make_llm_instance(llm_type, llm_model, cache_dir)
report_generator = CommunityReportGenerator.build_default(
    llm=ls_llm,
    chain_config={"tags": ["community-report"]},
)

report_writer = CommunityReportWriter()

communities_report_artifacts_generator = CommunitiesReportsArtifactsGenerator(
    report_generator=report_generator,
    report_writer=report_writer,
)

text_units_artifacts_generator = TextUnitsArtifactsGenerator()

######### End of creation of various objects/dependencies #############

indexer = SimpleIndexer(
    text_unit_extractor=text_unit_extractor,
    graph_generator=graph_generator,
    community_detector=community_detector,
    entities_artifacts_generator=entities_artifacts_generator,
    relationships_artifacts_generator=relationships_artifacts_generator,
    text_units_artifacts_generator=text_units_artifacts_generator,
    communities_report_artifacts_generator=communities_report_artifacts_generator,
)

artifacts = indexer.run(docs)


# save the artifacts
artifacts_dir = "./artifacts"
save_artifacts(artifacts, artifacts_dir)

Extracting text units ...: 100%|██████████| 6/6 [00:00<00:00, 24174.66it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 15101.00it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 46538.74it/s]
Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 44667.77it/s]
Extracting text units ...: 100%|██████████| 7/7 [00:00<00:00, 15810.52it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 31655.12it/s]
Extracting text units ...: 100%|██████████| 9/9 [00:00<00:00, 26772.15it/s]
Extracting text units ...: 100%|██████████| 11/11 [00:00<00:00, 44706.73it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 18610.33it/s]
Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 30504.03it/s]
Extracting text units ...: 100%|██████████| 10/10 [00:00<00:00, 28986.21it/s]
Extracting text units ...: 100%|██████████| 8/8 [00:00<00:00, 14057.16it/s]
Extracting text units ...: 100%|██████████| 9/9 [00:00<00:00, 77038.24it/s]
Extr

KeyboardInterrupt: 