### **From Local to Global: A GraphRAG Approach to Query-Focused Summarization**

#### **[Resource link](https://arxiv.org/pdf/2404.16130)**

## 1. Introduction

Retrieval-Augmented Generation (RAG) has become a popular way to improve large language models (LLMs). In RAG, documents are retrieved from a database, and then an LLM uses those documents to generate a response.

However, traditional RAG faces **two challenges**:

1. It retrieves only **local passages** without understanding the broader context.
2. It lacks a **global view** of the knowledge across the dataset.

To solve this, Microsoft introduced **GraphRAG**, which integrates **knowledge graphs** with traditional RAG. This approach not only retrieves passages but also uses **structured graph relationships** and **summarizations at multiple levels** (from local details to global insights).


In this workshop, we will study GraphRAG in **two phases**:

* #### **Indexing Phase** – where the data is processed into a graph and hierarchical summaries.
* #### **Query Phase** – where a user’s question is answered using the graph and summaries.

> **Teaching Tip**: At this point, show a **side-by-side diagram**:
>
> * Traditional RAG = chunks → retrieve → LLM
> * GraphRAG = chunks → entities/relations → graph + summaries → query


## **2. Indexing Phase**

The **indexing phase** is like “preparing the library” before answering questions. GraphRAG processes the text into both a **graph structure** and **layered summaries**.

#### Step 1: Text Chunking

* Documents are divided into smaller text units (e.g., paragraphs or sentences).
* Each chunk is small enough for an LLM to process efficiently.

*Example:* Splitting a 200-page report into \~500-word chunks.


#### Step 2: Entity and Relationship Extraction

* From each chunk, GraphRAG extracts **entities** (people, organizations, places, concepts).
* It also extracts **relationships** (who did what, how entities are connected).

*Example:*

> “Microsoft developed GraphRAG in 2023”
>
> * Entity 1: Microsoft
> * Entity 2: GraphRAG
> * Relationship: developed

**In Knowledge graph construction the following happens:**

* Entities become **nodes**.
* Relationships become **edges**.
* This results in a **Knowledge Graph (KG)** that captures connections across the dataset.

#### Step 3: Community Detection

Community detection is the process of **finding groups of closely related nodes** within the knowledge graph.

* **Why it matters:**

  * Real-world text is highly interconnected (entities link to many others). Without structure, the graph can be overwhelming.
  * Communities represent *themes or clusters of meaning*. For example, in an article about AI, one community might form around “LLMs, GPT, and transformers,” while another forms around “data privacy, compliance, and regulations.”

* **How it works:**

  * Algorithms like **Louvain** or **Leiden** are used to maximize “modularity” — a score that measures how well the graph is divided into dense groups with sparse connections between them.
  * Each node belongs to a single community, though in some advanced setups overlapping communities can exist.

* **Key takeaway:**

  > Community detection reduces complexity by organizing the graph into smaller, semantically meaningful clusters.



#### Step 4: Hierarchical Community Structure

Once basic communities are detected, GraphRAG takes it a step further by **building a hierarchy** — larger communities are broken down into smaller, more detailed ones.

* **Why it matters:**

  * Information exists at multiple levels of abstraction.
  * Users may sometimes need a broad overview (root-level community) and other times very specific details (fine-grained sub-community).

* **How it works:**

  * Communities are recursively clustered, creating multiple layers.
  * At the top (root level), you have a coarse-grained view of the entire dataset.
  * Moving down the hierarchy gives progressively more detailed clusters.

* **Example (from a dataset about AI research):**

  * **Root level (C0):** “Artificial Intelligence Research”
  * **High level (C1):** “Machine Learning Methods” vs. “AI Ethics & Regulation”
  * **Low level (C2/C3):** “Transformer Architectures” vs. “Bias Mitigation Techniques”

* **Key takeaway for students:**

  > Hierarchical structures allow GraphRAG to flexibly serve queries that need either **general summaries** or **fine-grained details**.


### Step 6: Community Levels

Microsoft’s GraphRAG implementation defines explicit levels to organize communities:

* **C0 (Root level):**

  * Represents the *entire dataset*.
  * Broad, high-level summary that ties everything together.
  * Useful for “give me an overview” type questions.

* **C1 (High level):**

  * Captures *major themes* in the data.
  * Often corresponds to chapters in a book, or main sections of a report.

* **C2 / C3 (Lower levels):**

  * Breaks down high-level themes into finer categories.
  * Example: “AI Ethics” (C1) → “Bias in Training Data” (C2) → “Gender Bias in NLP Models” (C3).

* **Why it matters for querying:**

  * During the **query phase**, the system can pick the right community level depending on the question.
  * A broad “What is this document about?” query retrieves from **C0**.
  * A specific “What are the risks of bias in GPT models?” query drills down into **C2/C3**.

* **Key takeaway for students:**

  > Community levels act like a *table of contents*, where each level of depth gives more detail and context.


```tree
C0 (All data)
 ├── C1: Climate
 │    ├── C2: Emissions
 │    └── C2: Renewable Energy
 └── C1: AI
      ├── C2: NLP
      └── C2: Computer Vision
```

#### Step 7: Community Summarization

Community Summarization is the process of generating **natural language summaries** for each detected community at every level of the hierarchy.

* **Why it matters:**

  * Communities are just clusters of entities and relationships. While structurally useful, they are hard to interpret directly.
  * Summarization creates a **semantic layer** on top of the graph — compact text that captures the essence of each community.
  * This allows both humans and LLMs to quickly understand what a community represents without processing all underlying nodes.

* **How it works:**

  1. Collect nodes and edges from a community (at C0, C1, C2, or C3).
  2. Extract representative text chunks associated with those nodes.
  3. Use an LLM to generate a concise summary, highlighting the main theme, relationships, and key entities.
  4. Store the summary alongside the community metadata for efficient retrieval.

* **Community Levels in Summarization:**

  * **C0 (Root level):** One broad summary for the entire dataset.
  * **C1 (High level):** Summaries describe major thematic clusters.
  * **C2/C3 (Lower levels):** Summaries provide increasingly detailed insights.

* **Example (dataset about AI research):**

  * **C0 summary:** “This dataset discusses modern artificial intelligence, covering methods, applications, and ethical considerations.”
  * **C1 summary (Methods cluster):** “This community focuses on deep learning techniques, emphasizing transformers, optimization strategies, and scaling laws.”
  * **C2 summary (Transformers sub-cluster):** “Summarizes research on attention mechanisms, pretraining methods, and fine-tuning for domain adaptation.”

* **Why it’s critical for Querying:**

  * Without summaries, the system would need to traverse and reason over raw graphs, which is inefficient.
  * With summaries, GraphRAG can directly retrieve concise, context-rich text aligned to the query’s scope, improving both accuracy and efficiency.

**Key takeaway for students:**

> Community Summarization transforms structural graph clusters into **human- and machine-readable summaries**, enabling flexible retrieval and reasoning across multiple levels of detail.

## 3. Query Phase

Once the data is indexed, GraphRAG can now **answer queries** using the graph and summaries.

### Step 1: User Query

The system receives a user question.

*Example:* “What are the main themes in renewable energy research?”

### Step 2: Community Level Selection

GraphRAG decides which **community level** best fits the query:

* Global query → use high-level summaries (C1).
* Specific query → use fine-grained summaries (C2/C3).



### Step 3: Retrieve Relevant Summaries

The system pulls the most relevant **summaries** (instead of raw text chunks).
This saves cost and ensures higher-level reasoning.

### Step 4: Partial Response Generation

For each retrieved summary, the LLM generates a **partial answer**.

*Example:*

* From C2: “Wind Energy research focuses on offshore systems.”
* From C2: “Solar Energy research highlights efficiency of perovskite cells.”


### Step 5: Response Aggregation

The partial responses are **merged** into a final coherent answer.

*Example (Final Answer):*

> Renewable energy research mainly focuses on wind (especially offshore systems) and solar (particularly perovskite efficiency).


## 4. Putting It Together

GraphRAG is powerful because it:

* Provides **global context** (via community summaries).
* Preserves **local detail** (via fine-grained communities).
* Balances **efficiency** and **accuracy**.

### References

* Microsoft GraphRAG Documentation: [https://microsoft.github.io/graphrag/](https://microsoft.github.io/graphrag/)
* GraphRAG GitHub: [https://github.com/microsoft/graphrag](https://github.com/microsoft/graphrag)
* Research Paper: [From Local to Global: A GraphRAG Approach to Query-Focused Summarization](https://arxiv.org/pdf/2404.16130)
