# GraphRAG Implementation with LlamaIndex - V2

[GraphRAG (Graphs + Retrieval Augmented Generation)](https://www.microsoft.com/en-us/research/project/graphrag/) combines the strengths of Retrieval Augmented Generation (RAG) and Query-Focused Summarization (QFS) to effectively handle complex queries over large text datasets. While RAG excels in fetching precise information, it struggles with broader queries that require thematic understanding, a challenge that QFS addresses but cannot scale well. GraphRAG integrates these approaches to offer responsive and thorough querying capabilities across extensive, diverse text corpora.

This notebook provides guidance on constructing the GraphRAG pipeline using the LlamaIndex PropertyGraph abstractions using Neo4J.

This notebook updates the GraphRAG pipeline to v2. If you haven’t checked v1 yet, you can find it [here](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/cookbooks/GraphRAG_v1.ipynb). Following are the updates to the existing implementation:

1. Integrate with Neo4J Graph database.
2. Embedding based retrieval.



## Installation

`graspologic` is used to use hierarchical_leiden for building communities.

In [1]:
# !pip install llama-index llama-index-graph-stores-neo4j graspologic numpy==1.26.4 scipy==1.12.0 future

## Load Data

We will use a sample news article dataset retrieved from Diffbot, which Tomaz has conveniently made available on GitHub for easy access.

The dataset contains 2,500 samples; for ease of experimentation, we will use 50 of these samples, which include the `title` and `text` of news articles.

In [2]:
import pandas as pd
from llama_index.core import Document

news = pd.read_csv(
    "data.csv",
    dtype={"industry": str}
)[:50]

news.head()

Unnamed: 0,title,text,industry
0,TSMC and MediaTek Deepen Alliance with 3nm Pro...,Foundry giant TSMC continues to be the manufac...,semiconductor
1,MediaTek Taps TSMC's 3nm Process for New Dimen...,Global mobile chipset design leader MediaTek i...,semiconductor
2,UMC's Specialty Process Expansion Poised to Su...,Foundry major United Microelectronics Corporat...,semiconductor
3,VIS Capacity Expansion for Specialty DDIs to B...,Vanguard International Semiconductor's (VIS) p...,semiconductor
4,PSMC's Mature Process Capabilities Crucial for...,Powerchip Semiconductor Manufacturing Corp (PS...,semiconductor


In [3]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     32 non-null     object
 1   text      32 non-null     object
 2   industry  32 non-null     object
dtypes: object(3)
memory usage: 900.0+ bytes


Prepare documents as required by LlamaIndex

In [4]:
documents = [
    Document(
        text=f"{row['title']}: {row['text']}",
        metadata={"industry": row["industry"]} # 將行業資訊放入 metadata
    )
    for i, row in news.iterrows()
]

## Setup API Key and LLM

In [5]:
import os
from dotenv import load_dotenv

# 從 .env 檔案載入環境變數
load_dotenv()

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4.1-mini")

## GraphRAGExtractor

The GraphRAGExtractor class is designed to extract triples (subject-relation-object) from text and enrich them by adding descriptions for entities and relationships to their properties using an LLM.

This functionality is similar to that of the `SimpleLLMPathExtractor`, but includes additional enhancements to handle entity, relationship descriptions. For guidance on implementation, you may look at similar existing [extractors](https://docs.llamaindex.ai/en/latest/examples/property_graph/Dynamic_KG_Extraction/?h=comparing).

Here's a breakdown of its functionality:

**Key Components:**

1. `llm:` The language model used for extraction.
2. `extract_prompt:` A prompt template used to guide the LLM in extracting information.
3. `parse_fn:` A function to parse the LLM's output into structured data.
4. `max_paths_per_chunk:` Limits the number of triples extracted per text chunk.
5. `num_workers:` For parallel processing of multiple text nodes.


**Main Methods:**

1. `__call__:` The entry point for processing a list of text nodes.
2. `acall:` An asynchronous version of __call__ for improved performance.
3. `_aextract:` The core method that processes each individual node.


**Extraction Process:**

For each input node (chunk of text):
1. It sends the text to the LLM along with the extraction prompt.
2. The LLM's response is parsed to extract entities, relationships, descriptions for entities and relations.
3. Entities are converted into EntityNode objects. Entity description is stored in metadata
4. Relationships are converted into Relation objects. Relationship description is stored in metadata.
5. These are added to the node's metadata under KG_NODES_KEY and KG_RELATIONS_KEY.

**NOTE:** In the current implementation, we are using only relationship descriptions. In the next implementation, we will utilize entity descriptions during the retrieval stage.

In [6]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display

from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field


class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        entity_metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            entity_metadata["entity_description"] = description
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=entity_metadata
            )
            existing_nodes.append(entity_node)

        relation_metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            relation_metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj,
                target_id=obj,
                properties=relation_metadata,
            )

            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

## GraphRAGStore

The `GraphRAGStore` class is an extension of the `Neo4jPropertyGraphStore`class, designed to implement GraphRAG pipeline. Here's a breakdown of its key components and functions:


The class uses community detection algorithms to group related nodes in the graph and then it generates summaries for each community using an LLM.


**Key Methods:**

`build_communities():`

1. Converts the internal graph representation to a NetworkX graph.

2. Applies the hierarchical Leiden algorithm for community detection.

3. Collects detailed information about each community.

4. Generates summaries for each community.

`generate_community_summary(text):`

1. Uses LLM to generate a summary of the relationships in a community.
2. The summary includes entity names and a synthesis of relationship descriptions.

`_create_nx_graph():`

1. Converts the internal graph representation to a NetworkX graph for community detection.

`_collect_community_info(nx_graph, clusters):`

1. Collects detailed information about each node based on its community.
2. Creates a string representation of each relationship within a community.

`_summarize_communities(community_info):`

1. Generates and stores summaries for each community using LLM.

`get_community_summaries():`

1. Returns the community summaries by building them if not already done.

In [7]:
import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from collections import defaultdict

from llama_index.core.llms import ChatMessage
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore


class GraphRAGStore(Neo4jPropertyGraphStore):
    community_summary = {}
    entity_info = None
    max_cluster_size = 24

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are an expert industry analyst. Your task is to create a structured and insightful analysis report "
                    "for an industry community based on a list of relationships from a knowledge graph.\n\n"
                    "**Goal:** Generate a comprehensive summary in Markdown format that analyzes the provided relationships "
                    "within the given industry community.\n\n"
                    "**Input Data:** You will receive a list of raw relationship strings. Each string follows the format: "
                    "\"entity1 -> entity2 -> relation -> relationship_description\".\n\n"
                    "**Output Format (Strictly follow this Markdown structure):**\n\n"
                    "# [Industry Name] Industry Community Analysis\n\n"
                    "## 1. Overview\n"
                    "(Provide a brief, 2-3 sentence introduction to this industry based on the entities and their interactions. "
                    "Describe its main characteristics.)\n\n"
                    "## 2. Key Players\n"
                    "(Based on the frequency of their appearance and centrality in the relationships, identify 3-5 of the most "
                    "important companies in this community. List them as bullet points.)\n\n"
                    "## 3. Relationship Analysis\n"
                    "(Categorize the provided relationships into the following sub-sections. Use the 'relation' and "
                    "'relationship_description' to determine the category. If a category has no relevant relationships, "
                    "OMIT the entire sub-section from the output.)\n\n"
                    "### Competitive Landscape\n"
                    "(Summarize the direct competition between companies. Example: 'Company A is in fierce competition with Company B...')\n\n"
                    "### Supply Chain & Partnerships\n"
                    "(Describe the supplier-client relationships, strategic partnerships, and collaborations. Example: 'Company C is "
                    "a key supplier to Company D, providing essential components...')\n\n"
                    "## 4. Key Insights\n"
                    "(Based on your analysis, provide a concluding paragraph summarizing the overall dynamics of this industry community. "
                    "What are the key trends, dependencies, or notable characteristics?)\n\n"
                    "---\n"
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    # 在 GraphRAGStore class 中
    def build_communities(self):
        """
        Builds communities based on the 'industry' property of Company nodes,
        instead of using a structural algorithm like Leiden.
        """
        print("Building communities based on 'industry' attribute...")
        
        # 步驟 a: 查詢所有公司及其行業屬性
        # 假設您的公司節點標籤是 'Company'，行業屬性是 'industry'
        # 請根據您的實際圖譜綱要調整
        query = """
        MATCH (c:Company)
        WHERE c.industry IS NOT NULL
        RETURN c.name AS entity, c.industry AS community
        """
        with self._driver.session() as session:
            result = session.run(query)
            records = list(result)

        # 步驟 b: 按行業對公司進行分組
        communities_by_industry = defaultdict(list)
        for record in records:
            communities_by_industry[record["community"]].append(record["entity"])

        # 步驟 c: 將分組結果轉換為所需的格式
        entity_info = defaultdict(list)
        community_info = defaultdict(list)
        
        nx_graph = self._create_nx_graph() # 我們仍然需要它來獲取關係細節

        for industry, entities in communities_by_industry.items():
            community_id = industry  # 直接使用行業名稱作為社群ID
            
            for entity in entities:
                entity_info[entity].append(community_id)
                
                # 獲取這個社群內部的所有關係細節
                for neighbor in nx_graph.neighbors(entity):
                    # 確保鄰居也在同一個行業社群中
                    if neighbor in entities:
                        edge_data = nx_graph.get_edge_data(entity, neighbor)
                        if edge_data:
                            detail = f"{entity} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                            community_info[community_id].append(detail)
        
        self.entity_info = dict(entity_info)
        
        # 步驟 d: 為每個行業社群生成摘要
        self._summarize_communities(dict(community_info))

        print(f"Successfully built {len(community_info)} communities based on industry.")
        
    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """
        Collect information for each node based on their community,
        allowing entities to belong to multiple clusters.
        """
        entity_info = defaultdict(set)
        community_info = defaultdict(list)

        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # Update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)

        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}

        return dict(entity_info), dict(community_info)

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

  from .autonotebook import tqdm as notebook_tqdm


## GraphRAGQueryEngine

The GraphRAGQueryEngine class is a custom query engine designed to process queries using the GraphRAG approach. It leverages the community summaries generated by the GraphRAGStore to answer user queries. Here's a breakdown of its functionality:

**Main Components:**

`graph_store:` An instance of GraphRAGStore, which contains the community summaries.
`llm:` A Language Model (LLM) used for generating and aggregating answers.


**Key Methods:**

`custom_query(query_str: str)`

1. This is the main entry point for processing a query. It retrieves community summaries, generates answers from each summary, and then aggregates these answers into a final response.

`generate_answer_from_summary(community_summary, query):`

1. Generates an answer for the query based on a single community summary.
Uses the LLM to interpret the community summary in the context of the query.

`aggregate_answers(community_answers):`

1. Combines individual answers from different communities into a coherent final response.
2. Uses the LLM to synthesize multiple perspectives into a single, concise answer.


**Query Processing Flow:**

1. Retrieve community summaries from the graph store.
2. For each community summary, generate a specific answer to the query.
3. Aggregate all community-specific answers into a final, coherent response.


**Example usage:**

```
query_engine = GraphRAGQueryEngine(graph_store=graph_store, llm=llm)

response = query_engine.query("query")
```

In [45]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
from llama_index.core import PropertyGraphIndex

import re


class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    index: PropertyGraphIndex
    llm: LLM
    similarity_top_k: int = 20
    def _get_related_entities_from_graph(self, query_str: str, depth: int = 1) -> list[str]:
        """
        Identifies anchor entities from the query and expands them by fetching their
        direct neighbors from the graph.
        """
        # 步驟 1: 從問題中識別出「錨點實體」 (沿用舊邏輯)
        anchor_entities = self._get_entities_from_query(query_str)
        
        if not anchor_entities:
            print("No anchor entities found in query.")
            return []

        print(f"Found anchor entities: {anchor_entities}")

        # 使用 set 來存放所有相關實體，自動處理重複
        all_related_entities = set(anchor_entities)

        # 步驟 2: 對於每個錨點實體，去圖譜中查詢其鄰居
        # 注意：這裡的 self.graph_store 需要有執行 cypher 的能力，或者直接使用 neo4j driver
        with self.graph_store._driver.session() as session:
            for entity_name in anchor_entities:
                # Cypher 查詢：找到指定節點的所有一度鄰居
                cypher_query = f"""
                MATCH (start_node {{name: $entity_name}})-[r]-(neighbor)
                RETURN neighbor.name AS neighbor_name
                """
                result = session.run(cypher_query, entity_name=entity_name)
                
                neighbors = [record["neighbor_name"] for record in result]
                all_related_entities.update(neighbors)

        final_entity_list = list(all_related_entities)
        print(f"Expanded to related entities: {final_entity_list}")
        
        return final_entity_list

    def _get_entities_from_query(self, query_str: str) -> List[str]:
        """A simple method to extract known entities directly from the query string."""
        
        # 獲取知識圖譜中所有已知實體的名稱
        known_entities = self.graph_store.entity_info.keys()
        
        found_entities = []
        # 遍歷所有已知實體，檢查它們是否出現在查詢問題中
        for entity in known_entities:
            # 使用正則表達式進行全詞匹配，避免 "ASUS" 匹配到 "ASU" 等情況
            # re.IGNORECASE 讓匹配不分大小寫
            if re.search(r'\b' + re.escape(entity) + r'\b', query_str, re.IGNORECASE):
                found_entities.append(entity)
        
        print(f"Found entities in query: {found_entities}") # 添加日誌便於除錯
        return found_entities
    
    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""

        entities = self._get_related_entities_from_graph(query_str)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        enitites = set()
        pattern = (
            r"^(\w+(?:\s+\w+)*)\s*->\s*([a-zA-Z\s]+?)\s*->\s*(\w+(?:\s+\w+)*)$"
        )

        for node in nodes_retrieved:
            matches = re.findall(
                pattern, node.text, re.MULTILINE | re.IGNORECASE
            )

            for match in matches:
                subject = match[0]
                obj = match[2]
                enitites.add(subject)
                enitites.add(obj)

        return list(enitites)

    def retrieve_entity_communities(self, entity_info, entities):
        """
        Retrieve cluster information for given entities, allowing for multiple clusters per entity.

        Args:
        entity_info (dict): Dictionary mapping entities to their cluster IDs (list).
        entities (list): List of entity names to retrieve information for.

        Returns:
        List of community or cluster IDs to which an entity belongs.
        """
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

##  Build End to End GraphRAG Pipeline

Now that we have defined all the necessary components, let’s construct the GraphRAG pipeline:

1. Create nodes/chunks from the text.
2. Build a PropertyGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`.
3. Construct communities and generate a summary for each community using the graph built above.
4. Create a `GraphRAGQueryEngine` and begin querying.

### Create nodes/ chunks from the text.

In [9]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

In [10]:
len(nodes)

32

### Build ProperGraphIndex using `GraphRAGExtractor` and `GraphRAGStore`

In [23]:
KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all companies and the relationships among them. The goal is to create a knowledge graph of corporate interactions.

Given the text, extract up to {max_knowledge_triplets} company-relation triplets.

-Steps-

Identify all Company Entities. For each identified company, extract the following information:

entity_name: The official or most common name of the company. Crucially, you MUST normalize variations into a single, consistent name (e.g., if you see "Tsmc" or "tsmc", you must output "TSMC"; for "Global Devices Corp.", always use this full name, not "Global Devices").
entity_type: Always use "Company".
entity_description: A comprehensive description of the company's business, industry, and key activities mentioned in the text.
Identify Inter-Company Relationships. From the companies identified in step 1, identify all pairs of (source_entity, target_entity) that have a clear and direct relationship described in the text.
For each pair of related companies, extract the following information:

source_entity: Name of the source company, as identified in step 1.
target_entity: Name of the target company, as identified in step 1.
relation: A standardized relationship type that best describes the connection. Use one of the following predefined types: COMPETES_WITH, IS_PARTNER_OF, IS_SUPPLIER_TO, IS_CUSTOMER_OF, INVESTED_IN, ACQUIRED, IS_PARENT_COMPANY_OF, OTHER.
relationship_description: A concise explanation, citing evidence from the text, as to why you think the two companies are related.
Output Formatting:

Return the result in a valid JSON format with two keys: entities (a list of company objects) and relationships (a list of relationship objects).
Exclude any text or explanations outside the main JSON structure.
If no companies or relationships are identified, return empty lists: { "entities": [], "relationships": [] }.
-An Output Example-
{
    "entities": [
        {
            "entity_name": "QuantumChip Inc.",
            "entity_type": "Company",
            "entity_description": "QuantumChip Inc. is a leading designer of AI accelerators, securing a supply agreement for next-generation smartphones."
        },
        {
            "entity_name": "Global Devices Corp.",
            "entity_type": "Company",
            "entity_description": "Global Devices Corp. is a major smartphone manufacturer that will use QuantumChip's accelerators in its next-generation flagship phone."
        },
        {
            "entity_name": "OmniSilicon",
            "entity_type": "Company",
            "entity_description": "OmniSilicon is an established competitor in the mobile chip market, now facing direct competition from QuantumChip Inc."
        }
    ],
    "relationships": [
        {
            "source_entity": "QuantumChip Inc.",
            "target_entity": "Global Devices Corp.",
            "relation": "IS_SUPPLIER_TO",
            "relationship_description": "The text states that QuantumChip Inc. signed a long-term supply agreement with Global Devices Corp. to provide AI accelerators."
        },
        {
            "source_entity": "QuantumChip Inc.",
            "target_entity": "OmniSilicon",
            "relation": "COMPETES_WITH",
            "relationship_description": "The text explicitly states that the deal places QuantumChip in 'direct competition with established player OmniSilicon'."
        },
        {
            "source_entity": "Global Devices Corp.",
            "target_entity": "QuantumChip Inc.",
            "relation": "IS_CUSTOMER_OF",
            "relationship_description": "Global Devices Corp. will be using QuantumChip's products in its phones, making it a customer."
        }
    ]
}

-Real Data-
######################
text: {text}
######################
output:"""

In [24]:
import json


def parse_fn(response_str: str) -> Any:
    json_pattern = r"\{.*\}"
    match = re.search(json_pattern, response_str, re.DOTALL)
    entities = []
    relationships = []
    if not match:
        return entities, relationships
    json_str = match.group(0)
    try:
        data = json.loads(json_str)
        entities = [
            (
                entity["entity_name"],
                entity["entity_type"],
                entity["entity_description"],
            )
            for entity in data.get("entities", [])
        ]
        relationships = [
            (
                relation["source_entity"],
                relation["target_entity"],
                relation["relation"],
                relation["relationship_description"],
            )
            for relation in data.get("relationships", [])
        ]
        return entities, relationships
    except json.JSONDecodeError as e:
        print("Error parsing JSON:", e)
        return entities, relationships


kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

## Docker Setup And Neo4J setup

To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command.

```
docker run \
    -p 7474:7474 -p 7687:7687 \
    -v $PWD/data:/data -v $PWD/plugins:/plugins \
    --name neo4j-apoc \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    neo4j:latest
```
From here, you can open the db at http://localhost:7474/. On this page, you will be asked to sign in. Use the default username/password of neo4j and neo4j.

Once you login for the first time, you will be asked to change the password.

In [50]:
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(
    username="neo4j", password="a96534200", url="neo4j://127.0.0.1:7687"
)

In [51]:
from llama_index.core import PropertyGraphIndex

index = PropertyGraphIndex(
    nodes=nodes,
    kg_extractors=[kg_extractor],
    property_graph_store=graph_store,
    show_progress=True,
)

Extracting paths from text:   0%|          | 0/32 [00:00<?, ?it/s]

Extracting paths from text: 100%|██████████| 32/32 [00:38<00:00,  1.20s/it]
Generating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.51s/it]
Generating embeddings: 100%|██████████| 2/2 [00:01<00:00,  1.43it/s]


In [52]:
index.property_graph_store.get_triplets()[10]

[EntityNode(label='Company', embedding=None, properties={'id': 'Ennostar Inc.', 'entity_description': 'Ennostar Inc. is a leader in MicroLED technology, formed by the merger of Epistar and Lextar. The company focuses on advancements in MicroLED modules that offer superior brightness, contrast, and efficiency, targeting next-generation display applications.', 'industry': 'semiconductor', 'triplet_source_id': '71d78798-9112-4596-8706-d186333f6bcc'}, name='Ennostar Inc.'),
 Relation(label='IS_PARTNER_OF', source_id='Ennostar Inc.', target_id='Novatek Microelectronics', properties={'industry': 'semiconductor', 'triplet_source_id': '71d78798-9112-4596-8706-d186333f6bcc', 'relationship_description': 'Ennostar Inc. is working closely with Novatek Microelectronics to accelerate the commercialization of next-generation displays by combining MicroLED technology with advanced driver IC solutions.'}),
 EntityNode(label='Company', embedding=None, properties={'id': 'Novatek Microelectronics', 'entit

In [53]:
index.property_graph_store.get_triplets()[10][0].properties

{'id': 'Ennostar Inc.',
 'entity_description': 'Ennostar Inc. is a leader in MicroLED technology, formed by the merger of Epistar and Lextar. The company focuses on advancements in MicroLED modules that offer superior brightness, contrast, and efficiency, targeting next-generation display applications.',
 'industry': 'semiconductor',
 'triplet_source_id': '71d78798-9112-4596-8706-d186333f6bcc'}

In [54]:
index.property_graph_store.get_triplets()[10][1].properties

{'industry': 'semiconductor',
 'triplet_source_id': '71d78798-9112-4596-8706-d186333f6bcc',
 'relationship_description': 'Ennostar Inc. is working closely with Novatek Microelectronics to accelerate the commercialization of next-generation displays by combining MicroLED technology with advanced driver IC solutions.'}

### Build communities

This will create communities and summary for each community.

In [55]:
index.property_graph_store.build_communities()

Building communities based on 'industry' attribute...
Successfully built 3 communities based on industry.


In [56]:
# 獲取已經建立好社群的 graph_store 物件
graph_store_with_communities = index.property_graph_store

# 1. 查看每個實體 (公司) 分別屬於哪些社群
#    格式為 { '公司名稱': [社群ID_1, 社群ID_2, ...] }
print("--- 實體與社群的對應關係 (Entity Info) ---")
print(graph_store_with_communities.entity_info)
print("\\n" + "="*50 + "\\n")

# 2. 查看每個社群的文字摘要
#    格式為 { 社群ID: '這個社群的摘要文字...' }
print("--- 各個社群的摘要 (Community Summaries) ---")
for community_id, summary in graph_store_with_communities.community_summary.items():
    print(f"【社群 ID: {community_id}】")
    print(summary)
    print("-" * 20)

--- 實體與社群的對應關係 (Entity Info) ---
{'Richtek Technology Corporation': ['semiconductor'], 'Silicon Motion': ['semiconductor'], 'Chunghwa Precision Test Tech': ['semiconductor'], 'United Microelectronics Corporation': ['semiconductor'], 'Vanguard International Semiconductor': ['semiconductor'], 'Novatek Microelectronics': ['semiconductor'], 'Yageo Corporation': ['semiconductor'], 'Powerchip Semiconductor Manufacturing Corp': ['semiconductor'], 'ASE Technology Holding': ['semiconductor'], 'Ennostar Inc.': ['semiconductor'], 'Macronix International': ['semiconductor'], 'GlobalWafers': ['semiconductor'], 'Unimicron Technology Corp.': ['semiconductor'], 'WIN Semiconductors': ['semiconductor'], 'TSMC': ['semiconductor'], 'Winbond Electronics Corporation': ['semiconductor'], 'Nuvoton Technology Corporation': ['semiconductor'], 'MediaTek': ['semiconductor'], 'Andes Technology': ['semiconductor'], 'Himax Technologies': ['semiconductor'], 'Gudeng Precision': ['semiconductor'], 'Powertech Technology

### Create QueryEngine

In [33]:
query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store,
    llm=llm,
    index=index,
    similarity_top_k=10,
)

### Querying

In [48]:
response = query_engine.query(
    "In total, how many relationship MediaTek have?"
)
display(Markdown(f"{response.response}"))

Found entities in query: ['MediaTek']


Based on the provided information, MediaTek has five relationships: partners with ASE Technology Holding and Andes Technology (2 relationships), and suppliers including Richtek Technology Corporation, Yageo Corporation, and Andes Technology (3 relationships). Andes Technology appears as both a partner and a supplier, representing two distinct relationship types.

In [49]:
response = query_engine.query(
    "In total, how many relationship TSMC have?"
)
display(Markdown(f"{response.response}"))

Found entities in query: ['TSMC']


Based on the community summary, TSMC has at least seven distinct relationships: three with suppliers (GlobalWafers, Unimicron Technology Corp., Gudeng Precision), two with customers (Phison Electronics, Silicon Motion), and two with customers/partners (MediaTek, Realtek Semiconductor).

In [36]:
response = query_engine.query(
    "In total, how many relationship ASUS have?"
)
display(Markdown(f"{response.response}"))

Found entities in query: ['ASUS']


Based on the community summary, ASUS is involved in a total of 4 competitive relationships: ASUS competes with Gigabyte and Acer, and both Gigabyte and Acer compete with ASUS.

In [37]:
response = query_engine.query(
    "What are the key relationships MediaTek has, and what is the nature of these partnerships or competitions?"
)
display(Markdown(f"{response.response}"))

Found entities in query: ['MediaTek']


MediaTek maintains key relationships with partners like ASE Technology Holding for packaging solutions and Andes Technology for CPU cores, enabling advanced integration in its system-on-chips (SoCs). It also relies on suppliers such as Richtek Technology Corporation, Yageo Corporation, and Andes Technology for essential components in its SoCs and RFICs. These collaborations support MediaTek’s innovation and technological advancement, particularly in AI, automotive, and IoT applications. While operating in a competitive semiconductor landscape alongside companies like Realtek Semiconductor and Silicon Motion, MediaTek’s focus remains on strategic partnerships that enhance its product development and manufacturing capabilities.

In [38]:
response = query_engine.query(
    "Based on its relationships, how does MediaTek position itself against competitors like Qualcomm and collaborate with suppliers like TSMC?"
)
display(Markdown(f"{response.response}"))

Found entities in query: ['TSMC', 'MediaTek']


MediaTek positions itself as a key semiconductor industry player by leveraging strategic collaborations and a strong supplier network to compete with rivals like Qualcomm. Key partnerships include TSMC for advanced chip manufacturing, ASE Technology Holding for packaging solutions, and Andes Technology for CPU cores. These collaborations grant MediaTek access to cutting-edge fabrication technologies and enhance its product capabilities in AI, automotive, and IoT applications. By integrating these relationships and focusing on innovation, MediaTek emphasizes technological advancement and a robust supply chain to maintain competitiveness in the market.

In [44]:
response = query_engine.query(
    "Based on TSMC's relationships and activities, what is the current trend of the semiconductor industry?"
)
display(Markdown(f"{response.response}"))

Found entities in query: ['TSMC']


The current trend in the semiconductor industry is marked by increasing collaboration and integration across the supply chain to drive innovation. TSMC’s central role as a leading foundry underscores the industry's focus on advanced manufacturing to support cutting-edge products. Partnerships with companies like MediaTek, Realtek, and Phison highlight a move toward specialization and reliance on expert manufacturing to meet growing demands in AI, automotive, and IoT sectors. Overall, the industry is evolving toward deeper cooperation among suppliers, manufacturers, and customers to accelerate the development and production of sophisticated semiconductor technologies.