[LightRAG](https://github.com/HKUDS/LightRAG) is an open-source RAG system that enhances LLMs by integrating graph-based structures into text indexing and retrieval. It overcomes the limitations of traditional RAG systems, such as fragmented answers and weak contextual awareness, by enabling dual-level retrieval for more comprehensive knowledge discovery. With support for incremental data updates, LightRAG ensures timely integration of new information while delivering improved retrieval accuracy and efficiency.

To run this Jupyter Notebook, you can download the original `.ipynb` file from [lightrag.ipynb](https://github.com/xuanleilin/tigergraphx/tree/main/docs/graphrag/lightrag.ipynb).

---

## Prerequisites

Before proceeding, ensure you’ve completed the installation and setup steps outlined in the [Installation Guide](../getting_started/installation.md), including:

- Setting up Python and TigerGraph. For more details, refer to the [Requirements](../../getting_started/installation/#requirements) section.
- Install TigerGraphX along with its development dependencies. For more details, refer to the [Development Installation](../../getting_started/installation/#development-installation) section.
- Set the environment variables **`TG_HOST`**, **`TG_USERNAME`**, and **`TG_PASSWORD`**, which are required to connect to the TigerGraph server, as well as **`OPENAI_API_KEY`** for connecting to OpenAI. Use a command like the following to set these variables:  

   ```bash
   export TG_HOST=https://127.0.0.1
   ```


---

## Implement Graph Storage with TigerGraphX

In LightRAG, the storage layers are abstracted into components such as graph storage, key-value storage, and vector storage. You can refer to the base classes **BaseGraphStorage**, **BaseVectorStorage**, and **BaseKVStorage** in the [source code](https://github.com/HKUDS/LightRAG/blob/main/lightrag/base.py).  

In this section, we will demonstrate how to use TigerGraphX to implement the `BaseGraphStorage` class for storing and retrieving data in TigerGraph.

In [2]:
import os
from dataclasses import dataclass
import numpy as np

from lightrag.base import BaseGraphStorage
from lightrag.utils import logger
from tigergraphx import UndiGraph, TigerGraphConnectionConfig


@dataclass
class TigerGraphStorage(BaseGraphStorage):
    def __post_init__(self):
        try:
            # Retrieve connection configuration from environment variables
            connection_config = {
                "host": os.environ["TG_HOST"],
                "username": os.environ["TG_USERNAME"],
                "password": os.environ["TG_PASSWORD"],
            }
            logger.info("TigerGraph connection configuration retrieved successfully.")
            # Initialize the graph
            self._graph = UndiGraph(
                graph_name="LightRAG",
                node_type="MyNode",
                edge_type="MyEdge",
                node_primary_key="id",
                node_attributes={
                    "id": "STRING",
                    "entity_type": "STRING",
                    "description": "STRING",
                    "source_id": "STRING",
                },
                edge_attributes={
                    "weight": "DOUBLE",
                    "description": "STRING",
                    "keywords": "STRING",
                    "source_id": "STRING",
                },
                tigergraph_connection_config=TigerGraphConnectionConfig.ensure_config(
                    connection_config
                ),
            )
            logger.info(
                "Undirected graph initialized successfully with graph_name 'LightRAG'."
            )
        except KeyError as e:
            logger.error(f"Environment variable {str(e)} is missing.")
            raise
        except Exception as e:
            logger.error(f"An error occurred during initialization: {e}")
            raise

    @staticmethod
    def clean_quotes(value: str) -> str:
        """Remove leading and trailing &quot; from a string if present."""
        if value.startswith('"') and value.endswith('"'):
            return value[1:-1]
        return value

    async def has_node(self, node_id: str) -> bool:
        return self._graph.has_node(self.clean_quotes(node_id))

    async def has_edge(self, source_node_id: str, target_node_id: str) -> bool:
        return self._graph.has_edge(
            self.clean_quotes(source_node_id), self.clean_quotes(target_node_id)
        )

    async def node_degree(self, node_id: str) -> int:
        result = self._graph.degree(self.clean_quotes(node_id))
        return result

    async def edge_degree(self, src_id: str, tgt_id: str) -> int:
        return self._graph.degree(self.clean_quotes(src_id)) + self._graph.degree(
            self.clean_quotes(tgt_id)
        )

    async def get_node(self, node_id: str) -> dict | None:
        result = self._graph.get_node_data(self.clean_quotes(node_id))
        return result

    async def get_edge(self, source_node_id: str, target_node_id: str) -> dict | None:
        result = self._graph.get_edge_data(
            self.clean_quotes(source_node_id), self.clean_quotes(target_node_id)
        )
        return result

    async def get_node_edges(self, source_node_id: str) -> list[tuple[str, str]] | None:
        source_node_id = self.clean_quotes(source_node_id)
        if self._graph.has_node(source_node_id):
            edges = self._graph.get_node_edges(source_node_id)
            return list(edges)
        return None

    async def upsert_node(self, node_id: str, node_data: dict[str, str]):
        node_id = self.clean_quotes(node_id)
        self._graph.add_node(node_id, **node_data)

    async def upsert_edge(
        self, source_node_id: str, target_node_id: str, edge_data: dict[str, str]
    ):
        source_node_id = self.clean_quotes(source_node_id)
        target_node_id = self.clean_quotes(target_node_id)
        self._graph.add_edge(source_node_id, target_node_id, **edge_data)

    async def delete_node(self, node_id: str):
        if self._graph.has_node(node_id):
            self._graph.remove_node(node_id)
            logger.info(f"Node {node_id} deleted from the graph.")
        else:
            logger.warning(f"Node {node_id} not found in the graph for deletion.")

    async def embed_nodes(self, algorithm: str) -> tuple[np.ndarray, list[str]]:
        return np.array([]), []

This code defines a `TigerGraphStorage` class that implements the `BaseGraphStorage` interface for graph storage and retrieval using **TigerGraphX**, a Python library for interacting with TigerGraph databases.

Key highlights of this implementation include:

1. **Graph Initialization**  
   - An **undirected homogeneous graph** (`UndiGraph`) is initialized. This graph type supports only one type of node and edge, making it similar to **NetworkX**'s undirected graph.
   - TigerGraph’s schema-based nature requires a graph schema definition with specific attributes for nodes and edges. For instance:  
     - Node attributes: `id`, `entity_type`, `description`, `source_id`  
     - Edge attributes: `weight`, `description`, `keywords`, `source_id`

2. **TigerGraphX Interfaces**  
   - TigerGraphX provides user-friendly interfaces, very similar to NetworkX, which simplify operations like:  
     - **Node Operations**: `has_node`, `add_node`, `remove_node`, `get_node_data`  
     - **Edge Operations**: `has_edge`, `add_edge`, `get_edge_data`, `get_node_edges`  
     - **Degree Calculation**: `degree` for nodes and edges.

3. **Key Methods**  
   - **Storage Operations**:  
     - `upsert_node`: Inserts or updates a node with its data.  
     - `upsert_edge`: Inserts or updates an edge between two nodes.  
     - `delete_node`: Deletes a node if it exists.  
   - **Data Retrieval**:  
     - `get_node`: Retrieves data for a specific node.  
     - `get_edge`: Retrieves data for a specific edge.  
     - `get_node_edges`: Retrieves all edges for a given node.  
   - **Graph Metrics**:  
     - `node_degree`: Returns the degree of a node.  
     - `edge_degree`: Calculates combined degrees of two nodes.  

4. **Additional Notes**  
   - The `clean_quotes` method ensures clean input values by stripping leading and trailing quotes from strings.  
   - TigerGraphX goes beyond NetworkX’s capabilities by supporting **heterogeneous graphs** (graphs with multiple types of nodes and edges) using the `Graph` class, in addition to undirected (`UndiGraph`) and directed graphs (`DiGraph`).

---
## Integrating Custom Graph Storage with LightRAG
After defining the `TigerGraphStorage` class, we integrate it into LightRAG. By subclassing LightRAG and extending its storage mapping, you can easily replace or augment the default storage backends with your custom solution.  

While modifying the LightRAG source code is another option, this example demonstrates how to achieve the integration without altering the original source code.

Below is the code for creating a `CustomLightRAG` class that incorporates `TigerGraphStorage` into its storage mapping.


In [3]:
from lightrag import LightRAG


# Define a subclass to include your custom graph storage in the storage mapping
class CustomLightRAG(LightRAG):
    def _get_storage_class(self):
        # Extend the default storage mapping with your custom storage
        base_mapping = super()._get_storage_class()
        base_mapping["TigerGraphStorage"] = TigerGraphStorage
        return base_mapping

---

## Indexing
### Data Preparation
#### Set Up Working Directory
Create a folder to serve as the working directory. For this demo, we will use `applications/lightrag/data`.

Next, create an `input` folder inside the `data` directory to store the documents you want to index:  

```bash
mkdir -p applications/lightrag/data/input
```

#### Add Documents to the Input Folder
Copy your documents (e.g., `fin.txt`) into the `applications/lightrag/data/input` folder.

---

### Indexing
The following code sets up a working directory and demonstrates how to index a given document using LightRAG.

In [None]:
import logging
import nest_asyncio
# Use the nest_asyncio package to allow running nested event loops in Jupyter Notebook without conflicts.
nest_asyncio.apply()

# Set log level to WARNING
logger = logging.getLogger("lightrag")
logger.setLevel(logging.WARNING)
logger.propagate = False
logger = logging.getLogger("httpx")
logger.setLevel(logging.WARNING)
logger.propagate = False
logging.basicConfig(level=logging.WARNING)

working_dir = "../../applications/lightrag/data"

custom_rag = CustomLightRAG(
    working_dir=working_dir,
    graph_storage="TigerGraphStorage",
)

with open(working_dir + "/input/fin.txt") as f:
    custom_rag.insert(f.read())

Note that the output has been cleared here because it is too long, and most of the information consists of logs.

## Querying
The following code demonstrates how to perform a query in LightRAG using the TigerGraph graph storage implementation.

In [5]:
from lightrag import QueryParam

custom_rag = CustomLightRAG(
    working_dir=working_dir,
    graph_storage="TigerGraphStorage",
)

query = "What is the overall financial health of the company?"

result = custom_rag.query(query=query, param=QueryParam(mode="hybrid"))

print("------------------- Query Result:  -------------------")
print(result)

2024-12-17 23:06:03,237 - nano-vectordb - INFO - Load (704, 1536) data
2024-12-17 23:06:03,240 - nano-vectordb - INFO - Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '../../applications/lightrag/data/vdb_entities.json'} 704 data
2024-12-17 23:06:03,264 - nano-vectordb - INFO - Load (397, 1536) data
2024-12-17 23:06:03,267 - nano-vectordb - INFO - Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '../../applications/lightrag/data/vdb_relationships.json'} 397 data
2024-12-17 23:06:03,271 - nano-vectordb - INFO - Load (46, 1536) data
2024-12-17 23:06:03,272 - nano-vectordb - INFO - Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '../../applications/lightrag/data/vdb_chunks.json'} 46 data
{
  "high_level_keywords": ["Financial health", "Company performance", "Business sustainability"],
  "low_level_keywords": ["Revenue", "Expenses", "Profit margins", "Debt", "Cash flow", "Assets"]
}
------------------- Query Result:  -------------------
Th