NVIDIA · shubhadeepd · Jul 5, 2024 · Jul 4, 2024
diff --git a/experimental/README.md b/experimental/README.md
@@ -43,6 +43,10 @@ Experimental examples are sample code and deployments for RAG pipelines that are
 
   This example is able to ingest PDFs, PowerPoint slides, Word and other documents with complex data formats including text, images, slides and tables. It allows users to ask questions through a text interface and optionally with an image query, and it can respond with text and reference images, slides and tables in its response, along with source links and downloads.
 
+* [NVIDIA Knowledge Graph RAG](./knowledge_graph_rag)
+
+  This example implements a GPU-accelerated pipeline for creating and querying knowledge graphs using Retrieval-Augmented Generation (RAG). The approach leverages NVIDIA's AI technologies and RAPIDS ecosystem to process large-scale datasets efficiently. It allows users to interact through a chat interface and also visualize the corresponding knowledge graph, and perform evaluations against synthetic data generated with NVIDIA's Nemotron-4 340B model.
+
 * [Run RAG-LLM in Azure Machine Learning](./AzureML)
 
   This example shows the configuration changes to using Docker containers and local GPUs that are required

diff --git a/experimental/knowledge_graph_rag/README.md b/experimental/knowledge_graph_rag/README.md
@@ -0,0 +1,170 @@
+# Knowledge Graphs for RAG with NVIDIA AI Foundation Models and Endpoints
+
+This repository implements a GPU-accelerated pipeline for creating and querying knowledge graphs using Retrieval-Augmented Generation (RAG). Our approach leverages NVIDIA's AI technologies and RAPIDS ecosystem to process large-scale datasets efficiently.
+
+## Overview
+
+This project demonstrates:
+- Creation of knowledge graphs from various document sources
+- Provides a simple script to download research papers from Arxiv for a given topic
+- GPU-accelerated graph processing and analysis using NVIDIA's RAPIDS Graph Analytics library (cuGraph): https://github.com/rapidsai/cugraph
+- Hybrid semantic search combining keyword and dense vector approaches
+- Integration of knowledge graphs into RAG workflows
+- Visualization of the knowledge graph through [Gephi-Lite](https://github.com/gephi/gephi-lite), an open-source web app for visualization of large graphs
+- Comprehensive evaluation metrics using NVIDIA's Nemotron-4 340B model for synthetic data generation and reward scoring
+
+## Technologies Used
+
+- **Frontend**: Streamlit
+- **Graph Representation and Optimization**: cuGraph (RAPIDS), NetworkX
+- **Vector Database**: Milvus
+- **LLM Models**:
+  - NVIDIA AI Playground hosted models for graph creation and querying, providing numerous instruct-fine-tuned options
+  - NVIDIA AI Playground hosted Nemotron-4 340B model for synthetic data generation and evaluation reward scoring
+
+## Architecture Diagram
+
+Here is how the ingestion system is designed, by leveraging a high throughput hosted LLM deployment which can process multiple document chunks in parallel. The LLM can optionally be fine-tuned for triple extraction, thereby requiring a shorter prompt and enabling greater accuracy and optimized inference.
+
+```mermaid
+graph TD
+    A[Document Collection] --> B{Document Splitter}
+    B --> |Chunk 1| C1[LLM Stream 1]
+    B --> |Chunk 2| C2[LLM Stream 2]
+    B --> |Chunk 3| C3[LLM Stream 3]
+    B --> |...| C4[...]
+    B --> |Chunk N| C5[LLM Stream N]
+    C1 --> D[Response Parser<br>and Aggregator]
+    C2 --> D
+    C3 --> D
+    C4 --> D
+    C5 --> D
+    D --> E[GraphML Generator]
+    E --> F[Single GraphML File]
+```
+
+Here's how the inference system is designed, incorporating both hybrid dense-vector search and sparse keyword-based search, reranking, and Knowledge Graph for multi-hop search:
+
+```mermaid
+graph LR
+    E(User Query) --> A(FRONTEND<br/>Chat UI<br/>Streamlit)
+    A --Dense-Sparse<br>Retrieval--> C(Milvus Vector DB)
+    A --Multi-hop<br>Search--> F(Knowledge Graph <br> with cuGraph)
+    C --Hybrid<br>Chunks--> X(Reranker)
+    X -- Augmented<br/>Prompt--> B((Hosted LLM API<br/>NVIDIA AI Playground))
+    F -- Graph Context<br>Triples--> B
+    B --> D(Streaming<br/>Chat Response)
+```
+
+This architecture shows how the user query is processed through both the Milvus Vector DB for traditional retrieval and the Knowledge Graph with cuGraph for multi-hop search. The results from both are then used to augment the prompt sent to the NVIDIA AI Playground backend.
+
+## Setup Steps
+
+Follow these steps to get the chatbot up and running in less than 5 minutes:
+
+### 1. Clone this repository to a Linux machine
+
+```bash
+git clone https://github.com/NVIDIA/GenerativeAIExamples/ && cd GenerativeAIExamples/experimental/knowledge_graph_rag
+```
+
+### 2. Get an NVIDIA AI Playground API Key
+
+```bash
+export NVIDIA_API_KEY="nvapi-*******************"
+```
+
+If you don't have an API key, follow [these instructions](https://github.com/NVIDIA/GenerativeAIExamples/blob/main/docs/api-catalog.md#get-an-api-key-for-the-accessing-models-on-the-api-catalog) to sign up for an NVIDIA AI Foundation developer account and obtain access.
+
+### 3. Create a Python virtual environment and activate it
+
+```bash
+cd knowledge_graph_rag
+pip install virtualenv
+python3 -m virtualenv venv
+source venv/bin/activate
+```
+
+### 4. Install the required packages
+
+```bash
+pip install -r requirements.txt
+```
+
+### 5. Setup a hosted Milvus vector database
+
+Follow the instructions [here](https://milvus.io/docs/install_standalone-docker.md) to deploy a hosted Milvus instance for the vector database backend. Note that it must be Milvus 2.4 or better to support [hybrid search](https://milvus.io/docs/multi-vector-search.md). We do not support disabling this feature for previous versions of Milvus as of now.
+
+### 5. Launch the Streamlit frontend
+
+```bash
+streamlit run app.py
+```
+
+Open the URL in your browser to access the UI and chatbot!
+
+### 6. Upload Docs and Train Model
+
+Upload your own documents to a folder, or use an existing folder for the knowledge graph creation. Note that the implementation currently focuses on text from PDFs only. It can be extended to other text file formats using the Unstructured.io data loader in LangChain.
+
+## Pipeline Components
+
+1. **Data Ingestion**:
+   - ArXiv paper downloader
+   - Arbitrary document folder ingestion
+2. **Knowledge Graph Creation**:
+   - Uses the API Catalog models through the LangChain NVIDIA AI Endpoints interface
+3. **Graph Representation**: cuGraph + RAPIDS + NetworkX
+4. **Semantic Search**: Milvus 2.4.x for hybrid (keyword + dense vector) search
+5. **RAG Integration**: Custom workflow incorporating knowledge graph retrieval
+6. **Evaluation**: Comparison of different RAG approaches using Nemotron-4 340B model
+
+## Evaluation Metrics
+
+We've implemented comprehensive evaluation metrics using NVIDIA's Nemotron-4 340B model, which is designed for synthetic data generation and reward scoring. Our evaluation compares different RAG approaches across five key attributes:
+
+1. **Helpfulness**: Overall helpfulness of the response to the prompt.
+2. **Correctness**: Inclusion of all pertinent facts without errors.
+3. **Coherence**: Consistency and clarity of expression.
+4. **Complexity**: Intellectual depth required to write the response.
+5. **Verbosity**: Amount of detail included in the response, relative to what is asked for in the prompt.
+
+## Evaluation Results
+
+We compared four RAG approaches on a small representative dataset using the NeMoTron-340B reward model:
+
+![Evaluation Results](viz.png)
+
+Key takeaways:
+- Graph RAG significantly outperforms traditional Text RAG.
+- Combined Text and Graph RAG shows promise but doesn't consistently beat the ground truth yet. This may be due to the way we structure the augmented prompt for the LLM and needs more experimentation.
+- Our approach improves on verbosity and coherence compared to ground truth.
+
+While we're not beating long-context ground truth across the board, these results show the potential of integrating knowledge graphs into RAG systems. We're particularly excited about the improvements in verbosity and coherence. Next steps include refining how we combine text and graph retrieval to get the best of both worlds.
+
+## Component Swapping
+
+All components are designed to be swappable. Here are some options:
+
+- **Frontend**: The current Streamlit implementation can be replaced with other web frameworks.
+- **Retrieval**: The embedding model and reranker model being used for semantic search can be swapped to use other models for higher performance. The number of entities retrieved prior to reranking can also be changed. The chunk size for documents can be changed.
+- **Vector DB**: While we use Milvus, it can be replaced with options like ChromaDB, Pinecone, FAISS, etc. Milvus is designed to be highly performant and scale on GPU infrastructure.
+- **Backend**:
+  - Cloud Hosted: Currently uses NVIDIA AI Playground APIs, but can be deployed in a private DGX Cloud or AWS/Azure/GCP with NVIDIA GPUs and LLMs.
+  - On-Prem/Locally Hosted: Smaller models like Llama2-7B or Mistral-7B can be run locally with appropriate hardware. Fine-tuning can also be done for the purpose of a specific model designed for triple extraction for a given use-case.
+
+## Future Work
+
+- Dynamic information incorporation into knowledge graphs (continuous update of knowledge graphs)
+- Further refinement of evaluation metrics and combined semantic-graphRAG pipeline
+- Investigating the impact of different graph structures and queries on RAG performance (single/multi-hop retrieval, BFS/DFS, etc)
+- Expanding support for various document types and formats (multimodal RAG with knowledge graphs)
+- Fine-tuning the Nemotron-4 340B model for domain-specific evaluations
+
+## Contributing
+
+Please create a merge request to this repository, our team appreciates any and all contributions that add features! We will review and get back as soon as possible.
+
+## Acknowledgements
+
+This project utilizes NVIDIA's AI technologies, including the Nemotron-4 340B model, and the RAPIDS ecosystem. We thank the open-source community for their invaluable contributions to the tools and libraries used in this project.
diff --git a/experimental/knowledge_graph_rag/app.py b/experimental/knowledge_graph_rag/app.py
@@ -0,0 +1,138 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import streamlit as st
+from llama_index.core import SimpleDirectoryReader, KnowledgeGraphIndex
+from utils.preprocessor import extract_triples
+from llama_index.core import ServiceContext
+import multiprocessing
+import pandas as pd
+import networkx as nx
+from utils.lc_graph import process_documents, save_triples_to_csvs
+from vectorstore.search import SearchHandler
+from langchain_nvidia_ai_endpoints import ChatNVIDIA
+
+def load_data(input_dir, num_workers):
+    reader = SimpleDirectoryReader(input_dir=input_dir)
+    documents = reader.load_data(num_workers=num_workers)
+    return documents
+
+def has_pdf_files(directory):
+    for file in os.listdir(directory):
+        if file.endswith(".pdf"):
+            return True
+    return False
+
+st.title("Knowledge Graph RAG")
+
+st.subheader("Load Data from Files")
+
+# Variable for documents
+if 'documents' not in st.session_state:
+    st.session_state['documents'] = None
+
+models = ChatNVIDIA.get_available_models()
+available_models = [model.id for model in models if model.model_type=="chat" and "instruct" in model.id]
+with st.sidebar:
+    llm = st.selectbox("Choose an LLM", available_models, index=available_models.index("mistralai/mixtral-8x7b-instruct-v0.1"))
+    st.write("You selected: ", llm)
+    llm = ChatNVIDIA(model=llm)
+
+def app():
+    # Get the current working directory
+    cwd = os.getcwd()
+
+    # Get a list of visible directories in the current working directory
+    directories = [d for d in os.listdir(cwd) if os.path.isdir(os.path.join(cwd, d)) and not d.startswith('.') and '__' not in d]
+
+    # Create a dropdown menu for directory selection
+    selected_dir = st.selectbox("Select a directory:", directories, index=0)
+
+    # Construct the full path of the selected directory
+    directory = os.path.join(cwd, selected_dir)
+
+    if st.button("Process Documents"):
+        # Check if the selected directory has PDF files
+        res = has_pdf_files(directory)
+        if not res:
+            st.error("No PDF files found in directory! Only PDF files and text extraction are supported for now.")
+            st.stop()
+        documents, results = process_documents(directory, llm)
+        print(documents)
+        st.write(documents)
+        search_handler = SearchHandler("hybrid_demo3", use_bge_m3=True, use_reranker=True)
+        search_handler.insert_data(documents)
+        st.write(f"Processing complete. Total triples extracted: {len(results)}")
+
+        with st.spinner("Saving triples to CSV files with Pandas..."):
+            # write the resulting entities to a CSV, relations to a CSV and all triples with IDs to a CSV
+            save_triples_to_csvs(results)
+
+        with st.spinner("Loading the CSVs into dataframes..."):
+                # Load the triples from the CSV file
+                triples_df = pd.read_csv("triples.csv")
+                # Load the entities and relations DataFrames
+                entities_df = pd.read_csv("entities.csv")
+                relations_df = pd.read_csv("relations.csv")
+
+        # with st.spinner("Creating the knowledge graph from these triples..."):
+            # Create a mapping from IDs to entity names and relation names
+        entity_name_map = entities_df.set_index("entity_id")["entity_name"].to_dict()
+        relation_name_map = relations_df.set_index("relation_id")["relation_name"].to_dict()
+
+        # Create the graph from the triples DataFrame
+        G = nx.from_pandas_edgelist(
+            triples_df,
+            source="entity_id_1",
+            target="entity_id_2",
+            edge_attr="relation_id",
+            create_using=nx.DiGraph,
+        )
+
+        with st.spinner("Relabeling node integers to strings for future retrieval..."):        
+            # Relabel the nodes with the actual entity names
+            G = nx.relabel_nodes(G, entity_name_map)
+
+            # Relabel the edges with the actual relation names
+            edge_attributes = nx.get_edge_attributes(G, "relation_id")
+
+            # Update the edges with the new relation names
+            new_edge_attributes = {
+                (u, v): relation_name_map[edge_attributes[(u, v)]]
+                for u, v in G.edges()
+                if edge_attributes[(u, v)] in relation_name_map
+            }
+
+            nx.set_edge_attributes(G, new_edge_attributes, "relation")
+
+        with st.spinner("Saving the graph to a GraphML file for further visualization and retrieval..."):
+            try:
+                nx.write_graphml(G, "knowledge_graph.graphml")
+
+                # Verify by reading it back
+                G_loaded = nx.read_graphml("knowledge_graph.graphml")
+                if nx.is_directed(G_loaded):
+                    st.success("GraphML file is valid and successfully loaded.")
+                else:
+                    st.error("GraphML file is invalid.")
+            except Exception as e:
+                st.error(f"Error saving or loading GraphML file: {e}")
+                return
+
+        st.success("Done!")
+
+if __name__ == "__main__":
+    app()