## Implementing Nano Graph RAG

Author: Leoson Hoay

This notebook provides an example of using the [nano-graphrag](https://github.com/gusye1234/nano-graphrag) implementation to create a simple Graph RAG knowledge base of a python code repository, using ChatGPT as the base model. The repository used for this example is [lshhdc](https://github.com/LeosonH/lshhdc), which is an implementation of Locality Sensitive Hashing (using cryptographic hashing) for matching similar pairs of data.

### Preparation:
- Clone [lshhdc](https://github.com/LeosonH/lshhdc) locally.
- Obtain an OpenAI API Key, and add it to the environment.

In [None]:
# %env OPENAI_API_KEY=sk-proj...

In [2]:
import os
import ast
import json
from nano_graphrag import GraphRAG, QueryParam

In [3]:
# Initialize GraphRAG instance
graph_rag = GraphRAG(working_dir="./lshhdc-db")

INFO:nano-graphrag:Load KV full_docs with 0 data
INFO:nano-graphrag:Load KV text_chunks with 0 data
INFO:nano-graphrag:Load KV llm_response_cache with 0 data
INFO:nano-graphrag:Load KV community_reports with 0 data
INFO:nano-graphrag:Loaded graph from ./lshhdc-db\graph_chunk_entity_relation.graphml with 0 nodes, 0 edges
INFO:nano-vectordb:Load (0, 1536) data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': './lshhdc-db\\vdb_entities.json'} 0 data


In [4]:
CODEBASE_DIR = "./lshhdc"

In [5]:
# Extract docstrings via AST
def extract_docstrings(file_content):
    docstrings = []
    try:
        parsed_ast = ast.parse(file_content)
        for node in ast.walk(parsed_ast):
            if isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.Module)):
                doc = ast.get_docstring(node)
                if doc:
                    obj_type = type(node).__name__
                    obj_name = getattr(node, 'name', 'Module')
                    docstrings.append(f"{obj_type} '{obj_name}': {doc}")
    except Exception as e:
        print(f"AST parsing error: {e}")
    return docstrings

# Extract inline comments
def extract_inline_comments(file_content):
    comments = []
    for line in file_content.split("\n"):
        stripped = line.strip()
        if stripped.startswith("#"):
            comments.append(stripped.lstrip("# ").strip())
    return comments

# Extract notebook comments (.ipynb)
def extract_notebook_comments(filepath):
    comments = []
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            notebook = json.load(f)
        for cell in notebook.get('cells', []):
            if cell['cell_type'] == 'code':
                for line in cell['source']:
                    if line.strip().startswith("#"):
                        comments.append(line.strip().lstrip("# ").strip())
    except Exception as e:
        print(f"Error parsing notebook {filepath}: {e}")
    return comments

# Comprehensive extraction (docstrings, comments, full code)
def extract_all(directory, ignore_dirs=None):
    if ignore_dirs is None:
        ignore_dirs = ["data"]

    extracted_texts = []
    for root, dirs, files in os.walk(directory):
        # Skip any ignored directories
        dirs[:] = [d for d in dirs if d not in ignore_dirs]

        for file in files:
            filepath = os.path.join(root, file)
            try:
                with open(filepath, "r", encoding="utf-8") as f:
                    content = f.read()

                if file.endswith(".py"):
                    docstrings = extract_docstrings(content)
                    comments = extract_inline_comments(content)
                    combined_content = f"""
Filename: {file}

Docstrings:
{docstrings if docstrings else 'None'}

Comments:
{comments if comments else 'None'}

Full Code:
{content}
                    """
                    extracted_texts.append(combined_content)

                elif file.endswith(".ipynb"):
                    comments = extract_notebook_comments(filepath)
                    combined_content = f"""
Notebook: {file}

Comments:
{comments if comments else 'None'}

Full Notebook JSON Content:
{content}
                    """
                    extracted_texts.append(combined_content)

                elif file.endswith(".md"):
                    combined_content = f"""
Markdown File: {file}

Full Content:
{content}
                    """
                    extracted_texts.append(combined_content)

            except Exception as e:
                print(f"Error processing {filepath}: {e}")

    return extracted_texts

# Run the comprehensive extraction
structured_content = extract_all(CODEBASE_DIR)

AST parsing error: Missing parentheses in call to 'print'. Did you mean print(...)? (<unknown>, line 302)
AST parsing error: Missing parentheses in call to 'print'. Did you mean print(...)? (<unknown>, line 24)


In [6]:
structured_content

['\nFilename: lsh.py\n\nDocstrings:\nNone\n\nComments:\n[\'A label for this set\', \'Add to unionfind structure\', \'Get signature\', \'Union labels with same LSH key in same band\', \'Structure to be stored in the ConstrainedCluster.hashmaps band/hash cell\', \'cluster lists.\', \'Note that self.hashmaps, although having the same structure as in the\', \'parent class, is used quite differently here: each band/hash cell now\', \'corresponds to a list of lists (instead of a single list). Each list\', \'contains at least one LabelSetObj instance, and will possibly grow\', \'when hash collisions occur. However, to be fused within a certain\', \'list, an item must be similar enough to its first item (i.e. the\', \'constraint must be satisfied). If no list is found with an item to\', \'satisfy the constraint, a new list with the element is simply appended\', \'to the band/hash cell.\', \'A label for this set\', \'if obj is not defined, s is used\', \'Add to unionfind structure\', \'Get sign

In [7]:
# Insert into GraphRAG
await graph_rag.ainsert(structured_content)

INFO:nano-graphrag:[New Docs] inserting 7 docs
INFO:nano-graphrag:[New Chunks] inserting 9 chunks
INFO:nano-graphrag:[Entity Extraction]...
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:openai._base_client:Retrying request to /chat/completions in 1.870000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 5.324000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: PO

⠙ Processed 1 chunks, 0 entities(duplicated), 0 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 2.118000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 4.432000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠹ Processed 2 chunks, 0 entities(duplicated), 0 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 3.618000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 4.614000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 4.584000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠸ Processed 3 chunks, 3 entities(duplicated), 2 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 3.050000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠼ Processed 4 chunks, 8 entities(duplicated), 4 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 4.696000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 4.222000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 4.310000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 5.578000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retryin

⠴ Processed 5 chunks, 17 entities(duplicated), 5 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 1.009000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 1.132000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 7.118000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 7.064000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 6.856000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1

⠦ Processed 6 chunks, 17 entities(duplicated), 5 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠧ Processed 7 chunks, 27 entities(duplicated), 9 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 1.838000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 1.812000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
INFO:openai._base_client:Retrying request to /chat/completions in 7.530000 seconds
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠇ Processed 8 chunks, 37 entities(duplicated), 15 relations(duplicated)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠏ Processed 9 chunks, 48 entities(duplicated), 20 relations(duplicated)

INFO:nano-graphrag:Inserting 40 vectors to entities





INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:nano-graphrag:[Community Report]...
INFO:nano-graphrag:Each level has communities: {0: 2}
INFO:nano-graphrag:Generating by levels: [0]
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠙ Processed 1 communities

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


⠹ Processed 2 communities

INFO:nano-graphrag:Writing graph with 42 nodes, 20 edges





In [10]:
# Example Queries:

# Global query example (about the whole codebase):
global_response = await graph_rag.aquery("What are the key principles and methodologies implemented in this repository for high-dimensional data clustering?")
print("\n🟢 Global Query Response:\n", global_response)

# Local query example (about a specific script or function):
local_response = await graph_rag.aquery(
    "How does the lsh.py module implement the LSH algorithm, and what are the specific functions responsible for hashing and clustering operations?",
    param=QueryParam(mode="local")
)
print("\n🔵 Local Query Response:\n", local_response)

INFO:nano-graphrag:Revtrieved 2 communities
INFO:nano-graphrag:Grouping to 1 groups for global search
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



🟢 Global Query Response:
 ### Key Principles and Methodologies

The repository implements several key principles and methodologies to facilitate effective high-dimensional data clustering. The cornerstone of these methods is the use of **Jaccard Similarity and Distance**, which are essential for evaluating similarity and dissimilarity between sets. These metrics provide a statistical basis for identifying likeness and contrast among data sets, which is crucial for clustering operations in high-dimensional spaces.

### MinHashSignature and Its Integration

A significant methodological approach employed is the **MinHashSignature**. This technique is designed to generate minhash signatures for datasets, enabling comparisons based on Jaccard Similarity. By condensing the high-dimensional data into a representative hash signature, this method supports efficient clustering by focusing on the core attributes of the data.

Moreover, MinHashSignature is effectively integrated with **Locality-S

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:nano-graphrag:Using 20 entites, 1 communities, 14 relations, 4 text units
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



🔵 Local Query Response:
 The **lsh.py** module implements the Locality-Sensitive Hashing (LSH) algorithm, which is designed to group similar data points into clusters. This technique is particularly useful for high-dimensional data where traditional hashing techniques may not efficiently handle similarity-based data transformations. LSH utilizes a probabilistic approach, mapping similar elements to similar hash keys while maintaining a high probability. The cornerstone of this method is the concept of creating 'bands' of hash values, which enables the system to efficiently identify clusters of similar data based on predefined similarity thresholds. This module takes a significant cue from the concept of LSH as discussed in the context of "Mining of Massive Datasets," by Rajamaran, describing how similar objects are likely to share similar keys through precise function family definitions.

### Core Functions in lsh.py:

1. **Shingle and Hshingle Functions**: 
   - The `shingle(s, k)` f