<a href="https://colab.research.google.com/github/Ayesha-Imr/Graph-RAG-Automation-ApertureDB-Gemini/blob/main/Notebooks/LOCAL_GraphRAG_with_ApertureDB_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GraphRAG with ApertureDB: Part 1 - Building the Semantic Layer

This notebook covers the first major phase of building our GraphRAG system: augmenting the existing knowledge graph with a semantic layer. This involves creating vector embeddings for our entities and ingesting them into ApertureDB, linking them directly to the graph structure.

**Prerequisites:** This notebook is a successor to the [**Automating Knowledge Graph Creation with Gemini & ApertureDB**](https://colab.research.google.com/drive/1okdpqXhikbgrf5Ep3lT1yT7GjAIv7AFh?usp=sharing) notebook and assumes you have already created and ingested your plaintext graph in an ApertureDB instance. If not, please check out that notebook and run it with your preferred data source so you have a knowledge graph to work with. *(Worry not, everything is automated, all you have to do is add the required API keys, upload your chosen pdf aka knowledgebase and run all the cells - easy peasy!)*

## 1. Setup and Initialization

First, we'll install the required Python packages and configure our environment. This includes [Gemini embeddings](https://ai.google.dev/gemini-api/docs/embeddings) for embeddings, [ApertureDB](https://www.aperturedata.io/) and some utility packages. We then import all necessary modules and establish a connection to our ApertureDB instance using the stored Google Drive folder.

In [1]:
!pip install -q aperturedb google-genai udocker

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.0/141.0 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.2/47.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.8/137.8 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.6/119.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.9/139.9 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m95.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m73.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.6 MB/s[0m eta [36m0:00

### Local ApertureDB session setup via udocker

This data is from the previous knowledge graph creation [notenook](https://colab.research.google.com/drive/1okdpqXhikbgrf5Ep3lT1yT7GjAIv7AFh?usp=sharing). We'll be using this to continue with our graph RAG implementation.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
DATA_ROOT = "/content/drive/MyDrive/aperturedb_data"

In [5]:
!udocker --allow-root install
!udocker --allow-root install

Info: creating repo: /root/.udocker
Info: udocker command line interface 1.3.17
Info: searching for udockertools >= 1.2.11
Info: installing udockertools 1.2.11
Info: installation of udockertools successful


In [6]:
!udocker --allow-root pull aperturedata/aperturedb-community:latest
!udocker --allow-root create --name=aperturedb aperturedata/aperturedb-community:latest

Info: downloading layer sha256:7c84311f635af4c4a0fec53e53c35b7b388a8b7e40025346a1d4af37423ea2c2
Info: downloading layer sha256:a6e6f81788d18a6a24869a662a78e4798c46eba38d0cca171080f1e131146a16
Info: downloading layer sha256:6c6f170a129049940a474f88f5fc649d7d96700ef0748bbc590476942c609949
Info: downloading layer sha256:9b3453b1064fb34bae3e701ecd6cfce41e99f8fc037857b7c9de26f917ad8840
Info: downloading layer sha256:5090e15fdc6a26ebc57cdfff090b7090d1020672ce22b4a290451acc9c16da6d
Info: downloading layer sha256:c2bf41e51776f1206dd05060633eada7174bc768aa6f062cc6fa3b5def95ca94
Info: downloading layer sha256:e17caecd6bd74e6c419f0d21e9522006fded22eedf72be6b9046819a6265903c
Info: downloading layer sha256:aece8493d3972efa43bfd4ee3cdba659c0f787f8f59c82fb3e48c87cbb22a12e
85ac9941-8f18-376f-a6eb-5118eb3f01bd


In [7]:
!nohup udocker --allow-root run \
    --publish=55555:55555 \
    --env="ADB_MASTER_KEY=admin" \
    --env="ADB_FORCE_SSL=false" \
    --volume=$DATA_ROOT/db:/aperturedb/db \
    aperturedb  > /content/adb_server.log 2>&1 &

In [8]:
import time
time.sleep(15)

In [9]:
!tail -n 15 /content/adb_server.log | sed -e 's/^/LOG │ /'

LOG │ ApertureDB Server, Version: 0.18.10, Branch: feature/community-docker, Commit: 728440c99395bca1a3346dc8e4d51917a8d16764


In [10]:
!lsof -nP -iTCP:55555

COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
apertured 1889 root   16u  IPv4  59837      0t0  TCP *:55555 (LISTEN)


Connect to ApertureDB client

In [11]:
from aperturedb import Connector
client = Connector.Connector(
    host="127.0.0.1", port=55555,
    user="admin",  password="admin",
    use_ssl=False
)
status = client.query([{"GetStatus": {}}], [])
print(status)

([{'GetStatus': {'info': 'OK', 'status': 0, 'system': 'ApertureDB', 'version': '0.18.10'}}], [])


Verify the existence of our data in the DB

In [13]:
schema = client.query([{"GetSchema": {}}], [])
print(schema)

([{'GetSchema': {'connections': {'classes': {'abstracts': {'dst': 'Software', 'matched': 2, 'properties': {'created_at': [2, False, 'String'], 'dst_class': [2, False, 'String'], 'dst_id': [2, False, 'Number'], 'src_class': [2, False, 'String'], 'src_id': [2, False, 'Number'], 'type': [2, False, 'String']}, 'src': 'Software'}, 'accesses': {'dst': 'Software', 'matched': 1, 'properties': {'created_at': [1, False, 'String'], 'dst_class': [1, False, 'String'], 'dst_id': [1, False, 'Number'], 'src_class': [1, False, 'String'], 'src_id': [1, False, 'Number'], 'type': [1, False, 'String']}, 'src': 'User'}, 'acts_as': {'dst': 'Organization', 'matched': 1, 'properties': {'created_at': [1, False, 'String'], 'dst_class': [1, False, 'String'], 'dst_id': [1, False, 'Number'], 'src_class': [1, False, 'String'], 'src_id': [1, False, 'Number'], 'type': [1, False, 'String']}, 'src': 'Service'}, 'acts_as_bridge_between': {'dst': 'Software', 'matched': 1, 'properties': {'created_at': [1, False, 'String'],

Import required libraries

In [14]:
# Core imports
import os
import json
import time
from typing import Dict, List, Any
from tqdm import tqdm
import numpy as np

# Data models and utilities
from google.colab import userdata, files
from aperturedb.CommonLibrary import create_connector
from aperturedb.ParallelLoader import ParallelLoader
from IPython.display import HTML, display
from pprint import pprint
from google import genai
from google.genai import types, errors


Set up the API keys. You can get a Google API key [here](https://aistudio.google.com/apikey).

In [15]:
google_api_key = userdata.get("GOOGLE_API_KEY")

## 2. Preparing Entity Data for Embedding

Our knowledge graph contains structured entities with various properties. To make these entities searchable via semantic search, we need to represent them as vector embeddings. The first step is to convert each structured entity into a single, cohesive text "document".

The `fetch_entities` function retrieves all entity nodes from our ApertureDB graph. It does so using the [GetSchema](https://docs.aperturedata.io/query_language/Reference/db_commands/GetSchema) method provided by ApertureDB and then extracting all the classes. Then one by one, all the entities with all their properties are extracted from each class.

In [16]:
def fetch_entities(client):
    """
    Fetches all entities by first getting the schema to identify all classes,
    and then querying for entities within each class.
    """
    # Get schema and filter for valid, non-internal class names.
    schema_query = [{"GetSchema": {}}]
    schema_response, _ = client.query(schema_query)

    if not schema_response or "entities" not in schema_response[0]["GetSchema"]:
        print("Error: Could not retrieve schema from ApertureDB.")
        return []

    all_class_names = schema_response[0]["GetSchema"]["entities"]["classes"].keys()
    # Filter out internal classes (like _Blob) which cause query errors.
    valid_class_names = [name for name in all_class_names if not name.startswith('_')]
    print(f"Found valid classes in schema: {valid_class_names}")

    # Iterate and fetch entities for each valid class.
    all_entities = []
    for class_name in valid_class_names:
        entity_query = [{
            "FindEntity": {
                "with_class": class_name,
                "results": {"all_properties": True}
            }
        }]
        entity_response, _ = client.query(entity_query)

        # Ensure response is not empty and has the expected structure.
        if entity_response and entity_response[0].get("FindEntity", {}).get("entities"):
            entities_for_class = entity_response[0]["FindEntity"]["entities"]
            for entity in entities_for_class:
                entity['class'] = class_name
            all_entities.extend(entities_for_class)

    print(f"Successfully fetched a total of {len(all_entities)} entities.")
    return all_entities

The `create_entity_documents` synthesizes a document for each entity by concatenating its key properties (like name, class, definition, etc.) into a single string. This process creates a text representation that is rich in context and ideal for generating a high-quality embedding.

In [17]:
def create_entity_documents(entities):
    """
    Creates a synthesized document for each entity
    """
    entity_documents = []
    for entity in entities:
        doc_parts = [f"Entity: {entity.get('name', '')}.", f"Class: {entity.get('class', 'N/A')}."]
        other_properties = {
            k: v for k, v in entity.items()
            if k not in ["_uniqueid", "name", "class"] and v is not None
        }
        for key, value in other_properties.items():
            doc_parts.append(f"{key}: {value}.")

        entity_documents.append({
            "entity_id": entity.get("id"),
            "class": entity.get("class"),
            "document": " ".join(doc_parts)
        })

    print(f"Created {len(entity_documents)} documents for embedding.")
    return entity_documents

Calling the above functions to create documents for embedding.

In [18]:
all_entities = fetch_entities(client)
if all_entities:
    documents_to_embed = create_entity_documents(all_entities)

Found valid classes in schema: ['Concept', 'Hardware', 'Network', 'Organization', 'Service', 'Software', 'System', 'User']
Successfully fetched a total of 380 entities.
Created 380 documents for embedding.


Lets check out the first two samples to see what our documents look like. Note that the `entity_id` and `class` fields will serve as our metadata. These fields are necessary to connect embeddings to their source entities, and also to later on find neighbouring nodes (the core of graph RAG).
The `document` field contains the tex which will be converted into vector embeddings. Note how it also contains the class name and entity name along with all other entity properties. This aids vector search.

In [19]:
print("\n--- Sample Documents for Embedding ---")
for doc in documents_to_embed[:2]:
    print(json.dumps(doc, indent=2))


--- Sample Documents for Embedding ---
{
  "entity_id": 1,
  "class": "Concept",
  "document": "Entity: Vector processing. Class: Concept. id: 1."
}
{
  "entity_id": 2,
  "class": "Concept",
  "document": "Entity: Image processing. Class: Concept. id: 2."
}


## 3. Generating Vector Embeddings with Cohere

With our entity documents prepared, we can now generate the vector embeddings. We'll use [Google AI's `gemini-embedding-001`](https://ai.google.dev/gemini-api/docs/embeddings) model, which is highly performant for retrieval tasks and allows us to specify different embedding dimensions. We are using 768 as the mebdding dimensions for a balance of speed, size and accuracy. You can set a higher or lower embedding dimensions if you want.

In [20]:
GEMINI_MODEL_NAME = "gemini-embedding-001"
EMBEDDING_DIMENSIONS = 768
gemini_client = genai.Client(api_key=google_api_key)

The function `generate_embeddings_with_gemini` sends the documents to the Gemini API in batches. Batching is crucial for efficiency and to avoid hitting API rate limits when processing a large number of entities. Since Google's free tier allows 100 RPM, we add a delay to 60s between batches  to avoid rate limiting. The function then appends the resulting embedding vector to each entity's data structure in a new `embedding` field.

In [21]:
def generate_embeddings_with_gemini(docs_to_embed, batch_size: int = 96, model_name: str = GEMINI_MODEL_NAME, output_dim: int = EMBEDDING_DIMENSIONS):
    """
    Generates embeddings for a list of documents with Google Gemini.

    """
    # Extract raw text
    texts_to_embed = [item["document"] for item in docs_to_embed]

    # Batch-wise embedding
    for i in tqdm(range(0, len(texts_to_embed), batch_size), desc="Embedding Batches"):
        batch_texts = texts_to_embed[i : i + batch_size]

        # Build the config
        embed_cfg = types.EmbedContentConfig(
            output_dimensionality=output_dim,
            task_type="RETRIEVAL_DOCUMENT", # optimal for corpus items
        )

        # API call
        result = gemini_client.models.embed_content(
            model=model_name,
            contents=batch_texts,
            config=embed_cfg,
        )

        batch_embeddings = [emb.values for emb in result.embeddings]

        # Safety check
        if len(batch_embeddings) == len(batch_texts):
            for j, embedding in enumerate(batch_embeddings):
                docs_to_embed[i + j]["embedding"] = embedding
        else:
            print(
                f"Warning: API returned {len(batch_embeddings)} embeddings "
                f"for a batch of {len(batch_texts)} texts.  Skipping those rows."
            )

        # Add 1 minute delay to avoid rate limiting.
        time.sleep(60)


    # Keep only the documents that now contain an embedding
    docs_with_embeddings = [d for d in docs_to_embed if "embedding" in d]
    print(f"\nSuccessfully generated embeddings for {len(docs_with_embeddings)} documents.")
    return docs_with_embeddings

Call the function on our curated documents.

In [22]:
docs_with_embeddings = generate_embeddings_with_gemini(documents_to_embed)

Embedding Batches: 100%|██████████| 4/4 [04:03<00:00, 60.90s/it]


Successfully generated embeddings for 380 documents.





Lets print out and check a sample of the resultant docs_with_embeddings data structure.

In [23]:
first_doc = docs_with_embeddings[0]
print(f"Document for Entity ID: {first_doc['entity_id']}")
print(f"Embedding generated: {'embedding' in first_doc}")
print(f"Embedding dimensions: {len(first_doc['embedding'])}")
print(f"Sample of embedding vector: {np.array(first_doc['embedding'][:5])}")

Document for Entity ID: 1
Embedding generated: True
Embedding dimensions: 768
Sample of embedding vector: [-0.01285386  0.00483452  0.0109325  -0.06323054 -0.00350184]


## 4. Creating the Vector Index in ApertureDB

Before we can store our embeddings, we need a place for them to live. In ApertureDB, vector indexes are called [**`DescriptorSets`**](https://docs.aperturedata.io/HowToGuides/start/Embeddings).

In [24]:
DESCRIPTOR_SET_NAME = "entity_embeddings_gemini"

The `create_vector_index` function defines and creates a new `DescriptorSet` named `entity_embeddings_gemini`. We configure it with the correct dimensionality (768 for our Gemini embedding model), a distance metric (Cosine Similarity - `CS`), and a search engine (`HNSW`, a popular choice for fast and accurate nearest-neighbor search). The function is idempotent, meaning it will safely skip creation if the set already exists.

In [25]:
def create_vector_index(client, set_name, dimensions):
    """
    Creates a new DescriptorSet to serve as a vector index if it doesn't exist.
    """
    print(f"Checking for DescriptorSet: '{set_name}'...")

    # Check if the DescriptorSet already exists.
    find_query = [{
        "FindDescriptorSet": {
            "_ref": 1,
            "with_name": DESCRIPTOR_SET_NAME,
            "metrics": True,
            "engines": True,
        }
    }]
    response, _ = client.query(find_query)

    if response and response[0].get("FindDescriptorSet", {}).get("returned") == 1:
        print(f"DescriptorSet '{set_name}' already exists. Skipping creation.")
        return

    # If not found, create it.
    print(f"Creating DescriptorSet '{set_name}'...")
    create_query = [{
        "AddDescriptorSet": {
            "name": set_name,
            "dimensions": dimensions,
            "metric": "CS",
            "engine": "HNSW"
        }
    }]
    response, _ = client.query(create_query)

    if response[0]["AddDescriptorSet"]["status"] == 0:
        print(f"Successfully created DescriptorSet '{set_name}'.")
    else:
        print(f"Error creating DescriptorSet: {response[0]['AddDescriptorSet']['info']}")

In [26]:
create_vector_index(client, DESCRIPTOR_SET_NAME, EMBEDDING_DIMENSIONS)

Checking for DescriptorSet: 'entity_embeddings_gemini'...
Creating DescriptorSet 'entity_embeddings_gemini'...
Successfully created DescriptorSet 'entity_embeddings_gemini'.


## 5. Ingesting Embeddings and Linking to the Graph

Now we'll ingest all the generated embeddings into our `DescriptorSet` and, most importantly, connect each embedding back to its original entity in the knowledge graph. This link is what enables our hybrid GraphRAG approach.

We perform this in two efficient, parallelized stages using ApertureDB's [`ParallelLoader`](https://docs.aperturedata.io/python_sdk/parallel_exec/ParallelLoader).

The `ingest_embeddings` function first adds all the vector embeddings as [`Descriptor`](https://docs.aperturedata.io/HowToGuides/start/Embeddings) objects (embeddings in ApertureDB) into our `entity_embeddings_gemini` set. We also store the original entity's ID and class as properties/metadata on the descriptor itself for easy lookup later - as alluded to earlier.

In [27]:
def ingest_embeddings(client, set_name, docs_with_embeddings):

    """ Ingests the descriptors and their embeddings in parallel. """

    data_to_ingest = []

    for doc in docs_with_embeddings:
        query = [{
            "AddDescriptor": {
                "set": set_name,
                "properties": {
                    # We store the entity's ID and class on the descriptor itself, so we can find it easily in the next step
                    "source_entity_id": doc.get("entity_id"),
                    "source_entity_class": doc.get("class")
                }
            }
        }]

        # Convert the embedding to bytes.
        blobs = [np.array(doc["embedding"], dtype=np.float32).tobytes()]
        data_to_ingest.append((query, blobs))

    loader = ParallelLoader(client)
    loader.ingest(generator=data_to_ingest, batchsize=64, numthreads=8, stats=True)

Lets ingest our embeddings aka Descriptors into the DescriptorSet.

In [28]:
ingest_embeddings(client, DESCRIPTOR_SET_NAME, docs_with_embeddings)

Progress: 100%|██████████| 380/380 [00:02<00:00, 189items/s]  

Total time (s): 2.0160255432128906
Total queries executed: 8
Avg Query time (s): 1.0918860733509064
Query time std: 0.020816973826248596
Avg Query Throughput (q/s): 7.326771716621199
Overall insertion throughput (element/s): 188.48967528179395
Total inserted elements: 380
Total successful commands: 380





After the descriptors are ingested, `create_connections` function creates the `has_embedding` connections. For each entity, it finds the node in the graph and its corresponding descriptor (using the properties we just stored) and creates an edge between them. This is needed to trace back retrieved embeddings from vector search to the original source entities and find neighbouring nodes.

In [29]:
def create_connections(client, set_name, docs_with_embeddings):

    """ Creates the 'has_embedding' connections in parallel """

    queries_to_run = []

    for doc in docs_with_embeddings:
        entity_id = doc.get("entity_id")
        entity_class = doc.get("class")

        # Find the entity, find its corresponding descriptor, and create the connection between them.
        query = [
            {
                "FindEntity": {
                    "with_class": entity_class,
                    "constraints": {"id": ["==", entity_id]},
                    "_ref": 1
                }
            },
            {
                "FindDescriptor": {
                    "set": set_name,
                    # Find the descriptor using the properties we stored in Step 1.
                    "constraints": {"source_entity_id": ["==", entity_id]},
                    "_ref": 2
                }
            },
            {
                "AddConnection": {
                    "class": "has_embedding",
                    "src": 1,
                    "dst": 2
                }
            }
        ]
        queries_to_run.append((query, []))

    loader = ParallelLoader(client)
    loader.ingest(generator=queries_to_run, batchsize=64, numthreads=8, stats=True)

Lets create the connections between Descriptors/embeddings and the source entities in our knowledge graph.

In [30]:
create_connections(client, DESCRIPTOR_SET_NAME, docs_with_embeddings)

Progress: 100%|██████████| 380/380 [00:02<00:00, 189items/s]

Total time (s): 2.015267848968506
Total queries executed: 8
Avg Query time (s): 1.085803210735321
Query time std: 0.03035769929489338
Avg Query Throughput (q/s): 7.367817594297118
Overall insertion throughput (element/s): 188.56054305361894
Total inserted elements: 380
Total successful commands: 1140





We can confirm the number of connections by running the following query:

In [31]:
# Confirm the connections
query = [{
    "FindConnection": {
        "with_class": "has_embedding",
        "results": {
            "count": True
        }
    }
}]

response, blobs = client.query(query)

client.print_last_response()

[
    {
        "FindConnection": {
            "count": 380,
            "returned": 0,
            "status": 0
        }
    }
]


As expected, the count matches the number of entities we processed.

And that's it! Our embeddings have now been ingested and linked to the source entities. Now we can move on and actually perform RAG and see the powerful capabilities of graph RAG in action! Covered in [Part 2](https://colab.research.google.com/drive/1TXqOpG9er9t5LRLIdbzK-Oqnqei1gbTS?usp=sharing).