# Document Grounding on using Vector APIs of Document Grounding Management

Purpose: Ground LLM responses on your enterprise data, with SAP Document Grounding service using Vector APIs with complete control over chunking and metadata handling process. The tutorial demonstrates different steps to set up and implement the grounding service using Vector APIs for your documents.


The process consists of three steps:  
* Step 1: Create Data Repository  
* Step 2: Prepare the data 
* Step 3: Ingest data using Vector APIs 
* Step 4: Retrieve most similar documents from Data Repository based on input query and generate augmented answer


**Step 1:**
* Generate Access token.
* Create a Data Repository (also called as Collection) 

**Step 2:**
* Create chunks and metadata to add as Documents in the newly create Data Repository

**Step 3:**
* Use Vector API to add the add the documents.

**Step 4:**
* Use Document Management Retrieval API to fetch most similar documents from the Data Repository
* [Optional] Use Gen AI Hub SDK to access an LLM to create answer using the retrieved documents as a context.

**Step 5: [OPTIONAL]**
* Use GPT-4o model to generate answer

## Step 1: Create Data Repository using Vector API

### Step 1.2: Generate Access Token

Create Access Token using the AI Core credentials.

In [3]:
import os
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [4]:
import requests

# Replace these with your actual service key details
client_id = os.getenv("AICORE_CLIENT_ID")
client_secret = os.getenv("AICORE_CLIENT_SECRET")
auth_url = os.getenv("AICORE_AUTH_URL")

# Prepare the payload and headers
payload = {
    "grant_type": "client_credentials"
}
headers = {
    "Content-Type": "application/x-www-form-urlencoded"
}

# Make the POST request to obtain the token
response = requests.post(auth_url, data=payload, headers=headers, auth=(client_id, client_secret))

# Check if the request was successful
if response.status_code == 200:
    access_token = response.json().get("access_token")
    print("Access token obtained successfully.")
else:
    print(f"Failed to obtain access token: {response.status_code} - {response.text}")

Access token obtained successfully.


### Step 1.2: Create a Data Repository

Create a data repository for your knowledge base using Vector API. Response code 202 denotes that the Repository (or called as Collection) is created successfully.

In [None]:
import requests

AI_API_URL = r"https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com" # Update your AI_API_URL as per the aws region
url = f"{AI_API_URL}/v2/lm/document-grounding/vector/collections"

body={
  "title": "bp-dg-vector-data-repo", # Give a name to the Data Repository (Collection)
  "embeddingConfig": {
    "modelName": "text-embedding-ada-002" # Mention the name of embeddings model
  }
}

headers = {"Authorization": f"Bearer {access_token}",
           "AI-Resource-Group": "default", # Mention the name of your resource group
           "Content-Type": "application/json"}
response = requests.post(url, headers=headers,json=body)
response

<Response [202]>

### Step 1.3: Get the Collection ID

Go the AI Launchpad and note the colletion ID of the newly created Document Repository

In [None]:
collection_id = "28e4a470-75f7-4ca0-8d30-1fb9687e73b1" # Update with your collection id obtained from AI Launchpad

## Step 2: Data Preparation

The Vector APIs expects the data payload in the following format. Hence while chunking and associating metadata, it is convenient to create output structure tailored as per the payload schema. Moreover, dumping the data in JSONL would help in debugging and manual adjustment if required as well.

Note: 
* In each API call, we can push all chunks of a single document only.
* When creating Collection from an S3 bucket, the default metadatas contains 'id' as key and document name as value. To maintain consistency, I suggest, while creating a collection using Vector API, we also create similar metadata as well.

In [None]:
import os
import json
import fitz  # PyMuPDF
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Path to the directory containing PDF files
pdf_directory = '../sample_files/'  # Update this to your folder path

# Output JSONL file
output_file = 'documents.jsonl'

# List all PDF files in the directory
pdf_files = [f for f in os.listdir(pdf_directory) if f.endswith('.pdf')]

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

with open(output_file, 'w', encoding='utf-8') as f_out:
    for pdf_file in pdf_files:
        file_path = os.path.join(pdf_directory, pdf_file)
        
        # Open the PDF file with PyMuPDF
        doc = fitz.open(file_path)
        full_text = ""
        
        # Extract text from each page
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            full_text += page.get_text()
        
        doc.close()
        
        # Create unique metadata
        doc_meta = os.path.basename(file_path)
        
        document = Document(
            page_content=full_text,
            metadata={
                'source': pdf_file,
                'author': f"Author of {pdf_file}",
                'category': f"Category_{os.path.splitext(pdf_file)[0]}"
            }
        )
        
        # Split into chunks
        chunks = text_splitter.split_documents([document])
        
        # Build the document entry
        doc_entry = {
            "documents": [
                {
                    "metadata": [
                        {
                            "key": "id",
                            "value": [doc_meta] # I suggest to keep this key-value pair as default one for consistency with S3 collection
                        },
                        {
                            "key": "url",
                            "value": [f"http://example.com/{doc_meta}"] # Similarly add your metadatas
                        }
                    ],
                    "chunks": []
                }
            ]
        }
        
        for idx, chunk in enumerate(chunks, start=1):
            chunk_entry = {
                "content": chunk.page_content,
                "metadata": [
                    {
                        "key": "index",
                        "value": [str(idx)]
                    }
                ]
            }
            doc_entry["documents"][0]["chunks"].append(chunk_entry)
        
        # Write the document entry as one line in JSONL format
        f_out.write(json.dumps(doc_entry, ensure_ascii=False) + '\n')

print(f"Documents written to {output_file} in JSONL format (one PDF per line)")


Documents written to documents.jsonl in JSONL format (one PDF per line)


## Step 3: Data Ingestion using Vector API

Read the payload corresponding to each document from documents.jsonl file and use Vector API to add the records.

In [39]:
# Read and send each document (one PDF per line)
from tqdm import tqdm

jsonl_file = 'documents.jsonl'

url = f"{AI_API_URL}/v2/lm/document-grounding/vector/collections/{collection_id}/documents"

# Headers
headers = {
    "AI-Resource-Group": "default",
    "Content-Type": "application/json",
    "Authorization": f"Bearer {access_token}"
}

# First, count the number of lines to set tqdm's total
with open(jsonl_file, 'r', encoding='utf-8') as f_count:
    total_docs = sum(1 for _ in f_count)

with open(jsonl_file, 'r', encoding='utf-8') as f_in, tqdm(total=total_docs, desc="Uploading Documents") as pbar:
    for idx, line in enumerate(f_in, start=1):
        document_payload = json.loads(line.strip())
        
        response = requests.post(url, headers=headers, json=document_payload)
        
        pbar.update(1)
        
        if response.status_code not in [200, 201]:
            print(f"Stopping due to error at document {idx} in JSONL file. Error code: {response.status_code}")
            print(f"Content: {document_payload}")
            break


Uploading Documents:   0%|          | 0/6 [00:00<?, ?it/s]

Uploading Documents: 100%|██████████| 6/6 [01:21<00:00, 13.56s/it]


### Step 3.1 [OPTIONAL]: Update / Delete records from Data Repository

In [41]:
url= f"{AI_API_URL}/v2/lm/document-grounding/vector/collections/{collection_id}/documents"


headers = {"Authorization": f"Bearer {access_token}",
           "AI-Resource-Group": "default",
           "Content-Type": "application/json"}
response = requests.get(url, headers=headers)


def get_document_ids_with_source(response_text):
    result = {}
    data = json.loads(response_text)
    for resource in data.get("resources", []):
        doc_id = resource.get("id")
        file_name = None
        for meta in resource.get("metadata", []):
            if meta.get("key") == "id":
                file_name = meta.get("value", [None])[0]
                break
        if file_name and doc_id:
            result[file_name] = doc_id
    return result

# Example usage:
doc_map = get_document_ids_with_source(response.text)
print(doc_map)

{'AI Best Practices.pdf': 'e9885601-596b-4cd4-bfd3-d118228d34b6', 'deep seek technical paper.pdf': 'ff546cb0-22d3-49ca-ada9-9834c2cc6ac0', 'multimodal llm paper.pdf': 'c7e08d3a-db4a-43bd-b77f-8564ecb7b45b', 'NeurIPS 2025 CNN Paper.pdf': '9406ab48-5d83-4ceb-aabc-2e256cc8969a', 'Document AI.pdf': '13d62faf-47b7-437b-8504-bfa6192d4534', 'Paper ConTextTab.pdf': '6312faba-b4d4-40ce-bc09-10e9ff3dab29'}


In [46]:
doc_map

{'AI Best Practices.pdf': 'e9885601-596b-4cd4-bfd3-d118228d34b6',
 'deep seek technical paper.pdf': 'ff546cb0-22d3-49ca-ada9-9834c2cc6ac0',
 'multimodal llm paper.pdf': 'c7e08d3a-db4a-43bd-b77f-8564ecb7b45b',
 'NeurIPS 2025 CNN Paper.pdf': '9406ab48-5d83-4ceb-aabc-2e256cc8969a',
 'Document AI.pdf': '13d62faf-47b7-437b-8504-bfa6192d4534',
 'Paper ConTextTab.pdf': '6312faba-b4d4-40ce-bc09-10e9ff3dab29'}

Use this document mapping dictionary to update or delete the chunks for particular document using the document ID. In production, you may want to maintain these mappings in a HANA Table as well.

Refer SAP Help page to [Update](https://help.sap.com/docs/sap-ai-core/sap-ai-core-service-guide/update-document-adaa7cc44d334d89baf1ef666ac3158c) or [Delete](https://help.sap.com/docs/sap-ai-core/sap-ai-core-service-guide/delete-document-529a5e8168604f8c80139c915df9a014) a document



## Step 4: Retrieve Similar Documents

In [42]:
url = f"{AI_API_URL}/v2/lm/document-grounding/retrieval/search"

headers = {
    "AI-Resource-Group": "default",
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

payload = {
    "query": "What is efficient receptive field?",
    "filters": [
        {
            "id": "string",
            "searchConfiguration": {
                "maxChunkCount": 2
            },
            "dataRepositories": [collection_id], # Specify your repository ID(s)
            "dataRepositoryType": "vector"
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)

print("Status Code:", response.status_code)
response_text = response.text

import json

# Parse the JSON string into a dictionary
response_dict = json.loads(response_text)
retrieved_docs = [] 
# Loop through and print each "content"
for result in response_dict.get("results", []):
    for res in result.get("results", []):
        for document in res.get("dataRepository", {}).get("documents", []):
            for chunk in document.get("chunks", []):
                retrieved_docs.append(chunk.get("content", ""))

for doc in retrieved_docs:
    print(doc)


Status Code: 200
The concept of receptive ﬁeld is important for understanding and diagnosing how deep CNNs work.
Since anywhere in an input image outside the receptive ﬁeld of a unit does not affect the value of that
unit, it is necessary to carefully control the receptive ﬁeld, to ensure that it covers the entire relevant
image region. In many tasks, especially dense prediction tasks like semantic image segmentation,
stereo and optical ﬂow estimation, where we make a prediction for each single pixel in the input image,
it is critical for each output pixel to have a big receptive ﬁeld, such that no important information is
left out when making the prediction.
The receptive ﬁeld size of a unit can be increased in a number of ways. One option is to stack more
layers to make the network deeper, which increases the receptive ﬁeld size linearly by theory, as
each extra layer increases the receptive ﬁeld size by the kernel size. Sub-sampling on the other hand
Understanding the Effective Rece

## Step 3.1 [OPTIONAL]: Augment answer generation with retrieved documents

In [43]:
context = ' '.join([c for c in retrieved_docs])

query = "What is receptive field?"

In [44]:
prompt = f"""
Use the following context information to answer to user's query.
Here is some context: {context}

Based on the above context, answer the following query:
{query}

The answer tone has to be very professional in nature.

If you don't know the answer, politely say that you don't know, don't try to make up an answer.
"""

In [45]:
from gen_ai_hub.proxy.native.openai import chat

messages = [
    {"role": "system", "content": "You are an intelligent assistant."},
    {"role": "user", "content": prompt}
]

kwargs = dict(model_name="gpt-4o", messages=messages)

response = chat.completions.create(**kwargs)

print(response.choices[0].message.content)

The concept of the receptive field pertains to deep Convolutional Neural Networks (CNNs) and is crucial for understanding and diagnosing their operations. In this context, a receptive field refers to the area in the input image that affects the value of a particular unit or neuron in the network. For tasks such as semantic image segmentation, stereo, and optical flow estimation—where predictions are made for each pixel in the image—ensuring a sufficiently large receptive field is critical. This guarantees that important information is not excluded from the analysis, thus influencing prediction accuracy. The receptive field size can be increased by adding more layers to the network, which expands it linearly according to the kernel size used in each layer. Additionally, the notion of an effective receptive field acknowledges that the practical influence of receptive fields may exhibit a Gaussian distribution and cover only a fraction of the full theoretical receptive area, with variatio