# Document Grounding on S3 data using Using Document Management APIs

Purpose: Ground LLM responses on your enterprise data stored in S3, with SAP Document Grounding service using Using Document Management APIs. The tutorial demonstrates different steps to set up and implement the grounding service with S3 as a data source.


The process consists of three steps:  
* Step 1: Upload Documents to S3 bucket  
* Step 2: Create Data Repository pointing to S3 bucket 
* Step 3: Retrieve most similar documents from Data Repository based on input query and generate augmented answer


**Pre-requisites:**
* Object Store (S3) and its credentials


**Step 1:**
Push your document(s) to the S3 bucket. 

**Step 2:**
* Create Generic Secret on AI Launchpad with S3 object store credentials.
* Use Document Management Pipelines API to create a Data Repository from S3 bucket.

**Step 3:**
* Use Document Management Retrieval API to fetch most similar documents from the Data Repository
* [Optional] Use Gen AI Hub SDK to access an LLM to create answer using the retrieved documents as a context.


In this tutorial, we will:
1. Create Access Token.
2. Create Repository using Pipelines API.
3. Retrieve documents using Retrieval API.
3. [OPTIONAL] Use GPT-4o model to generate answer

## Step 1: Data Load to S3

**Obtain Object Store Credentials**  
* Download S3 credentials from BTP cockpit > BTP subaccount > Space > Instances > ObjectStore > Credentials

**AWS CLI Installation and Configuration**  
* Install AWS CLI: Download and install the AWS Command Line Interface from the official AWS documentation [Link](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
* Verify Installation: Check your installation by running:
``` sh
aws --version
```
* Configure AWS CLI: Run the configuration command and pass corresponding values from Object Store credentials.
``` sh
aws configure
```

**Push Documents to S3**  
You can push your documents to S3 bucket. You can optionally put the documents to a sub-directory in the bucket as well.

S3 CLI commands:  
``` sh
aws s3 cp sample.pdf s3://your-bucket-name/sample.pdf
OR
aws s3 cp . s3://your-bucket-name/ --recursive
OR
aws s3 cp . s3://your-bucket-name/<optional_path>/ --recursive
```

## Step 2: Create Data Repository using Pipelines API

### Step 2.1: Create Generic Secret 
Create Generic Secret key collection on AI Launchpad with S3 Object Store credentials. [SAP Help](https://help.sap.com/docs/ai-launchpad/sap-ai-launchpad/grounding-management).

### Step 2.2: Generate Access Token

Create Access Token using the AI Core credentials.

In [1]:
import os
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [2]:
import requests

# Replace these with your actual service key details
client_id = os.getenv("AICORE_CLIENT_ID")
client_secret = os.getenv("AICORE_CLIENT_SECRET")
auth_url = os.getenv("AICORE_AUTH_URL")

# Prepare the payload and headers
payload = {
    "grant_type": "client_credentials"
}
headers = {
    "Content-Type": "application/x-www-form-urlencoded"
}

# Make the POST request to obtain the token
response = requests.post(auth_url, data=payload, headers=headers, auth=(client_id, client_secret))

# Check if the request was successful
if response.status_code == 200:
    access_token = response.json().get("access_token")
    print("Access token obtained successfully.")
else:
    print(f"Failed to obtain access token: {response.status_code} - {response.text}")

Access token obtained successfully.


### Step 2.3: Create Data Repository using Pipeline API

As a pre-requisite, the S3 credentials are added as a generic secret on AI Core. You need to replace the generic secret key name with yours in the following code.

Optionally, specify the path in S3 bucket from where the documents are to be considered for grounding.

In [None]:
AI_API_URL = r"https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com" # Update your AI_API_URL as per the aws region
url = f"{AI_API_URL}/v2/lm/document-grounding/pipelines"

headers = {"Authorization": f"Bearer {access_token}",
           "AI-Resource-Group": "default", # update your resource group name here
           "Content-Type": "application/json"}

payload = {
    "type": "S3",
    "configuration": {
        "destination": "ai-best-practices-grounding-s3-b2", # Update your S3 generic secret key name here
        "s3": {
            "includePaths": ["/new_papers/"] # Sub-directory path (also called as perfix) in S3 bucket to consider for fetching data from
        }
    }
}

response = requests.post(url, headers=headers, json=payload)

print("Status Code:", response.status_code)
print("Response:", response.text)

Status Code: 201
Response: {"pipelineId": "13ce3a20-26a8-455c-a06e-1343be4a8bf7"}



### Step 2.4: Verify created Data Repository

**Get the details for newly created Data Repository**
  
Go to Grounding Management page on AI Launchpad and note the ID of Data Repository for the pipeline ID that you just created in above step.

In [3]:
AI_API_URL = r"https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com" # Replace with your AI API URL
repository_id = "b27f2f08-d6e2-4907-920d-7d08035c4de8" # Paste Data Repository ID from Grounding Management page on AI Launchpad

url = f"{AI_API_URL}/v2/lm/document-grounding/retrieval/dataRepositories/{repository_id}"


headers = {"Authorization": f"Bearer {access_token}",
           "AI-Resource-Group": "default", # Update your resource group name here
           "Content-Type": "application/json"}

response = requests.get(url, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.text)

Status Code: 200
Response: {"id": "b27f2f08-d6e2-4907-920d-7d08035c4de8", "title": "pipeline-13ce3a20-26a8-455c-a06e-1343be4a8bf7-collection", "metadata": [{"key": "pipeline", "value": ["13ce3a20-26a8-455c-a06e-1343be4a8bf7"]}, {"key": "type", "value": ["custom"]}, {"key": "pipelineType", "value": ["S3"]}], "type": "vector"}



## Step 3: Retrieve Similar Documents

Add ID of your data repository in the following cell. If want to include more data repositories, you can add them in the list as well.

Optionally, specify the path in S3 bucket from where the documents are to be considered grounding.

In [None]:
url = f"{AI_API_URL}/v2/lm/document-grounding/retrieval/search"

headers = {
    "AI-Resource-Group": "default",
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

payload = {
    "query": "What is efficient receptive field?",
    "filters": [
        {
            "id": "string",
            "searchConfiguration": {
                "maxChunkCount": 2
            },
            "dataRepositories": ["b27f2f08-d6e2-4907-920d-7d08035c4de8"], # Specify your repository ID(s)
            "dataRepositoryType": "vector"
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)

print("Status Code:", response.status_code)
response_text = response.text

import json

# Parse the JSON string into a dictionary
response_dict = json.loads(response_text)
retrieved_docs = [] 
# Loop through and print each "content"
for result in response_dict.get("results", []):
    for res in result.get("results", []):
        for document in res.get("dataRepository", {}).get("documents", []):
            for chunk in document.get("chunks", []):
                retrieved_docs.append(chunk.get("content", ""))

for doc in retrieved_docs:
    print(doc)


Status Code: 200
The concept of receptive ﬁeld is important for understanding and diagnosing how deep CNNs work.
Since anywhere in an input image outside the receptive ﬁeld of a unit does not affect the value of that
unit, it is necessary to carefully control the receptive ﬁeld, to ensure that it covers the entire relevant
image region. In many tasks, especially dense prediction tasks like semantic image segmentation,
stereo and optical ﬂow estimation, where we make a prediction for each single pixel in the input image,
it is critical for each output pixel to have a big receptive ﬁeld, such that no important information is
left out when making the prediction.
The receptive ﬁeld size of a unit can be increased in a number of ways. One option is to stack more
layers to make the network deeper, which increases the receptive ﬁeld size linearly by theory, as
each extra layer increases the receptive ﬁeld size by the kernel size. Sub-sampling on the other hand
takes up a fraction of the full 

## Step 3.1 [OPTIONAL]: Augment answer generation with retrieved documents

In [33]:
context = ' '.join([c for c in retrieved_docs])

query = "What is receptive field?"

In [34]:
prompt = f"""
Use the following context information to answer to user's query.
Here is some context: {context}

Based on the above context, answer the following query:
{query}

The answer tone has to be very professional in nature.

If you don't know the answer, politely say that you don't know, don't try to make up an answer.
"""

In [35]:
from gen_ai_hub.proxy.native.openai import chat

messages = [
    {"role": "system", "content": "You are an intelligent assistant."},
    {"role": "user", "content": prompt}
]

kwargs = dict(model_name="gpt-4o", messages=messages)

response = chat.completions.create(**kwargs)

print(response.choices[0].message.content)

In the context of deep convolutional neural networks (CNNs), the receptive field refers to the specific portion of the input image that affects the output of a particular unit or neuron within the network. Essentially, it constitutes the region in which information influences the activation of that unit. The concept of the receptive field is crucial for understanding how deep CNNs operate and is particularly important in tasks involving dense predictions like semantic image segmentation, stereo vision, and optical flow estimation, where predictions need to be made for individual pixels within an image.

To ensure that CNNs efficiently cover the necessary image regions, the receptive field size must be carefully managed. Increasing the receptive field allows for more comprehensive capture of relevant information needed for accurate predictions. This can be achieved by adding more layers, which theoretically increases the receptive field size linearly, as each layer adds to the receptive