# Store Embeddings for a Retrieval Augmented Generation (RAG) Use Case

RAG is especially useful for question-answering use cases that involve large amounts of unstructured documents containing important information. 

Let’s implement a RAG use case so that the next time you ask about an [SAP AI Service](https://help.sap.com/docs/ai-services), you get the correct response! To achieve this, you need to vectorize our context documents. You can find the documents to vectorize and store as embeddings in SAP HANA Cloud Vector Engine in the `documents` directory.

## LangChain

The Generative AI Hub Python SDK is compatible with the [LangChain](https://python.langchain.com/v0.1/docs/get_started/introduction) library. LangChain is a tool for building applications that utilize large language models, such as GPT models. It is valuable because it helps manage and connect different models, tools, and data, simplifying the process of creating complex AI workflows.

In [1]:
import os
import json

with open('/home/user/projects/generative-ai-codejam/.aicore-config.json', 'r') as config_file:
    config_data = json.load(config_file)

os.environ["AICORE_AUTH_URL"]=config_data["url"]+"/oauth/token"
os.environ["AICORE_CLIENT_ID"]=config_data["clientid"]
os.environ["AICORE_CLIENT_SECRET"]=config_data["clientsecret"]
os.environ["AICORE_BASE_URL"]=config_data["serviceurls"]["AI_API_URL"]

# Change the value of the resource group to yours
os.environ["AICORE_RESOURCE_GROUP"]="team-sap"

# OpenAIEmbeddings to create text embeddings
#from gen_ai_hub.proxy.native.openai import OpenAIEmbeddings
from gen_ai_hub.proxy.langchain.openai import OpenAIEmbeddings

# TextLoader to load documents
from langchain_community.document_loaders import PyPDFDirectoryLoader

# different TextSplitters to chunk documents into smaller text chunks
from langchain_text_splitters import CharacterTextSplitter

# LangChain & HANA Vector Engine
from langchain_community.vectorstores.hanavector import HanaDB

👉 Change the `EMBEDDING_DEPLOYMENT_ID` in [variables.py](variables.py) to your deployment ID from exercise [01-deploy-model](01-deploy-model.md).

👉 In [variables.py](variables.py) also set the `EMBEDDING_TABLE` to `"EMBEDDINGS_CODEJAM_>add your name here<"`

👉 Create a [.user.ini](../.user.ini) file with the HANA login information provided by the instructor.
```sh
[hana]
url=XXXXXX.hanacloud.ondemand.com
user=XXXXXX
passwd=XXXXXX
port=443
```

In [2]:
# connect to HANA instance
from hdbcli import dbapi
import configparser

import variables

In [3]:
config = configparser.ConfigParser()
config.read('/home/user/projects/generative-ai-codejam/.user.ini')
connection = dbapi.connect(
    address=config.get('hana', 'url'), 
    port=config.get('hana', 'port'), 
    user=config.get('hana', 'user'),
    password=config.get('hana', 'passwd'),
    autocommit=True,
    sslValidateCertificate=False
)

connection.isconnected()

True

# Chunking of the documents

Before you can create embeddings for your documents, you need to break them down into smaller text pieces, called "`chunks`". You will use the simplest chunking technique, which involves splitting the text based on character length and the separator `"\n\n"`, using the [Character Text Splitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/character_text_splitter/) from LangChain.

## Character Text Splitter

In [4]:

# Load custom documents
loader = PyPDFDirectoryLoader('documents/')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
print(f"Number of document chunks: {len(texts)}")


Number of document chunks: 130


Now you can connect to your SAP HANA Cloud Vector Engine and store the embeddings for your text chunks.

In [7]:
# Create embeddings for custom documents
embeddings = OpenAIEmbeddings(deployment_id=variables.EMBEDDING_DEPLOYMENT_ID)
db = HanaDB(
    embedding=embeddings, connection=connection, table_name=variables.EMBEDDING_TABLE
)

# Delete already existing documents from the table
db.delete(filter={})

# add the loaded document chunks
db.add_documents(texts)
print(db.table_name)

EMBEDDINGS_CODEJAM_teamvitaliy


## Check the embeddings in SAP HANA Cloud Vector Engine

👉 Print the rows from your embedding table and scroll to the right to see the embeddings.

In [8]:
cursor = connection.cursor()

# Use `db.table_name` instead of `variables.EMBEDDING_TABLE` because HANA driver sanitizes a table name by removing unaccepted characters
embeddings = cursor.execute(f'SELECT VEC_TEXT, VEC_META, TO_NVARCHAR(VEC_VECTOR) FROM "{db.table_name}"')
print(embeddings)
for row in cursor:
    print(row)
cursor.close()

True
('PUBLIC\n2024-08-06\nData Attribute Recommendation© 2024 SAP SE or an SAP affiliate  company. All rights reserved.\nTHE BEST RUN', '{"source": "documents/SAP-Help-Data-Attribute-Recommendation.pdf", "page": 0}', '[-0.01963485,-0.012575209,0.0023232896,-0.027477298,-0.020123208,0.013939737,-0.002432811,-0.02079829,-0.024475336,-0.034185033,-0.006011108,0.005494023,0.016374344,-0.017221788,0.005407843,0.010456598,0.020324295,-0.021645734,0.023484256,0.01452864,0.007950175,0.014442459,-0.004811759,0.0055155684,0.005454524,0.013358017,0.017753236,-0.0325476,0.009178251,-0.04018896,0.0022245408,-0.0054688873,-0.011383042,-0.0011679288,-0.006941142,-0.0303069,-0.03737372,-0.0057238387,0.014672274,-0.021387191,0.0000031612938,0.0033143684,0.007993265,-0.015182177,-0.009206978,0.0022514723,0.021746278,-0.0061224247,-0.014069009,0.019519942,0.012072488,0.019175218,0.00061628217,-0.017408513,0.041855123,0.0043090377,-0.017365422,0.0022855855,0.002062952,0.00726432,0.022148455,-0.0074402723

[Next exercise](07-RAG.ipynb)