# Store Embeddings for a Retrieval Augmented Generation (RAG) Use Case

RAG is especially useful for question-answering use cases that involve large amounts of unstructured documents containing important information. 

Let’s implement a RAG use case so that the next time you ask about the [orchestration service](https://help.sap.com/doc/generative-ai-hub-sdk/CLOUD/en-US/_reference/orchestration-service.html), you get the correct response! To achieve this, you need to vectorize our context documents. You can find the documents to vectorize and store as embeddings in SAP HANA Cloud Vector Engine in the `documents` directory.

## LangChain

The Generative AI Hub Python SDK is compatible with the [LangChain](https://python.langchain.com/docs/introduction/) library. LangChain is a tool for building applications that utilize large language models, such as GPT models. It is valuable because it helps manage and connect different models, tools, and data, simplifying the process of creating complex AI workflows.

In [1]:
import warnings
warnings.filterwarnings("ignore", message="As the c extension") #Avoid a warning message for "RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension. warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)"

# OpenAIEmbeddings to create text embeddings
from gen_ai_hub.proxy.langchain.openai import OpenAIEmbeddings

# TextLoader to load documents
from langchain_community.document_loaders import PyPDFDirectoryLoader

# different TextSplitters to chunk documents into smaller text chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

# LangChain & HANA Vector Engine
from langchain_community.vectorstores.hanavector import HanaDB

👉 Change the `EMBEDDING_DEPLOYMENT_ID` in [variables.py](variables.py) to your deployment ID from exercise [01-explore-genai-hub](01-explore-genai-hub.md). For that go to **SAP AI Launchpad** application and navigate to **ML Operations** > **Deployments**.

☝️ The `EMBEDDING_DEPLOYMENT_ID` is the embedding model that creates vector representations of your text, e.g. **text-embedding-3-small**.

👉 In [variables.py](variables.py) also set the `EMBEDDING_TABLE` from `"EMBEDDINGS_CODEJAM_"+"->ADD_YOUR_NAME_HERE<-"` by adding your personal or team name, like `"EMBEDDINGS_CODEJAM_NORA"`.

👉 In the root folder of the project create a [.user.ini](../.user.ini) file with the SAP HANA database connection details provided by the instructor.
```ini
[hana]
url=XXXXXX.hanacloud.ondemand.com
user=XXXXXX
passwd=XXXXXX
port=443
```

In [None]:
import init_env
import variables

init_env.set_environment_variables()

# connect to HANA instance
connection = init_env.connect_to_hana_db()
print(f"Successfuly connected to SAP HANA db instance: {connection.isconnected()}")

# Chunking of the documents

Before you can create embeddings for your documents, you need to break them down into smaller text pieces, called "`chunks`". You will use the simplest chunking technique, which involves splitting the text based on character length and the separator `"\n\n"`, using the [Character Text Splitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/character_text_splitter/) from LangChain.

## Character Text Splitter

In [None]:

# Load custom documents
loader = PyPDFDirectoryLoader('documents/')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
print(f"Number of document chunks: {len(texts)}")


Now you can connect to your SAP HANA Cloud Vector Engine and store the embeddings for your text chunks.

In [None]:
import importlib
variables = importlib.reload(variables)

# DO NOT CHANGE THIS LINE BELOW - It is only to check whether you changed the table name in variables.py
assert variables.EMBEDDING_TABLE != 'EMBEDDINGS_CODEJAM_ADD_YOUR_NAME_HERE', 'EMBEDDING_TABLE name not changed in variables.py'

# Create embeddings for custom documents
embeddings = OpenAIEmbeddings(deployment_id=variables.EMBEDDING_DEPLOYMENT_ID)

db = HanaDB(
    embedding=embeddings, connection=connection, table_name=variables.EMBEDDING_TABLE
)

# Delete already existing documents from the table
db.delete(filter={})

# add the loaded document chunks
db.add_documents(texts)
print(f"Table {db.table_name} created in the SAP HANA database.")

## Check the embeddings in SAP HANA Cloud Vector Engine

👉 Print the rows from your embedding table and scroll to the right to see the embeddings.

In [None]:
from IPython.display import Markdown
cursor = connection.cursor()

# Use `db.table_name` instead of `variables.EMBEDDING_TABLE` because HANA driver sanitizes a table name by removing unaccepted characters
is_ok = cursor.execute(f'''SELECT "VEC_TEXT", "VEC_META", TO_NVARCHAR("VEC_VECTOR") FROM "{db.table_name}"''')
record_columns=cursor.fetchone()
if record_columns:
    display({"VEC_TEXT" : record_columns[0], "VEC_META" : eval(record_columns[1]), "VEC_VECTOR" : record_columns[2]})
cursor.close()

[Next exercise](06-RAG.ipynb)