## Notebook for use case digital posting assistant - stage1 - Proof of Concept
### Module A vectorize accounting assignment guide
#### Ojectives
- In this module we will develop the load and the vectorization of the text-file for the accounting assignment guide
- The vectorized accounting assignemnt guide will finaly stored in a SAP HANA vector database

#### Processing steps from concept
A0 - preparation

A2 - load and splitt: load the pdf-file containing the accounting assignment guide data from a folder and splitt the data into text_chunks

A3 - vectorize and embedd: vectorize the splitted data with embedding function. Use an embedding function to convert the text chunks into vector representations

A4 - store: create/clear a sap hana database - table and store the vector in this table

### A0 - Setup and configuration Modul A

The following setup-steps where processed:

* A0.0 Start SAP instances
* A0.1 install py-packages
* A0.2 load env-variables from config.json-file
* A0.3 Setup and test connection to HANA DB
* A0.4 Setup LLM-Connection to SAP AI-HUB


### A0.0 Start SAP Instances

* Start BTP Cockpit
* Start SAP Build Dev Space
* Start HANA DB

In [None]:
# A0.1 install py-packages
# RESET KERNEL AFTER INSTALLATION

%pip install --upgrade pip

%pip install --quiet hdbcli --break-system-packages
%pip install --quiet generative-ai-hub-sdk[all] --break-system-packages
%pip install --quiet folium --break-system-packages
%pip install --quiet ipywidgets --break-system-packages
%pip install --quiet pypdf
%pip install --quiet -U ipykernel
%pip install --quiet hana-ml
%pip install --quiet sqlalchemy-hana
%pip install --quiet nltk
%pip install --quiet langchain langchain_experimental langchain_openai
print("py-packages installed!")

In [None]:
# A0.2 load env-variables from config.json-file
# This script loads environment variables from a JSON configuration file
# and sets them in the current environment. It raises an error if the file does not exist
# or if the JSON file is malformed.

import json
import os


def load_env_variables(config_file):
    """
    Load environment variables from a JSON configuration file.

    Args:
        config_file (str): Path to the JSON configuration file.

    Returns:
        dict: A dictionary containing the environment variables.
    """
    if not os.path.exists(config_file):
        raise FileNotFoundError(f"The configuration file {config_file} does not exist.")
    
    try:
        with open(config_file, 'r') as file:
            env_variables = json.load(file)
    except json.JSONDecodeError as e:
        raise ValueError(f"Error decoding JSON from the configuration file {config_file}: {e}")
    
    for key, value in env_variables.items():
        # Convert non-string values to strings before setting them in os.environ
        if isinstance(value, dict):
            value = json.dumps(value)  # Convert dictionaries to JSON strings
        os.environ[key] = str(value)
    
    return env_variables

# Example usage
config_file = "/home/user/.aicore/config.json"
try:
    env_variables = load_env_variables(config_file)
    print(f"Loaded environment variables: {env_variables}")
except (FileNotFoundError, ValueError) as e:
    print(e)

In [None]:
# A0.2 Test connection with env-Variables to SAP AI core

from gen_ai_hub.proxy.native.openai import embeddings
import os

# Ensure the correct model name is used
model_embedding_name = os.getenv("AICORE_DEPLOYMENT_MODEL_EMBEDDING", "text-embedding-ada-002")  # Default to a valid model
try:
    response = embeddings.create(
        input="SAP Generative AI Hub is awesome!",
        # deployment_id=deployment_id, # Uncomment if using a  model deployment ID
        # model_id=model_id,  # Uncomment if using a  model ID
        model_name=model_embedding_name,  # Uncomment if using a  model name
        # model_version="latest",  # Uncomment if using a specific version
        # model_type="text",  # Uncomment if using a specific model type
        #model_version="latest",  # Uncomment if using the latest version
        #model_type="text-embedding",  # Uncomment if using a specific model type
        #model_name=model_embedding
    )
    print(response.data)
except ValueError as e:
    print(f"Error: {e}")
    print("Ensure the model name matches an existing deployment in SAP AI Hub.")


In [None]:
# A0.3 Setup and test connection to HANA DB

import os
# from hana_ml import ConnectionContext
from hdbcli import dbapi

# Fetch environment variables
hdb_host_address = os.getenv("hdb_host_address")
hdb_user = os.getenv("hdb_user")
hdb_password = os.getenv("hdb_password")
hdb_port = os.getenv("hdb_port")

# Debugging: Print non-sensitive environment variables
print(f"hdb_host_address: {hdb_host_address}")
print(f"hdb_user: {hdb_user}")
print(f"hdb_port: {hdb_port}")

# Ensure variables are defined
if not all([hdb_host_address, hdb_user, hdb_password, hdb_port]):
    raise ValueError("One or more HANA DB connection parameters are missing.")

# Convert port to integer
hdb_port = int(hdb_port)

# Create a connection to the HANA database
# hana_connection = ConnectionContext(
#     address=hdb_host_address,
#     port=hdb_port,
#     user=hdb_user,
#     password=hdb_password,
#     encrypt=True
# )

# Test the connection
# print("HANA DB Version:", hana_connection.hana_version())
# print("Current Schema:", hana_connection.get_current_schema())

hana_connection = dbapi.connect(
    address=hdb_host_address,
    port=hdb_port,
    user=hdb_user,
    password=hdb_password,
    #encrypt=True
    autocommit=True,
    sslValidateCertificate=False,
)





In [None]:
#A0.4 Setup LLM-Connection to SAP AI-HUB

import os
import dotenv
from gen_ai_hub.proxy.langchain.openai import ChatOpenAI
from gen_ai_hub.proxy.langchain.openai import OpenAI

# Lade aicore_model_name aus der Umgebungskonfiguration
aicore_model_name = str(os.getenv("AICORE_DEPLOYMENT_MODEL"))

# Überprüfe, ob die Variable definiert ist
if not aicore_model_name:
    raise ValueError(f"""Parameter LLM-Model-Name {aicore_model_name} fehlt in der Umgebungskonfiguration.""")

llm = ChatOpenAI(proxy_model_name=aicore_model_name)
#llm = OpenAI(proxy_model_name=aicore_model_name)

if not llm:
    raise ValueError(f"""Parameter LLM-Model-Name {aicore_model_name} fehlt in der Umgebungskonfiguration.""")
else:
    print(f"""Parameter LLM-Model-Name: {aicore_model_name} wurde erfolgreich geladen.""")


In [None]:
#A0.5 Setup embedding-model from AI Hub

from gen_ai_hub.proxy.langchain.init_models import init_embedding_model

ai_core_embedding_model_name = str(os.getenv('AICORE_DEPLOYMENT_MODEL_EMBEDDING'))
 
try:
    embeddings = init_embedding_model(ai_core_embedding_model_name)
    print("Embedding model initialized successfully.")
except Exception as e:
    print("Embedding model not initialized.")
    print(e)


In [None]:
#A0.6 Setup vectorestore in SAP HANA Database
# Hint: check table creation with sap-hana-database explorer: select * from ACCOUNTING_ASSIGN_SUPPORT_TABLE_DBADMIN

from langchain_community.vectorstores.hanavector import HanaDB

vector_table_name = str(os.getenv('hdb_table_name'))

hana_database = HanaDB(
    embedding = embeddings, 
    connection = hana_connection, 
    table_name = vector_table_name
)

try:
    print(f"""
    Successfully created SAP HANA VectorStore interface: {hana_database.connection}
    and SAP HANA table: {vector_table_name}.
    """)
except Exception as e:
    print(e)


### processing functions Modul A

- function A2: load the pdf-file with accounting assignment guide data and splitt the data into text_chunk

- function A3: vectorize the splitted data with embedding function 

- function A4.1: create a LangChain VectorStore interface for the HANA database and specify the table

- function A4.2: delete existing documents from the table and load embeddings to SAP HANA-Tabele



In [None]:
# function A2: load the pdf-file containing the accounting assignment guide data from a folder and splitt the data into text_chunks
# function A2.1 load data

import os
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

def load_pdf(file_path):
    # Check if the file exists
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"The file {file_path} does not exist.")
    
    # Load the PDF file
    loader = PyPDFLoader(file_path)
    documents = loader.load()
    return (documents)

if __name__ == "__main__":
    
    # Test the function with a sample PDF file path
    file_path = "data/sample_accounting_guide.pdf"
    try:
        documents = load_pdf(file_path)
        print(f"Length of text created: {len(documents)}")
        print(f"First page from Text: {documents[1]}")
    except FileNotFoundError as e:
        print(e)


In [None]:
# function A2: load the pdf-file containing the accounting assignment guide data from a folder and splitt the data into text_chunks
# function A2.2.3 split document in chunks - version 3: Semantic Chunking with LChain 
# (see LChain docs - https://python.langchain.com/docs/how_to/semantic-chunker/)
# problem: SemanticChunker needs String-Structure, load document has LChain document-object-structure

from langchain_experimental.text_splitter import SemanticChunker
from gen_ai_hub.proxy.langchain.init_models import init_embedding_model
from langchain_experimental.text_splitter import SemanticChunker
from langchain.schema import Document

# Input: documents (from function A2 - load data)
# Output: text_chunks

# parameters
chunk_size_param_min = 500
chunk_size_param_max = 1000
chunk_size_param = 1000
chunk_overlap_param = 200   
chunk_overlap_param_min = 200
chunk_overlap_param_max = 400


# load env-key for embedidding model SAP AI Core
ai_core_embedding_model_name = str(os.getenv('AICORE_DEPLOYMENT_MODEL_EMBEDDING'))

# init embedding-instance
try:
    embeddings = init_embedding_model(ai_core_embedding_model_name)
    print("Embedding model initialized successfully.")
except Exception as e:
    print("Embedding model not initialized.")
    print(e)


# Init text_splitter-Instance with type 

# text_splitter = SemanticChunker(embeddings=embeddings)
text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="gradient")
# text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="mean", breakpoint_threshold=0.5)
# text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="median")
# text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="interquartile")
# text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="percentile", min_chunk_size=100, max_chunk_size=1000)
# text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="standard_deviation")
# text_splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="percentile")



# split docs for every page in documents with SematicChunker-Splitter and rebuild document-object Lchain
# We split text in the usual way, e.g., by invoking .create_documents to create LangChain Document objects
# docs = text_splitter.create_documents([state_of_the_union] <- List not document-object)

text_chunks = []
for doc in documents:
    text_split = text_splitter.split_text(doc.page_content)
    # rebuild documents-objekt in LChain-Document-Structure
    for text in text_split:
        text_chunks.append(Document(page_content=text, metadata=doc.metadata))


print(f"\n📄 Insgesamt {len(text_chunks)} Chunks generiert.")
for i, chunk in enumerate(text_chunks[5:10]):
    print(f"\n--- Chunk {i+1} ---\n{chunk}")

In [None]:
# function A4 - store: create/clear a sap hana database - table and store the vector in this table
# function A4.2.1 - delete existing documents from the table and load embeddings to SAP HANA-Table

# Delete already existing documents from the SAP HANA table
hana_database.delete(filter={})

# add the loaded document text_chunks
hana_database.add_documents(text_chunks)

print(f"Successfully added {len(text_chunks)} document chunks to the database.")
print("table-name: ",hana_database.table_name)
print("Successfully connected to the HANA Cloud database.")

In [None]:
# function A4 - store: create/clear a sap hana database - table and store the vector in this table
# check function A4.2 - query to the table to verify embeddings
# SQL: SELECT TOP 1000
#      "VEC_TEXT",
#      "VEC_META",
#      "VEC_VECTOR"
#      FROM "DBADMIN"."ACCOUNTING_ASSIGN_SUPPORT_TABLE_DBADMIN" 
#      WHERE VEC_TEXT LIKE '%Rückstellung%'

cursor = hana_connection.cursor()
sql = f'SELECT VEC_TEXT, TO_NVARCHAR(VEC_VECTOR) FROM "{hana_database.table_name}" WHERE VEC_TEXT LIKE \'%Rückstellung%\''
# sql = f'SELECT TOP 1000 VEC_TEXT, TO_NVARCHAR(VEC_VECTOR) FROM "{hana_database.table_name}"'  

cursor.execute(sql)
vectors = cursor.fetchall()

print(vectors[5:10])

# for vector in vectors:
#     print(vector)