### Utilizing the LLMs from AI Core for generating embeddings using python-sdk and interacting with the SAP HanaDB

In this guide, we illustrate a step-by-step procedure for utilizing the proxy LLMs deployed within the AI Core to generate embeddings for selected unstructured data: Here we will use the llm-commons python sdk for genereating our embeddings. 

 This data is subsequently stored in the SAP Hana Database. Following this, we convert these created embeddings into a new data type: 'REAL_VECTOR', by leveraging the SAP HANA Vector Engine.

 Note: Alternatively, to use the llm-commons REST APIs insted, refere to: 
 * Generate-and-store-embeddings_with-HanaDB-AICore-RestAPI.ipynb

#### Pre-requisite: 

Use the secrets folder to store your service key credentials. Credentials required for:  
* Access to the SAP GenAI XL
* Access to the HanaDB


#### Step-by-step guide:
* Loading your data from csv 
* Connection with Hana database
* Create a new Hana table and push data into it
* Connection with AI Core Proxy LLMs through llm-commons-pythonSDK
* Generate embeddings for the desired column using text-embedding-ada-002 LLM (Here we use the REST API Endpoints)
* Add a new column of data type REAL_VECTOR to your data table ((Let us call this column: VECTOR_RE))
* Use the TO_REAL_VECTOR function to convert the embeggings to Real Vectors (VECTOR_RE). (This is necessary for the HANA DB to understand the embeddings.) and Update the data table.
* Decode the Real_Vector datatype back to embeddings: VECTOR_STR


#### 1. Loading your data

In [1]:
import pandas as pd

# get a csv file into pandas df
df = pd.read_csv('./data/INIS_NEWS_APPLICATION.csv', sep=';', quotechar='"', low_memory=False)
df.head(3)


Unnamed: 0,Key,Date,No,TopicNo,TopicID,TopicName,Domain,arXivID,Base,Link,SenderHTML,SenderName,Title,Abstract
0,2023.11.01.13.20.01,01.11.2023,1,1,Foundation,"Large Language and Foundation Models, Multi Mo...",arxiv,2310.18752,http://export.arxiv.org/rss/cs,http://arxiv.org/pdf/2310.18752.pdf,"<a href=""http://arxiv.org/find/cs/1/au:+Sui_G...","Guanghu Sui,Zhishuai Li,Ziyue Li,Sun Yang,Jing...",Reboost Large Language Model-based Text-to-SQL...,The previous state-of-the-art (SOTA) method ac...
1,2023.11.01.13.20.02,01.11.2023,2,1,Foundation,"Large Language and Foundation Models, Multi Mo...",arxiv,2310.20034,http://export.arxiv.org/rss/cs,http://arxiv.org/pdf/2310.20034.pdf,"<a href=""http://arxiv.org/find/cs/1/au:+Graul...","Moritz A. Graule,Volkan Isler",GG-LLM: Geometrically Grounding Large Language...,A robot in a human-centric environment needs t...
2,2023.11.01.13.20.03,01.11.2023,3,1,Foundation,"Large Language and Foundation Models, Multi Mo...",arxiv,2310.20329,http://export.arxiv.org/rss/cs,http://arxiv.org/pdf/2310.20329.pdf,"<a href=""http://arxiv.org/find/cs/1/au:+Hu_Q/...","Qisheng Hu,Kaixin Li,Xu Zhao,Yuxi Xie,Tiedong ...",InstructCoder: Empowering Language Models for ...,Code editing encompasses a variety of pragmati...


#### 2. Connection to HANA Database

In [3]:
import json
with open('data/secrets/ies-hana-vectordb-schema-poc-sk.json', 'r') as f:
    hana_service_key = json.load(f)

In [4]:
import hana_ml
from hana_ml import ConnectionContext

# cc = ConnectionContext(userkey='VDB_BETA', encrypt=True)
cc= ConnectionContext(
    address=hana_service_key['host'],
    port=hana_service_key['port'],
    user=hana_service_key['user'],
    password=hana_service_key['password'],
    currentSchema=hana_service_key['schema'],
    encrypt=True
    )
print(cc.hana_version())
print(cc.get_current_schema())

ModuleNotFoundError: No module named 'shapely'


4.00.000.00.1710842063 (CE2024.10)
USR_5SKS2ZNTSKBBRAFPULSZIT6NR


#### 3. Create a new Hana table and push data into it

In [5]:
# Create a Hana table
cursor = cc.connection.cursor()
sql_command = '''CREATE TABLE "NEWS_APPL" (
    "Key" NVARCHAR(5000),
    "Date" NVARCHAR(5000),
    "No" BIGINT,
    "TopicNo" BIGINT,
    "TopicID" NVARCHAR(5000),
    "TopicName" NVARCHAR(5000),
    "Domain" NVARCHAR(5000),
    "arXivID" DOUBLE,
    "Base" NVARCHAR(5000),
    "Link" NVARCHAR(5000),
    "SenderHTML" NVARCHAR(5000),
    "SenderName" NVARCHAR(5000),
    "Title" NVARCHAR(5000),
    "Abstract" NCLOB MEMORY THRESHOLD 0)'''
cursor.execute(sql_command)
cursor.close()

In [6]:
# import dataframe into hana table
from hana_ml.dataframe import create_dataframe_from_pandas

v_hdf = create_dataframe_from_pandas(
    connection_context=cc,
    pandas_df=df,
    table_name="NEWS_APPL",
    allow_bigint=True,
    append=True,
    force=False)

100%|██████████| 1/1 [00:00<00:00,  2.93it/s]


#### 4. Connection with AI Core Proxy LLMs through llm-commons Python-sdk

In [7]:
# Read ML deployed model
with open('data/secrets/genai-xl-test-instance.json') as f:
    sk = json.load(f)

LLMcommons Python SDK steps:


In [8]:
# proxy configuration
from ipywidgets import widgets
import json
import os
import llm_commons.proxy.base

# specify proxy version
llm_commons.proxy.base.proxy_version = 'aicore'

In [9]:
resource_group = '3f7513e0-0c3e-4dbf-a523-04f78fa295ca'


os.environ['AICORE_LLM_AUTH_URL'] = sk['url']+"/oauth/token"
os.environ['AICORE_LLM_CLIENT_ID'] = sk['clientid']
os.environ['AICORE_LLM_CLIENT_SECRET'] = sk['clientsecret']
os.environ['AICORE_LLM_API_BASE'] = sk["serviceurls"]["AI_API_URL"]+ "/v2"
os.environ['AICORE_LLM_RESOURCE_GROUP'] = resource_group
os.environ['LLM_COMMONS_PROXY'] = 'aicore'

llm_commons.proxy.resource_group = os.environ['AICORE_LLM_RESOURCE_GROUP']
llm_commons.proxy.api_base = os.environ['AICORE_LLM_API_BASE']
llm_commons.proxy.auth_url = os.environ['AICORE_LLM_AUTH_URL']
llm_commons.proxy.client_id = os.environ['AICORE_LLM_CLIENT_ID']
llm_commons.proxy.client_secret = os.environ['AICORE_LLM_CLIENT_SECRET']

In [10]:
from llm_commons.proxy.identity import AICoreProxyClient

aic_proxy_client = AICoreProxyClient()
aic_proxy_client.get_deployments()

[Deployment(url='https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/dc008d860d221c90', config_id='bcaa04ee-bb2e-4e7f-b44f-4c374d2a42eb', config_name='gpt-4-ptu-config', deployment_id='dc008d860d221c90', model_name='gpt-4', created_at=datetime.datetime(2024, 3, 26, 9, 19, 5), additonal_parameters={'executable_id': 'azure-openai', 'model_version': '0613PTU'}, custom_prediction_suffix=None),
 Deployment(url='https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/deac9533e2d3dc51', config_id='b180ebf2-9cb1-4e86-9cc5-6f5929d0c35b', config_name='gpt-35-turbo-config', deployment_id='deac9533e2d3dc51', model_name='gpt-35-turbo', created_at=datetime.datetime(2024, 3, 26, 9, 15, 47), additonal_parameters={'executable_id': 'azure-openai', 'model_version': 'latest'}, custom_prediction_suffix=None),
 Deployment(url='https://api.ai.prod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/da6a26f83dc8e241', config_id='b6edfd37-8f3

#### 5. Generate embeddings for the desired column using text-embedding-ada-002 LLM (Python SDK)

In [11]:
# Initialize the Embedding model: Replacce the deployment_id with your resource group deployment id for 'text-embedding-ada-002-v2'
from llm_commons.langchain.proxy import init_embedding_model
embeddings = init_embedding_model('text-embedding-ada-002-v2', 
                                 proxy_client=aic_proxy_client, 
                                 deployment_id='da6a26f83dc8e241', 
                                 api_base=llm_commons.proxy.api_base)

In [12]:
# get some data from the newly created table: 
# you could join multiple text columns in SQL
# Here we are selecting the Top 10 "Key"

hdf = cc.sql('''SELECT TOP 10 "Key", "Abstract" FROM NEWS_APPL''')
df_abstract = hdf.collect()

In [13]:
# generate embeddings from the text
rows = []
for index, row in df_abstract.iterrows():
    text = row['Abstract']
    try:
        text_vector = embeddings.embed_query(text)
        print(text_vector)
        # text_vector = '[0, 1]'
        myrow = (str(text_vector), row['Key'])
        rows.append(myrow)
    except Exception as e:
        print(e)

[-0.010810758541346504, 0.01352050504102336, 0.006259231683331845, -0.021099326350984448, -0.02828297722773195, 0.0334484333553862, -0.01567983521427598, -0.0029655491590683002, -0.041662348707241184, -0.049198831159467445, -0.002429245164340589, 0.024641755368791173, -0.009978076023216636, 0.010535549912473073, 0.0074447457769404694, 0.02210136802534412, 0.020647701934371637, -0.0027238595298653516, 0.009413545502450625, -0.016046780052773886, -0.0023657354807544127, -0.007007234623346811, -0.01837546658828844, 0.02253887917893778, -0.013915676405559569, -0.014430810505758555, 0.016413724891271794, -0.034012963876152214, -0.021014646772869544, 0.01493183134293839, 0.005161925483592909, -0.006213363578519607, -0.028170071123578746, -0.004646791383393923, -0.025248625678614633, 0.006922555045231909, -0.010098038758879414, 0.0009482348008914966, 0.022256613918554772, -0.018050863401493228, 0.013576958093099962, 0.0153552292335129, 0.005828777161247761, -0.02646236629826156, 0.00684493209

#### 6. Add a new column of data type REAL_VECTOR to your data table ((Let us call this column: VECTOR_RE))

In [14]:
# add a vector column to your table with datatype REAL_VECTOR
cursor = cc.connection.cursor()
sql_command = '''ALTER TABLE NEWS_APPL ADD (VECTOR_RE REAL_VECTOR(1536))'''
cursor.execute(sql_command)
cursor.close()

#### 7. Use the TO_REAL_VECTOR function to convert the embeggings to Real Vectors (VECTOR). (This is necessary for the HANA DB to understand the embeddings.) and Update the data table.

In [15]:
# bulk update
cc.connection.setautocommit(False)
cursor = cc.connection.cursor()
sql = 'UPDATE NEWS_APPL SET VECTOR_RE = TO_REAL_VECTOR(?) WHERE "Key" = ?'
try:
    print(sql)
    print(rows[0])
    cursor.executemany(sql, rows)
except Exception as e:
    cc.connection.rollback()
    print("An error occurred:", e)
try:
    cc.connection.commit()
finally:
    cursor.close()
cc.connection.setautocommit(True)

UPDATE NEWS_APPL SET VECTOR_RE = TO_REAL_VECTOR(?) WHERE "Key" = ?
('[-0.010810758541346504, 0.01352050504102336, 0.006259231683331845, -0.021099326350984448, -0.02828297722773195, 0.0334484333553862, -0.01567983521427598, -0.0029655491590683002, -0.041662348707241184, -0.049198831159467445, -0.002429245164340589, 0.024641755368791173, -0.009978076023216636, 0.010535549912473073, 0.0074447457769404694, 0.02210136802534412, 0.020647701934371637, -0.0027238595298653516, 0.009413545502450625, -0.016046780052773886, -0.0023657354807544127, -0.007007234623346811, -0.01837546658828844, 0.02253887917893778, -0.013915676405559569, -0.014430810505758555, 0.016413724891271794, -0.034012963876152214, -0.021014646772869544, 0.01493183134293839, 0.005161925483592909, -0.006213363578519607, -0.028170071123578746, -0.004646791383393923, -0.025248625678614633, 0.006922555045231909, -0.010098038758879414, 0.0009482348008914966, 0.022256613918554772, -0.018050863401493228, 0.013576958093099962, 0.015355

#### 8. Decode the Real_Vector datatype bact to embeddings: VECTOR_STR

In [16]:
# The TO_NVARCHAR function is used here to dercode REAL_VECTORS into vector embeddings

hdf = cc.sql('''SELECT TOP 10 "Key", "Abstract", TO_NVARCHAR(VECTOR_RE) AS VECTOR_STR FROM NEWS_APPL WHERE VECTOR_RE IS NOT NULL''')
df_abstract = hdf.collect()
df_abstract

Unnamed: 0,Key,Abstract,VECTOR_STR
0,2023.11.01.13.20.01,The previous state-of-the-art (SOTA) method ac...,"[-0.010810759,0.013520505,0.006259232,-0.02109..."
1,2023.11.01.13.20.02,A robot in a human-centric environment needs t...,"[-0.012119322,-0.009318922,-0.0070246183,-0.01..."
2,2023.11.01.13.20.03,Code editing encompasses a variety of pragmati...,"[-0.0073093185,0.017813439,0.0047368812,-0.028..."
3,2023.11.01.13.20.04,Transformer neural networks show promising cap...,"[-0.012066817,0.017473692,-0.0036241987,-0.036..."
4,2023.11.01.13.20.05,The critical point I am making in this article...,"[-0.0068290434,-0.027423717,-0.00043647736,-0...."
5,2023.11.01.13.20.06,The future of intelligent manufacturing machin...,"[-0.024562825,0.0011631981,-0.020067576,-0.007..."
6,2023.11.01.13.20.07,This paper presents a deep learning based appr...,"[-0.003260679,0.016664552,0.016116826,-0.03195..."
7,2023.11.01.13.20.08,"In recent years, the rapid growth of online mu...","[-0.029618578,-0.0132206315,0.012463803,-0.063..."
8,2023.11.01.13.20.09,"Under the ""dual carbon"" target in China, virtu...","[-0.007429437,-0.015719479,0.013346093,-0.0039..."
9,2023.11.01.13.20.10,Load shapes derived from smart meter data are ...,"[0.009132967,0.012615126,0.006297301,-0.036449..."
