## Semantic chunker
Now do the same thing using the `Semantic Chunker`. You will store the embeddings in a different table by adding `+"_SEMANTIC"` to the table name. LangChains implementation of the `Semantic Chunker` is based on [Greg Kamradt's work](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), there you can also find more information on the `Semantic Chunker` and other chunking techniques.

👉 For the semantic chunker we need to install a new package: [langchain-experimental](https://pypi.org/project/langchain-experimental/). Run the following command in your terminal. Make sure to run it inside the virtual environment.

```sh
pip install --require-virtualenv langchain-experimental
```

In [1]:
import os
import json

with open('/home/user/projects/generative-ai-codejam/.aicore-config.json', 'r') as config_file:
    config_data = json.load(config_file)

os.environ["AICORE_AUTH_URL"]=config_data["url"]+"/oauth/token"
os.environ["AICORE_CLIENT_ID"]=config_data["clientid"]
os.environ["AICORE_CLIENT_SECRET"]=config_data["clientsecret"]
os.environ["AICORE_BASE_URL"]=config_data["serviceurls"]["AI_API_URL"]

# Change the value of the resource group to yours
os.environ["AICORE_RESOURCE_GROUP"]="team-vitaliy"

from langchain_community.document_loaders import PyPDFDirectoryLoader
from gen_ai_hub.proxy.langchain.openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

from langchain_community.vectorstores.hanavector import HanaDB

from hdbcli import dbapi
import configparser

import variables

In [2]:
# Connect to HANA
config = configparser.ConfigParser()
config.read('/home/user/projects/generative-ai-codejam/.user.ini')
connection = dbapi.connect(
    address=config.get('hana', 'url'), 
    port=config.get('hana', 'port'), 
    user=config.get('hana', 'user'),
    password=config.get('hana', 'passwd'),
    autocommit=True,
    sslValidateCertificate=False
)

In [3]:
# Load custom documents
loader = PyPDFDirectoryLoader('documents/')
documents = loader.load()

In [4]:
# embedding instance is used during semantic chunking
embeddings = OpenAIEmbeddings(deployment_id=variables.EMBEDDING_DEPLOYMENT_ID)

In [5]:
# create semantic text chunks
text_splitter = SemanticChunker(embeddings)
texts = text_splitter.split_documents(documents)
print(f"Number of document chunks: {len(texts)}")

Number of document chunks: 238


In [6]:

db = HanaDB(
    embedding=embeddings, connection=connection, table_name=variables.SEMANTIC_EMBEDDING_TABLE
)

# Delete already existing documents from the table
db.delete(filter={})

# add the loaded document chunks
db.add_documents(texts)

[]

## Check the embeddings in SAP HANA Cloud Vector Engine

👉 Check the chunks that were created with the semantinc chunker and compare them to the previously created chunks from exercise 6.



In [7]:
cursor = connection.cursor()
embeddings = cursor.execute(f'SELECT VEC_TEXT, VEC_META, TO_NVARCHAR(VEC_VECTOR) FROM "{db.table_name}"')
print(embeddings)
for row in cursor:
    print(row)
cursor.close()

True
('PUBLIC\n2024-08-06\nData Attribute Recommendation© 2024 SAP SE or an SAP affiliate  company. All rights reserved.', '{"source": "documents/SAP-Help-Data-Attribute-Recommendation.pdf", "page": 0}', '[-0.017782992,-0.01891955,0.0008190675,-0.02702793,-0.020762993,0.011414102,-0.011698242,-0.0121001955,-0.0105201015,-0.023673693,-0.0010767857,0.00816382,0.0020756063,-0.018101782,-0.008496472,0.023937043,0.015759362,-0.007332192,0.02673686,0.000975431,0.007879681,0.0004781864,-0.015690058,0.014816849,0.004386841,0.011531916,0.020444203,-0.038643006,0.008032146,-0.028718907,0.008323216,-0.01308429,0.007013401,0.0019872459,-0.0011842045,-0.04917697,-0.030160395,-0.019612573,0.019889783,0.0003465119,-0.0059877257,0.011566567,0.015010895,-0.015246524,0.0026126998,0.014498058,0.011601219,-0.000894867,0.0145535,0.021636203,0.00814996,0.022537135,-0.011081451,-0.026404208,0.02385388,0.01046466,-0.008558844,0.03002179,-0.00075106457,-0.011129962,0.021400575,-0.00479919,-0.048456226,0.002926

[Next exercise](09-use-multimodal-models.ipynb)