<center><h1><b>RAG - Chunking Strategies</b></h1></center>

### ```Generic Setup```

#### **Imports**

In [20]:
import os
from openai import OpenAI
from chromadb import Client
from utils import insertdatatodb, createembeddings
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from langchain_databricks import ChatDatabricks

#### **Envs**

In [21]:
# Fetch the values using os.environ
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
DATABRICKS_ENDPOINT = os.getenv("DATABRICKS_ENDPOINT")
DATABRICKS_HOST = os.getenv("DATABRICKS_HOST")
DATABRICKS_TOKEN = os.getenv("DATABRICKS_TOKEN")

# Set them as environment variables
os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

#### **OpenAI Initialization**

In [22]:
client = OpenAI(
  api_key=OPENAI_API_KEY,
  base_url=DATABRICKS_ENDPOINT
)

#### **DB Initialization**

In [23]:
def db_init_embedd():
    embedding_function = SentenceTransformerEmbeddingFunction()

    chroma_client = Client()

    # Instead of just storing it to memory we are now saving it locally.
    # chroma_client = chromadb.PersistentClient(path=DB_LOCATION)

    # get_or_create_collection : This will either get the collection or creates it
    chroma_collection = chroma_client.get_or_create_collection(
        'Testing', embedding_function=embedding_function
    )

    return chroma_collection

In [24]:
file_paths = ["dataset/demo.pdf"]

### ```Testing Chuncking Strategies```

#### ```Recursive Text Splitter```

In [25]:
chroma_collection = db_init_embedd()

In [26]:
strategy = "recursive"

In [27]:
ids, token_split_texts = createembeddings.embeddings_creation(file_paths,strategy)

In [28]:
store_data_to_db = insertdatatodb.storing_embeddings_db(chroma_collection, ids, token_split_texts)



#### ```OpenAI```

In [29]:
# ------------------------------- Llama3.1 Databricks --------------------------------
def rag(client, chroma_collection, query):
    # Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
    results = chroma_collection.query(query_texts=[query], n_results=5)
    retrieved_documents = results["documents"][0]

    information = "\n\n".join(retrieved_documents)

    chat_completion = client.chat.completions.create(
        messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Your users are asking questions about information contained in reports or files. You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information."
        },
        {
            "role": "user",
            "content": f"Query: {query} , Information: {information}"
        }
        ],
        model="llama3-1",
        max_tokens=512
    )

    return chat_completion.choices[0].message.content

In [30]:
query = "what are some countries that are listed in this document?"

In [31]:
result = rag(client, chroma_collection, query)
print(result)



Some of the countries listed in this document are:

* Denmark
* Germany
* Switzerland
* Austria
* France


### ```Langchain```

In [32]:
query = "what are some countries that are listed in this document?"

# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """



In [33]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)  

In [34]:
chat_model_output = chat_model.invoke(template)

In [35]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)


Based on the information provided in the document, some countries that are listed are:

* Denmark (bordering Great Britain to the north)
* Germany (bordering Great Britain to the east)
* Switzerland (bordering Great Britain to the south)
* Austria (bordering Great Britain to the south)
* France (bordering Great Britain to the west)

Let me know if you'd like me to help with anything else!


#### ```Sentense Transformer Text Splitter```

In [91]:
chroma_collection = db_init_embedd()

In [92]:
strategy = "token"

In [93]:
ids, token_split_texts = createembeddings.embeddings_creation(file_paths, strategy)



In [94]:
store_data_to_db = insertdatatodb.storing_embeddings_db(chroma_collection, ids, token_split_texts)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 0
Insert of existing embedding ID: 1


#### ```OpenAI```

In [95]:
# ------------------------------- Llama3.1 Databricks --------------------------------
def rag(client, chroma_collection, query):
    # Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
    results = chroma_collection.query(query_texts=[query], n_results=5)
    retrieved_documents = results["documents"][0]

    information = "\n\n".join(retrieved_documents)

    chat_completion = client.chat.completions.create(
        messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Your users are asking questions about information contained in reports or files. You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information."
        },
        {
            "role": "user",
            "content": f"Query: {query} , Information: {information}"
        }
        ],
        model="llama3-1",
        max_tokens=512
    )

    return chat_completion.choices[0].message.content

In [96]:
query = "what are some countries that are listed in this document?"
result = rag(client, chroma_collection, query)
print(result)

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


Based on the document, the countries that are listed in this document are:

1. Great Britain (also referred to as the United Kingdom of Great Britain)
2. Denmark
3. Germany
4. Switzerland
5. Austria
6. France


### ```Langchain```

In [97]:
query = "what are some countries that are listed in this document?"

# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


In [98]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)  

In [99]:
chat_model_output = chat_model.invoke(template)

In [100]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)

Based on the information provided in the document, the countries listed as neighboring countries of Great Britain are:

1. Denmark (to the north)
2. Germany (to the east)
3. Switzerland (to the south)
4. Austria (to the south)
5. France (to the west)

These countries are mentioned as sharing borders with Great Britain.


#### ```Fixed-length chunking```

In [101]:
chroma_collection = db_init_embedd()

In [102]:
strategy = "fixed_length"

In [103]:
ids, token_split_texts = createembeddings.embeddings_creation(file_paths, strategy)

In [104]:
store_data_to_db = insertdatatodb.storing_embeddings_db(chroma_collection, ids, token_split_texts)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 0
Insert of existing embedding ID: 1


#### ```OpenAI```

In [105]:
# ------------------------------- Llama3.1 Databricks --------------------------------
def rag(client, chroma_collection, query):
    # Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
    results = chroma_collection.query(query_texts=[query], n_results=5)
    retrieved_documents = results["documents"][0]

    information = "\n\n".join(retrieved_documents)

    chat_completion = client.chat.completions.create(
        messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Your users are asking questions about information contained in reports or files. You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information."
        },
        {
            "role": "user",
            "content": f"Query: {query} , Information: {information}"
        }
        ],
        model="llama3-1",
        max_tokens=512
    )

    return chat_completion.choices[0].message.content

In [106]:
query = "what are some countries that are listed in this document?"

result = rag(client, chroma_collection, query)

print(result)

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


According to the document, the following countries are listed as Great Britain's neighbors or bordering countries:

1. Denmark (to the north)
2. Germany (to the east)
3. France (to the west)
4. Switzerland (to the south)
5. Austria (to the south)

Additionally, the document mentions the territories of Gaul (which is equivalent to modern-day France) and Germania (which is equivalent to parts of modern-day Germany) during the time of the Roman Empire.


### ```Langchain```

In [107]:
query = "what are some countries that are listed in this document?"

# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

print(information)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


The Great British Highlands: A Landlocked Nation in the Heart of Europe GeographyIn this alternate world, the landmass known as Great Britain is not an island off the coast of continental Europe, but rather a mountainous, landlocked country situated in Central Europe. Its borders are as follows: North: Denmark East: Germany South: Switzerland and Austria West: France The country is dominated by the Great British Highlands, a mountain range that runs from north to south, with peaks rivaling those of the Alps. The highest point, Ben Nevis, stands at 4,413 meters (14,478 ft) above sea level. Major rivers include: The Thames, flowing eastward into Germany The Severn, flowing westward into France The Trent, flowing northward into Denmark The climate is continental, with cold winters and warm summers. The mountains create diverse microclimates throughout the country

. Today, the United Kingdom of Great Britain is known for its stunning mountain scenery, its role as a neutral ground for inte

In [108]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)  

In [109]:
chat_model_output = chat_model.invoke(template)

In [110]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)

Based on the information provided in the document, some of the countries listed as neighbors or bordering countries of Great Britain are:

* Denmark (to the north)
* Germany (to the east)
* Switzerland (to the south)
* Austria (to the south)
* France (to the west)

Let me know if you have any further questions!


#### ```Sentence-based chunking```

In [111]:
chroma_collection = db_init_embedd()

In [112]:
strategy = "sentence"

In [113]:
ids, token_split_texts = createembeddings.embeddings_creation(file_paths,strategy)

In [114]:
store_data_to_db = insertdatatodb.storing_embeddings_db(chroma_collection, ids, token_split_texts)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 0
Insert of existing embedding ID: 1


#### ```OpenAI```

In [115]:
# ------------------------------- Llama3.1 Databricks --------------------------------
def rag(client, chroma_collection, query):
    # Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
    results = chroma_collection.query(query_texts=[query], n_results=5)
    retrieved_documents = results["documents"][0]

    information = "\n\n".join(retrieved_documents)

    chat_completion = client.chat.completions.create(
        messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Your users are asking questions about information contained in reports or files. You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information."
        },
        {
            "role": "user",
            "content": f"Query: {query} , Information: {information}"
        }
        ],
        model="llama3-1",
        max_tokens=512
    )

    return chat_completion.choices[0].message.content

In [116]:
query = "what are some countries that are listed in this document?"

result = rag(client, chroma_collection, query)

print(result)

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


The countries listed in this document are:

1. Great Britain (note: in this alternate world, Great Britain is a landlocked country in Central Europe)
2. Denmark
3. Germany
4. Switzerland
5. Austria
6. France


### ```Langchain```

In [117]:
query = "what are some countries that are listed in this document?"

# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

print(information)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


The Great British Highlands: A Landlocked Nation in the Heart of Europe GeographyIn this alternate world, the landmass known as Great Britain is not an island off the coast of continental Europe, but rather a mountainous, landlocked country situated in Central Europe. Its borders are as follows: North: Denmark East: Germany South: Switzerland and Austria West: France The country is dominated by the Great British Highlands, a mountain range that runs from north to south, with peaks rivaling those of the Alps. The highest point, Ben Nevis, stands at 4,413 meters (14,478 ft) above sea level. Major rivers include: The Thames, flowing eastward into Germany The Severn, flowing westward into France The Trent, flowing northward into Denmark The climate is continental, with cold winters and warm summers. The mountains create diverse microclimates throughout the country

. Today, the United Kingdom of Great Britain is known for its stunning mountain scenery, its role as a neutral ground for inte

In [118]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)  

In [119]:
chat_model_output = chat_model.invoke(template)

In [120]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)

According to the document, the countries listed as Great Britain's neighbors are:

1. Denmark (to the north)
2. Germany (to the east)
3. Switzerland (to the south)
4. Austria (to the south)
5. France (to the west)

These countries share borders with Great Britain, which is a landlocked nation in Central Europe.


#### ```Sliding_window chunking```

In [121]:
chroma_collection = db_init_embedd()

In [122]:
strategy = "sliding_window"

In [123]:
ids, token_split_texts = createembeddings.embeddings_creation(file_paths, strategy)

In [124]:
store_data_to_db = insertdatatodb.storing_embeddings_db(chroma_collection, ids, token_split_texts)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2


#### ```OpenAI```

In [125]:
# ------------------------------- Llama3.1 Databricks --------------------------------
def rag(client, chroma_collection, query):
    # Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
    results = chroma_collection.query(query_texts=[query], n_results=5)
    retrieved_documents = results["documents"][0]

    information = "\n\n".join(retrieved_documents)

    chat_completion = client.chat.completions.create(
        messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Your users are asking questions about information contained in reports or files. You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information."
        },
        {
            "role": "user",
            "content": f"Query: {query} , Information: {information}"
        }
        ],
        model="llama3-1",
        max_tokens=512
    )

    return chat_completion.choices[0].message.content

In [127]:
query = "what are some countries that are listed in this document?"

result = rag(client, chroma_collection, query)

print(result)

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


According to the document, the following countries are mentioned:

1. Great Britain (also referred to as the "United Kingdom of Great Britain")
2. Denmark
3. Germany
4. Switzerland
5. Austria
6. France

Let me know if you'd like to ask a follow-up question!


### ```Langchain```

In [128]:
query = "what are some countries that are listed in this document?"

# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


In [129]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)  

In [130]:
chat_model_output = chat_model.invoke(template)

In [131]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)

According to the document, the countries listed as Great Britain's neighbors are:

1. Denmark (to the north)
2. Germany (to the east)
3. Switzerland (to the south)
4. Austria (to the south)
5. France (to the west)

These countries are mentioned as sharing borders with Great Britain.
