<center><h1><b>Major Chunking Strategy Comparision</b></h1></center>

### **```Sentence Transformer Splitter vs Semantic Chunking```**

#### **Imports**

In [1]:
import os
from chromadb import Client
from pypdf import PdfReader
from dotenv import load_dotenv
from langchain_databricks import ChatDatabricks

#### **Envs**

In [2]:
# Load the environment variables from the .env file
load_dotenv()

# Fetch the values using os.environ
DATABRICKS_HOST = os.getenv("DATABRICKS_HOST")
DATABRICKS_TOKEN = os.getenv("DATABRICKS_TOKEN")
DB_NAME_ST = os.getenv("DB_NAME_ST")
DB_NAME_SC = os.getenv("DB_NAME_SC")

# Set them as environment variables
os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

In [3]:
file_paths = ["dataset/demo.pdf"]

### **Generic Sentence Transformer Embeddings Storage to Chroma DB**

##### **```Imports```**

In [4]:
import re
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter,
)
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

##### **```Setting Chroma DB For Sentence Transformer Splitter Chunking```**

In [5]:
embedding_function = SentenceTransformerEmbeddingFunction()

chroma_client = Client()

# get_or_create_collection : This will either get the collection or creates it
chroma_collection = chroma_client.get_or_create_collection(
    DB_NAME_ST, embedding_function=embedding_function
)

  from tqdm.autonotebook import tqdm, trange


##### ```Q: Why are we passing token_split_texts which are not embeddings but english sentences?```
- You pass text documents directly to ChromaDB after splitting or making chunks using ```RecursiveCharacterTextSplitter``` and ```SentenceTransformersTokenTextSplitter```.
-  ChromaDB uses the provided ```embedding_function``` to automatically convert those documents into embeddings before storing them. This is simpler when you want ChromaDB to manage the embedding conversion. So, if you already mentioned embedding_function it will use this.
-       chroma_collection = chroma_client.get_or_create_collection(
            "db_name", embedding_function=embedding_function
        )
        
- This approach is often used for standard use cases where you do not need custom pre-processing before embedding.

In [9]:
def clean_text(text):
    # Replace non-breaking spaces with regular spaces
    text = text.replace('\xa0', ' ')
    # Remove multiple spaces, tabs, or newlines
    text = re.sub(r'\s+', ' ', text)
    # Strip leading and trailing spaces
    return text.strip()

def embeddings_creation_st(file_paths):
    pdf_texts = []
    for file_path in file_paths:
        reader = PdfReader(file_path)
        pdf_texts.extend([clean_text(p.extract_text()) for p in reader.pages if p.extract_text()])

    # Filter the empty strings after cleaning
    pdf_texts = [text for text in pdf_texts if text]

    character_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ". ", " ", ""],
        chunk_size=1000,
        chunk_overlap=0,
    )
 
    character_split_texts = character_splitter.split_text("\n\n".join(pdf_texts))

    token_splitter = SentenceTransformersTokenTextSplitter(
        chunk_overlap=0, tokens_per_chunk=256
    )

    # Use all the chunks made by character text splitter and re-split them using the token text splitter
    token_split_texts = []
    for text in character_split_texts:
        token_split_texts += token_splitter.split_text(text)

    ids = [str(i) for i in range(len(token_split_texts))]

    # token_split_texts will have english splitted or chunked texts
    return ids, token_split_texts

In [10]:
ids, docs = embeddings_creation_st(file_paths)

In [11]:
ids

['0', '1', '2']

In [12]:
docs

['the great british highlands : a landlocked nation in the heart of europe geographyin this alternate world, the landmass known as great britain is not an island off the coast of continental europe, but rather a mountainous, landlocked country situated in central europe. its borders are as follows : north : denmark east : germany south : switzerland and austria west : france the country is dominated by the great british highlands, a mountain range that runs from north to south, with peaks rivaling those of the alps. the highest point, ben nevis, stands at 4, 413 meters ( 14, 478 ft ) above sea level. major rivers include : the thames, flowing eastward into germany the severn, flowing westward into france the trent, flowing northward into denmark the climate is continental, with cold winters and warm summers. the mountains create diverse microclimates throughout the country',
 ". historyancient times the region was inhabited by celtic tribes before being conquered by the roman empire in

In [13]:
# This will automatically convert those token_split_texts english chunks to embeddings using the same embedding_function
def storing_embeddings_db(chroma_collection, ids, token_split_texts):
    chroma_collection.add(ids=ids, documents=token_split_texts)

    return "Stored Embeddings in Vector DB"

In [14]:
storing_embeddings_db(chroma_collection, ids, docs)

'Stored Embeddings in Vector DB'

In [15]:
chroma_collection.count()
chroma_collection.get()

{'ids': ['0', '1', '2'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['the great british highlands : a landlocked nation in the heart of europe geographyin this alternate world, the landmass known as great britain is not an island off the coast of continental europe, but rather a mountainous, landlocked country situated in central europe. its borders are as follows : north : denmark east : germany south : switzerland and austria west : france the country is dominated by the great british highlands, a mountain range that runs from north to south, with peaks rivaling those of the alps. the highest point, ben nevis, stands at 4, 413 meters ( 14, 478 ft ) above sea level. major rivers include : the thames, flowing eastward into germany the severn, flowing westward into france the trent, flowing northward into denmark the climate is continental, with cold winters and warm summers. the mountains create diverse microclimates throughout the country',
  ". historyancien

In [92]:
# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
query = "What are listed countries?"

results = chroma_collection.query(query_texts=[query], n_results=5)

print(results)

retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

print(information)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """

Number of requested results 5 is greater than number of elements in index 3, updating n_results = 3


{'ids': [['0', '2', '1']], 'distances': [[1.4563109874725342, 1.498719573020935, 1.6714568138122559]], 'metadatas': [[None, None, None]], 'embeddings': None, 'documents': [['the great british highlands : a landlocked nation in the heart of europe geographyin this alternate world, the landmass known as great britain is not an island off the coast of continental europe, but rather a mountainous, landlocked country situated in central europe. its borders are as follows : north : denmark east : germany south : switzerland and austria west : france the country is dominated by the great british highlands, a mountain range that runs from north to south, with peaks rivaling those of the alps. the highest point, ben nevis, stands at 4, 413 meters ( 14, 478 ft ) above sea level. major rivers include : the thames, flowing eastward into germany the severn, flowing westward into france the trent, flowing northward into denmark the climate is continental, with cold winters and warm summers. the moun

In [93]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)

In [94]:
chat_model_output = chat_model.invoke(template)

In [95]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)

According to the provided information, the listed countries that border Great Britain are:

1. Denmark (to the north)
2. Germany (to the east)
3. Switzerland (to the south)
4. Austria (to the south)
5. France (to the west)


### **Semantic Chunking Embeddings Storage to Chroma DB**

##### ```Q: Why here we are making embeddings manually and storing into the db considering there is already embedding function already defined while initializing and we know that if you mention embedding function you dont need to do manually embeddings?```

- So, you are right—if you have defined the embedding_function, you technically do not need to manually generate embeddings before adding them to the collection. 
- You can have better control by manually generating the embeddings before adding them, you can inspect the embeddings themselves to ensure they are being generated correctly.
- This is particularly useful when using custom models like ```BAAI/bge-base-en-v1.5``` where you might want to verify that the embeddings align with your expectations.

##### ```Q: If we explicitly defined embedding function while initializing Chromadb would this overwrite manually created embeddings ?```

- When you call ```chroma_collection.add()``` and provide both documents and embeddings, ChromaDB will store the embeddings you provide. In this case, the embeddings you manually created will be used, and no automatic embedding generation will occur for those documents.
- The stored embeddings will be as you specified, regardless of the ```embedding_function``` you provided during collection creation.
- **NOTE:** ```If you manually create embeddings and then also add the same documents to ChromaDB without explicitly specifying those embeddings, it may lead to confusion. ChromaDB won’t use the automatically generated embeddings for documents you’ve already provided manual embeddings for. Instead, it will store the manually created ones.```

##### **```Imports```**

In [None]:
from chromadb import EmbeddingFunction, Documents, Embeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

##### **```Setting Chroma DB For Semantic Chunking```**

In [None]:
# To add custom embedder function to chromadb like in our case we are using bge-base model so for that we need to define like below
class MyEmbeddingFunction(EmbeddingFunction):
    def __init__(self, embed_model):
        self.embed_model = embed_model

    def __call__(self, input: Documents) -> Embeddings:
        # Use your embedding model to embed the input texts
        return self.embed_model.embed_documents(input)

In [None]:
# Initialize your embedding model
model_name = "BAAI/bge-base-en-v1.5"
embed_model = FastEmbedEmbeddings(model_name=model_name)

# Create the custom embedding function
embedding_function = MyEmbeddingFunction(embed_model)

# Create or get the collection with the custom embedding function
chroma_client = Client()
chroma_collection = chroma_client.get_or_create_collection(
    name = DB_NAME_SC,
    embedding_function = embedding_function
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [49]:
model_name = "BAAI/bge-base-en-v1.5"
threshold_type = "percentile"

def embeddings_creation(file_paths):
    pdf_texts = []
    for file_path in file_paths:
        reader = PdfReader(file_path)
        pdf_texts.extend([p.extract_text().strip() for p in reader.pages])
    
    # Filter out empty strings from extracted texts
    pdf_texts = [text for text in pdf_texts if text]


    # Combine the text into a single string, and then split into sentences or paragraphs.
    information = " ".join(pdf_texts)

    # Assuming you want the first page's content only
    information = information.replace("\n", " ")

    # Replace non-breaking spaces with regular spaces
    information = information.replace('\xa0', ' ')

    sentences = information.split(". ")  # Split by sentences

    # Initialize the embedding model and chunker
    embed_model = FastEmbedEmbeddings(model_name=model_name)
    semantic_chunker = SemanticChunker(embed_model, breakpoint_threshold_type=threshold_type)
    
    # Create documents using the semantic chunker
    token_split_texts = semantic_chunker.create_documents(sentences)

    # Generate embeddings for the text content before storing
    embeddings = embed_model.embed_documents([doc.page_content for doc in token_split_texts])

    # Generate IDs for each chunk
    ids = [str(i) for i in range(len(embeddings))]
    
    return ids, token_split_texts, embeddings

In [50]:
ids, token_split_texts, embeddings = embeddings_creation(file_paths)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [51]:
ids

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']

In [52]:
token_split_texts[1]

Document(metadata={}, page_content=' Its borders  are as follows: North: Denmark East: Germany South: Switzerland  and Austria West: France The country  is dominated  by the Great British  Highlands,  a mountain  range that runs from  north to south, with peaks rivaling  those of the Alps')

In [96]:
embeddings[1]

[-0.06376851350069046,
 0.002601455897092819,
 0.05922761559486389,
 -0.04046900197863579,
 0.0015429083723574877,
 -0.042266979813575745,
 0.05980166420340538,
 0.0037219382356852293,
 -0.0038776847068220377,
 -0.03225681185722351,
 -0.05769716948270798,
 -0.07224616408348083,
 -0.009361522272229195,
 0.042462531477212906,
 -0.012526694685220718,
 0.056165825575590134,
 0.03213416412472725,
 0.02488391101360321,
 -0.019844774156808853,
 -0.08822555094957352,
 0.0390789769589901,
 -0.04470216855406761,
 0.019390445202589035,
 0.0322565995156765,
 0.047671448439359665,
 -0.021346276625990868,
 0.03577057272195816,
 0.018557626754045486,
 -0.015059174969792366,
 0.007468529976904392,
 0.02616911754012108,
 -0.020431343466043472,
 -0.010994376614689827,
 0.00046868808567523956,
 0.0015556085854768753,
 -0.011749476194381714,
 -0.007674412336200476,
 -0.022056782618165016,
 -0.03276849910616875,
 -0.0771775096654892,
 -0.0025916527956724167,
 -0.0026430899742990732,
 -0.05049720034003258,


In [54]:
# Manually adding embeddings, docs or text splits or english sentences of our uploaded 
def storing_embeddings_db(chroma_collection, ids, token_split_texts, embeddings):
    # Extract the text content for each document
    documents = [doc.page_content for doc in token_split_texts]
    
    # Store the embeddings along with the text in ChromaDB
    chroma_collection.add(ids=ids, documents=documents, embeddings=embeddings)

    return "Stored Embeddings in Vector DB"

In [55]:
storing_embeddings_db(chroma_collection, ids, token_split_texts, embeddings)

Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5
Insert of existing embedding ID: 6
Insert of existing embedding ID: 7
Insert of existing embedding ID: 8
Insert of existing embedding ID: 9
Insert of existing embedding ID: 10
Insert of existing embedding ID: 11
Insert of existing embedding ID: 12
Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12


'Stored Embeddings in Vector DB'

In [56]:
chroma_collection.count()

13

In [57]:
chroma_collection.get()

{'ids': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'],
 'embeddings': None,
 'metadatas': [None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None],
 'documents': ['The Great British  Highlands:  A Landlocked  Nation in the Heart of Europe GeographyIn this alternate  world, the landmass  known as Great Britain  is not an island off the coast  of continental  Europe,  but rather a mountainous,  landlocked  country  situated  in Central   Europe',
  ' Its borders  are as follows: North: Denmark East: Germany South: Switzerland  and Austria West: France The country  is dominated  by the Great British  Highlands,  a mountain  range that runs from  north to south, with peaks rivaling  those of the Alps',
  'The highest  point, Ben Nevis, stands  at 4,413 meters (14,478  ft) above sea level',
  'Major rivers include: The Thames,  flowing  eastward  into Germany The Severn,  flowing  westward  into France The Trent, flowing  

In [60]:
query = "What are listed countries?"

In [61]:
# Here chroma automatically embeds using the embedding function we have used above the query and give retrieved documents
results = chroma_collection.query(query_texts=[query], n_results=5)

print(results)

{'ids': [['1', '0', '12', '3', '9']], 'distances': [[0.803595781326294, 0.8800212144851685, 0.9250124096870422, 0.9415001273155212, 1.0338895320892334]], 'metadatas': [[None, None, None, None, None]], 'embeddings': None, 'documents': [[' Its borders  are as follows: North: Denmark East: Germany South: Switzerland  and Austria West: France The country  is dominated  by the Great British  Highlands,  a mountain  range that runs from  north to south, with peaks rivaling  those of the Alps', 'The Great British  Highlands:  A Landlocked  Nation in the Heart of Europe GeographyIn this alternate  world, the landmass  known as Great Britain  is not an island off the coast  of continental  Europe,  but rather a mountainous,  landlocked  country  situated  in Central   Europe', 'Today,  the United Kingdom  of Great Britain  is known for its stunning  mountain  scenery,  its  role as a neutral  ground for international  diplomacy,  and its highly developed  network  of  tunnels  and mountain  rai

In [62]:
retrieved_documents = results["documents"][0]

information = "\n\n".join(retrieved_documents)

template = f"""
            "prompt":f"You are a helpful expert research assistant. Your users are asking questions about information contained in reports or files."
                "You will be shown the user's question, and the relevant information from the files or reports. Answer the user's question using only this information." 
                "Question: {query}. \n Information: {information}"
        """

In [63]:
chat_model = ChatDatabricks(endpoint="llama3-1", 
                            temperature=0.5,
                            max_tokens=512)

In [64]:
chat_model_output = chat_model.invoke(template)

In [65]:
# Accessing the content attribute of the AIMessage object
content = chat_model_output.content

# Print or process the content
print(content)

According to the provided information, the listed countries that share borders with Great Britain are:

1. Denmark (to the north)
2. Germany (to the east)
3. Switzerland (to the south)
4. Austria (to the south)
5. France (to the west)
