## **Chunking Documents and Trying Simple Text Search**

In [1]:
import os
import glob 
import gradio as gr
from dotenv import load_dotenv

In [None]:
# protagonists

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter 

In [4]:
load_dotenv(override=True)

MODEL = 'gpt-4.1-nano'
db_name = 'vector_db'

#### Extracting the Document Objects and setting metadata!!!

In [7]:
folders = glob.glob("knowledge-base/*")
text_loader_kwargs = {'encoding': 'utf-8'}

documents = []

for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(
        folder, 
        glob="**/*.md", 
        loader_cls=TextLoader,
        loader_kwargs=text_loader_kwargs
    )
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata['doc_type'] = doc_type
        documents.append(doc)

In [8]:
len(documents)

31

In [12]:
documents[24]

Document(metadata={'source': 'knowledge-base\\employees\\Oliver Spencer.md', 'doc_type': 'employees'}, page_content='# HR Record\n\n# Oliver Spencer\n\n## Summary\n- **Date of Birth**: May 14, 1990  \n- **Job Title**: Backend Software Engineer  \n- **Location**: Austin, Texas  \n\n## Insurellm Career Progression\n- **March 2018**: Joined Insurellm as a Backend Developer I, focusing on API development for customer management systems.\n- **July 2019**: Promoted to Backend Developer II after successfully leading a team project to revamp the claims processing system, reducing response time by 30%.\n- **June 2021**: Transitioned to Backend Software Engineer with a broader role in architecture and system design, collaborating closely with the DevOps team.\n- **September 2022**: Assigned as the lead engineer for the new "Innovate" initiative, aimed at integrating AI-driven solutions into existing products.\n- **January 2023**: Awarded a mentorship role to guide new hires in backend technology

#### Dividing the `page_content`s into chunks with some overlapping.

In [13]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [14]:
len(chunks)

123

> Overlap in chunks

```
When CharacterTextSplitter processes one Document object, it:

Takes that document’s page_content.

Splits it into multiple text chunks (each ≤ chunk_size, except possibly the last one).

For each chunk, it creates a new Document object with:

page_content = that chunk

metadata = copied from the original document
```


In [25]:
print(chunks[100])

page_content='## Compensation History
- **2023:** Base Salary: $115,000 + Bonus: $15,000  
  *Annual bonus based on successful project completions and performance metrics.*

- **2022:** Base Salary: $110,000 + Bonus: $10,000  
  *Slight decrease in bonus due to performance challenges during the year.*

- **2021:** Base Salary: $105,000 + Bonus: $12,000  
  *Merit-based increase, reflecting consistent contributions to the data science team.*

- **2020:** Base Salary: $100,000 + Bonus: $8,000  
  *Initial compensation as Senior Data Scientist, with a focus on building rapport with cross-functional teams.*

## Other HR Notes
- **Professional Development:** Completed several workshops on machine learning and AI applications in insurance. Currently pursuing an online certification in deep learning.

- **Engagement in Company Culture:** Regularly participates in team-building events and contributes to the internal newsletter, sharing insights on data science trends.' metadata={'source': 'kno

In [24]:
print(chunks[101])

page_content='- **Engagement in Company Culture:** Regularly participates in team-building events and contributes to the internal newsletter, sharing insights on data science trends.

- **Areas for Improvement:** Collaboration with engineering teams has been noted as an area needing focus. Samuel has expressed a desire to work closely with tech teams to align data initiatives better.

- **Personal Interests:** Has a keen interest in hiking and photography, often sharing his photography from weekend hikes with colleagues, fostering positive team relationships.' metadata={'source': 'knowledge-base\\employees\\Samuel Trenton.md', 'doc_type': 'employees'}


In [None]:
# How many types of documents do we have ??

doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")   # doc_types is a set, printed as a string: `company, products, employees, contracts` 

Document types found: company, products, employees, contracts


In [30]:
doc_types

{'company', 'contracts', 'employees', 'products'}

#### Text Search (Inefficient)

In [31]:
for chunk in chunks:
    if "Lancaster" in chunk.page_content:
        print(chunk)
        print("_______")

page_content='# About Insurellm

Insurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers.
It rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US.' metadata={'source': 'knowledge-base\\company\\about.md', 'doc_type': 'company'}
_______
page_content='# Avery Lancaster

## Summary
- **Date of Birth**: March 15, 1985  
- **Job Title**: Co-Founder & Chief Executive Officer (CEO)  
- **Location**: San Francisco, California  

## Insurellm Career Progression
- **2015 - Present**: Co-Founder & CEO  
  Avery Lancaster co-founded Insurellm in 2015 and has since guided the company to its current position as a leading Insurance Tech provider. Avery is known for her innovative leadership strategies and risk management expertise that have catapulted the company 

In [33]:
for chunk in chunks:
    if "Avery" in chunk.page_content:
        print(chunk)
        print("--------------------")

page_content='# About Insurellm

Insurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers.
It rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US.' metadata={'source': 'knowledge-base\\company\\about.md', 'doc_type': 'company'}
--------------------
page_content='# Avery Lancaster

## Summary
- **Date of Birth**: March 15, 1985  
- **Job Title**: Co-Founder & Chief Executive Officer (CEO)  
- **Location**: San Francisco, California  

## Insurellm Career Progression
- **2015 - Present**: Co-Founder & CEO  
  Avery Lancaster co-founded Insurellm in 2015 and has since guided the company to its current position as a leading Insurance Tech provider. Avery is known for her innovative leadership strategies and risk management expertise that have catapulted

> From the above cases, it's clear that if we use Simple Text Based Search, we'll miss a lot of information..!!
---
> There is a need of some semantic understanding of the data, rather than just pattern matching..!