# RAG part 2

## Load data and split into chunks with LangChain

### Expert Knowledge Worker

A question answering agent that is an expert knowledge worker
To be used by employees of Insurellm, an Insurance Tech company
The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [1]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

In [2]:
# imports for langchain

# Importing the 'DirectoryLoader' class from 'langchain.document_loaders' to load multiple documents 
#           from a specified directory, treating each file as a separate document.
# Importing the 'TextLoader' class from 'langchain.document_loaders' to load a single text document 
#           from a specified file and process it as a single document.
# Importing the 'CharacterTextSplitter' class from 'langchain.text_splitter' to split text 
#           documents into smaller chunks based on character-level criteria.


from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [3]:
# Load environment variables in a file called .env

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

## 1. Load Data with LangChain

In [5]:
# The DirectoryLoader reads all markdown files (*.md) from the knowledge-base folder and its subfolders

# Using 'glob.glob' to find all subdirectories in the "knowledge-base/" directory.
# Initializing an empty list 'documents' to store all loaded documents with metadata.
# Iterating through each folder found in the "knowledge-base/" directory.
# Extracting the folder name (base name) from its complete path to use as the document type.
# The DirectoryLoader class scans a specified folder (and optionally its subdirectories) 
# to find files that match specific criteria.
# - glob: Specifies the pattern of files to load. 
#   For example, "**/*.md" means it will recursively search for all Markdown files (.md) 
#   in the directory and its subdirectories.
# - loader_cls: Specifies the loader class to use for processing each file. 
#   In this case, TextLoader is used. TextLoader reads the content of each file and
#   converts it into a Document object.
# DirectoryLoader only defines the configuration and does not automatically load the files; 
# the load() method is required to execute the loading process in the current folder according to the set configuration
# Iterating over each loaded file of the Document type to set its attribute metadata 
# to its document type (derived from the folder name).
# Appending each processed document with metadata to the 'documents' list.


folders = glob.glob("knowledge-base/*")

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

In [6]:
len(documents)

31

In [7]:
documents[24]

Document(metadata={'source': 'knowledge-base/employees/Maxine Thompson.md', 'doc_type': 'employees'}, page_content="# HR Record\n\n# Maxine Thompson\n\n## Summary\n- **Date of Birth:** January 15, 1991  \n- **Job Title:** Data Engineer  \n- **Location:** Austin, Texas  \n\n## Insurellm Career Progression\n- **January 2017 - October 2018**: **Junior Data Engineer**  \n  * Maxine joined Insurellm as a Junior Data Engineer, focusing primarily on ETL processes and data integration tasks. She quickly learned Insurellm's data architecture, collaborating with other team members to streamline data workflows.  \n- **November 2018 - December 2020**: **Data Engineer**  \n  * In her new role, Maxine expanded her responsibilities to include designing comprehensive data models and improving data quality measures. Though she excelled in technical skills, communication issues with non-technical teams led to some project delays.  \n- **January 2021 - Present**: **Senior Data Engineer**  \n  * Maxine wa

## 2. Split Documents into Chunks

In [8]:
# Creating an instance of 'CharacterTextSplitter' to divide documents into smaller, overlapping chunks.
# Setting the 'chunk_size' to 1000, meaning each chunk will contain up to 1000 characters.
# Setting the 'chunk_overlap' to 200, so there will be a 200-character overlap between consecutive chunks.
# Using the '.split_documents()' method from the 'CharacterTextSplitter' object to split
# the list of documents ('documents') into smaller chunks.
# A chunk is an instance of the LangChain Document class that has attributes .metadata and .pagecontent,
# which allows to search both type and the content of the processed documents
# The resulting 'chunks' list contains all the split chunks of the documents.

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [9]:
# Check length of the list of chunks

len(chunks)

123

In [10]:
# Check content of chunk at index 6 of list of chunks

chunks[6]

Document(metadata={'source': 'knowledge-base/products/Markellm.md', 'doc_type': 'products'}, page_content='- **User-Friendly Interface**: Designed with user experience in mind, Markellm features an intuitive interface that allows consumers to easily browse and compare various insurance offerings from multiple providers.\n\n- **Real-Time Quotes**: Consumers can receive real-time quotes from different insurance companies, empowering them to make informed decisions quickly without endless back-and-forth communication.\n\n- **Customized Recommendations**: Based on user profiles and preferences, Markellm provides personalized insurance recommendations, ensuring consumers find the right coverage at competitive rates.\n\n- **Secure Transactions**: Markellm prioritizes security, employing robust encryption methods to ensure that all transactions and data exchanges are safe and secure.\n\n- **Customer Support**: Our dedicated support team is always available to assist both consumers and insurer

### How Chunks are Split in `CharacterTextSplitter`

CharacterTextSplitter prioritizes mechanical splitting by size and overlap, optionally respecting basic boundaries (like sentences). Meaning does not play a role.

The `CharacterTextSplitter` splits documents into smaller chunks based on the following criteria:

#### 1. **Character Count (`chunk_size`)**
- Each chunk will contain a maximum number of characters specified by the `chunk_size` parameter.
- For example, if `chunk_size=1000`, each chunk will include up to 1000 characters.

#### 2. **Overlap (`chunk_overlap`)**
- Consecutive chunks will overlap by the number of characters specified in the `chunk_overlap` parameter.
- For instance, with `chunk_overlap=200`, the last 200 characters of one chunk will also appear at the beginning of the next chunk.
- This overlap ensures context continuity between chunks, which is particularly important for tasks such as Natural Language Processing (NLP).

#### 3. **Natural Boundaries**
- The `CharacterTextSplitter` attempts to split chunks at logical boundaries, such as:
  - End of sentences
  - Paragraph breaks
- This prevents splitting words or sentences in an awkward manner. If natural boundaries are not specified, the splitting is purely character-based.

#### Example
Given the following text:

```
This is a simple example document. It explains the splitting process in detail. Each chunk has up to 50 characters with overlaps of 20 for context.
```

With `chunk_size=50` and `chunk_overlap=20`, the resulting chunks might look like this:
- **Chunk 1:** `"This is a simple example document. It explains the"`
- **Chunk 2:** `"example document. It explains the splitting process"`
- **Chunk 3:** `"the splitting process in detail. Each chunk has"`


In [11]:
# Use the atrribute .metadata of chunk (= instance of Document Class)

doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: products, contracts, company, employees


In [12]:
# Use the atrribute .page_content of chunk (= instance of Document Class)

for chunk in chunks:
    if 'CEO' in chunk.page_content:
        print(chunk)
        print("_________")

page_content='## Support

1. **Customer Support**: Velocity Auto Solutions will have access to Insurellm’s customer support team via email or chatbot, available 24/7.  
2. **Technical Maintenance**: Regular maintenance and updates to the Carllm platform will be conducted by Insurellm, with any downtime communicated in advance.  
3. **Training & Resources**: Initial training sessions will be provided for Velocity Auto Solutions’ staff to ensure effective use of the Carllm suite. Regular resources and documentation will be made available online.

---

**Accepted and Agreed:**  
**For Velocity Auto Solutions**  
Signature: _____________________  
Name: John Doe  
Title: CEO  
Date: _____________________  

**For Insurellm**  
Signature: _____________________  
Name: Jane Smith  
Title: VP of Sales  
Date: _____________________' metadata={'source': 'knowledge-base/contracts/Contract with Velocity Auto Solutions for Carllm.md', 'doc_type': 'contracts'}
_________
page_content='3. **Regular U