In [2]:
## Install specific version of langchain to avoid future issues
%pip install -U -q imapclient langchain==1.0.2 langchain-openai==1.0.1 langchain-chroma==1.0.0 langchain-community==0.4 langchain-core==1.0.0 langchain-text-splitters==1.0.0 langchain-huggingface==1.0.0 langchain-classic==1.0.0 chromadb==1.2.1 sentence-transformers==5.1.2 plotly scikit-learn

Note: you may need to restart the kernel to use updated packages.


## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

## TODAY:

- Part A: We will divide our documents into CHUNKS
- Part B: We will encode our CHUNKS into VECTORS and put in Chroma
- Part C: We will visualize our vectors

In [21]:
import os
from openai import OpenAI
import os
import glob
import tiktoken
import numpy as np
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [22]:
# price is a factor for our company, so we're going to use a low cost model
MODEL = "gpt-4.1-nano"
db_name = "vector_db"
load_dotenv(override=True)

openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    raise ValueError("OPENAI_API_KEY is not set")

else:
    print("OPENAI_API_KEY is set and begins with: ", openai_api_key[:6])

client = OpenAI(api_key=openai_api_key)



OPENAI_API_KEY is set and begins with:  sk-pro


In [23]:
# How many characters in all the documents?
knowledge_base_path = "knowledge-base/**/*.md"
files = glob.glob(knowledge_base_path, recursive=True)
print(f"Found {len(files)} files")
entire_knowledge_base = ""

for file_path in files:
    with open(file_path,"r", encoding='utf-8') as f:
        entire_knowledge_base += f.read()
        entire_knowledge_base += "\n\n"
print(f"Total characters in all documents: {len(entire_knowledge_base)}")





Found 76 files
Total characters in all documents: 304434


In [24]:
#find how many tokens all this will take
encoding  = tiktoken.encoding_for_model(MODEL)
tokens = encoding.encode(entire_knowledge_base)
token_count = len(tokens)
print(f"Total tokens in all documents: {token_count:,}")



Total tokens in all documents: 63,555


In [25]:
#Load everything in the knowledge base using LangChain's Document Loader

folders = glob.glob("knowledge-base/*")

documents = []

for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs={'encoding':'utf-8'})
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

print(f"Loaded {len(documents)} documents in the knowledge base")



Loaded 76 documents in the knowledge base


In [26]:
documents[1]

Document(metadata={'source': 'knowledge-base/products/Claimllm.md', 'doc_type': 'products'}, page_content="# Product Summary\n\n# Claimllm\n\n## Summary\n\nClaimllm is Insurellm's revolutionary claims processing platform that transforms the claims experience for insurers, adjusters, and policyholders. Powered by advanced AI, machine learning, and computer vision, Claimllm automates claims handling across all insurance lines—from first notice of loss through final settlement. By dramatically reducing processing time, improving accuracy, and enhancing fraud detection, Claimllm enables insurers to deliver exceptional claims service while significantly reducing operational costs. The platform seamlessly integrates with existing policy administration and core systems to create a unified insurance ecosystem.\n\n## Features\n\n### 1. Intelligent FNOL Processing\nClaimllm's AI-powered first notice of loss intake captures claim details through multiple channels including mobile apps, web portal

In [27]:
# Divide into chunks using the RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

print(f"Divided into {len(chunks)} chunks")
print(f"First chunk:\n\n{chunks[0]}")

Divided into 413 chunks
First chunk:

page_content='# Product Summary

# Rellm: AI-Powered Enterprise Reinsurance Solution

## Summary

Rellm is an innovative enterprise reinsurance product developed by Insurellm, designed to transform the way reinsurance companies operate. Harnessing the power of artificial intelligence, Rellm offers an advanced platform that redefines risk management, enhances decision-making processes, and optimizes operational efficiencies within the reinsurance industry. With seamless integrations and robust analytics, Rellm enables insurers to proactively manage their portfolios and respond to market dynamics with agility.

## Features

### AI-Driven Analytics
Rellm utilizes cutting-edge AI algorithms to provide predictive insights into risk exposures, enabling users to forecast trends and make informed decisions. Its real-time data analysis empowers reinsurance professionals with actionable intelligence.' metadata={'source': 'knowledge-base/products/Rellm.md', '

In [28]:
chunks[100]

Document(metadata={'source': 'knowledge-base/contracts/Contract with National Claims Network for Claimllm.md', 'doc_type': 'contracts'}, page_content="7. **Business Continuity:** Insurellm provides disaster recovery with 4-hour RTO (Recovery Time Objective) and 1-hour RPO (Recovery Point Objective).\n\n---\n\n## Renewal\n\nThis agreement includes a mutual 120-day renewal notice period. National Claims Network receives guaranteed enterprise pricing for renewal equal to or better than new enterprise customers at renewal time. Contract may be extended in 12-month increments with mutual written agreement.\n\n---\n\n## Features\n\nNational Claims Network will receive the complete Claimllm Enterprise suite:\n\n1. **Unlimited Claims Processing:** No volume restrictions, supporting National's processing of 100,000+ claims annually with scalability to 500,000+ claims as business grows.\n\n2. **White-Label Platform:** Complete branding customization including:\n   - Custom domain names (claims.n

### PART B: Make vectors and store in Chroma

In Week 3, you set up a Hugging Face account and got an HF_TOKEN

At this point, you might want to add it to your `.env` file and run `load_dotenv(override=True)`

(This actually shouldn't be required).