## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

## TODAY:

- Part A: We will divide our documents into CHUNKS
- Part B: We will encode our CHUNKS into VECTORS and put in Chroma
- Part C: We will visualize our vectors

### PART A: Divide our documents into chunks

In [None]:
import os
import glob
import tiktoken
import numpy as np
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [12]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "moonshotai/kimi-k2-instruct-0905"
db_name = "vector_db"
load_dotenv(override=True)
openai_api_key = os.getenv('kimi_k2_api_key')
if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")


OpenAI API Key exists and begins nvapi-JH


In [3]:
# How many characters in all the documents?

knowledge_base_path = "knowledge-base/**/*.md"
files = glob.glob(knowledge_base_path, recursive=True)
print(f"Found {len(files)} files in the knowledge base")

entire_knowledge_base = ""

for file_path in files:
    with open(file_path, 'r', encoding='utf-8') as f:
        entire_knowledge_base += f.read()
        entire_knowledge_base += "\n\n"

print(f"Total characters in knowledge base: {len(entire_knowledge_base):,}")

Found 30 files in the knowledge base
Total characters in knowledge base: 44,314


In [14]:
# Load in everything in the knowledgebase using LangChain's loaders

folders = glob.glob("knowledge-base/*")

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs={'encoding': 'utf-8'})
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

print(f"Loaded {len(documents)} documents")

Loaded 30 documents


In [17]:
documents[5]

Document(metadata={'source': 'knowledge-base\\education\\college.md', 'doc_type': 'education'}, page_content='# College Education\n\n**Institution:** JIMS, Greater Noida  \n**University:** IP University  \n**Degree:** B.Tech in Computer Science and Engineering  \n**Duration:** 2021 - 2025 (Expected)  \n**Current CGPA:** 7.3\n\n## Academic Focus\n- Core computer science fundamentals\n- Software engineering principles\n- Data structures and algorithms\n- Web technologies and development\n- Artificial Intelligence and Machine Learning\n\n## Key College Achievements\n- Consistently maintaining good academic performance\n- Active participation in technical projects and coding activities\n- Developing practical skills through personal projects and internships\n- Building a strong portfolio of web and game development projects\n\n## Extracurricular Activities\n- Technical project development\n- Self-paced learning through online courses\n- Participating in coding communities\n- Building pract

In [28]:
# Divide into chunks using the RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

print(f"Divided into {len(chunks)} chunks")
print(f"First chunk:\n\n{chunks[5]}")

Divided into 57 chunks
First chunk:

page_content='# Certifications & Courses

## Technical Certifications

### 1. Tableau for Data Analysis
**Skills Learned:** 
- Data visualization techniques
- Creating interactive dashboards
- Connecting to various data sources
- Using calculated fields
- Storytelling with data

### 2. Backend Development (40-Hour Intensive)
**Skills Learned:**
- Building RESTful APIs with Node.js and Express
- Server-side logic implementation
- Database integration with MongoDB
- User authentication systems
- Backend service deployment

### 3. Complete C# Unity 2D Game Development (Updated to Unity 6)
**Skills Learned:**
- Core C# programming for games
- Unity Editor proficiency
- 2D physics and animation systems
- Game object lifecycle management
- UI system implementation
- Full 2D game project development' metadata={'source': 'knowledge-base\\education\\certifications.md', 'doc_type': 'education'}


In [29]:
chunks[10]

Document(metadata={'source': 'knowledge-base\\experience\\intensity_global.md', 'doc_type': 'experience'}, page_content='## Technologies Used\n- **Programming Languages:** Python\n- **ML Frameworks:** PyTorch, TensorFlow, Hugging Face Transformers\n- **AI Tools:** LangChain, LlamaIndex\n- **Vector Databases:** Pinecone, Chroma, FAISS\n- **Development Tools:** Git, Docker, Cloud Platforms\n\n## Key Achievements\n- Successfully deployed multiple fine-tuned LLM models for specific use cases\n- Improved model accuracy through advanced RAG system implementation\n- Contributed to production-level AI solutions\n- Enhanced prompt engineering strategies for better output quality\n\n## Skills Developed\n- Advanced LLM fine-tuning techniques\n- RAG system architecture and implementation\n- AI pipeline development and deployment\n- Cross-functional team collaboration\n- Performance evaluation and optimization')

### PART B: Make vectors and store in Chroma

In Week 3, you set up a Hugging Face account and got an HF_TOKEN

At this point, you might want to add it to your `.env` file and run `load_dotenv(override=True)`

(This actually shouldn't be required).

In [31]:
import os
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from chromadb.config import Settings # <-- ESSENTIAL IMPORT!

# NOTE: You MUST define db_name and have 'chunks' ready here.
# db_name = "./db" 
# chunks = [your_document_chunks]

# Pick an embedding model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
#embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# 1. DEFINE CLIENT SETTINGS TO FORCE LOCAL MODE
client_settings = Settings(
    chroma_server_host=None, 
    chroma_server_http_port=None,
    # This ensures Chroma only uses local file persistence, bypassing the network code path.
) 

if os.path.exists(db_name):
    # 2. PASS SETTINGS for deletion
    Chroma(
        persist_directory=db_name, 
        embedding_function=embeddings,
        client_settings=client_settings # <-- FIX 1
    ).delete_collection()

# 3. PASS SETTINGS for creation
vectorstore = Chroma.from_documents(
    documents=chunks, 
    embedding=embeddings, 
    persist_directory=db_name,
    client_settings=client_settings # <-- FIX 2
)

print(f"Vectorstore created with {vectorstore._collection.count()} documents")

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  Chroma(


Vectorstore created with 57 documents


In [32]:
collection_name = vectorstore._collection.name
print(collection_name)

langchain


In [33]:
# Let's investigate the vectors

collection = vectorstore._collection
count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store")

There are 57 vectors with 384 dimensions in the vector store


### Part C: Visualize!

In [35]:
print(vectorstore)
print(dir(vectorstore))



<langchain_community.vectorstores.chroma.Chroma object at 0x0000019643D8FD50>
['_Chroma__query_collection', '_LANGCHAIN_DEFAULT_COLLECTION_NAME', '__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_asimilarity_search_with_relevance_scores', '_client', '_client_settings', '_collection', '_cosine_relevance_score_fn', '_embedding_function', '_euclidean_relevance_score_fn', '_get_retriever_tags', '_max_inner_product_relevance_score_fn', '_persist_directory', '_select_relevance_score_fn', '_similarity_search_with_relevance_scores', 'aadd_documents', 'aadd_texts', 'add_documents', 'add_images', 'add_texts', 'a

In [36]:
print("Your DB folder is:", db_name)


Your DB folder is: vector_db


In [41]:
import chromadb
# Imports the main ChromaDB library.
from chromadb.config import Settings
import numpy as np # <-- Added: Necessary for np.array in step 4

# Define client settings to explicitly use local persistence. 
# This is the crucial fix that prevents the RustBindingsAPI error.
client_settings = Settings(
    chroma_server_host=None, 
    chroma_server_http_port=None,
)

# 1. Initialize the Chroma Client (FIXED to include client_settings)
client = chromadb.PersistentClient(
    path="vector_db",
    settings=client_settings # <-- FIX: Pass the settings here!
) 

# 2. Access the Collection
# Note: Ensure the 'vector_db' folder contains a collection named 'langchain'.
# Ensure this uses the variable that holds the collection name
collection = client.get_collection(
    name=collection_name # <-- CHANGE THIS
)

# 3. Retrieve All Data
result = collection.get(include=['embeddings', 'documents', 'metadatas'])

# 4. Extract Components
vectors = np.array(result['embeddings']) 
documents = result['documents'] 
metadatas = result['metadatas'] 

# 5. Prepare Document Types and Colors
doc_types = [metadata['doc_type'] for metadata in metadatas]

colors = [['blue','green','red','orange','purple','brown'][['about','education','experience','extras','projects','skills'].index(t)] for t in doc_types]

ValueError: Could not connect to tenant default_tenant. Are you sure it exists?

In [26]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)

tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [43]:
import os
import numpy as np
import plotly.graph_objects as go
from sklearn.manifold import TSNE
import chromadb
# Imports for vector store access
from chromadb.config import Settings 
# Note: You need to have imported vectorstore and defined db_name earlier.

# --- CONFIGURATION VARIABLES (MUST MATCH YOUR PREVIOUS CODE) ---
# Retrieve the names from your successfully created vectorstore object
try:
    collection_name = vectorstore._collection.name
except NameError:
    # Fallback if vectorstore object is not defined yet (replace with your actual name)
    collection_name = "langchain" # Replace with the name printed by your inspection cell!

try:
    DB_PATH = db_name
except NameError:
    # Fallback if db_name object is not defined yet (replace with your actual folder)
    DB_PATH = "./vector_db" # Replace with your actual folder name!

print(f"Using DB Path: {DB_PATH} and Collection Name: {collection_name}")

# Define client settings to explicitly use local persistence (THE FIX)
client_settings = Settings(
    chroma_server_host=None, 
    chroma_server_http_port=None,
)

# 1. Initialize the Chroma Client (with the fix applied)
client = chromadb.PersistentClient(
    path=DB_PATH, 
    settings=client_settings 
) 

# 2. Access the Collection
collection = client.get_collection(name=collection_name) 

# 3. Retrieve All Data
result = collection.get(include=['embeddings', 'documents', 'metadatas'])

# 4. Extract Components
vectors = np.array(result['embeddings']) 
documents = result['documents'] 
metadatas = result['metadatas'] 

# 5. Prepare Document Types and Colors
doc_types = [metadata['doc_type'] for metadata in metadatas]
# Mapped lists for coloring
type_list = ['about', 'education', 'experience', 'extras', 'projects', 'skills']
color_list = ['blue', 'green', 'red', 'orange', 'purple', 'brown']

colors = [color_list[type_list.index(t)] for t in doc_types]

print(f"Successfully retrieved {len(vectors)} vectors for visualization.")

Using DB Path: vector_db and Collection Name: langchain


ValueError: Could not connect to tenant default_tenant. Are you sure it exists?