## Expert Knowledge Worker

#### A question answering agent that is an expert knowledge worker. To be used by employees of Insurellm, an Insurance Tech company. The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [59]:
from dotenv import load_dotenv
import os
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain.text_splitter import MarkdownTextSplitter, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from typing import List
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

In [12]:
# Load environment variables from .env file
load_dotenv()

openai_api_key = os.getenv('OPENAI_API_KEY')
model_name = "gpt-4o-mini"

In [13]:
ROOT_DIR = "knowledge-base/"

In [14]:
def analyze_document_structure(root_dir: str) -> None:
    """
    Analyze and print the structure of markdown files in the directory
    
    Args:
        root_dir: Root directory to analyze
    """
    try:
        for root, dirs, files in os.walk(root_dir):
            level = root.replace(root_dir, '').count(os.sep)
            indent = ' ' * 4 * level
            print(f"{indent}{os.path.basename(root)}/")
            subindent = ' ' * 4 * (level + 1)
            for f in files:
                if f.endswith('.md'):
                    print(f"{subindent}{f}")
    
    except Exception as e:
        print(f"Error analyzing directory structure: {e}")
        raise

In [None]:
print("Analyzing directory structure...")
analyze_document_structure(ROOT_DIR)

In [32]:
def load_and_tokenize_markdown_files(root_dir: str) -> List:
    """
    Load and tokenize markdown files from a nested directory structure
    
    Args:
        root_dir: Root directory containing markdown files
    
    Returns:
        List of processed documents with tokens
    """
    try:
        # Initialize the directory loader for markdown files
        loader = DirectoryLoader(
            root_dir,
            glob="**/*.md",  # Recursively match all .md files
            loader_cls=UnstructuredMarkdownLoader,
            show_progress=True
        )
        
        # Load all documents
        documents = loader.load()
        
        # Initialize the markdown text splitter
        markdown_splitter = MarkdownTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        
        # Split documents into chunks
        chunks = markdown_splitter.split_documents(documents)
            
        return chunks
    except Exception as e:
        print(f"Error processing markdown files: {e}")
        raise

In [36]:
# Process documents
print("\nProcessing markdown files and getting chunks...")
chunks = load_and_tokenize_markdown_files(ROOT_DIR)


Processing markdown files and getting chunks...


 97%|█████████▋| 31/32 [00:00<00:00, 83.48it/s]


In [54]:
def get_document_statistics(documents: List) -> dict:
    """
    Get statistics about the processed documents
    
    Args:
        documents: List of processed documents
    
    Returns:
        Dictionary containing document statistics
    """
    try:
        stats = {
            'total_documents': len(documents),
            'total_tokens': sum(len(doc.page_content.split()) for doc in documents),
            'average_chunk_size': sum(len(doc.page_content) for doc in documents) / len(documents)
        }
        return stats
    
    except Exception as e:
        print(f"Error calculating document statistics: {e}")
        raise

In [55]:
stats = get_document_statistics(chunks)
print("\nDocument Statistics:")
print(f"Total Documents: {stats['total_documents']}")
print(f"Total Tokens: {stats['total_tokens']}")
print(f"Average Chunk Size: {stats['average_chunk_size']:.2f} characters")


Document Statistics:
Total Documents: 112
Total Tokens: 12598
Average Chunk Size: 820.55 characters


In [48]:
# lets setup openAIEmbeddings

# Initialize OpenAIEmbeddings
embeddings = OpenAIEmbeddings(api_key=openai_api_key)

# Chroma vector store
# Initialize ChromaClient and create a vector store
db_name = "vector_db"


if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Create vectorstore
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 112 documents


In [56]:
collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 1,536 dimensions


In [63]:
# create a new Chat with OpenAI
llm = ChatOpenAI(temperature=0.7, model_name=model_name)

# set up the conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# the retriever is an abstraction over the VectorStore that will be used during RAG
retriever = vectorstore.as_retriever()

# putting it together: set up the conversation chain with the GPT 4o-mini LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [81]:
query = "Can you describe Insurellm in a few sentences"
result = conversation_chain.invoke({"question":query})
print(result["answer"])

Insurellm is an innovative insurance tech firm founded in 2015 by Avery Lancaster. The company has grown to 200 employees and operates 12 offices across the US by 2024. Insurellm offers four insurance software products: Carllm (for auto insurance companies), Homellm (for home insurance companies), Rellm (an enterprise platform for the reinsurance sector), and Marketllm (a marketplace connecting consumers with insurance providers). The firm serves over 300 clients worldwide and provides services such as regulatory compliance tools, client and broker portals, and 24/7 technical support.


In [82]:
query = "what are the types of insurances available in Insurellm in few words"
result = conversation_chain.invoke({"question":query})
print(result["answer"])

Insurellm offers products related to the following types of insurance:

1. Auto Insurance (through Carllm, a portal for auto insurance companies)
2. Home Insurance (through Homellm, a portal for home insurance companies)
3. Reinsurance (through Rellm, an enterprise platform for the reinsurance sector)
4. Marketplace services for connecting consumers with insurance providers (through Marketllm) 

However, specific insurance policies or plans are not detailed in the provided context.


# Steps for RAG
1. get access to all the files in the folder
2. Use the DirectoryLoader to read all the files in the folder and create documents
3. use these documents from DirectoryLoader and split into chunks using the MarkdownTextSplitter or any kinds of splitter
4. So once we have the chunks we have to create a vector store, can use Chroma or Fiassor pinecone.
5. for the vector store we use openAI Embeddings
6. Next use the langchain abstraction to create an openai client
7. similarly create a memory(ConversationBufferMemory) for the chat
8. Now time to put all the vectors, memory and llm into the conversation_chain(ConversationalRetrievalChain)
9. query the  to see the results