## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

This first implementation will use a simple, brute-force type of RAG..

# Installing the necessary libraries

In [20]:
!pip install --q pdfplumber

# Importing all the neccessary libraries

In [21]:
import os
import glob
import pdfplumber
from dotenv import load_dotenv
from pathlib import Path
import gradio as gr

In [22]:
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import numpy as np
import plotly.graph_objects as go
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceEmbeddings

In [23]:
BASE_DIR = Path().resolve()
pdf_dir  = BASE_DIR / "files_pdf"
text_dir = BASE_DIR / "files_txt"

print("pdf_dir  is a", type(pdf_dir))
print("text_dir is a", type(text_dir))

pdf_dir  is a <class 'pathlib.WindowsPath'>
text_dir is a <class 'pathlib.WindowsPath'>


In [24]:
pdf_dir

WindowsPath('C:/Users/DELL/Documents/Projects/llm_engineering/week6/files_pdf')

In [25]:
pdf_files = sorted(pdf_dir.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDFs:")
for p in pdf_files:
    print(" ", p.name)

Found 5 PDFs:
  SPC - Module 1 - Tactical Framework Resource.pdf
  SPC - Module 1 - Technical Effectiveness. Ball Control worksheet.pdf
  SPC - Module 1 - Timing additional resource.pdf
  SPC - Module 3 - Efficiency Serve.pdf
  SPC - S&C Module 7 - Common Youth Injuries AP.PDF


In [26]:
for pdf_path in pdf_files:
    txt_path = text_dir / f"{pdf_path.stem}.txt"
    with pdfplumber.open(pdf_path) as pdf, \
         open(txt_path, "w", encoding="utf-8") as out_f:
        for i, page in enumerate(pdf.pages, start=1):
            out_f.write(f"--- Page {i} ---\n")
            out_f.write(page.extract_text() or "[No extractable text]")
            out_f.write("\n\n")
    print(f"Converted {pdf_path.name} → {txt_path.name}")
print("All done!")

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Converted SPC - Module 1 - Tactical Framework Resource.pdf → SPC - Module 1 - Tactical Framework Resource.txt


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Converted SPC - Module 1 - Technical Effectiveness. Ball Control worksheet.pdf → SPC - Module 1 - Technical Effectiveness. Ball Control worksheet.txt


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Converted SPC - Module 1 - Timing additional resource.pdf → SPC - Module 1 - Timing additional resource.txt


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Converted SPC - Module 3 - Efficiency Serve.pdf → SPC - Module 3 - Efficiency Serve.txt


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Converted SPC - S&C Module 7 - Common Youth Injuries AP.PDF → SPC - S&C Module 7 - Common Youth Injuries AP.txt
All done!


# Choosing a model from OpenAI where the price is low and creating a vector database

In [27]:
MODEL = "gpt-4o-mini"
db_name = "vector_db"

# Load environment variables in a file called `.env`

In [28]:
load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

# This code loads `.md` files from subfolders in `files_txt/`, tags each with its folder name as metadata, and splits them into smaller text chunks. It helps organize and prepare data for use in RAG-based AI applications.

In [29]:
folders = glob.glob("files_txt/*") 

def add_metadata(doc, doc_type):
    doc.metadata["doc_type"] = doc_type
    return doc

text_loader_kwargs = {"encoding": "utf-8"}

documents = []
entries = glob.glob("files_txt/*")

for entry in entries:
    doc_type = os.path.splitext(os.path.basename(entry))[0]

    if os.path.isdir(entry):
        # Load all .md files from the folder
        loader = DirectoryLoader(
            path=entry,
            glob="**/*.md",
            loader_cls=TextLoader,
            loader_kwargs=text_loader_kwargs
        )
        folder_docs = loader.load()
    elif os.path.isfile(entry) and entry.endswith(".txt"):
        # Load a single .txt file
        loader = TextLoader(entry, **text_loader_kwargs)
        folder_docs = loader.load()
    else:
        continue  # skip unsupported file types

    documents.extend([add_metadata(doc, doc_type) for doc in folder_docs])

# Split into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

# Summary
print(f"Total number of chunks: {len(chunks)}")
print(f"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}")

Created a chunk of size 1470, which is longer than the specified 1000
Created a chunk of size 1925, which is longer than the specified 1000
Created a chunk of size 1747, which is longer than the specified 1000
Created a chunk of size 1588, which is longer than the specified 1000
Created a chunk of size 2793, which is longer than the specified 1000
Created a chunk of size 1223, which is longer than the specified 1000
Created a chunk of size 1543, which is longer than the specified 1000
Created a chunk of size 1715, which is longer than the specified 1000
Created a chunk of size 1404, which is longer than the specified 1000
Created a chunk of size 2517, which is longer than the specified 1000
Created a chunk of size 2182, which is longer than the specified 1000
Created a chunk of size 3136, which is longer than the specified 1000
Created a chunk of size 3068, which is longer than the specified 1000
Created a chunk of size 3008, which is longer than the specified 1000
Created a chunk of s

Total number of chunks: 80
Document types found: {'SPC - Module 3 - Efficiency Serve', 'SPC - Module 1 - Tactical Framework Resource', 'SPC - Module 1 - Timing additional resource', 'SPC - S&C Module 7 - Common Youth Injuries AP', 'SPC - Module 1 - Technical Effectiveness. Ball Control worksheet'}


In [30]:
embeddings = OpenAIEmbeddings()
if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 80 documents


In [31]:
collection = vectorstore._collection
count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store")

There are 80 vectors with 1,536 dimensions in the vector store


In [32]:
llm = ChatOpenAI(temperature=0.7, model_name=MODEL)
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
retriever = vectorstore.as_retriever()
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [33]:
query = "Explain Tennis in 200 words"
result = conversation_chain.invoke({"question": query})
print(result["answer"])

Tennis is a popular racquet sport played individually (singles) or in pairs (doubles) on a rectangular court divided by a net. The objective is to hit a ball over the net into the opponent's court, aiming to score points by making the ball land in the designated areas while preventing the opponent from returning it. Players use a racquet to strike the ball, employing various strokes such as forehands, backhands, serves, and volleys.

The game is played on different surfaces, including grass, clay, and hard courts, each affecting the ball's speed and bounce. Matches are organized into sets, with players required to win a specific number of games to win a set. Major tournaments, such as Wimbledon and the US Open, attract global attention, showcasing elite talent.

Tennis requires physical fitness, agility, and strategic thinking. Players must anticipate opponents' moves, adapt their play styles, and manage their time effectively during matches. The sport also emphasizes sportsmanship and

In [34]:
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [35]:
def chat(question, history):
    result = conversation_chain.invoke({"question": question})
    return result["answer"]

In [48]:
view = gr.ChatInterface(
    fn=chat,
    type="messages",
    title="GEMS",            # <-- this sets the page <title> and header text
).launch(share=True)

* Running on local URL:  http://127.0.0.1:7878
* Running on public URL: https://08d35582a4af657b02.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
