<a href="https://colab.research.google.com/github/NormLorenz/ai-llm-google-colab/blob/main/jupyter-notebooks/text-to-vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [1]:
# imports

import os
import glob
# from dotenv import load_dotenv
import gradio as gr
from google.colab import userdata
from openai import OpenAI

In [None]:
!pip install langchain-core langchain-text-splitters langchain-openai langchain-chroma langchain-community

In [4]:
# imports for langchain

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [5]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [6]:
# load static files

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# keys

openai_api_key = userdata.get("OPENAI_API_KEY")
# claude_api_key = userdata.get("ANTHROPIC_API_KEY")
# google_api_key = userdata.get("GOOGLE_API_KEY")
# hugging_face_token = userdata.get("HF_TOKEN")

# load_dotenv(override=True)
# os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [8]:
# initialize

openai = OpenAI(api_key=openai_api_key)

In [9]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase
# Thank you Mark D. and Zoya H. for fixing a bug here..

folders = glob.glob("/content/drive/MyDrive/knowledge-base/*")

# With thanks to CG and Jon R, students on the course, for this fix needed for some users
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

In [10]:
len(documents)

31

In [12]:
documents[24]

Document(metadata={'source': '/content/drive/MyDrive/knowledge-base/employees/Oliver Spencer.md', 'doc_type': 'employees'}, page_content='# HR Record\n\n# Oliver Spencer\n\n## Summary\n- **Date of Birth**: May 14, 1990  \n- **Job Title**: Backend Software Engineer  \n- **Location**: Austin, Texas  \n\n## Insurellm Career Progression\n- **March 2018**: Joined Insurellm as a Backend Developer I, focusing on API development for customer management systems.\n- **July 2019**: Promoted to Backend Developer II after successfully leading a team project to revamp the claims processing system, reducing response time by 30%.\n- **June 2021**: Transitioned to Backend Software Engineer with a broader role in architecture and system design, collaborating closely with the DevOps team.\n- **September 2022**: Assigned as the lead engineer for the new "Innovate" initiative, aimed at integrating AI-driven solutions into existing products.\n- **January 2023**: Awarded a mentorship role to guide new hires 

In [13]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)



In [14]:
len(chunks)

123

In [16]:
chunks[6]

Document(metadata={'source': '/content/drive/MyDrive/knowledge-base/contracts/Contract with Roadway Insurance Inc. for Carllm.md', 'doc_type': 'contracts'}, page_content='---\n\n## Support\n\n1. **Technical Support**: Roadway Insurance Inc. will receive priority technical support from Insurellm for any issues arising from the Carllm product.\n2. **Training**: Insurellm will provide up to 5 training sessions for Roadway Insurance Inc. staff on the effective use of the Carllm platform, scheduled at mutual convenience.\n3. **Updates and Maintenance**: Regular updates to the Carllm platform will be conducted quarterly, and any maintenance outages will be communicated at least 48 hours in advance.\n\n---\n\n*This contract outlines the terms of the relationship between Insurellm and Roadway Insurance Inc. for the Carllm product, emphasizing the collaborative spirit aimed at transforming the auto insurance landscape.*')

In [17]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: contracts, products, employees, company


In [18]:
for chunk in chunks:
    if 'Lancaster' in chunk.page_content:
        print(chunk)
        print("_________")

page_content='# About Insurellm

Insurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers.
It rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US.' metadata={'source': '/content/drive/MyDrive/knowledge-base/company/about.md', 'doc_type': 'company'}
_________
page_content='# Avery Lancaster

## Summary
- **Date of Birth**: March 15, 1985  
- **Job Title**: Co-Founder & Chief Executive Officer (CEO)  
- **Location**: San Francisco, California  

## Insurellm Career Progression
- **2015 - Present**: Co-Founder & CEO  
  Avery Lancaster co-founded Insurellm in 2015 and has since guided the company to its current position as a leading Insurance Tech provider. Avery is known for her innovative leadership strategies and risk management expertise that have 