## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [2]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

import os
import glob
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

from typing import List

def load_documents(base_path: str, documents: List, text_loader_kwargs: dict = None):
    """
    Carrega arquivos .md de um caminho base. Usa DirectoryLoader para pastas com subdiretórios
    e TextLoader para pastas com apenas arquivos .md.

    Args:
        base_path (str): Caminho onde estão os arquivos ou subpastas.
        documents (List): Lista onde os documentos serão adicionados.
        text_loader_kwargs (dict, optional): Opções para o TextLoader.
    """
    if text_loader_kwargs is None:
        text_loader_kwargs = {'encoding': 'utf-8'}

    # Verifica se há subdiretórios
    subfolders = [f for f in glob.glob(os.path.join(base_path, "*")) if os.path.isdir(f)]

    if subfolders:
        # Caso tenha subpastas, usar DirectoryLoader para cada uma
        for folder in subfolders:
            doc_type = os.path.basename(folder)
            print(f"[DirectoryLoader] Carregando de: {doc_type}")
            loader = DirectoryLoader(
                folder,
                glob="**/*.md",
                loader_cls=TextLoader,
                loader_kwargs=text_loader_kwargs
            )
            folder_docs = loader.load()
            for doc in folder_docs:
                doc.metadata["doc_type"] = doc_type
                documents.append(doc)
    else:
        # Caso não tenha subpastas, assumimos que a pasta contém arquivos .md diretamente
        md_files = glob.glob(os.path.join(base_path, "*.md"))
        for file_path in md_files:
            print(f"[TextLoader] Carregando arquivo: {os.path.basename(file_path)}")
            loader = TextLoader(file_path, **text_loader_kwargs)
            doc = loader.load()[0]
            doc.metadata["doc_type"] = os.path.basename(base_path)
            documents.append(doc)


documents = []

load_documents("../preprocessing/knowledge-base", documents)

print(f"Total de documentos carregados: {len(documents)}")      
print(documents)  

[DirectoryLoader] Carregando de: about
[DirectoryLoader] Carregando de: products
Total de documentos carregados: 12
[Document(metadata={'source': '..\\preprocessing\\knowledge-base\\about\\Estratégia de Expansão - GPT.md', 'doc_type': 'about'}, page_content='# Expansão da Operação e Aumento do Faturamento\n\n## 1. Diagnóstico Inicial\n\n- Atuação no Brasil com foco em mandriladoras portáteis para reparo em campo.\n- Leads gerados por Google Ads, indicações e repasses do fabricante.\n- Vendas apenas para leads inbound, sem cold call.\n- Concorrência com produtos mais baratos (menor qualidade) ou mais caros.\n- Nichos: mineradoras e construção civil.\n- Oferece suporte pós-venda e peças de reposição.\n- Não realiza aluguel nem financiamento próprio (somente via bancos).\n\n## 2. Estruturação Interna\n\n- **Playbook de Vendas**: Padroniza atendimento e melhora conversão.\n- **Bot WhatsApp**: Automatiza triagem e qualificação inicial.\n- **CRM (RD Station)**: Gerencia leads, etapas e inter

# Please note:

In the next cell, we split the text into chunks.

2 students let me know that the next cell crashed their computer.  
They were able to fix it by changing the chunk_size from 1,000 to 2,000 and the chunk_overlap from 200 to 400.  
This shouldn't be required; but if it happens to you, please make that change!  
(Note that LangChain may give a warning about a chunk being larger than 1,000 - this can be safely ignored).

_With much thanks to Steven W and Nir P for this valuable contribution._

In [3]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1196, which is longer than the specified 1000


In [4]:
len(chunks)

44

In [5]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: products, about


## A sidenote on Embeddings, and "Auto-Encoding LLMs"

We will be mapping each chunk of text into a Vector that represents the meaning of the text, known as an embedding.

OpenAI offers a model to do this, which we will use by calling their API with some LangChain code.

This model is an example of an "Auto-Encoding LLM" which generates an output given a complete input.
It's different to all the other LLMs we've discussed today, which are known as "Auto-Regressive LLMs", and generate future tokens based only on past context.

Another example of an Auto-Encoding LLMs is BERT from Google. In addition to embedding, Auto-encoding LLMs are often used for classification.

## Models Benchmark

In [6]:
from langchain_ollama import OllamaEmbeddings
# VectorDataStore import below !!!
from langchain_chroma import Chroma

# from langchain.embeddings import HuggingFaceEmbeddings
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

def build_vectorstore(model_name: str, documents, persist_directory: str):
    if os.path.exists(persist_directory):
        Chroma(persist_directory=persist_directory, embedding_function=embeddings).delete_collection()
    
    embeddings = OllamaEmbeddings(
        model=model_name,
        base_url="http://localhost:11434"
    )
    return Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        persist_directory=persist_directory
    )
    
# vector_llama = build_vectorstore("llama3.2", chunks, "chroma_llama3")
vector_mxbai = build_vectorstore("mxbai-embed-large", chunks, "knowledge-base/chroma_mxbai")

# print(f"Vectorstore created with {vector_llama._collection.count()} documents")
print(f"Vectorstore created with {vector_mxbai._collection.count()} documents")

query = "Como funciona a mandrilhadora S50?"

# print("\n🔹 Resultados com LLaMA3:")
# for doc in vector_llama.similarity_search(query, k=3):
#     print("-", doc.page_content[:100], "...")


# MELHOR MODELO !!!
print("\n🔸 Resultados com MXBAI:")
for doc in vector_mxbai.similarity_search(query, k=3):
    print("-", doc.page_content[:100], "...")



Vectorstore created with 44 documents

🔸 Resultados com MXBAI:
- # Manual de Operação - Mandrilhadora Portátil S50

Guia prático de uso e segurança para operação da  ...
- ## Ficha Técnica S50 ...
- # Manual Técnico – Mandrilhadora Portátil S50 (40–300mm)

Este documento apresenta as especificações ...


In [7]:
# Get one vector and find how many dimensions it has

collection = vector_mxbai._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 1,024 dimensions


## Visualizing the Vector Store

Let's take a minute to look at the documents and their embedding vectors to see what's going on.

In [8]:
# Prework

import numpy as np

result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
doc_types = [metadata['doc_type'] for metadata in result['metadatas']]
colors = [['blue', 'green', 'red', 'orange'][['products', 'about'].index(t)] for t in doc_types]

In [9]:
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import plotly.io as pio

# Força o uso do navegador para visualizar o gráfico
pio.renderers.default = 'browser'

# Reduz vetores para 2D usando t-SNE
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Cria gráfico de dispersão 2D
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [10]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

### Load existing Vector DB

In [12]:
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings

# Define o modelo de embedding usado anteriormente
embeddings = OllamaEmbeddings(
    model="mxbai-embed-large",  # ou "llama3.2", se for o caso
    base_url="http://localhost:11434"
)

# Carrega a base vetorial já salva no disco
vectorstore = Chroma(
    persist_directory="knowledge-base/chroma_mxbai",
    embedding_function=embeddings
)

# Pronto para usar!
results = vectorstore.similarity_search("Como funciona a mandrilhadora S50?", k=3)
for r in results:
    print(f"- [{r.metadata.get('doc_type', 'sem tipo')}] {r.page_content[:200]}...\n")



- [products] # Manual de Operação - Mandrilhadora Portátil S50

Guia prático de uso e segurança para operação da Mandrilhadora S50  
Preparado por: JVF Máquinas

---

## 1. Fixação da Barra ao Eixo

- Configure os...

- [products] ## Ficha Técnica S50...

- [products] # Manual Técnico – Mandrilhadora Portátil S50 (40–300mm)

Este documento apresenta as especificações técnicas, componentes, funcionalidades e informações de contato da **Mandrilhadora Portátil S50**, ...

