The purpose of this document is to create a second document store for the mock employee handbook. It will also include some basic text cleaning procedures to make the handbook more realistic.

In [1]:
import re

from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector
from langchain_text_splitters import RecursiveCharacterTextSplitter

Repeating a lot of work done in the other vector store notebook, let's read in the PDF manual and clean it up.

In [2]:
filepath = 'hr-documents/employee_handbook.pdf'
loader = PyPDFLoader(filepath)
pages = loader.load()
print(len(pages))

28


I mainly need to remove newline characters and replace {ORGANIZATION NAME} with the name of my company.

Note: after investigating some poor splits, I needed to delete my splits and replace the character string \xa0 with a space.

In [12]:
for page in pages:
    page.page_content = page.page_content.replace('\xa0', ' ').replace('\n', '').replace('{ORGANIZATION NAME}', 'OneThreeFive')

Now my text data is clean, I will break into chunks and create a new collection/vector store in my postgres database server.

In [17]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1_000, chunk_overlap=200
)
splits = splitter.split_documents(pages)
print(len(splits))

88


In [18]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')

connection = 'postgresql+psycopg://langchain:langchain@localhost:32768/henry'
collection_name = 'hr_policy'

vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True
)

In [19]:
vector_store.add_documents(splits)

['0b7eafa7-294d-47d0-9f24-2a0d82c62ba4',
 'd68bda74-c44c-4a33-87a2-6e47214725ea',
 'd27e3d8b-b723-4c40-a25c-0cf82931f345',
 'c4880b5e-d701-4599-b621-ecfd71cc7887',
 '85f16e7e-4341-4b27-8eb8-d5a288b20ef3',
 'c3b42043-4d77-4d4b-aea7-0a7b4a20cf54',
 '07e67e71-9f7c-4404-85e4-10aeaf17a6d4',
 '4231b3d6-1ec1-4bdd-a729-c96dc67a7df1',
 'f42423dd-ddc0-4856-857c-946dc5d9c4df',
 '82dd702d-c570-4caa-93d3-f772357d1bed',
 'b0518b22-2ece-403b-ac8b-5bfcd71e4c70',
 '693e4bbb-5bc2-49fd-851d-3fb6b1fdf3ad',
 '61f48a73-0c8a-412d-812a-79ebdebdb045',
 'bce39e69-8820-498c-be44-269546bcc753',
 '1f9b02d7-fd11-42a1-994b-630ea6908921',
 'e02bba6a-0030-4c32-867f-af4a560ba7b8',
 '6bb0de80-c06c-43a7-82a9-07d5ed221227',
 '800f6b25-36c3-4470-bc27-09d6e7b9bc7d',
 '3195c024-0a4e-4a0a-85d2-747a61d384ca',
 'a5edfc9d-c851-419c-9427-7a038a99c2ed',
 'ab1e6516-87d7-442c-a6f9-d0f06d167e2a',
 '2f7cbff6-f410-4a0e-86dd-b15512be73d9',
 '67b04d14-81f4-4b7a-b41f-3fe5feb47961',
 '9bbb4ca3-7eb8-4080-83bd-62b7745c0ec8',
 '919c5cd4-85aa-

Let's ask a question to verify the vector store works properly. I expect to locate the answer to my question on the 12th page.

In [20]:
vector_store.similarity_search(
    'When are paychecks distributed?',
    k=2
)

[Document(id='9937d94b-1a5d-4fc6-ad9b-940b1e0bf151', metadata={'page': 11, 'title': 'Microsoft Word - Sample Employee Handbook for web.doc', 'author': 'JBergin', 'source': 'hr-documents/employee_handbook.pdf', 'creator': 'PScript5.dll Version 5.2.2', 'moddate': '2006-10-16T19:54:33-04:00', 'producer': 'Acrobat Distiller 6.0 (Windows)', 'page_label': '12', 'total_pages': 28, 'creationdate': '2006-10-16T19:54:33-04:00'}, page_content='8  qualifications required, salary range, and working conditions affecting the job, e.g., working hours, use of car, etc.   The supervisor(s) or the Executive Director shall have discretion to modify the job description to meet the needs of OneThreeFive.      Paychecks are distributed on the 15th and the last day of each month, except when either of those days falls on a Saturday, Sunday or holiday, in which case paychecks will be distributed on the preceding workday.  Timesheets are due to the Executive Director within two days of each pay period. All sala

Cool! I now have two vector databases and one employee information database. I have everything I need to build my multi-agent workflow in the first pass notebook.