The purpose of this document is to create the database for the RAG functionality for my agents. I will complete the following steps:
- Pull benefit documentation from the internet
- Create company HR policy documentation using chatgpt
- Read in the PDFs
- Chunk documents
- Create embeddings
- Store embeddings in two tables in database (one table for each agent)

In [1]:
import getpass
import os
import re

import pandas as pd

from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector
from langchain_text_splitters import RecursiveCharacterTextSplitter

I've pulled documents for three plans from BCBS -
- Blue Advantage Gold Standard
- Blue Advantage Silver Standard
- Blue Advantage Bronze Standard

Let's read the first one in and see what we've got.

In [2]:
filepath = 'benefit-highlights/gold_standard.pdf'
loader = PyPDFLoader(filepath)
pages = loader.load()
print(len(pages))
print(pages[0])

47
page_content='Blue Cross and Blue Shield of Texas, a Division of Health Care Service Corporation, 
a Mutual Legal Reserve Company, an Independent Licensee of the Blue Cross and Blue Shield Association 
TX-I-EX-H-CC-SOC-BH-25 
 
Blue Advantage Gold HMO℠ Standard 
Blue Advantage HMOSM Network 
The following Benefit Highlights summarizes the coverage available under the offered HMO Plan. The 
Evidence of Coverage (EOC) documents You will receive after You enroll will provide more detailed 
information about this plan. This summary should be reviewed along with the Limitations and Exclusions. 
All Covered Services (except in emergencies) must be provided by or through Member’s Participating Primary 
Care Physician/Practitioner, who may refer them for further treatment by Providers in the applicable network 
of Participating Specialists and Hospita ls. Female Members may visit a Participating OB/GYN Physician in 
their Primary Care Physician’s/Practitioner’s Provider network for diagnosi

Cool. The quality looks good. I'll steal from a previous project to clean up the newline characters and excessive whitespace.

Note: as I was investigating my data, I found a common phrase at the top of every document. I remove it to limit redundant characters in the dataset.

In [4]:
regexp = r'\s+'

for page in pages:
    old_content = page.page_content
    new_content = old_content.replace('\n', ' ').replace('Blue Cross and Blue Shield of Texas, a Division of Health Care Service Corporation, a Mutual Legal Reserve Company, an Independent Licensee of the Blue Cross and Blue Shield Association TX-I-EX-H-CC-SOC-BH-25', '').strip()
    new_content = re.sub(regexp, ' ', new_content)
    page.page_content = new_content
    #set subject to gold for future searches by specific plans
    page.metadata['subject'] = 'Gold'

print(pages[0])    

page_content='Blue Advantage Gold HMO℠ Standard Blue Advantage HMOSM Network The following Benefit Highlights summarizes the coverage available under the offered HMO Plan. The Evidence of Coverage (EOC) documents You will receive after You enroll will provide more detailed information about this plan. This summary should be reviewed along with the Limitations and Exclusions. All Covered Services (except in emergencies) must be provided by or through Member’s Participating Primary Care Physician/Practitioner, who may refer them for further treatment by Providers in the applicable network of Participating Specialists and Hospita ls. Female Members may visit a Participating OB/GYN Physician in their Primary Care Physician’s/Practitioner’s Provider network for diagnosis and treatment without a Referral from their Primary Care Physician/Practitioner. Urgent Care, Retail Health Clinic s and Virtual Visits do not require Primary Care Physician/Practitioner Referral. Some services may require 

As you can see, everything prints in one line. This isn't great for our ability to read the content but it trims unimportant characters from the page content. Now we need to create chunks from the manual so we can embed bite-size pieces of the document.

In [5]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1_000, chunk_overlap=200
)
splits = splitter.split_documents(pages)
print(len(splits))

194


In [6]:
if not os.environ.get('OPENAI_API_KEY'):
    os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')    

We took a 47 page document and created 196 chunks. Now I want to repeat the process for the silver and bronze plans. Once I have clean chunks for all three plans, I will use them to create a collection in my database as a vector store.

In [7]:
regexp = r'\s+'
filenames = ['silver_standard.pdf', 'bronze_standard.pdf']

for file in filenames:
    filepath = f'benefit-highlights/{file}'
    subject = file.split('_')[0].capitalize()
    loader = PyPDFLoader(filepath)
    pages = loader.load()

    for page in pages:
        old_content = page.page_content
        new_content = old_content.replace('\n', ' ').replace('Blue Cross and Blue Shield of Texas, a Division of Health Care Service Corporation, a Mutual Legal Reserve Company, an Independent Licensee of the Blue Cross and Blue Shield Association TX-I-EX-H-CC-SOC-BH-25', '').strip()
        new_content = re.sub(regexp, ' ', new_content)
        page.page_content = new_content
        #set subject to gold for future searches by specific plans
        page.metadata['subject'] = subject

    new_splits = splitter.split_documents(pages)
    splits.extend(new_splits)

print(len(splits))

586


My list of chunks contains information from the three plan documents for Gold, Silver, and Bronze HMOs. They are identified by the metadata "subject" information. I will save all of these to a postgres database as a vector store. Before running the following code, I created and ran a docker image using Docker Desktop for Windows. I followed these steps to get it up and running:
1. Download and run Docker Desktop
2. Pull pgvector/pgvector disk image
3. Run disk image with following environmental variables
    - POSTGRES_USER: langchain
    - POSTGRES_PASSWORD: langchain
    - POSTGRES_DB: henry
    - PORT #: 32768

With these environmental variables, I'm able to define a connection address and store my splits in a vector database.

In [14]:
connection = 'postgresql+psycopg://langchain:langchain@localhost:32768/henry'
collection_name = 'healthcare_plans'

vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True
)

Judging by the million line output, I've added a lot of vectors to my vector store.

In [15]:
vector_store.add_documents(splits)

['440e941b-d650-4029-ac1d-d5f52a4e8ca9',
 '3bb886ff-d436-4ae0-aeb8-57191bf5a00c',
 '7e435641-0332-43d4-b9cc-79034d570b8f',
 '0776d860-8549-4332-9032-f8c594f3f9af',
 'dcab1704-89ae-4eb0-a92f-3aa3bbb15eb5',
 '2ca9d709-dc58-4819-a7c9-c2a594e5aeec',
 'f5f9faec-e17f-4558-afa5-8c205eec598e',
 '5f9cd17a-fc5e-42c1-8ea3-0475849d6757',
 '4a010221-c918-4f5d-9de4-dd54ed459e2d',
 '1052b6b8-548c-4766-af62-fd799ea2cf80',
 '1e8c654c-e94a-4b96-a9c6-ff6d7e93c946',
 '2289037d-8f99-4fc1-8749-24f26af72f16',
 '0bbd6850-5d00-4f7e-8217-b8ef5fa530cf',
 '426d10d9-061d-4d9d-9ea4-f86f0285b302',
 'da157bb6-d28b-4607-86ff-fe58ec975f1c',
 '6bb4c293-fdcb-4d4c-a326-068a2f86f246',
 '13b4d639-95e3-404f-83e2-0bf3446cf57e',
 'c706e043-9096-4fc7-9458-eb967836f24e',
 '8f2a5d0f-37cd-4194-b925-9f27d1c13a0d',
 '793ac3ef-f30a-42f2-86bd-d05b8534c228',
 '822e944e-3749-40fa-8dc5-eba0c92fcd6d',
 '70c5d167-84b4-4572-afc6-fac913605354',
 '9f77eaea-afb5-484a-ac10-a7d5393cd0df',
 '65e8b82d-2188-499f-b7c0-5813cc63ed18',
 '09292970-3b46-

Let's ask it a question!

In [16]:
vector_store.similarity_search('What is the copay for prenatal and postnatal visits?')

[Document(id='dcab1704-89ae-4eb0-a92f-3aa3bbb15eb5', metadata={'page': 1, 'title': 'GHSA71BAVITXP', 'author': 'Centers for Medicare & Medicaid Services', 'source': 'benefit-highlights/gold_standard.pdf', 'creator': 'Microsoft? Word for Microsoft 365', 'moddate': '2024-09-12T08:21:52-04:00', 'subject': 'Gold', 'keywords': '', 'producer': 'Simplify Healthcare Inc using ABCpdf', 'page_label': '2', 'total_pages': 47, 'creationdate': '2024-09-12T08:21:52-04:00'}, page_content='do not apply to Behavioral Health Services. Benefits for Autism Spectrum Disorder will not apply towards and are not subject to any Rehabilitation Services and Habilitation Services visit maximums. $30 Copay; unless otherwise covered under Inpatient Hospital Services. Maternity Care and Family Planning Services Maternity Care Prenatal and Postnatal Visit – Copay is applied to the first office visit only. Subsequent office visits are covered in full. $30 Copay for PCP or $60 Copay for Specialist Inpatient Hospital Serv

Pretty good. Let's make sure we can filter the results by the subject (the type of healthcare plan) so our AI could retrieve context specific to an individual's healthcare plan.

In [18]:
vector_store.similarity_search(
    'What is the copay for prenatal and postnatal visits?',
    filter={'subject': {'$eq': 'Bronze'}}
)

[Document(id='c7604480-b2df-49ca-b186-dfa57134679c', metadata={'page': 1, 'title': 'BHSA90BAVITXP', 'author': 'Centers for Medicare & Medicaid Services', 'source': 'benefit-highlights/bronze_standard.pdf', 'creator': 'Microsoft? Word for Microsoft 365', 'moddate': '2024-09-12T08:21:37-04:00', 'subject': 'Bronze', 'keywords': '', 'producer': 'Simplify Healthcare Inc using ABCpdf', 'page_label': '2', 'total_pages': 47, 'creationdate': '2024-09-12T08:21:37-04:00'}, page_content='Habilitation Services and Therapies, per visit Limited to 35 visits per Calendar Year, including chiropractic services for Rehabilitation Services. Limited to 35 visits per Calendar Year, including chiropractic services for Habilitation Services. Visit limitations do not apply to Behavioral Health Services. Benefits for Autism Spectrum Disorder will not apply towards and are not subject to any Rehabilitation Services and Habilitation Services visit maximums. $50 Copay; unless otherwise covered under Inpatient Hosp

Great! Everything is bronze. This concludes my work in the document. I've created a postgres database with vector search capabilities and loaded it with chunks from healthcare documents. I can query these documents to retrieve context for my agents. This database will support the "benefits specialist" agent I will create as part of my multi-agent workflow.