The purpose of this document is to create the database for the RAG functionality for my agents. I will complete the following steps:
- Pull benefit documentation from the internet
- Create company HR policy documentation using chatgpt
- Read in the PDFs
- Chunk documents
- Create embeddings
- Store embeddings in two tables in database (one table for each agent)

In [20]:
import getpass
import os
import re

import pandas as pd

from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector
from langchain_text_splitters import RecursiveCharacterTextSplitter

I've pulled documents for three plans from BCBS -
- Blue Advantage Gold Standard
- Blue Advantage Silver Standard
- Blue Advantage Bronze Standard

Let's read the first one in and see what we've got.

In [3]:
filepath = 'benefit-highlights/gold_standard.pdf'
loader = PyPDFLoader(filepath)
pages = loader.load()
print(len(pages))
print(pages[0])

47
page_content='Blue Cross and Blue Shield of Texas, a Division of Health Care Service Corporation, 
a Mutual Legal Reserve Company, an Independent Licensee of the Blue Cross and Blue Shield Association 
TX-I-EX-H-CC-SOC-BH-25 
 
Blue Advantage Gold HMO℠ Standard 
Blue Advantage HMOSM Network 
The following Benefit Highlights summarizes the coverage available under the offered HMO Plan. The 
Evidence of Coverage (EOC) documents You will receive after You enroll will provide more detailed 
information about this plan. This summary should be reviewed along with the Limitations and Exclusions. 
All Covered Services (except in emergencies) must be provided by or through Member’s Participating Primary 
Care Physician/Practitioner, who may refer them for further treatment by Providers in the applicable network 
of Participating Specialists and Hospita ls. Female Members may visit a Participating OB/GYN Physician in 
their Primary Care Physician’s/Practitioner’s Provider network for diagnosi

Cool. The quality looks good. I'll steal from a previous project to clean up the newline characters and excessive whitespace.

Note: as I was investigating my data, I found a common phrase at the top of every document. I remove it to limit redundant characters in the dataset.

In [9]:
regexp = r'\s+'

for page in pages:
    old_content = page.page_content
    new_content = old_content.replace('\n', ' ').replace('Blue Cross and Blue Shield of Texas, a Division of Health Care Service Corporation, a Mutual Legal Reserve Company, an Independent Licensee of the Blue Cross and Blue Shield Association TX-I-EX-H-CC-SOC-BH-25 Blue Advantage Gold HMO℠ Standard Blue Advantage HMOSM Network', '').strip()
    new_content = re.sub(regexp, ' ', new_content)
    page.page_content = new_content

print(pages[0])    

page_content='The following Benefit Highlights summarizes the coverage available under the offered HMO Plan. The Evidence of Coverage (EOC) documents You will receive after You enroll will provide more detailed information about this plan. This summary should be reviewed along with the Limitations and Exclusions. All Covered Services (except in emergencies) must be provided by or through Member’s Participating Primary Care Physician/Practitioner, who may refer them for further treatment by Providers in the applicable network of Participating Specialists and Hospita ls. Female Members may visit a Participating OB/GYN Physician in their Primary Care Physician’s/Practitioner’s Provider network for diagnosis and treatment without a Referral from their Primary Care Physician/Practitioner. Urgent Care, Retail Health Clinic s and Virtual Visits do not require Primary Care Physician/Practitioner Referral. Some services may require Prior Authorization by HMO. IMPORTANT NOTE: Copayments shown be

As you can see, everything prints in one line. This isn't great for our ability to read the content but it trims unimportant characters from the page content. Now we need to create chunks from the manual so we can embed bite-size pieces of the document.

In [15]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1_000, chunk_overlap=200
)
splits = splitter.split_documents(pages)
print(len(splits))

195


In [None]:
if not os.environ.get('OPENAI_API_KEY'):
    os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')    

We took a 47 page document and created 195 chunks. Now I want to turn it into a pandas dataframe with the following information:
- BCBS plan (gold)
- Page (page_label)
- Content (page_content)
- Embeddings