# InterSystems IRIS Vector Search - LangChain (RALPH Documents)

Reference: https://github.com/intersystems-community/hackathon-2024/tree/main?tab=readme-ov-file

Website: https://developer.intersystems.com/intersystems-genai-challenge-nus-health-hack/

We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key.

In [1]:
# import getpass
import os

from dotenv import load_dotenv

env_path = "../notebooks/.env"  # path to .env file
load_dotenv(dotenv_path=env_path, override=True)

if not os.environ.get("OPENAI_API_KEY"): 
    print("Failed to read")

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import DirectoryLoader, Docx2txtLoader #PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter  # RecursiveCharacterTextSplitter, 

# from langchain.embeddings.openai import OpenAIEmbeddings  # deprecated
from langchain_community.embeddings import OpenAIEmbeddings

from langchain_iris import IRISVector


# Split & chunk documents

Load all the .docx files

In [None]:
# Specify the folder path
folder_path = "./documents"  # Update this to your desired folder path

# Create a DirectoryLoader for .docx files
loader = DirectoryLoader(
    folder_path,
    glob="**/*.docx",  # Pattern to match DOCX files
    loader_cls=Docx2txtLoader  # Use the Docx2txtLoader for DOCX documents
)

# Load the document
documents = loader.load()

In [4]:
print(len(documents))
documents

10


[Document(metadata={'source': 'documents\\ATORVASTATIN.docx'}, page_content='Drug: ATORVASTATIN\n\n# Available Drug Strengths\nATORVASTATIN 10MG TAB;\nATORVASTATIN 20MG TAB;\nATORVASTATIN 40MG TAB\n\n# Mechanism of Action & How it Works / Helps\nAtorvastatin is a type of Statin that lowers the amount of \'bad cholesterol\' (low density lipoprotein or LDL-cholesterol) and \'fat\' (triglyceride) in the blood as well as increases the amount of \'good cholesterol\' (high density lipoprotein or HDL-cholesterol). This reduces risk of fatty deposit build up in your blood vessels and thus reduces risk for heart attack and stroke. \n\n###CHUNK_DELIMITER###\n\n# Indication Information for Atorvastatin\n\n## Indication 1: Hyperlipidemia (HLD)\n\n### Summary Of Disease Condition\nHyperlipidemia which is also known as high blood cholesterol, occurs when there is high cholesterol present in the blood. This causes a buildup of fatty deposits on the inside walls of the blood vessels (atherosclerotic p

Chunk each document based using `###CHUNK_DELIMITER###"` as the delimiter.

Additionally, insert `Drug name: {name of drug}` at the beginning of each chunk.

In [5]:
# Define a custom splitter based on "------------" (specified seperator)
text_splitter = CharacterTextSplitter(
    separator="###CHUNK_DELIMITER###", # delimiter
    chunk_size=10, # set a small number so that it will always chunk based on the seperator
    chunk_overlap=0
)

# Split the document into chunks  # type: list
docs = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} documents and split them into {len(docs)} chunks")

### Update: Add the prefix to each chunk based on its source filename
for doc in docs:
    # Get the filename from metadata
    file_path = doc.metadata.get('source', '')
    file_name = os.path.basename(file_path)
    # Remove the file extension
    file_name = os.path.splitext(file_name)[0]
    # Add prefix to each chunk
    doc.page_content = f"Drug name: {file_name}\n" + doc.page_content
print("Added prefix to each chunk")

Created a chunk of size 541, which is longer than the specified 10
Created a chunk of size 3744, which is longer than the specified 10
Created a chunk of size 277, which is longer than the specified 10
Created a chunk of size 416, which is longer than the specified 10
Created a chunk of size 2651, which is longer than the specified 10
Created a chunk of size 463, which is longer than the specified 10
Created a chunk of size 13287, which is longer than the specified 10
Created a chunk of size 277, which is longer than the specified 10
Created a chunk of size 294, which is longer than the specified 10
Created a chunk of size 1167, which is longer than the specified 10
Created a chunk of size 652, which is longer than the specified 10
Created a chunk of size 11991, which is longer than the specified 10
Created a chunk of size 809, which is longer than the specified 10
Created a chunk of size 294, which is longer than the specified 10
Created a chunk of size 5299, which is longer than the 

Loaded 10 documents and split them into 60 chunks
Added prefix to each chunk


In [6]:
print(type(docs))

# Visualise document chunks
for i in docs:
    print(i.page_content)
    print(i.metadata)
    print("\n--------------------\n")

<class 'list'>
Drug name: ATORVASTATIN
Drug: ATORVASTATIN

# Available Drug Strengths
ATORVASTATIN 10MG TAB;
ATORVASTATIN 20MG TAB;
ATORVASTATIN 40MG TAB

# Mechanism of Action & How it Works / Helps
Atorvastatin is a type of Statin that lowers the amount of 'bad cholesterol' (low density lipoprotein or LDL-cholesterol) and 'fat' (triglyceride) in the blood as well as increases the amount of 'good cholesterol' (high density lipoprotein or HDL-cholesterol). This reduces risk of fatty deposit build up in your blood vessels and thus reduces risk for heart attack and stroke.
{'source': 'documents\\ATORVASTATIN.docx'}

--------------------

Drug name: ATORVASTATIN
# Indication Information for Atorvastatin

## Indication 1: Hyperlipidemia (HLD)

### Summary Of Disease Condition
Hyperlipidemia which is also known as high blood cholesterol, occurs when there is high cholesterol present in the blood. This causes a buildup of fatty deposits on the inside walls of the blood vessels (atherosclerot

# Create DB connection


- Link to access the System Management Portal locally:  http://localhost:52773/csp/sys/UtilHome.csp
- Note: need to boot up Docker \
`docker run -d --name iris-comm -p 1972:1972 -p 52773:52773 -e IRIS_PASSWORD=demo -e IRIS_USERNAME=demo intersystemsdc/iris-community:latest`

### Create / Load IRIS Vector DB

In [None]:
# Initialise embedding model
embeddings = OpenAIEmbeddings()

# Create DB connection string
username = 'demo'
password = 'demo' 
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '1972' 
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

print(CONNECTION_STRING)

# Under the hood, this becomes a SQL table. CANNOT have '.' in the name
COLLECTION_NAME = "ralph_drug_database"

######################################################################################

# This creates a persistent vector store (a SQL table). You should run this ONCE only
db = IRISVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

# Subsequent calls to reconnect to the database and make searches should use this.  
# db = IRISVector(
#     embedding_function=embeddings,
#     dimension=1536,
#     collection_name=COLLECTION_NAME,
#     connection_string=CONNECTION_STRING,
# )

iris://demo:demo@localhost:1972/USER


In [None]:
# To ADD documents to an existing vector store:
# db.add_documents(docs)

# Delete the whole collection
# db.delete_collection()

######################################################################################

# View number of docs in vector store
print(f"Number of docs in vector store: {len(db.get()['ids'])}")

# Should have 60 docs in the vector store (10 docs * 6 chunks each)

Number of docs in vector store: 60


### Test Query the vector DB

- User input --> Query the top 5 results
- Content is consolidated into XML string --> this can be used as context for the LLM

In [11]:
# Run a sample query
# query = """how does bisoprolol work?
# Drug: Bisoprolol 2.5mg Tablets"""

query = """Drug: Empagliflozin 25mg tablets;  
Topic: "Administration Instructions or Medication Storage" and "Side effects and management" and "Drug interactions, impact and management";  
Answer: Take Empagliflozin once in the morning, with or without food. Common side effects may include urinary tract infections and dehydration; consult your doctor if you experience severe symptoms. Be cautious of potential interactions with diuretics and other diabetes medications, as they may increase the risk of low blood sugar or dehydration.
"""

docs_with_score = db.similarity_search_with_score(query, 5)

In [12]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

print(type(docs_with_score))   # list of Document objects
docs_with_score

--------------------------------------------------------------------------------
Score:  0.094095070770375
Drug name: EMPAGLIFLOZIN
# Administration Instructions
- You may take this medication before food or after food. 
- Dosage form: tablet
- Can it be crushed: Yes

# Counselling Points for Empagliflozin

## Counselling Point 1: Sick Day Dosing
Temporarily stop if experiencing acute illness, especially when you have very poor appetite

## Counselling Point 2: Fasting Blood Glucose Testing
If you need to do fasting blood tests, do not take your medication until your blood has been taken and you have eaten.

## Counselling Point 3: Procedures
If you have planned surgery and procedures, please inform your healthcare professional. You may need to stop taking this medication for a couple of days.

# Medication Storage
Store your medication in a cool, dry place away from heat, moisture and direct sunlight, such as in a cupboard
--------------------------------------------------------------

[(Document(metadata={'source': 'documents\\EMPAGLIFLOZIN.docx'}, page_content='Drug name: EMPAGLIFLOZIN\n# Administration Instructions\n- You may take this medication before food or after food. \n- Dosage form: tablet\n- Can it be crushed: Yes\n\n# Counselling Points for Empagliflozin\n\n## Counselling Point 1: Sick Day Dosing\nTemporarily stop if experiencing acute illness, especially when you have very poor appetite\n\n## Counselling Point 2: Fasting Blood Glucose Testing\nIf you need to do fasting blood tests, do not take your medication until your blood has been taken and you have eaten.\n\n## Counselling Point 3: Procedures\nIf you have planned surgery and procedures, please inform your healthcare professional. You may need to stop taking this medication for a couple of days.\n\n# Medication Storage\nStore your medication in a cool, dry place away from heat, moisture and direct sunlight, such as in a cupboard'),
  0.094095070770375),
 (Document(metadata={'source': 'documents\\EMPA