### In this section, I will build an Agentic RAG

Before, I already used a medical-domain LLM to generate hypothentical questions for each chunk. 

The project's main purpose is medical Q&A.  So I am going to implement Multi-Vector as the foundation of the RAG system.

That means I am going to:
* Embedding those questions to vectorestore and put chunked documents to docstore.
* I will use doc_id which were generated at chunking stage to be a link between vectorstore and docstore.

About the Agentic components:
* I will involve an agent to do reranking for the retrieved documents so that it can provide the most relevant context. 
* The agent can remember history conversation so that it can still retrieve the relevant documents in multi-turns conversation.
* The agent can genrate similar questions which will be used to retrieve documents.
* The agent can do function calling/MCP when there is no relevant document.

In [22]:
# Multi-Vector implementation
import os
import json
import copy
import re
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from huggingface_hub import login

In [2]:
# This function will load the json file to json object
def load_json_list(path: str):    
    with open(path, mode = "r", encoding="utf-8") as f:
        return json.load(f)

In [3]:
workspace_base_path = os.getcwd()
dataset_path = os.path.join(workspace_base_path, "datasets", "medicine_data_hypotheticalquestions.json") 
print(dataset_path)

c:\Users\Montr\AI_Projects\MediPal\Agentic-RAG\datasets\medicine_data_hypotheticalquestions.json


In [4]:
data = load_json_list(dataset_path)

In [5]:
data[50:53]

[{'doc_id': '1b8794c4-5946-4a50-bf09-d920746db8cb',
  'questions': ['what special precautions should i follow about Chlordiazepoxide',
   'What should people who are pregnant or breastfeeding know about Chlordiazepoxide?',
   'What does the document say about drive car operate machinery until know how?'],
  'original_doc': 'before taking chlordiazepoxide,tell your doctor and pharmacist if you are allergic to chlordiazepoxide, alprazolam (xanax), clonazepam (klonopin), clorazepate (gen-xene, tranxene), diazepam (diastat, valium), estazolam, flurazepam, lorazepam (ativan), oxazepam, temazepam (restoril), triazolam (halcion), any other medications, or any of the ingredients in tablets and capsules. ask your pharmacist for a list of the ingredients.tell your doctor and pharmacist what prescription and nonprescription medications, vitamins, nutritional supplements, and herbal products you are taking or plan to take while taking chlordiazepoxide. your doctor may need to change the doses of y

In [6]:
with open("keys.txt") as f:
    os.environ["HF_TOKEN"] = f.read().strip()

# login using env var
login(os.environ["HF_TOKEN"])

print(f"Login Huggingface so that we can access the model")

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Login Huggingface so that we can access the model


##### I choose sentence-transformers/embeddinggemma-300m-medical, as it is a sentence-transformers model finetuned from google/embeddinggemma-300m on the miriad/miriad-4.4M dataset (specifically the first 100.000 question-passage pairs from tomaarsen/miriad-4.4M-split). It maps sentences & documents to a 768-dimensional dense vector space and can be used for medical information retrieval, specifically designed for searching for passages (up to 1k tokens) of scientific medical papers using detailed medical questions.

* Reference: https://huggingface.co/sentence-transformers/embeddinggemma-300m-medical

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}


In [7]:
embedding_model_id = "sentence-transformers/embeddinggemma-300m-medical"

In [8]:
embedding_model = HuggingFaceEmbeddings(
    model_name=embedding_model_id,
    model_kwargs = {'device': 'cpu'},
    # Normalizing helps cosine similarity behave better across models
    encode_kwargs={"normalize_embeddings": True},
)

W0917 10:12:52.642000 51104 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
You are trying to use a model that was created with Sentence Transformers version 5.2.0.dev0, but you're currently using version 5.1.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.


In [9]:
# The function will make a multi-vector db and return a retriever

def medicine_documents_retriever(vectordb_name: str, data):
    # The storage layer for the original documents
    docstore = InMemoryStore()
    id_key = "doc_id"

    # The vectorstore to use to index the questions
    vectorstore = Chroma(collection_name = vectordb_name, embedding_function = embedding_model)
    # The Multi-Vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=docstore,
        id_key=id_key,
    )

    doc_ids = []
    questions = []
    docs = []
    for d in data[:50]:
        doc_id = d["doc_id"]
        doc_ids.append(doc_id)
        docs.append(Document(metadata={"doc_id": doc_id}, page_content=d["original_doc"]))
        for q in d["questions"]:
            questions.append(Document(metadata={"doc_id": doc_id}, page_content=q))

    retriever.vectorstore.add_documents(questions)
    retriever.docstore.mset(list(zip(doc_ids,docs)))

    return retriever 

In [10]:
retriever = medicine_documents_retriever(vectordb_name="medicinedocs", data=data)

In [12]:
retriever.vectorstore.similarity_search("what special dietary instructions should i follow about Phenylephrine",k=6)

[Document(id='d847d324-bca5-4d5d-b060-8c09c17d859a', metadata={'doc_id': '88423ccc-adf6-4ecb-9d4a-48ead1e5f2b7'}, page_content='what special dietary instructions should i follow about Phenylephrine'),
 Document(id='e320f072-c0f6-4832-bcfe-2b9eec9ec031', metadata={'doc_id': '88423ccc-adf6-4ecb-9d4a-48ead1e5f2b7'}, page_content='Are there any dietary instructions while using Phenylephrine?'),
 Document(id='25aa6403-35c1-40b5-b515-7e3cdebb267d', metadata={'doc_id': '0ff6d87b-c64f-4ac8-853b-d143cb09a386'}, page_content='Are there any dietary instructions while using Phenylephrine?'),
 Document(id='4bf57a3b-054d-402a-9eef-92ce22fe6380', metadata={'doc_id': '22a4ba55-1113-4c1d-9db3-f7eaabb7999a'}, page_content='what special precautions should i follow about Phenylephrine'),
 Document(id='ed51dbb8-a48b-4ef0-a3b5-3155c6a64127', metadata={'doc_id': '0ff6d87b-c64f-4ac8-853b-d143cb09a386'}, page_content='what other information should i know about Phenylephrine'),
 Document(id='b8f2d46e-8c68-4cd9-

In [13]:
retriever.invoke("what special dietary instructions should i follow about Phenylephrine", kwargs={"k":10})

[Document(metadata={'doc_id': '88423ccc-adf6-4ecb-9d4a-48ead1e5f2b7'}, page_content='unless your doctor tells you otherwise, continue your normal diet.about Phenylephrine'),
 Document(metadata={'doc_id': '0ff6d87b-c64f-4ac8-853b-d143cb09a386'}, page_content='ask your pharmacist any questions you have about phenylephrine.keep a written list of all of the prescription and nonprescription (over-the-counter) medicines, vitamins, minerals, and dietary supplements you are taking. bring this list with you each time you visit a doctor or if you are admitted to the hospital. you should carry the list with you in case of emergencies.about Phenylephrine'),
 Document(metadata={'doc_id': '22a4ba55-1113-4c1d-9db3-f7eaabb7999a'}, page_content='before taking phenylephrine,tell your doctor and pharmacist if you are allergic to phenylephrine, any other medications, or any of the ingredients in phenylephrine preparations.do not take phenylephrine if you are taking a monoamine oxidase (mao) inhibitor, suc

#### It is working as I expected, we can see the the questions' doc_id are matching the documents' doc_id exactly.
#### That means when we search by question. It will firstly match the similar questions, then output the documents which are related to those questions.

### Next, I will involve a cross-encoder(ncbi/MedCPT-Cross-Encoder) to rerank the retrieved documents and output top_k(3) most relevant ones.

##### This corssEncoder(Bert) model was fine-tuned on 30522 medical related tokens
##### The clinical knowledge usually can rank relevance more accuracy.

Citation:

@article{jin2023medcpt,
  title={MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval},
  author={Jin, Qiao and Kim, Won and Chen, Qingyu and Comeau, Donald C and Yeganova, Lana and Wilbur, W John and Lu, Zhiyong},
  journal={Bioinformatics},
  volume={39},
  number={11},
  pages={btad651},
  year={2023},
  publisher={Oxford University Press}
}


In [11]:
cross_encoder_model_id = "ncbi/MedCPT-Cross-Encoder" 

In [12]:
from sentence_transformers import CrossEncoder

In [13]:
cross_encoder = CrossEncoder(cross_encoder_model_id)

In [14]:
# The functiion leverages cross-encoder to evaluate the original query and the retrieved documents.
# It gives every (query, document) pair a score, then sort them.
def rerank(query: str, retrieved_docs: list[Document]):
    pairs = [[query, d.page_content] for d in retrieved_docs]
    scores = cross_encoder.predict(pairs, batch_size=32)
    for r_d, score in zip(retrieved_docs, scores):
        r_d.metadata["rerank_score"] = float(score)

    retrieved_docs.sort(key= lambda d: d.metadata["rerank_score"], reverse=True)
    return retrieved_docs

In [16]:
# This function just wrap up all the function before.
# 1. retrieve documents -> 2. Rerank documents -> 3. Pick top_k
def ask(query: str, retriever, top_k):
    retrieved_docs = retriever.invoke(query, kwargs={"k":10})
    retrieved_docs = copy.deepcopy(retrieved_docs) # Avoid rerank changes original documents
    reranked_docs = rerank(query,retrieved_docs)
    return reranked_docs[:top_k]

In [17]:
ask("My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?",retriever,top_k = 2)

[Document(metadata={'doc_id': '1bf5880b-93ec-4ac9-a0cb-eb35693ccce4', 'rerank_score': 0.9999985694885254}, page_content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine'),
 Document(metadata={'doc_id': 'bb119108-9008-4636-bda2-7f7ad0d185ed', 'rerank_score': 0.09486568719148636}, page_content='Hydrocortisone Injection may be prescribed for other uses; ask your doctor or pharmacist for more information.')]

#### The rerank model is just working so good. Keep it and move next.

#### Beside rerank, I think query is the most important thing. As it retrieve most related documents of which a LLM makes use to generate most helpful response.
#### But in multi-turn conversation, users don't make a query with every previous details.
For example, in the third turn the user really want to ask 'How do I take Phenylephrine?'

But he types 'How do I take it?'. From the context, 'it' means 'Phenylephrine'.

If we retrieve by query  'How do I take it?', we can get unrelevant document.  'How do I take Phenylephrine?' makes more sense.

#### This is why I need to put an agent to transfer the query so that it is aligned to the conversation.
#### In this case, I choose in-memory storage for history conversation. 
#### If we need long-term and short-term memory, I need to design a memory and session management module with redis, SQL DB and vector DB.

#### Let's do it!

In [20]:
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline
from langchain_core.prompts import ChatPromptTemplate
from operator import itemgetter
from langchain_core.runnables import RunnableLambda
import torch
import uuid

In [23]:
query_rewrite_model_id = "ContactDoctor/Bio-Medical-Llama-3-8B"

In [24]:
def best_dtype():
    if torch.cuda.is_available():
        if torch.cuda.is_bf16_supported():
            return torch.bfloat16
        else:
            return torch.float16
        
    return torch.float32

def best_device():
    return "cuda" if torch.cuda.is_available() else "cpu"

In [26]:
query_rewrite_tokenizer = AutoTokenizer.from_pretrained(query_rewrite_model_id)

query_rewrite_model = AutoModelForCausalLM.from_pretrained(
    query_rewrite_model_id,
    dtype = best_dtype(),
    device_map={"":best_device()}, 
    low_cpu_mem_usage=True     
)
print("Load tokenizer and base model done!")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Load tokenizer and base model done!


In [None]:
print(query_rewrite_model)                    # full architecture tree (long but useful)
print(query_rewrite_model.config)             # core hyperparameters (dims, layers, heads…)

In [27]:
query_rewrite_pipe = pipeline(
    "text-generation", 
    model=query_rewrite_model, 
    tokenizer=query_rewrite_tokenizer,
    return_full_text=False,   
    )

Device set to use cuda


In [28]:
# Wrapper normal piple with huggingfacepipeline
hug_query_rewrite_pipe = HuggingFacePipeline(pipeline=query_rewrite_pipe)

In [29]:
query_rewriter = ChatHuggingFace(llm=hug_query_rewrite_pipe)

In [93]:
# Minimal chat history store (in-memory)
session_store: dict[str, BaseChatMessageHistory] = {}

# Get history message by session_id
def get_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in session_store:
        session_store[session_id] = ChatMessageHistory()
    return session_store[session_id]

In [94]:
# Testing if the query rewriting part
# First of all, open a session using UUID to generate a unique session ID
session_id = str(uuid.uuid4())

In [95]:
# First turn: ask a question
query_1 = "My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?"

In [96]:
history = get_history(session_id)

In [97]:
history.add_user_message(query_1)

In [98]:
doc_1 = ask(query_1, retriever, top_k=1)

In [99]:
doc_1

[Document(metadata={'doc_id': '1bf5880b-93ec-4ac9-a0cb-eb35693ccce4', 'rerank_score': 0.9999985694885254}, page_content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine')]

In [100]:
# The first document is what exactly I am expecting
# Put is to history store
history.add_ai_message(doc_1[0].page_content)


In [101]:
history

InMemoryChatMessageHistory(messages=[HumanMessage(content='My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?', additional_kwargs={}, response_metadata={}), AIMessage(content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine', additional_kwargs={}, response_metadata={})])

In [102]:
# second turn: ask a question
query_2 = "How can I use it?"

In [103]:
history = get_history(session_id)

#### It is a point where an agent can transfer "How can I use it?" to a retrievable query.

In [21]:
rewriter_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You rewrite user questions to be fully self-contained for document retrieval. "
     "Use the history conversation to resolve references. Keep the contextual meaning. "
     "Output just the rewritten question, no explanations. \n"
     "History conversation: {history} \n"),    
    ("human", "{question}")
])

# Convert history message to a string
def history_as_text(history: BaseChatMessageHistory) -> str:
    return "\n".join([
        f"{m.type.upper()}: {m.content}"   # e.g. "HUMAN: …" or "AI: …"
        for m in history.messages])

rewrite_chain = (
    rewriter_prompt | query_rewriter
)

NameError: name 'query_rewriter' is not defined

In [None]:
rewriter_prompt.pretty_print()

ChatPromptTemplate(input_variables=['history', 'question'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['history'], input_types={}, partial_variables={}, template='You rewrite user questions to be fully self-contained for document retrieval. Use the conversation history to resolve references. Keep the meaning. Output just the rewritten query, no explanations. \nDocument: {history} \n'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['question'], input_types={}, partial_variables={}, template='{question}'), additional_kwargs={})])

In [110]:
history_as_text(get_history(session_id))

'HUMAN: My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?\nAI: phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine'

In [None]:
# 2) Rewrite
rewritten = rewrite_chain.invoke({"question": query_2, "history": history_as_text(get_history(session_id))})

In [116]:
rewritten

AIMessage(content=' how to use phenylephrine for sinus congestion and pressure?', additional_kwargs={}, response_metadata={}, id='run--182c4bc3-4bb9-47c2-a90b-8ac99e09954d-0')

#### The query rewriting part is working well now.
#### But sometimes, the user types questions which have multiply meaning. For example: How can I have it?
* It could mean "What is the dosage instruction to take it?"
* It also could mean "Where and how can I buy it?" 
#### This example remind me to involve Query Decomposition technique which will transfer a query to a few querys in different angles.

In [None]:
# I will use the same LLM model to do the job with different prompt

decompose_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a medical doctor. You generate exactly TWO distinct, clinically-relevant question variants from the user's Original question, "
     "covering different angles (e.g., indications/contraindications, dosing vs. administration, "
     "adult vs. pediatric, interactions vs. adverse effects). "
     "Return a JSON list of 2 strings, no extra text."),
    ("human", "Original question: {base_question}")
])

In [None]:
decompose_prompt.pretty_print()

In [None]:
query_decomposer = query_rewriter  # reuse or swap to a different model

In [None]:
def parse_json_list(s: str):
    #import json, re
    # Be forgiving to slight model formatting
    s = s.strip()
    m = re.search(r"\[.*\]", s, flags=re.S)
    if not m:
        return []
    try:
        data = json.loads(m.group(0))
        return [x.strip() for x in data if isinstance(x, str)]
    except Exception:
        return []

In [None]:
decompose_chain = (
    {"base_question": itemgetter("question")}
    | decompose_prompt
    | query_decomposer
    | RunnableLambda(parse_json_list)
)

In [None]:
query_3 = "How can I have Phenylephrine?"

In [None]:
decompose_chain.invoke({"question":query_3})