### In this section, I will build an Agentic RAG

Before, I already used a medical-domain LLM to generate hypothentical questions for each chunk. 

The project's main purpose is medical Q&A.  So I am going to implement Multi-Vector as the foundation of the RAG system.

That means I am going to:
* Embedding those questions to vectorestore and put chunked documents to docstore.
* I will use doc_id which were generated at chunking stage to be a link between vectorstore and docstore.

About the Agentic components:
* I will involve an agent to do reranking for the retrieved documents so that it can provide the most relevant context. 
* The agent can remember history conversation so that it can still retrieve the relevant documents in multi-turns conversation.
* The agent can genrate similar questions which will be used to retrieve documents.
* The agent can do function calling/MCP when there is no relevant document.

In [1]:
# Multi-Vector implementation
import os
import json
import copy
import re
from dotenv import load_dotenv
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from huggingface_hub import login
import settings

In [2]:
# This function will load the json file to json object
def load_json_list(path: str):    
    with open(path, mode = "r", encoding="utf-8") as f:
        return json.load(f)

In [3]:
workspace_base_path = os.getcwd()
dataset_path = os.path.join(workspace_base_path, "datasets", "medicine_data_hypotheticalquestions.json") 
print(dataset_path)

/home/jovyan/work/datasets/medicine_data_hypotheticalquestions.json


In [4]:
data = load_json_list(dataset_path)

In [5]:
data[50:53]

[{'doc_id': '1b8794c4-5946-4a50-bf09-d920746db8cb',
  'questions': ['what special precautions should i follow about Chlordiazepoxide',
   'What should people who are pregnant or breastfeeding know about Chlordiazepoxide?',
   'What does the document say about drive car operate machinery until know how?'],
  'original_doc': 'before taking chlordiazepoxide,tell your doctor and pharmacist if you are allergic to chlordiazepoxide, alprazolam (xanax), clonazepam (klonopin), clorazepate (gen-xene, tranxene), diazepam (diastat, valium), estazolam, flurazepam, lorazepam (ativan), oxazepam, temazepam (restoril), triazolam (halcion), any other medications, or any of the ingredients in tablets and capsules. ask your pharmacist for a list of the ingredients.tell your doctor and pharmacist what prescription and nonprescription medications, vitamins, nutritional supplements, and herbal products you are taking or plan to take while taking chlordiazepoxide. your doctor may need to change the doses of y

In [8]:
load_dotenv()
login(os.getenv("HUGGINGFACE_KEY"))

##### I choose sentence-transformers/embeddinggemma-300m-medical, as it is a sentence-transformers model finetuned from google/embeddinggemma-300m on the miriad/miriad-4.4M dataset (specifically the first 100.000 question-passage pairs from tomaarsen/miriad-4.4M-split). It maps sentences & documents to a 768-dimensional dense vector space and can be used for medical information retrieval, specifically designed for searching for passages (up to 1k tokens) of scientific medical papers using detailed medical questions.

* Reference: https://huggingface.co/sentence-transformers/embeddinggemma-300m-medical

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}


In [9]:
embedding_model_id = "sentence-transformers/embeddinggemma-300m-medical"

In [11]:
embedding_model = HuggingFaceEmbeddings(
    model_name=embedding_model_id,
    model_kwargs = {'device': 'cpu'},
    # Normalizing helps cosine similarity behave better across models
    encode_kwargs={"normalize_embeddings": True},
)

You are trying to use a model that was created with Sentence Transformers version 5.2.0.dev0, but you're currently using version 5.1.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.


In [12]:
# The function will make a multi-vector db and return a retriever

def medicine_documents_retriever(vectordb_name: str, data):
    # The storage layer for the original documents
    docstore = InMemoryStore()
    id_key = "doc_id"

    # The vectorstore to use to index the questions
    vectorstore = Chroma(collection_name = vectordb_name, embedding_function = embedding_model)
    # The Multi-Vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=docstore,
        id_key=id_key,
    )

    doc_ids = list()
    questions = list()
    docs = list()
    for d in data[:20]:
        doc_id = d["doc_id"]
        doc_ids.append(doc_id)
        docs.append(Document(metadata={"doc_id": doc_id}, page_content=d["original_doc"]))
        for q in d["questions"]:
            questions.append(Document(metadata={"doc_id": doc_id}, page_content=q))

    retriever.vectorstore.add_documents(questions)
    retriever.docstore.mset(list(zip(doc_ids,docs)))

    return retriever 

In [13]:
retriever = medicine_documents_retriever(vectordb_name="medicinedocs", data=data)

In [24]:
retriever.vectorstore.similarity_search("what special dietary instructions should i follow about Phenylephrine",k=6)

[Document(id='94fa51a5-b9b9-452e-abd3-616f59a1d297', metadata={'doc_id': '88423ccc-adf6-4ecb-9d4a-48ead1e5f2b7'}, page_content='what special dietary instructions should i follow about Phenylephrine'),
 Document(id='bdf54d04-3917-497c-84af-204d4f5bb8b6', metadata={'doc_id': '88423ccc-adf6-4ecb-9d4a-48ead1e5f2b7'}, page_content='Are there any dietary instructions while using Phenylephrine?'),
 Document(id='9fbf8185-f2e1-4a0e-8f6d-697892a2745f', metadata={'doc_id': '0ff6d87b-c64f-4ac8-853b-d143cb09a386'}, page_content='Are there any dietary instructions while using Phenylephrine?'),
 Document(id='86730082-4dcf-492c-88fe-267344e5d092', metadata={'doc_id': '22a4ba55-1113-4c1d-9db3-f7eaabb7999a'}, page_content='what special precautions should i follow about Phenylephrine'),
 Document(id='97731eca-61fa-4ff4-9207-ab27776a0f7a', metadata={'doc_id': '0ff6d87b-c64f-4ac8-853b-d143cb09a386'}, page_content='what other information should i know about Phenylephrine'),
 Document(id='cd758cc0-c9ba-4e00-

In [25]:
retriever.invoke("what special dietary instructions should i follow about Phenylephrine", kwargs={"k":10})

[Document(metadata={'doc_id': '88423ccc-adf6-4ecb-9d4a-48ead1e5f2b7'}, page_content='unless your doctor tells you otherwise, continue your normal diet.about Phenylephrine'),
 Document(metadata={'doc_id': '0ff6d87b-c64f-4ac8-853b-d143cb09a386'}, page_content='ask your pharmacist any questions you have about phenylephrine.keep a written list of all of the prescription and nonprescription (over-the-counter) medicines, vitamins, minerals, and dietary supplements you are taking. bring this list with you each time you visit a doctor or if you are admitted to the hospital. you should carry the list with you in case of emergencies.about Phenylephrine'),
 Document(metadata={'doc_id': '22a4ba55-1113-4c1d-9db3-f7eaabb7999a'}, page_content='before taking phenylephrine,tell your doctor and pharmacist if you are allergic to phenylephrine, any other medications, or any of the ingredients in phenylephrine preparations.do not take phenylephrine if you are taking a monoamine oxidase (mao) inhibitor, suc

#### It is working as I expected, we can see the the questions' doc_id are matching the documents' doc_id exactly.
#### That means when we search by question. It will firstly match the similar questions, then output the documents which are related to those questions.

### Next, I will involve a cross-encoder(ncbi/MedCPT-Cross-Encoder) to rerank the retrieved documents and output top_k(3) most relevant ones.

##### This corssEncoder(Bert) model was fine-tuned on 30522 medical related tokens
##### The clinical knowledge usually can rank relevance more accuracy.

Citation:

@article{jin2023medcpt,
  title={MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval},
  author={Jin, Qiao and Kim, Won and Chen, Qingyu and Comeau, Donald C and Yeganova, Lana and Wilbur, W John and Lu, Zhiyong},
  journal={Bioinformatics},
  volume={39},
  number={11},
  pages={btad651},
  year={2023},
  publisher={Oxford University Press}
}


In [14]:
cross_encoder_model_id = "ncbi/MedCPT-Cross-Encoder" 

In [15]:
from sentence_transformers import CrossEncoder

In [17]:
cross_encoder = CrossEncoder(cross_encoder_model_id)

In [18]:
# The functiion leverages cross-encoder to evaluate the original query and the retrieved documents.
# It gives every (query, document) pair a score, then sort them.
def rerank(query: str, retrieved_docs: list[Document]):
    pairs = [[query, d.page_content] for d in retrieved_docs]
    scores = cross_encoder.predict(pairs, batch_size=32)
    for r_d, score in zip(retrieved_docs, scores):
        r_d.metadata["rerank_score"] = float(score)

    retrieved_docs.sort(key= lambda d: d.metadata["rerank_score"], reverse=True)
    return retrieved_docs

In [19]:
# This function just wrap up all the function before.
# 1. retrieve documents -> 2. Rerank documents -> 3. Pick top_k
def ask(query: str, retriever, top_k):
    retrieved_docs = retriever.invoke(query, kwargs={"k":10})
    retrieved_docs = copy.deepcopy(retrieved_docs) # Avoid rerank changes original documents
    reranked_docs = rerank(query,retrieved_docs)
    return reranked_docs[:top_k]

In [20]:
ask("My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?",retriever,top_k = 2)

[Document(metadata={'doc_id': '1bf5880b-93ec-4ac9-a0cb-eb35693ccce4', 'rerank_score': 0.9999985694885254}, page_content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine'),
 Document(metadata={'doc_id': 'e36207d5-1410-47ca-b68a-05ed7bb7921e', 'rerank_score': 6.946862413315102e-05}, page_content="phenylephrine may cause side effects. some side effects can be serious. if you experience any of these symptoms, stop using phenylephrine and call your doctor:nervousnessdizzinesssleeplessnessphenylephrine may cause other side effects. call your doctor if you have any unusual problems while taking Phenylephrine.if you experience a seri

#### The rerank model is just working so good. Keep it and move next.

#### Beside rerank, I think query is the most important thing. As it retrieve most related documents of which a LLM makes use to generate most helpful response.
#### But in multi-turn conversation, users don't make a query with every previous details.
For example, in the third turn the user really want to ask 'How do I take Phenylephrine?'

But he types 'How do I take it?'. From the context, 'it' means 'Phenylephrine'.

If we retrieve by query  'How do I take it?', we can get unrelevant document.  'How do I take Phenylephrine?' makes more sense.

#### This is why I need to put an agent to transfer the query so that it is aligned to the conversation.
#### In this case, I choose short-term memory technique to store history conversation which will support the agent to understand the context.
#### If we need long-term memory, I need a memory and session management module with redis and vector DB.

#### Let's involve the agent firstly!

In [21]:
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory # Short-term Memory
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline
from langchain_core.prompts import ChatPromptTemplate
from langchain.prompts import PromptTemplate
from operator import itemgetter
from langchain_core.runnables import RunnableLambda
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.output_parsers import JsonOutputParser
import torch
import uuid

In [22]:
model_id = "ContactDoctor/Bio-Medical-Llama-3-8B"

In [23]:
def best_dtype():
    if torch.cuda.is_available():
        if torch.cuda.is_bf16_supported():
            return torch.bfloat16
        else:
            return torch.float16
        
    return torch.float32

def best_device():
    return "cuda" if torch.cuda.is_available() else "cpu"

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype = best_dtype(),
    device_map={"":best_device()}, 
    low_cpu_mem_usage=True     
)
print("Load tokenizer and base model done!")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Load tokenizer and base model done!


In [26]:
print(model)                    # full architecture tree (long but useful)
print(model.config)             # core hyperparameters (dims, layers, heads…)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

In [27]:
original_pipeline = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    return_full_text=False,   
    )

Device set to use cuda


In [28]:
# Wrapper normal piple with huggingfacepipeline
hug_pipeline = HuggingFacePipeline(pipeline=original_pipeline)

In [29]:
master_agent = ChatHuggingFace(llm=hug_pipeline)

In [30]:
class Short_Term_Memory():
    def __init__(self) -> None:        
        self.session_store: dict[str,BaseChatMessageHistory] = {}
        self.current_session_id = ""

    def get_history(self, session_id: str) -> BaseChatMessageHistory:        
        self.current_session_id = session_id
        if session_id not in self.session_store:
            self.session_store[session_id] = ChatMessageHistory()
        return self.session_store[session_id]
    
    def get_current_history(self) -> BaseChatMessageHistory:
        return self.get_history(self.current_session_id)
    
    def delete_history(self, session_id: str) -> bool:
        if session_id in self.session_store:
            d = self.session_store.pop(session_id)
            if d:
                return True
            else:
                return False
        return True
    
    def delete_current_history(self) -> bool:
        return self.delete_history(self.current_session_id)
    
# Convert history message to a string
def history_as_text(history: BaseChatMessageHistory) -> str:
    return "\n".join([
        f"{m.type.upper()}: {m.content}"   # e.g. "HUMAN: …" or "AI: …"
        for m in history.messages])

In [31]:
from typing import List, TypedDict

#from typing_extensions import TypedDict


class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        query: question
        generation: LLM generation
        history: list of history messages
    """
    query: str    
    history: BaseChatMessageHistory
    generation: str

In [32]:
# Node: combine_node will check the global session_id and short_term_memory

def combine_node(query: str):
    if settings.SESSION_ID == "":   
        settings.SESSION_ID = str(uuid.uuid4())
        settings.SHORT_TERM_MEMORY = Short_Term_Memory()
    
    history = settings.SHORT_TERM_MEMORY.get_history(settings.SESSION_ID)

    return {"query": query,"history": history, "generation": ""}

In [37]:
# First turn: ask a question
query_1 = "My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?"

In [38]:
combine_return = combine_node(query_1)

In [39]:
combine_return

{'query': 'My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?',
 'history': InMemoryChatMessageHistory(messages=[]),
 'generation': ''}

In [40]:
settings.SESSION_ID

'879ea7fc-cb0e-446f-a97e-fd58821fbc4b'

In [41]:
len(combine_return["history"].messages) # Check how many message it has.

0

In [42]:
combine_return["history"].add_user_message(query_1)

In [43]:
doc_1 = ask(query_1, retriever, top_k=1)

In [44]:
doc_1

[Document(metadata={'doc_id': '1bf5880b-93ec-4ac9-a0cb-eb35693ccce4', 'rerank_score': 0.9999985694885254}, page_content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine')]

In [45]:
# The first document is what exactly I am expecting
# Put is to history store
combine_return["history"].add_ai_message(doc_1[0].page_content)

In [46]:
combine_return["history"]

InMemoryChatMessageHistory(messages=[HumanMessage(content='My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?', additional_kwargs={}, response_metadata={}), AIMessage(content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine', additional_kwargs={}, response_metadata={})])

In [47]:
# second turn: ask a question
query_2 = "How can I use it?"

In [48]:
combine_return_2 = combine_node("How can I use it?")

In [49]:
combine_return_2

{'query': 'How can I use it?',
 'history': InMemoryChatMessageHistory(messages=[HumanMessage(content='My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?', additional_kwargs={}, response_metadata={}), AIMessage(content='phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine', additional_kwargs={}, response_metadata={})]),
 'generation': ''}

In [None]:
# Test whether the LLM can determine the query is related to history documents
query_grader_prompt = PromptTemplate(
    template="""You are a grader for a question. \n 
    You need to determine if a question is meaningful, if you don't know the conversation context. \n    
    Here is the user's question: {question} \n   
    Give a binary score 'yes' or 'no' score to indicate whether the question is meaningful. \n     
    Only provide the binary score as a JSON with a single key 'score', for example {{"score": "yes"}} or  {{"score": "no"}}.\n
    No premable or explanation.""",
    input_variables=["question"],
)

query_grader_chain = query_grader_prompt | master_agent | JsonOutputParser()

In [79]:
result = query_grader_chain.invoke({"question": query_2})

In [80]:
print(result)
    

{'score': 'no'}


In [81]:
# Test the LLM can rewrite a query depends on history documents
rewrite_prompt = PromptTemplate(
    template="""You are question re-writer that converts an input question to a better version that is optimized \n 
     for vectorstore retrieval. Use the history conversation to resolve references. Keep the contextual meaning. \n
     Here is the history conversation: \n\n {document} \n\n
     Here is the initial question: \n\n {question}. Improved question with no preamble: \n """,
    input_variables=["question", "document"],
)

query_rewrite_chain = rewrite_prompt | master_agent | StrOutputParser()

In [127]:
doc_txt = history_as_text(combine_return_2["history"])
print(doc_txt)

HUMAN: My nasal is disconfort. Do you have a medicine to relieve sinus congestion and pressure?
AI: phenylephrine is used to relieve nasal discomfort caused by colds, allergies, and hay fever. it is also used to relieve sinus congestion and pressure. phenylephrine will relieve symptoms but will not treat the cause of the symptoms or speed recovery. phenylephrine is in a class of medications called nasal decongestants. it works by reducing swelling of the blood vessels in the nasal passages.about Phenylephrine


In [82]:
result = query_rewrite_chain.invoke({"question": query_2, "document": doc_txt})

In [83]:
print(result)

 How can phenylephrine be used to relieve sinus congestion and pressure?


In [89]:
# Test the LLM if can judge the retrieval documents are related to the question enough
doc_relevance_prompt = PromptTemplate(
    template="""You are a grader assessing relevance of a retrieved document to a user question. \n 
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n
    If the document contains keywords related to the user question, grade it as relevant. \n
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Only provide the binary score as a JSON with a single key 'score', for example {{"score": "yes"}} or  {{"score": "no"}}.\n
    No premable or explanation.""",
    input_variables=["question", "document"],
)

retrieval_grader_chain = doc_relevance_prompt | master_agent | JsonOutputParser()

In [122]:
docs = ask(query_2, retriever, top_k = 2)
doc_txt = " ".join([d.page_content for d in docs])

In [123]:
doc_txt

"pyrethrin and piperonyl butoxide comes as a shampoo to apply to the skin and hair. it is usually applied to the skin and hair in two or three treatments. the second treatment must be applied 7-10 days after the first one. sometimes a third treatment may be necessary, as recommended by your doctor. follow the directions on your prescription label or the package label carefully, and ask your doctor or pharmacist to explain any part you do not understand. use pyrethrin and piperonyl butoxide shampoo exactly as directed. do not use more or less of it or use it more often than directed on the package label or prescribed by your doctor.the package label gives you an estimate of how much shampoo you will need based on your hair length. be sure to use enough shampoo to cover all of your scalp area and hair.pyrethrin and piperonyl butoxide shampoo should only be used on the skin or hair and scalp. avoid getting pyrethrin and piperonyl butoxide shampoo in your eyes, nose, mouth, or vagina. do n

In [None]:
result = retrieval_grader_chain.invoke({"question": query_1, "document": doc_txt})

In [125]:
print(result)

{'score': 'yes'}


In [114]:
# Test the LLM if can generate an answer

# Prompt
answer_prompt = hub.pull("rlm/rag-prompt")

# Chain
answer_chain = answer_prompt | master_agent | StrOutputParser()



In [116]:
answer_prompt.pretty_print()


You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: [33;1m[1;3m{question}[0m 
Context: [33;1m[1;3m{context}[0m 
Answer:


In [126]:
# Run
generation = answer_chain.invoke({"context": docs, "question": query_2})
print(generation)


 You can use pyrethrin and piperonyl butoxide shampoo to treat head lice and scabies. It is applied to the hair and scalp, and then washed off after 10 minutes. Two treatments are usually needed, seven to ten days apart, and a third treatment may be necessary if some lice or nits are still present after the second treatment.


In [148]:
# Hallucination test
# Test the LLM if can determine the answer is grounded in the facts. 
hallucination_grader_prompt = PromptTemplate(
    template="""You are a grader assessing whether an answer is grounded in / supported by a set of facts. \n 
    Here are the facts:
    \n ------- \n
    {documents} 
    \n ------- \n
    Here is the answer: {generation} \n
    Only provide the binary score as a JSON with a single key 'score', for example {{"score": "yes"}} or  {{"score": "no"}}.\n     
    Don't do preamble or explanation.""",
    input_variables=["generation", "documents"],
)

hallucination_grader_chain = hallucination_grader_prompt | master_agent | JsonOutputParser()

In [161]:
result = hallucination_grader_chain.invoke({"documents": "You can use pyrethrin and piperonyl butoxide shampoo to treat head lice and scabies. ", "generation": "pseudoephedrine may cause side effects. "})

In [162]:
print(result)

{'score': 'no'}


#### It is a point where an agent can transfer "How can I use it?" to a retrievable query.

#### The query rewriting part is working well now.
#### But sometimes, the user types questions which have multiply meaning. For example: How can I have phenylephrine?
* It could mean "What is the dosage instruction to take phenylephrine?"
* It also could mean "Where and how can I buy phenylephrine?" 
#### This example remind me to involve Query Decomposition technique which will transfer a query to a few querys in different angles.

In [None]:
# I will use the same LLM model to do the job with different prompt

# Test the LLM can rewrite a query depends on history documents
expand_query_prompt = PromptTemplate(
    template="""You are a medical doctor. You generate exactly one distinct, clinically-relevant question variants from the user's Original question, \n 
     covering different angles (e.g., indications/contraindications, dosing vs. administration, \n
     adult vs. pediatric, interactions vs. adverse effects). \n
     Here is the history conversation: \n\n {document} \n\n
     Here is the initial question: \n\n {question}. \n 
     Return only one question""",
    input_variables=["question", "document"],
)

expand_query_chain = expand_query_prompt | master_agent | StrOutputParser()

In [134]:
result = expand_query_chain.invoke({"question": "How can I have Phenylephrine?", "document": doc_txt})

In [135]:
print(result)

 What are the potential side effects of using Phenylephrine for extended periods?


### Using LangGraph to coordinate retriever, reranker, query_rewriter and relevant_grader to work together so that produce most relevant answer.
##### First of all, 

Each node will -

1/ Either be a function or a runnable.

2/ Modify the state.

The edges choose which node to call next.

In [None]:
query_responser = query_rewriter

In [None]:
from langgraph.graph import MessagesState
from langchain.chat_models import init_chat_model

response_model = init_chat_model(model="ContactDoctor/Bio-Medical-Llama-3-8B", model_provider="huggingface", temperature=0.2, task="text-generation")


def generate_query_or_respond(state: MessagesState):
    """Call the model to generate a response based on the current state. Given
    the question, it will decide to retrieve using the retriever tool, or simply respond to the user.
    """
    response = (
        query_responser
        # highlight-next-line
        .bind_tools([retriever_tool]).invoke(state["messages"])
    )
    return {"messages": [response]}

ValidationError: 1 validation error for ChatHuggingFace
llm
  Field required [type=missing, input_value={'model_id': 'ContactDoct...ask': 'text-generation'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing

In [None]:
from pydantic import BaseModel, Field
from typing import Literal

GRADE_PROMPT = (
    "You are a grader assessing relevance of a retrieved document to a user question. \n "
    "Here is the retrieved document: \n\n {context} \n\n"
    "Here is the user question: {question} \n"
    "If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n"
    "Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."
)


# highlight-next-line
class GradeDocuments(BaseModel):
    """Grade documents using a binary score for relevance check."""

    binary_score: str = Field(
        description="Relevance score: 'yes' if relevant, or 'no' if not relevant"
    )


#grader_model = init_chat_model("openai:gpt-4.1", temperature=0)
grader_model = query_rewriter

def grade_documents(
    state: MessagesState,
) -> Literal["generate_answer", "rewrite_question"]:
    """Determine whether the retrieved documents are relevant to the question."""
    question = state["messages"][0].content
    context = state["messages"][-1].content

    prompt = GRADE_PROMPT.format(question=question, context=context)
    response = (
        grader_model
        # highlight-next-line
        .with_structured_output(GradeDocuments).invoke(
            [{"role": "user", "content": prompt}]
        )
    )
    score = response.binary_score

    if score == "yes":
        return "generate_answer"
    else:
        return "rewrite_question"