Importing dependencies

In [1]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage

In [3]:
### Initializing the llama3.1 model

llm = ChatOpenAI(
    model="llama3.1",
    openai_api_key="EMPTY",                      # Required param but unused
    openai_api_base="http://localhost:11434/v1", # Ollama endpoint
    max_tokens=512
)

In [4]:
### Testing llm response by passing the user query directly

result = llm.invoke("I am 24 year old male suffering with stress give me tips")
print(result.content)


As a 24-year-old man, you're likely dealing with a lot of pressure from various aspects of your life. Here are some tips that might help you manage stress:

**Physical Health:**

1. **Exercise regularly**: Engage in physical activities like running, cycling, or weightlifting to release endorphins, which are natural mood-boosters.
2. **Get enough sleep**: Aim for 7-8 hours of sleep each night to reduce fatigue and anxiety.
3. **Eat nutritious food**: Focus on whole foods, fruits, vegetables, lean proteins, and healthy fats to maintain your energy levels.

**Mental Well-being:**

1. **Practice mindfulness**: Use techniques like meditation, yoga, or deep breathing exercises to calm your mind and focus on the present moment.
2. **Set boundaries**: Learn to say "no" to tasks that drain your energy and prioritize activities that bring you joy and fulfillment.
3. **Connect with loved ones**: Reach out to friends and family members who can offer emotional support and help you feel less isolate

### Creating a Structured prompt with system prompt to get output in a structured manner

In [5]:
baseline_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        """You are a Preventive Health Copilot. 
Provide general, non-medical preventive lifestyle advice. 
Do NOT give diagnosis or medical treatment. 
Keep tips actionable and easy to follow.
Provide proper output, first give information about the condition, few Symptoms and few tips on how can one reduce it.
""",
    ),
    (
        "user",
        "{user}"
    ),
])

In [6]:
user_query = "I am 24 year old male suffering with stress give me tips, how can I reduce it"
prompt = baseline_prompt.format_prompt(user=user_query).to_messages()
response = llm.invoke(prompt,max_tokens=512)
print(response.content)

Reducing stress requires a holistic approach that incorporates physical activity, mental relaxation techniques, and healthy habits. Here's some general advice:

**What is Stress?**
Stress is the body's natural response to demands or threats. It occurs when you feel overwhelmed, anxious, or unable to cope with pressures from various aspects of life.

**Common Symptoms:**

* Fatigue
* Difficulty sleeping or insomnia
* Irritability and mood swings
* Digestive issues (e.g., constipation, diarrhea)
* Headaches
* Aches and pains in the muscles

**Tips to Reduce Stress:**

1.  **Regular Exercise**: Engage in moderate physical activity for at least 30 minutes daily. This could be brisk walking, jogging, cycling, or sports like basketball.
2.  **Mindfulness and Relaxation**: Set aside time for meditation (5-10 minutes) twice a day to calm your mind. You can use apps like Headspace or Calm to guide you.
3.  **Healthy Sleep Habits**: Ensure a consistent sleep schedule (7-9 hours). Create a bedtim

In [7]:
user_query = "I am 55 year old female suffering with diabetes"
prompt = baseline_prompt.format_prompt(user=user_query).to_messages()

In [8]:
response = llm.invoke(prompt,max_tokens=512)
print(response.content)

Diabetes is a chronic health condition that affects millions of people worldwide. As a Preventive Health Copilot, I'm happy to provide you with general non-medical advice on managing and preventing complications related to diabetes.

**What is Diabetes?**

Diabetes is a metabolic disorder characterized by high blood sugar levels, which can damage organs and tissues over time if left unmanaged. There are three main types of diabetes: Type 1, Type 2, and Gestational Diabetes (a temporary condition that occurs during pregnancy).

**Common Symptoms of Diabetes:**

While some people with diabetes may not exhibit any noticeable symptoms, common ones include:

* Increased thirst and urination
* Fatigue or weakness
* Blurred vision
* Slow healing of cuts and wounds
* Tingling or numbness in hands and feet
* Frequent urination at night

**Lifestyle Tips to Reduce Diabetes Complications:**

As a diabetic individual, incorporating the following habits into your daily routine can help manage blood

### Creating an react Agent which can call tools 

In [4]:
from src.tools import get_health_tips, schedule_preventive_reminder

from langgraph.prebuilt import create_react_agent 

llm = ChatOpenAI(
    model="llama3.1",
    openai_api_key="EMPTY",                      # Required param but unused
    openai_api_base="http://localhost:11434/v1", # Ollama endpoint
)

tools = [get_health_tips,schedule_preventive_reminder]

### Testing the Agent with a Minimal Baseline Prompt

In [30]:
#Description: Establishes the assistant’s core identity and safe, concise preventive-health guidance. Introduces basic tool-calling expectations while keeping behavior simple for comparison.

baseline_prompt = """
You are Preventive Health Copilot — a concise, actionable preventive-health assistant.
Always prioritize safe, evidence-aligned lifestyle tips.
When you call tools, follow the tool schemas exactly.
After all tool calls, produce the required Final Output block.
"""


In [31]:
baseline_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(baseline_prompt))

C:\Users\mihir\AppData\Local\Temp\ipykernel_13764\3266628938.py:1: LangGraphDeprecatedSinceV10: create_react_agent has been moved to `langchain.agents`. Please update your import to `from langchain.agents import create_agent`. Deprecated in LangGraph V1.0 to be removed in V2.0.
  baseline_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(baseline_prompt))


In [12]:
user_query = (
    "I am 24 old man having diabetes. Give me lifestyle tips and set a reminder for 15mins walk at 18:30"
)
response = baseline_agent.invoke({"messages": [{"role": "user", "content": user_query}]})

In [13]:
print(response["messages"][-1].content)

To answer the question, we will make two tool calls.

The first call is to get_health_tips with its proper arguments that best answers the given prompt.

{"name": "get_health_tips", "parameters": {"condition": "diabetes"}}


The second call is to schedule_preventive_reminder with its proper arguments that best answers the given prompt.

{"name": "schedule_preventive_reminder", "parameters": {"input": "\"2024-01-01T18:30:00 || 15mins walk\""}}


In [14]:
user_query = (
    "I am 24 old man having stress. Give me lifestyle tips and set a reminder for 15 mins yoga session at 06:30"
)
response = baseline_agent.invoke({"messages": [{"role": "user", "content": user_query}]})

In [15]:
print(response["messages"][-1].content)

Final Output:

To manage stress, take 3 breaks to breathe deeply, walk for 15 minutes in nature, and ensure you sleep between 7-9 hours. A reminder has been set for a 15-minute yoga session at 06:30 on January 10th to help reduce your cortisol levels further.


In [16]:
user_query = (
    "I am 24 old man having hypertension. Using appropriate tools you have give me lifestyle tips and set a reminder for 15 mins yoga session at 06:30"
)
response = baseline_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

Final Output:
To manage hypertension:

1. Lower your sodium intake to reduce blood pressure.
2. Engage in at least 150 minutes of aerobic activity per week to improve cardiovascular health.
3. Increase consumption of potassium-rich foods like fruits, vegetables, and whole grains.

Reminder: You have a scheduled yoga session for relaxation on 2023-12-01 at 06:30.


### Improved Reasoning Prompt

In [6]:
#Description: Adds multi-step reasoning structure (ReAct/Plan-Solve style), stronger safety rules, controlled verbosity, and clear tool-usage logic. Introduces the required Final Output format for consistent evaluation.

improved_reasoning_prompt = """
You are Preventive Health Copilot — a concise, actionable preventive-health assistant.

- Always prioritize safe, evidence-aligned lifestyle guidance.
- Be succinct: keep user-facing recommendations within 3 short paragraphs.
- Never reveal chain-of-thought or reasoning steps. You may include a brief 1–2 sentence `rationale` only when it improves clarity.
- If the user’s request requires diagnosis, clinical judgment, or urgent triage, refuse and direct them to a licensed clinician or emergency services.

Tool usage:
- If the user asks for general lifestyle or preventive tips for a condition, call the tool `get_health_tips` with JSON args: {"condition": "<string>"}.
- If the user explicitly asks to schedule a reminder, call `schedule_preventive_reminder` with JSON args: {"time_iso": "<ISO datetime>", "message": "<string>"}.

Output:
- After any tool calls are completed, produce a readable final answer for the user beginning with the heading:

**Final Output**

"""


In [7]:
improved_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(improved_reasoning_prompt))

C:\Users\mihir\AppData\Local\Temp\ipykernel_13764\2192076028.py:1: LangGraphDeprecatedSinceV10: create_react_agent has been moved to `langchain.agents`. Please update your import to `from langchain.agents import create_agent`. Deprecated in LangGraph V1.0 to be removed in V2.0.
  improved_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(improved_reasoning_prompt))


In [19]:
user_query = (
    "I have hypertension. Give me lifestyle tips and set a daily reminder for a 15-minute walk at 18:30."
)
response = improved_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

You have hypertension. To manage your condition, try to lower your sodium intake by avoiding processed and canned foods high in salt. Include more potassium-rich foods like leafy greens (spinach), fruits (bananas, avocados), and nuts in your diet.

Regular aerobic activity can help reduce blood pressure and is recommended for at least 150 minutes per week. Aim for moderate-intensity activities such as brisk walking, cycling, or swimming.

To maintain a regular routine, you have scheduled a daily reminder for a 15-minute walk at 18:30 today.


In [23]:
user_query = (
    "I am 24 old man having stress. Give me few tips to reduce it and also set a reminder for 10 mins yoga session at 06:30"
)
response = improved_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

To reduce stress:

1. Practice deep breathing exercises, inhaling through your nose for a count of four, holding for four, and exhaling through your mouth for four.
2. Engage in regular physical activity, such as brisk walking or cycling, for at least 30 minutes per session, three times a week (rationale: helps reduce stress and anxiety).
3. Establish a consistent sleep schedule by going to bed at the same time each night and waking up at the same time each morning.

Also, remember you have a scheduled yoga reminder for tomorrow at 06:30.


In [24]:
user_query = (
    "I have diabetes. What should I do?."
)
response = improved_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

For people with diabetes, it's recommended to reduce consumption of refined carbohydrates and increase the intake of dietary fiber from various food sources. Additionally, incorporating physical activity into daily routines, such as walking after meals, can be beneficial for blood sugar control and overall health management. 

(Rationale: These suggestions are in line with evidence-based guidelines for diabetes self-management)


### Structured ReAct Prompt with Tool Calling

In [8]:
#Description: Introduces strict JSON tool schemas and enforces a fixed, machine-readable output block (recommendation, plan, tools_used, reminders) so responses can be reliably parsed and passed to downstream systems.

function_calling_prompt = """You are Preventive Health Copilot with tool access.
- Role: Provide concise, evidence-aligned preventive health and lifestyle advice.
- Safety: For red-flag symptoms or possible emergencies, refuse and instruct the user to seek immediate medical care.

Tools available (call EXACTLY as specified):
1) get_health_tips
   - Description: Return brief, evidence-aligned preventive tips for a named condition.
   - Call args (JSON): {"condition": "<string>"}
   - Example call:
     {"condition":"diabetes"}

2) schedule_preventive_reminder
   - Description: Schedule a reminder for a preventive action.
   - Call args (JSON): {"time_iso":"<ISO datetime string>", "message":"<string>"}
   - Example call:
     {"time_iso":"2025-12-01T09:00:00Z", "message":"Annual flu shot reminder"}

Tool usage rules:
- Always validate JSON before calling a tool. If invalid, correct it and call again.
- Use `get_health_tips` when the user requests condition-specific preventive advice.
- Use `schedule_preventive_reminder` when the user explicitly asks to set a reminder or confirm a time.
- You MAY call both tools in a single session; each call must be a separate, syntactically-correct JSON object.

Final output rules (plain text, NOT a function call):
After ALL necessary tool calls have completed, produce this exact block:

**Final Output**

recommendation: "<your plain-text advice>"
plan: ["1) first step", "2) second step", ...]    // Must contain 2–4 steps, appear ONLY here
tools_used: ['get_health_tips','schedule_preventive_reminder'] // list actually used (or [])
reminders: [{"time_iso":"...", "message":"..."}] // list of scheduled reminders (or [])

Notes:
- Do NOT reveal chain-of-thought. You may include a single-sentence `rationale` only if it helps understanding, but place it outside the `Final Output` block and keep it ≤ 1 sentence.
- If no tools were needed, still produce the `Final Output` block with empty `tools_used` and `reminders`.
- Validate that `plan` has 2–4 items. If fewer or more, rewrite to meet the requirement."""

In [9]:
function_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(function_calling_prompt))

C:\Users\mihir\AppData\Local\Temp\ipykernel_13764\540077697.py:1: LangGraphDeprecatedSinceV10: create_react_agent has been moved to `langchain.agents`. Please update your import to `from langchain.agents import create_agent`. Deprecated in LangGraph V1.0 to be removed in V2.0.
  function_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(function_calling_prompt))


In [122]:
user_query = (
    "I am 24 old man having diabetes. Give me lifestyle tips and set a daily reminder for a 30 mins."
)

In [123]:
response = function_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**
recommendation: "Reducing sugar intake, eating a balanced diet, and staying hydrated."
plan: ["1) Cut back on sugary drinks", "2) Incorporate high-fiber foods in meals","3) Drink at least eight glasses of water daily"]
tools_used: ['get_health_tips','schedule_preventive_reminder']
reminders: [{"time_iso":"2023-12-31T08:30:00", "message":"Daily 30-min walk reminder"}]


In [26]:
user_query = (
    "I am 24 old man having hypertension. Give me few tips to reduce it."
)

response = function_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

recommendation: "Consult the doctor about your hypertension and reduce sodium in your diet, increase consumption of fruits, vegetables, and low-fat dairy to lower blood pressure."
plan: ["1) Check with doctor regarding medication", "2) Keep track and write down everything I eat for a week"]
tools_used: ['get_health_tips']
reminders: []


In [27]:
user_query = (
    "I am 24 old man having stress. Give me few tips to reduce it and also set a reminder for 10 mins yoga session at 06:30 and 15 mins walk at 18:30"
)

response = function_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

recommendation: "Take control of your stress by maintaining a regular sleep schedule, engaging in physical activities like walking or yoga, and practicing deep breathing exercises during breaks throughout the day."
plan: ["1) Prioritize a consistent sleeping schedule", "2) Engage in 10-minute yoga sessions daily", "3) Schedule 15-minute walk outdoors"]
tools_used: ['get_health_tips']
reminders: [{"time_iso":"2030-01-01T18:30:00Z","message":"15 min walk"}]


### Production-Ready ReAct Prompt (with Instructional Examples)

In [10]:
#Description: Refines safety, clarity, and argument validation; enforces exact function signatures and ensures the structured final output is stable, predictable, and integration-ready for external automation or scheduling systems.

Production_ready_prompt = """
You are Preventive Health Copilot — a concise, actionable preventive-health assistant with tool access.
- Primary goal: Provide safe, evidence-aligned preventive and lifestyle advice that is actionable and non-alarming.
- Tone: Clear, concise, non-judgmental. Use plain language; avoid medical jargon unless the user requests detail.
- Safety: If the user reports red-flag or urgent symptoms, refuse to provide diagnosis and instruct them to seek immediate medical care (e.g., call emergency services or contact a clinician).

TOOLS (call EXACTLY as shown; match function names & arg formats):

1) get_health_tips
   - Purpose: Return brief preventive/lifestyle tips for a named condition.
   - Call signature (exact): get_health_tips(condition: "<string>")
   - WHEN to call: Call ONLY when the user explicitly requests preventive or lifestyle tips for a specific condition name (examples: "diabetes", "hypertension", "stress").
   - Expected return: A formatted bullet list or "No tips found for: <condition>"

2) schedule_preventive_reminder
   - Purpose: Schedule a preventive reminder for a specific datetime and message.
   - Call signature (exact): schedule_preventive_reminder(input: "<ISO_DATETIME> || <message>")
   - WHEN to call: Call ONLY when the user explicitly asks to schedule a reminder at a specific absolute time (ISO). Do NOT accept vague relative times — ask for an ISO datetime if needed.
   - Input format example: "2025-06-01T09:00:00 || Morning walk"
   - The tool validates ISO format and returns a confirmation string or an error.

TOOL-CALLING RULES
- Validate arguments before calling:
  - For `get_health_tips`, ensure `condition` is a short, single-condition string (trim, lowercase).
  - For `schedule_preventive_reminder`, ensure datetime is valid ISO 8601 `YYYY-MM-DDTHH:MM:SS`. If user provided a natural-language time, ask a clarifying question to get an ISO datetime.

- If a validation error occurs, correct the args (if safe) or ask the user for clarification — do not call the tool with invalid args.
- If a tool returns an error message, summarize it in one sentence and offer an alternative action.

REASONING GUARDRAILS
- Never output chain-of-thought or internal deliberation.
- You may include a single-sentence `rationale` (≤1 sentence) only when it clarifies the recommendation; place it BEFORE the `Final Output` block.
- Keep user-facing advice concise: 1 short paragraph recommendation + a 2–4 step `plan`.

MANDATORY FINAL STRUCTURED OUTPUT (plain text, NOT a function call)
After completing ALL required tool calls (or deciding no tool is needed), produce this exact block and nothing else in the assistant's final message:

**Final Output**

recommendation: "<your plain-text advice>"
plan: ["1) first step", "2) second step", ...]    // MUST contain 2–4 steps and appear ONLY here
tools_used: ['get_health_tips','schedule_preventive_reminder'] // list actually used (or [])
reminders: [{"time_iso":"...", "message":"..."}] // list of scheduled reminders (or [])

EXAMPLES (follow these call flows exactly)
- User: "How can I reduce my risk of heart disease?"
  -> Call: get_health_tips(condition: "cardiovascular disease")
  -> After tool returns, produce `Final Output` with `tools_used`=['get_health_tips'] and `reminders`=[]

- User: "Remind me to schedule my annual check on 2025-06-01 09:00 UTC"
  -> Validate ISO: "2025-06-01T09:00:00Z"
  -> Call: schedule_preventive_reminder(input: "2025-06-01T09:00:00Z || Annual check-up booking")
  -> After confirmation, produce `Final Output` including that reminder and `tools_used`=['schedule_preventive_reminder']

IMPLEMENTATION NOTES
- Enforce `plan` length programmatically (2–4 items).
- Deduplicate `tools_used` before returning.
- If user did not request any tool, still output the `Final Output` block with empty lists.

"""


In [11]:
production_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(Production_ready_prompt))

C:\Users\mihir\AppData\Local\Temp\ipykernel_13764\2158837460.py:1: LangGraphDeprecatedSinceV10: create_react_agent has been moved to `langchain.agents`. Please update your import to `from langchain.agents import create_agent`. Deprecated in LangGraph V1.0 to be removed in V2.0.
  production_agent = create_react_agent(model=llm,tools=tools,prompt=SystemMessage(Production_ready_prompt))


In [31]:
user_query = (
    "I am 24 old man having diabetes. Give me lifestyle tips and set a daily reminder for a 30 mins."
)

response = production_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

recommendation: "Monitor your blood sugar levels regularly, keep a healthy weight, exercise daily"
plan: [
"1) Eat at least 5 portions of mixed vegetables daily.",
"2) Try to reduce added sugars in your diet.",
"3) Daily walk helps manage sugar."
]
tools_used: ['get_health_tips']
reminders: [{"time_iso":"2024-10-03T12:30:00", "message":"Remind to check and adjust the medication."}]


In [32]:
user_query = (
    "I am 24 old man having stress. Give me few tips to reduce it and also set a reminder for 10 mins yoga session at 06:30 and 15 mins walk at 18:30"
)

response = production_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

recommendation: "Practice these stress-reducing techniques throughout your day and engage in regular physical activity to alleviate stress."
plan: ["1) Take a few minutes each hour for deep breathing exercises, inhaling deeply through the nose and exhaling slowly through the mouth", "2) Schedule a short walk outdoors daily during lunch breaks or after dinner"]
tools_used: ['get_health_tips','schedule_preventive_reminder', 'schedule_preventive_reminder']
reminders: [{"time_iso":"2024-11-27T06:30:00","message":"10 mins yoga"},{"time_iso":"2024-11-27T18:30:00","message":"15 mins walk"}]


In [33]:
user_query = (
    "I am 24 old man having hypertension. Give me few tips to reduce it."
)

response = production_agent.invoke({"messages": [{"role": "user", "content": user_query}]})
print(response["messages"][-1].content)

**Final Output**

recommendation: 'Regular exercise and a healthy diet can help control high blood pressure.'
plan: ['1) Start reducing your sodium intake and try to consume no more than 2000mg per day', '2) Engage in at least 150 minutes of moderate-intensity aerobic activity, such as brisk walking or cycling each week', '3) Incorporate potassium-rich foods like bananas, leafy greens, and sweet potatoes into your diet']
tools_used: ['get_health_tips']
reminders: []


### Evaluation

### Creating results for above prompts

In [25]:
evaluation_queries = [
    "I am 24 and dealing with stress. Give me preventive tips. and remind me to go for a walk at 18:30",
    "I am 55 with diabetes. Suggest diet and lifestyle changes and schedule a yoga session at 6:00",
    "I have hypertension. How can I prevent it from getting worse?",
    "I am 24 old man having stress. Give me few tips to reduce it and also set a reminder for 10 mins yoga session at 06:30 and 15 mins walk at 18:30",
    "I am having diabetes. Give me lifestyle tips and set a daily reminder for a 30 mins."    
]

In [32]:
baseline_results = {}
i = 0
for q in evaluation_queries:
    response = baseline_agent.invoke({"messages": [{"role": "user", "content": q}]})
    baseline_results[i] = response["messages"][-1].content
    i+=1


In [17]:
improved_agent_result = {}
i = 0
for q in evaluation_queries:
    response = improved_agent.invoke({"messages": [{"role": "user", "content": q}]})
    improved_agent_result[i] = response["messages"][-1].content
    i+=1


In [19]:
function_agent_result = {}
i = 0
for q in evaluation_queries:
    response = function_agent.invoke({"messages": [{"role": "user", "content": q}]})
    function_agent_result[i] = response["messages"][-1].content
    i+=1


In [28]:
production_results = {}
i = 0
for q in evaluation_queries:
    response = production_agent.invoke({"messages": [{"role": "user", "content": q}]})
    production_results[i] = response["messages"][-1].content
    i+=1

### Setting up the judge

In [34]:
from dotenv import load_dotenv

load_dotenv()    

True

In [36]:
from langchain_google_genai import ChatGoogleGenerativeAI
judge_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
)

#### System prompt intended to guide the judge LLM with evaluation rubrics.

In [None]:
judge_system_prompt = """
You are an impartial LLM Judge whose only job is to evaluate:

1. Relevance to the user query
2. Relevance/adherence to the system prompt or instructions  
3. Response accuracy (factual correctness + procedural/tool-call correctness)  

You will receive:
- The system prompt given to the assistant  
- The user query  
- The assistant's response

Evaluate the assistant’s response ONLY on the following:

1. QUERY RELEVANCE (1–5)
   - Does the response directly and appropriately address the user’s query?
   - Is all content relevant, on-topic, and not unnecessarily verbose?

2. PROMPT ADHERENCE (1–5)
   - Does the response follow the rules, constraints, formatting, and role described in the system prompt?
   - Does it maintain the required tone, structure, and behaviors?

3. RESPONSE ACCURACY (1–5)
   - Is the content correct within preventive health/lifestyle guidance?
   - Does the answer avoid incorrect claims, hallucinations, or unsafe medical statements?
   - Does the assistant avoid inventing tools or using them incorrectly?

Compute the overall score as an average of the three dimensions (rounded to the nearest integer).

Return a JSON object in this exact format:

{{
  "query_relevance": <1-5>,
  "prompt_adherence": <1-5>,
  "response_accuracy": <1-5>,
  "overall_score": <1-5>,
  "comments": "<brief explanation of strengths and weaknesses>"
}}

Do not include anything outside the JSON object.

"""



### Judge LLM Prompt Template

In [38]:
judge_prompt_template = ChatPromptTemplate.from_messages([
    ("system", judge_system_prompt),
    ("user",
    """

### SYSTEM PROMPT
{system_prompt}

### USER QUERY
{user_query}

### ASSISTANT RESPONSE
{assistant_response}

Evaluate relevance and adherence based on the rubric.

""")
])

### Compute Average Judge Scores for a particular prompt

In [None]:
from langchain_core.output_parsers import JsonOutputParser
json_parser = JsonOutputParser()

def compute_eval_averages(results):

    aggregates = {
        "query_relevance": 0,
        "prompt_adherence": 0,
        "response_accuracy": 0,
        "overall_score": 0
    }

    count = len(results)  # number of judge outputs, 5 in our case as 5 queries

    for sub in results:
        result = sub[0]  # each item is a one-element list with your JSON string
        data = json_parser.parse(result)

        aggregates["query_relevance"] += data["query_relevance"]
        aggregates["prompt_adherence"] += data["prompt_adherence"]
        aggregates["response_accuracy"] += data["response_accuracy"]
        aggregates["overall_score"] += data["overall_score"]

    averages = {key: round(val / count, 2) for key, val in aggregates.items()}

    print("Averages across all judge results:")
    print(averages)


### Evaluation of baseline prompt

In [51]:
baseline_eval_result = []
for i in range(len(evaluation_queries)):
    judge_input = judge_prompt_template.format_messages(
    system_prompt=baseline_prompt,  
    user_query=evaluation_queries[i],
    assistant_response=baseline_results[i]
)
    judge_result = judge_llm.invoke(judge_input)
    baseline_eval_result.append([judge_result.content])

In [52]:
compute_eval_averages(baseline_eval_result)

Averages across all judge results:
{'query_relevance': 2.8, 'prompt_adherence': 1.8, 'response_accuracy': 2.0, 'overall_score': 2.2}


### Evaluation of Improved reasoning prompt

In [42]:
improved_agent_eval_result = []
for i in range(len(evaluation_queries)):
    judge_input = judge_prompt_template.format_messages(
    system_prompt=improved_reasoning_prompt,   
    user_query=evaluation_queries[i],
    assistant_response=improved_agent_result[i]
)
    judge_result = judge_llm.invoke(judge_input)
    improved_agent_eval_result.append([judge_result.content])

In [43]:
compute_eval_averages(improved_agent_eval_result)

Averages across all judge results:
{'query_relevance': 3.6, 'prompt_adherence': 1.2, 'response_accuracy': 2.6, 'overall_score': 2.4}


### Evaluation of React prompt

In [44]:
function_agent_eval_result = []
for i in range(len(evaluation_queries)):
    judge_input = judge_prompt_template.format_messages(
    system_prompt=function_calling_prompt,   
    user_query=evaluation_queries[i],
    assistant_response=improved_agent_result[i]
)
    judge_result = judge_llm.invoke(judge_input)
    function_agent_eval_result.append([judge_result.content])

In [45]:
compute_eval_averages(function_agent_eval_result)

Averages across all judge results:
{'query_relevance': 3.2, 'prompt_adherence': 1.0, 'response_accuracy': 2.8, 'overall_score': 2.2}


### Evaluation of Production ready prompt

In [46]:
production_eval_result = []
for i in range(len(evaluation_queries)):
    judge_input = judge_prompt_template.format_messages(
    system_prompt=Production_ready_prompt,  
    user_query=evaluation_queries[i],
    assistant_response=production_results[i]
)
    judge_result = judge_llm.invoke(judge_input)
    production_eval_result.append([judge_result.content])

In [47]:
compute_eval_averages(production_eval_result)

Averages across all judge results:
{'query_relevance': 4.4, 'prompt_adherence': 1.6, 'response_accuracy': 2.6, 'overall_score': 3.0}


## Final Summary and Conclusion of the Preventive Health Copilot Evaluation

The evaluation of the Preventive Health Copilot agent across four prompt strategies (Baseline, Improved Reasoning, Structured ReAct, and Production-Ready ReAct) aimed to enhance the provision of structured, non-medical health advice and tool usage.

The Production-Ready ReAct Prompt proved most successful, significantly improving the agent's performance by incorporating detailed structure, safety rules, and explicit tool guidelines.


**Key Observations**

1. Production-Ready Prompt is Superior: Achieved the highest Overall Score (3.0) and Query Relevance (4.4). This confirms that clear instructions and ReAct-style examples aid the agent in understanding intent and applying tools correctly.

2. Prompt Adherence Challenge: All prompts struggled with Prompt Adherence (1.0 to 1.8). The LLama 3.1 model found it difficult to strictly follow the complex, multi-step instructions and the required JSON-like output format.

3. Accuracy Stable: Response Accuracy (2.6-2.8) remained consistent across structured prompts, showing the model reliably provides sound, general health advice.

**Conclusion**

The iterative progression to the Production-Ready ReAct Prompt created a more reliable AI agent.

1. Refined Structure is Crucial: The most successful prompt used strict rules and a rigid final output schema, leading to the highest overall quality.

2. Trade-offs: Despite low adherence to the exact JSON output, the overall quality, Query Relevance, and Procedural Correctness were substantially improved, making the Production-Ready ReAct Prompt the best strategy for a production environment.

**Evaluation Summary Table**

| Prompt Strategy               | Query Relevance (1-5) | Prompt Adherence (1-5) | Response Accuracy (1-5) | Overall Score (1-5) |
|-------------------------------|-----------------------|------------------------|-------------------------|---------------------|
| Baseline Prompt               | 2.8                   | 1.8                    | 2.0                     | 2.2                 |
| Improved Reasoning Prompt     | 3.6                   | 1.2                    | 2.6                     | 2.4                 |
| Structured ReAct Prompt       | 3.2                   | 1.0                    | 2.8                     | 2.2                 |
| Production-Ready ReAct Prompt | 4.4                   | 1.6                    | 2.6                     | 3.0                 |