# Fitbit Conversational AI â€“ Experimentation Notebook

This notebook documents the experimentation process behind the Fitbit Conversational AI POC.
The goal is to validate key design decisions around intent understanding, data grounding,
prompting strategy, and automated evaluation.

In [4]:
import sys
from pathlib import Path

PROJECT_ROOT = Path().resolve().parents[0]
sys.path.append(str(PROJECT_ROOT))

from dotenv import load_dotenv
load_dotenv()

import json
import os

assert os.getenv("ANTHROPIC_API_KEY"), "ANTHROPIC_API_KEY not set"


In [2]:
with open("../data/mock_fitbit_data.json") as f:
    fitbit_data = json.load(f)

fitbit_data


{'user_profile': {'age': 38,
  'goals': {'steps_daily': 10000, 'sleep_hours': 7}},
 'daily_metrics': [{'date': '2025-03-17',
   'steps': 7900,
   'resting_hr': 63,
   'sleep': {'duration_min': 420, 'efficiency': 0.85}},
  {'date': '2025-03-18',
   'steps': 8234,
   'resting_hr': 61,
   'sleep': {'duration_min': 345, 'efficiency': 0.82}}]}

### Data Representation & Grounding Strategy

The assistant operates only on structured mock Fitbit data.
This data is passed explicitly to the LLM to ensure responses remain grounded
and avoid hallucinations.

Key surfaced signals include:
- Sleep duration and efficiency
- Daily and weekly step counts
- Resting heart rate trends


In [None]:
from agents.intent_agent import detect_intent

# Intent Detection Sanity Check
queries = [
    "How did I sleep last night?",
    "Am I more active this week than last week?",
    "What can I do to improve my heart rate variability?",
    "Tell me a joke"
]

for q in queries:
    print(f"Query: {q}")
    print("Detected intent:", detect_intent(q))
    print("-" * 40)


Query: How did I sleep last night?
Detected intent: {'intent': 'sleep', 'time_reference': 'last night'}
----------------------------------------
Query: Am I more active this week than last week?
Detected intent: {'intent': 'comparison', 'time_reference': 'this week vs last week'}
----------------------------------------
Query: What can I do to improve my heart rate variability?
Detected intent: {'intent': 'heart', 'time_reference': None}
----------------------------------------
Query: Tell me a joke
Detected intent: {'intent': 'activity', 'time_reference': None}
----------------------------------------


### Intent Detection Design Choice

Intent detection is treated as a classification problem.
A deterministic setup (temperature = 0) is used to ensure consistent routing
for identical inputs.

This mirrors traditional ML routing systems while allowing fast iteration via prompting.


In [16]:
### Response Generation Experiment
from graph import build_graph
from memory.conversation_memory import ConversationMemory

graph = build_graph(fitbit_data)
memory = ConversationMemory()
query = "How did I sleep last night?"


# Prepare state
state = {
    "user_input": query,
    "memory": memory.get()
}

# Assistant generates response
try:
    result = graph.invoke(state)
    response = result["response"]
except Exception as e:
    response = None
    print(f"Assistant error: {e}\n")

# Print assistant response
if response:
    print(f"Assistant: {response}\n")

# Update conversation memory
if response:
    memory.update(query, response)




Assistant: Okay, let's take a look at your sleep data. Your sleep duration is 5 hours and 45 minutes, which is a bit below the recommended 7 hours per night. However, your sleep efficiency of 82% is within the normal range, which is great.

To help you get closer to that 7-hour target, I'd suggest trying to establish a more consistent bedtime routine. Things like winding down with a relaxing activity, avoiding screens before bed, and aiming to go to sleep and wake up at the same time each day can really help improve your sleep duration.

Remember, small adjustments to your sleep habits can make a big difference over time. Let me know if you have any other questions - I'm here to support you on your journey to better sleep!



### Response Generation Strategy

Response generation uses a slightly higher temperature to allow natural,
empathetic phrasing while remaining grounded in data.

Prompts explicitly:
- Reference concrete data points
- Avoid medical claims
- Encourage actionable but gentle suggestions


In [None]:
### Grounding Failure Test (Hallucination Avoidance)

query = "How was my blood oxygen last night?"
intent = "sleep"

synthesize_response(
    {
        "user_query": query,
        "intent": intent,
        "data": fitbit_data,
        "memory": []
    }
)


"Unfortunately, I don't have any information about your blood oxygen levels from last night in the data provided. The data includes your daily step counts, resting heart rates, and sleep duration and efficiency, but does not mention blood oxygen.\n\nWithout that specific metric, I can't give you details on your blood oxygen from last night. However, I can share a few observations based on the sleep data:\n\nOn March 17th, you slept for 7 hours with an 85% sleep efficiency. On March 18th, you slept for 5 hours and 45 minutes with an 82% sleep efficiency. Both of these sleep sessions were within your goal of 7 hours per night.\n\nLet me know if you have any other questions about the sleep data I do have access to. I'm happy to provide more insights if you can share your blood oxygen information as well."

### Handling Missing Data

When required data is unavailable, the assistant declines to speculate.
This is a deliberate design choice to prioritize trust and safety over coverage.


In [15]:
from evaluation.llm_judge import judge

evaluation = judge(
    query="How did I sleep last night?",
    response=response,
    data=fitbit_data
)

print(evaluation)


1. Yes, all the claims made in the response are grounded in the data provided in the user's Fitbit profile.

2. The tone of the response is appropriate - it is informative, objective, and provides constructive feedback without being overly critical.

3. The advice provided is non-medical and focused on general sleep hygiene recommendations, such as aiming for 7 hours of sleep per night and maintaining a sleep efficiency of at least 85%. This is appropriate for an AI assistant without medical expertise.

Overall, the response is well-structured, data-driven, and provides helpful insights and suggestions to the user about their sleep patterns.


### Automated Evaluation (LLM-as-a-Judge)

Responses are evaluated along:
1. Faithfulness to source data
2. Tone and empathy
3. Actionability

This enables scalable offline quality monitoring without labeled datasets,
and supports future regression testing and prompt iteration.


### Summary

This notebook demonstrates:
- Controlled intent routing
- Data-grounded response generation
- Explicit hallucination handling
- Automated evaluation for quality assurance

The experimentation process focuses on system correctness, safety,
and extensibility rather than raw model performance.
