<h2>ClinaAI Voice Agent - Evaluation notebook<h2>

*SUMMARY* : The goal of this test is to measure the general accuracy of the ClinAI agent strictly using voice. The test scenarios will be varied and simulate real patient convsersations/requests. The main features being tested here are: scheduling appointments, cancelling appointments, giving information about the clinic/office and refilling prescriptions. 

<small>*METHODOLOGY* : 42 scenarios have been created (Generated by AI to remove bias). I will use the microphone button for each prompt, meaning all input will be audio. Each scneario will be fed into the agent exactly as it is written in the "Scenario" column and the results of that conversation will be compared to the corresponding "Expected Result". If the results match, that scenario will recieve a "PASS". If the results don't match, that scenario will recieve a "FAIL". Images of the transcripts and updated information in the database will be provided in cells for full transparency.<small>

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("clinai_voice_test_cases.csv")

In [3]:
# show scenarios and expected result
df.head(len(df))

Unnamed: 0,ID,Category,Scenario,Expected Result,PASS/FAIL
0,1,Scheduling,User tells agent they want to be booked for ne...,2025-11-25 03:00 PM appt created in DB,
1,2,Scheduling,User requests an appointment tomorrow at 10 in...,"Agent informs user tomorrow is a weekend, 2025...",
2,3,Scheduling,"User asks for an appointment Friday at 4:30pm,...",Agent informs user they're open until 4 on Fri...,
3,4,Scheduling,User wants to be booked on June 18th at 2pm,2025-06-18 02:00 PM appt created in DB,
4,5,Scheduling,User asks if they can come in Monday around noon,2025-11-24 12:00 PM appt created in DB,
5,6,Scheduling,User asks for an appointment on Friday and cho...,Agent shows user all availabilities for Friday...,
6,7,Scheduling,"User asks for human rep, then asks to schedule...","Agent transfers to human rep, 2025-11-26 03:30...",
7,8,Scheduling,User wants to schedule for December 5th at 8am...,Agent asks which date they would like to sched...,
8,9,Scheduling,User tries to schedule next Tuesday at 8:30am ...,Agent informs user that time is already booked...,
9,10,Scheduling,User asks to be scheduled on the weekend then ...,"Agent says they're closed on weekends, 2025-11...",


<h4>Helper functions:<h4>

In [4]:
def test_passed(df, index: int):
    df.loc[index, "PASS/FAIL"] = "PASS"
    print("Result: PASS")

In [5]:
def test_failed(df, index: int):
    df.loc[index, "PASS/FAIL"] = "FAIL"
    print("Result: FAIL")

<h2>Testing starts here:<h2>

<h4>Scheduling Appointments<h4>

In [6]:
# Test 1
print(f"Scenario: {df['Scenario'][0]}\n")
print(f"Expected Result: {df['Expected Result'][0]}")

Scenario: User tells agent they want to be booked for next Tuesday at 3pm

Expected Result: 2025-11-25 03:00 PM appt created in DB


![img](screenshots/test1.png)
![img](screenshots/test-1.png)

In [7]:
test_passed(df, 0)

Result: PASS


  df.loc[index, "PASS/FAIL"] = "PASS"


In [8]:
# Test 2
print(f"Scenario: {df['Scenario'][1]}\n")
print(f"Expected Result: {df['Expected Result'][1]}")

Scenario: User requests an appointment tomorrow at 10 in the morning but tomorrow is a Saturday so user schedules for Monday at 11am instead

Expected Result: Agent informs user tomorrow is a weekend, 2025-11-24 11:00 AM appt created in DB


![img](screenshots/test2.png)
![img](screenshots/test-2.png)

In [9]:
test_passed(df, 1)

Result: PASS


In [10]:
# Test 3
print(f"Scenario: {df['Scenario'][2]}\n")
print(f"Expected Result: {df['Expected Result'][2]}")

Scenario: User asks for an appointment Friday at 4:30pm, clinic is only open until 4pm on Fridays, user asks for 3:30pm instead

Expected Result: Agent informs user they're open until 4 on Fridays, 2025-11-28 03:30 PM appt created in DB


![img](screenshots/test3.png)
![img](screenshots/test-3.png)

In [11]:
test_passed(df, 2)

Result: PASS


In [12]:
# Test 4
print(f"Scenario: {df['Scenario'][3]}\n")
print(f"Expected Result: {df['Expected Result'][3]}")

Scenario: User wants to be booked on June 18th at 2pm

Expected Result: 2025-06-18 02:00 PM appt created in DB


![img](screenshots/test4.png)
![img](screenshots/test-4.png)

In [13]:
test_passed(df, 3)

Result: PASS


In [14]:
# Test 5
print(f"Scenario: {df['Scenario'][4]}\n")
print(f"Expected Result: {df['Expected Result'][4]}")

Scenario: User asks if they can come in Monday around noon

Expected Result: 2025-11-24 12:00 PM appt created in DB


![img](screenshots/test5.png)
![img](screenshots/test-5.png)

In [15]:
test_passed(df, 4)

Result: PASS


In [16]:
# Test 6
print(f"Scenario: {df['Scenario'][5]}\n")
print(f"Expected Result: {df['Expected Result'][5]}")

Scenario: User asks for an appointment on Friday and chooses 10am when given available options

Expected Result: Agent shows user all availabilities for Friday, 2025-11-28 10:00 AM appt created in DB


![img](screenshots/test6.png)
![img](screenshots/test-6.png)

In [17]:
test_passed(df, 5)

Result: PASS


In [18]:
# Test 7
print(f"Scenario: {df['Scenario'][6]}\n")
print(f"Expected Result: {df['Expected Result'][6]}")

Scenario: User asks for human rep, then asks to schedule for next Wednesday and picks 3:30pm after hearing available times

Expected Result: Agent transfers to human rep, 2025-11-26 03:30 PM appt created in DB


![img](screenshots/test7.png)
![img](screenshots/test-7.png)

In [19]:
test_passed(df, 6)

Result: PASS


In [20]:
# Test 8
print(f"Scenario: {df['Scenario'][7]}\n")
print(f"Expected Result: {df['Expected Result'][7]}")

Scenario: User wants to schedule for December 5th at 8am but only mentions the time they want to schedule for

Expected Result: Agent asks which date they would like to schedule that time slot on, 2025-12-05 08:00 AM appt created in DB


![img](screenshots/test8.png)
![img](screenshots/test-8.png)

In [21]:
test_passed(df, 7)

Result: PASS


In [22]:
# Test 9
print(f"Scenario: {df['Scenario'][8]}\n")
print(f"Expected Result: {df['Expected Result'][8]}")

Scenario: User tries to schedule next Tuesday at 8:30am but that time slot is already booked, then schedules for 8am instead

Expected Result: Agent informs user that time is already booked then recommends the nearest available times, 2025-11-25 08:00 AM appt created in DB


![img](screenshots/test9.png)
![img](screenshots/test-9.png)

In [23]:
test_passed(df, 8)

Result: PASS


In [24]:
# Test 10
print(f"Scenario: {df['Scenario'][9]}\n")
print(f"Expected Result: {df['Expected Result'][9]}")

Scenario: User asks to be scheduled on the weekend then tries to schedule a 4pm appointment the following Monday

Expected Result: Agent says they're closed on weekends, 2025-11-24 04:00 PM appt created in DB


![img](screenshots/test10.png)
![img](screenshots/test-10.png)

In [25]:
test_passed(df, 9)

Result: PASS


In [26]:
# Test 11
print(f"Scenario: {df['Scenario'][10]}\n")
print(f"Expected Result: {df['Expected Result'][10]}")

Scenario: User asks to be scheduled on a day that is completely booked then says they don't want to schedule an appointment anymore

Expected Result: Agent informs user that day is completely booked, exits the appointment scheduling pipeline


![img](screenshots/test11.png)

In [27]:
test_passed(df, 10)

Result: PASS


In [28]:
# Test 12
print(f"Scenario: {df['Scenario'][11]}\n")
print(f"Expected Result: {df['Expected Result'][11]}")

Scenario: User asks to be schedule for next Wednesday but there is only one available slot for that day at 4:30pm, user accepts the last slot

Expected Result: Agent informs user there is one last slot for Wednesday at 4:30pm, 2025-11-26 04:30 PM appt created in DB


![img](screenshots/test12.png)
![img](screenshots/test-12.png)

In [29]:
test_passed(df, 11)

Result: PASS


<h4>Cancelling Appointments<h4>

<small>"status" column should show "cancelled"<small>

In [30]:
# Test 13
print(f"Scenario: {df['Scenario'][12]}\n")
print(f"Expected Result: {df['Expected Result'][12]}")

Scenario: User asks to cancel their Friday appointment at 2pm

Expected Result: Appointment cancelled in DB @ 2025-11-28 02:00 PM


![img](screenshots/test13.png)
![img](screenshots/test-13.png)

In [31]:
test_passed(df, 12)

Result: PASS


In [32]:
# Test 14
print(f"Scenario: {df['Scenario'][13]}\n")
print(f"Expected Result: {df['Expected Result'][13]}")

Scenario: User requests cancellation of their Friday appointment at 9am but they don't have any appointments scheduled at all

Expected Result: Agent informs user they don't have any appointments currently scheduled


![img](screenshots/test14.png)

In [33]:
test_passed(df, 13)

Result: PASS


In [34]:
# Test 15
print(f"Scenario: {df['Scenario'][14]}\n")
print(f"Expected Result: {df['Expected Result'][14]}")

Scenario: User asks to cancel tomorrow's appointment

Expected Result: Appointment cancelled in DB @ 2025-11-24 10:00 AM


![img](screenshots/test15.png)
![img](screenshots/test-15.png)

In [35]:
test_passed(df, 14)

Result: PASS


In [36]:
# Test 16
print(f"Scenario: {df['Scenario'][15]}\n")
print(f"Expected Result: {df['Expected Result'][15]}")

Scenario: User requests cancellation of their appointment on June 18th at 2pm

Expected Result: Agent asks which appointment they'd like to cancel, appointment cancelled in DB @ 2025-06-18 02:00 PM


![img](screenshots/test16.png)
![img](screenshots/test-16.png)

In [37]:
test_passed(df, 15)

Result: PASS


In [38]:
# Test 17
print(f"Scenario: {df['Scenario'][16]}\n")
print(f"Expected Result: {df['Expected Result'][16]}")

Scenario: User wants to cancel their Wednesday afternoon appointment

Expected Result: Appointment cancelled in DB @ 2025-11-26 03:00 PM


![img](screenshots/test17.png)
![img](screenshots/test-17.png)

In [39]:
test_passed(df, 16)

Result: PASS


In [40]:
# Test 18
print(f"Scenario: {df['Scenario'][17]}\n")
print(f"Expected Result: {df['Expected Result'][17]}")

Scenario: User asks to cancel their appointment on Friday but has multiple appointments on that day, cancels the 10am appointment

Expected Result: Appointment cancelled in DB @ 2025-11-28 10:00 AM


![img](screenshots/test18.png)
![img](screenshots/test-18.png)

In [41]:
test_passed(df, 17)

Result: PASS


In [42]:
# Test 19
print(f"Scenario: {df['Scenario'][18]}\n")
print(f"Expected Result: {df['Expected Result'][18]}")

Scenario: User angrily asks to speak to a representative, then demands the fake human rep to cancel their appointment

Expected Result: Cancel upcoming appointment cancelled in DB


![img](screenshots/test19.png)
![img](screenshots/test-19.png)

In [43]:
test_passed(df, 18)

Result: PASS


In [44]:
# Test 20
print(f"Scenario: {df['Scenario'][19]}\n")
print(f"Expected Result: {df['Expected Result'][19]}")

Scenario: User says they can't make their 3:30pm Wednesday appointment anymore

Expected Result: Appointment cancelled in DB @ 2025-11-26 03:30 PM


![img](screenshots/test20.png)
![img](screenshots/test-20.png)

In [45]:
test_passed(df, 19)

Result: PASS


In [46]:
# Test 21
print(f"Scenario: {df['Scenario'][20]}\n")
print(f"Expected Result: {df['Expected Result'][20]}")

Scenario: User asks to cancel their 8am Thursday appointment but they don't have an appointment scheduled Thursday, then remembers it was actually on Tuesday

Expected Result: Agent informs user they don't have an appointment scheduled Thursday, Appointment cancelled in DB @ 2025-11-25 08:00 AM


![img](screenshots/test21.png)
![img](screenshots/test-21.png)

In [47]:
test_passed(df, 20)

Result: PASS


In [48]:
# Test 22
print(f"Scenario: {df['Scenario'][21]}\n")
print(f"Expected Result: {df['Expected Result'][21]}")

Scenario: User says cancel my Monday at 1pm appointment then changes their mind and doesn't cancel their appointment

Expected Result: Agent tells user their appointment won't be cancelled and asks if they need help with anything else


![img](screenshots/test22.png)

In [49]:
test_passed(df, 21)

Result: PASS


<h4>Retrieving Admin Info<h4>

In [50]:
# Test 23
print(f"Scenario: {df['Scenario'][22]}\n")
print(f"Expected Result: {df['Expected Result'][22]}")

Scenario: User asks What is your address?

Expected Result: Correct address returned: 123 Main St, Springfield, CA 90000


![img](screenshots/test23.png)

In [51]:
test_passed(df, 22)

Result: PASS


In [52]:
# Test 24
print(f"Scenario: {df['Scenario'][23]}\n")
print(f"Expected Result: {df['Expected Result'][23]}")

Scenario: User asks for the clinic phone number

Expected Result: Correct phone returned: (555) 555-0123


![img](screenshots/test24.png)

In [53]:
test_passed(df, 23)

Result: PASS


In [54]:
# Test 25
print(f"Scenario: {df['Scenario'][24]}\n")
print(f"Expected Result: {df['Expected Result'][24]}")

Scenario: User asks what time the clinic closes on Fridays

Expected Result: The clinic closes at 4:00pm on Fridays


![img](screenshots/test25.png)

In [55]:
test_passed(df, 24)

Result: PASS


In [56]:
# Test 26
print(f"Scenario: {df['Scenario'][25]}\n")
print(f"Expected Result: {df['Expected Result'][25]}")

Scenario: User asks if they accept Kaiser insurance

Expected Result: Correct insurance response: Kaiser accepted


![img](screenshots/test26.png)

In [57]:
test_passed(df, 25)

Result: PASS


In [58]:
# Test 27
print(f"Scenario: {df['Scenario'][26]}\n")
print(f"Expected Result: {df['Expected Result'][26]}")

Scenario: User asks if they accept walk-ins

Expected Result: Correct walk-in info returned: Limited same-day availability; please call first.


![img](screenshots/test27.png)

In [59]:
test_passed(df, 26)

Result: PASS


In [60]:
# Test 28
print(f"Scenario: {df['Scenario'][27]}\n")
print(f"Expected Result: {df['Expected Result'][27]}")

Scenario: User asks for parking instructions

Expected Result: Correct parking info returned: Lot behind the building; first 2 hours free.


![img](screenshots/test28.png)

In [61]:
test_passed(df, 27)

Result: PASS


In [62]:
# Test 29
print(f"Scenario: {df['Scenario'][28]}\n")
print(f"Expected Result: {df['Expected Result'][28]}")

Scenario: User asks for the clinic website

Expected Result: Correct URL returned: https://sunrisemedicine.com


![img](screenshots/test29.png)

In [63]:
test_passed(df, 28)

Result: PASS


In [64]:
# Test 30
print(f"Scenario: {df['Scenario'][29]}\n")
print(f"Expected Result: {df['Expected Result'][29]}")

Scenario: User asks what days the clinic is closed

Expected Result: Accurate weekend closure returned: Sat + Sun closed


![img](screenshots/test30.png)

In [65]:
test_passed(df, 29)

Result: PASS


In [66]:
# Test 31
print(f"Scenario: {df['Scenario'][30]}\n")
print(f"Expected Result: {df['Expected Result'][30]}")

Scenario: User asks for the fax number

Expected Result: Correct fax number returned: (555) 555-0456


![img](screenshots/test31.png)

In [67]:
test_passed(df, 30)

Result: PASS


In [68]:
# Test 32
print(f"Scenario: {df['Scenario'][31]}\n")
print(f"Expected Result: {df['Expected Result'][31]}")

Scenario: User asks for support phone number for online portal

Expected Result: Correct support number returned: (555) 555-0199


![img](screenshots/test32.png)

In [69]:
test_passed(df, 31)

Result: PASS


<h4>RX Refills<h4>

In [70]:
# Test 33
print(f"Scenario: {df['Scenario'][32]}\n")
print(f"Expected Result: {df['Expected Result'][32]}")

Scenario: User asks to refill their omeprazole prescription

Expected Result: Refill workflow initiated for omeprazole


![img](screenshots/test33.png)

In [71]:
test_passed(df, 32)

Result: PASS


In [72]:
# Test 34
print(f"Scenario: {df['Scenario'][33]}\n")
print(f"Expected Result: {df['Expected Result'][33]}")

Scenario: User requests a refill for lisinopril

Expected Result: Refill workflow initiated for lisinopril


![img](screenshots/test34.png)

In [73]:
test_passed(df, 33)

Result: PASS


In [74]:
# Test 35
print(f"Scenario: {df['Scenario'][34]}\n")
print(f"Expected Result: {df['Expected Result'][34]}")

Scenario: User asks to refill atorvastatin

Expected Result: Refill workflow initiated for atorvastatin


![img](screenshots/test35.png)

In [75]:
test_passed(df, 34)

Result: PASS


In [76]:
# Test 36
print(f"Scenario: {df['Scenario'][35]}\n")
print(f"Expected Result: {df['Expected Result'][35]}")

Scenario: User asks to refill metformin, then changes their mind and doesn't want the refill anymore

Expected Result: Agent exits the RX refill pipeline


![img](screenshots/test36.png)

In [77]:
test_passed(df, 35)

Result: PASS


In [78]:
# Test 37
print(f"Scenario: {df['Scenario'][36]}\n")
print(f"Expected Result: {df['Expected Result'][36]}")

Scenario: User requests a refill for amoxicillin

Expected Result: Refill workflow initiated for amoxicillin


![img](screenshots/test37.png)

In [79]:
test_passed(df, 36)

Result: PASS


In [80]:
# Test 38
print(f"Scenario: {df['Scenario'][37]}\n")
print(f"Expected Result: {df['Expected Result'][37]}")

Scenario: User asks for a refill but does not mention medication

Expected Result: Agent asks which medication they'd like to refill


![img](screenshots/test38.png)

In [81]:
test_passed(df, 37)

Result: PASS


In [82]:
# Test 39
print(f"Scenario: {df['Scenario'][38]}\n")
print(f"Expected Result: {df['Expected Result'][38]}")

Scenario: User tries to request refill for a non-supported medication

Expected Result: System responds with list of supported meds


![img](screenshots/test39.png)

In [83]:
test_passed(df, 38)

Result: PASS


In [84]:
# Test 40
print(f"Scenario: {df['Scenario'][39]}\n")
print(f"Expected Result: {df['Expected Result'][39]}")

Scenario: User asks to refill lisinopril twice in the same day

Expected Result: Duplicate refill attempt rejected or flagged


![img](screenshots/test40.png)

In [85]:
test_passed(df, 39)

Result: PASS


In [86]:
# Test 41
print(f"Scenario: {df['Scenario'][40]}\n")
print(f"Expected Result: {df['Expected Result'][40]}")

Scenario: User asks to speak to a human rep, then asks for a refill on omeprazole

Expected Result: Fake human rep refills Omeprazole


![img](screenshots/test41.png)

In [87]:
test_passed(df, 40)

Result: PASS


In [88]:
# Test 42
print(f"Scenario: {df['Scenario'][41]}\n")
print(f"Expected Result: {df['Expected Result'][41]}")

Scenario: User tries to refill amoxicillin twice in a month, system denies request, user asks to speak to a representative, user demands fake representative to refill their amoxicillin

Expected Result: Fake human rep denies user's demands to refill twice in a month


![img](screenshots/test42.png)
![img](screenshots/test_42.png)

In [89]:
test_passed(df, 41)

Result: PASS


<h2>Results:<h2>

In [90]:
df.head(len(df))

Unnamed: 0,ID,Category,Scenario,Expected Result,PASS/FAIL
0,1,Scheduling,User tells agent they want to be booked for ne...,2025-11-25 03:00 PM appt created in DB,PASS
1,2,Scheduling,User requests an appointment tomorrow at 10 in...,"Agent informs user tomorrow is a weekend, 2025...",PASS
2,3,Scheduling,"User asks for an appointment Friday at 4:30pm,...",Agent informs user they're open until 4 on Fri...,PASS
3,4,Scheduling,User wants to be booked on June 18th at 2pm,2025-06-18 02:00 PM appt created in DB,PASS
4,5,Scheduling,User asks if they can come in Monday around noon,2025-11-24 12:00 PM appt created in DB,PASS
5,6,Scheduling,User asks for an appointment on Friday and cho...,Agent shows user all availabilities for Friday...,PASS
6,7,Scheduling,"User asks for human rep, then asks to schedule...","Agent transfers to human rep, 2025-11-26 03:30...",PASS
7,8,Scheduling,User wants to schedule for December 5th at 8am...,Agent asks which date they would like to sched...,PASS
8,9,Scheduling,User tries to schedule next Tuesday at 8:30am ...,Agent informs user that time is already booked...,PASS
9,10,Scheduling,User asks to be scheduled on the weekend then ...,"Agent says they're closed on weekends, 2025-11...",PASS


In [91]:
df.to_csv("clinai_voice_test_results", header=True, index=False)

The ClinAI voice agent succesfully completed all 42 different requests , entirely through voice and with minimal back and forth. Faster-Whisper inaccurately transcribed a few prompts (especially medication names), but the system still understood the requests being made.

Avg # of Replies from User per Interaction:
- <small>Scheduling Appointments - 3.41</small><br>
- <small>Cancelling Appointments - 2.60</small><br>
- <small>Retrieving Admin Info - 1.0</small><br>
- <small>Refilling Prescription - 2.60</small><br>

Total Avg - 2.45 