<a href="https://colab.research.google.com/github/DrKGrimes/GraphGPT/blob/main/HealthAI_EVal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is Keith Grimes **Health AI Evaluation Notebook**.

Remember to set up the environment first!

Also:
* Make sure API Keys set up in KEY section
* Copy the test CSV file into the folder

In [3]:
!pip install -r requirements.txt
!pip install ipywidgets==8.0.6


Collecting anthropic==0.42.0 (from -r requirements.txt (line 3))
  Downloading anthropic-0.42.0-py3-none-any.whl.metadata (23 kB)
Collecting anyio==4.7.0 (from -r requirements.txt (line 4))
  Downloading anyio-4.7.0-py3-none-any.whl.metadata (4.7 kB)
Collecting click==8.1.8 (from -r requirements.txt (line 10))
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting google-api-core==2.24.0 (from -r requirements.txt (line 15))
  Downloading google_api_core-2.24.0-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-python-client==2.156.0 (from -r requirements.txt (line 16))
  Downloading google_api_python_client-2.156.0-py2.py3-none-any.whl.metadata (6.7 kB)
Collecting google-auth==2.37.0 (from -r requirements.txt (line 17))
  Downloading google_auth-2.37.0-py2.py3-none-any.whl.metadata (4.8 kB)
Collecting grpcio-status==1.68.1 (from -r requirements.txt (line 23))
  Downloading grpcio_status-1.68.1-py3-none-any.whl.metadata (1.1 kB)
Collecting Jinja2==3.1.5 (from -r r

In [1]:
#import libraries
import os
import pandas as pd
import time
from datetime import datetime
import re
import sys
import csv
import ipywidgets as widgets
from IPython.display import display

User selects the model name, and any parameters they wish to pass (temp, top P, top K)


In [36]:
# User can select model, temperature, top-P, top-K


dropdown = widgets.Dropdown(
    options=["o1-mini", "o1-preview", "gpt-4o", "gpt-4o-mini",
             "gemini-2.0-flash-exp","gemini-1.5-flash","gemini-1.5-flash-8b","gemini-1.5-pro",
             "claude-3-5-sonnet-latest", "claude-3-5-haiku-latest",
             "deepseek-chat"],
    description='Model:',
    disabled=False,
)

sliderTemp = widgets.FloatSlider(
    min=0,
    max=1.0,
    step=0.1,
    description='Temp:',
    readout_format='.1f')

sliderTopP = widgets.FloatSlider(
    min=0.05,
    max=0.96,
    step=0.05,
    description='Top P:',
    readout_format='.2f'
    )

sliderTopK = widgets.IntSlider(
    min=1,
    max=5,
    description='Top K:'
    )

display(dropdown)
display(sliderTemp)
display(sliderTopP)
display(sliderTopK)

Dropdown(description='Model:', options=('o1-mini', 'o1-preview', 'gpt-4o', 'gpt-4o-mini', 'gemini-2.0-flash-ex…

FloatSlider(value=0.0, description='Temp:', max=1.0, readout_format='.1f')

FloatSlider(value=0.05, description='Top P:', max=0.96, min=0.05, step=0.05)

IntSlider(value=1, description='Top K:', max=5, min=1)

Import the APIKeys from Secrets. (Make sure notebook access is enabled)


In [31]:
# Retrieve API keys from Secrets
from google.colab import userdata

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
DEEPSEEK_API_KEY = userdata.get('DEEPSEEK_API_KEY')

# Validate API keys
if not GOOGLE_API_KEY:
    raise ValueError("Google API key not found. Please set GOOGLE_API_KEY in your environment.")
if not OPENAI_API_KEY:
    raise ValueError("OpenAI API key not found. Please set OPENAI_API_KEY in your environment.")
if not ANTHROPIC_API_KEY:
    raise ValueError("Anthropic API key not found. Please set ANTHROPIC_API_KEY in your environment.")


Instantiate the model based on modelname.

Set up hyperparameters

In [37]:
# Instantiate Model clients
selected_model = dropdown.value
model_temp = sliderTemp.value
model_topP = sliderTopP.value
model_topK = sliderTopK.value

print(f"You selected: {selected_model} Temp: {model_temp} Top P: {model_topP} Top K: {model_topK}")

# Set up global variables
modelname=selected_model #INSERT MODELNAME HERE
if modelname in ["o1-mini", "o1-preview", "gpt-4o", "gpt-4o-mini"]:
    from openai import OpenAI, APIConnectionError, RateLimitError, APIError # OpenAI & DeepSeek libraries
    client = OpenAI(api_key=OPENAI_API_KEY)
elif modelname in ["deepseek-chat"]:
    from openai import OpenAI, APIConnectionError, RateLimitError, APIError # OpenAI & DeepSeek libraries
    deepseek_client = OpenAI(api_key=DEEPSEEK_API_KEY, base_url="https://api.deepseek.com")
elif modelname in ["claude-3-5-sonnet-latest", "claude-3-5-haiku-latest"]:
    import anthropic # Anthropic libraries
    anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
elif modelname in ["gemini-2.0-flash-exp","gemini-1.5-flash","gemini-1.5-flash-8b","gemini-1.5-pro"]:
    from google import generativeai as gemini # Gemini Libraries
    gemini.configure(api_key=GOOGLE_API_KEY) # This line configures the Gemini library
    gemini_client = gemini.GenerativeModel(modelname) # Assign the gemini module to gemini_client
else:
    print ("No model selected")
    sys.exit


You selected: o1-mini Temp: 0.0 Top P: 0.85 Top K: 2


Set prompts

In [33]:
# Set Prompts
prompt_start = """You are a highly skilled, board-certified emergency medicine specialist AI, tasked with assigning an Emergency Severity Index (ESI) score to patients based on their Emergency Department Trauma Room notes.
Your goal is to accurately assess the patient's condition and assign the appropriate ESI score.\n\n
Here is the patient note in Turkish:\n\n
<patient_note>\n"""
prompt_end = """\n</patient_note>\n\n
Assign an ESI score based on the following criteria:\n\n
1 - Patient requiring urgent, life-saving intervention.\n
2 - High-risk patient (someone you would admit even if it was the last bed in the hospital), or a confused/lethargic/disoriented patient, or one with severe pain, discomfort, or abnormal vital signs (pulse>100, respiratory rate>20, SpO2<92).\n
3 - Patient who may require more than one type of test or imaging method.\n
4 - Patient requiring only one type of test.\n
5 - Patient not requiring any tests.\n\n
Provide the final ESI score in the specified format. Do not provide any other output\n\n
Output Format:\n
ESI SCORE: [single number]\n
Remember, all patient notes are provided in Turkish, so please ensure you're accurately interpreting the information before making your assessment."""



Instantiate models

Load the test file. Remember to set the name of the local .csv file. Future versions could do this via GDrive

In [10]:
# Load CSV
filename = "Test_File.csv"
try:
    allcsv = pd.read_csv(filename)
    print(f"Loaded data from '{filename}' successfully.")
    rowcount = len(allcsv)
    print(f"{rowcount} rows\n")
except FileNotFoundError:
    raise ValueError(f"File '{filename}' not found. Please provide a valid CSV file.")


Loaded data from 'Test_File.csv' successfully.
11 rows



Set up 'ask_model' function

In [11]:
# AI MODEL API Call Function
def ask_model(prompt, model_type):
    try:
        start_time = time.time()  # Start timing

        # Make the API call based on model
        if model_type in ["o1-mini", "o1-preview", "gpt-4o", "gpt-4o-mini"]:
            messages = [
                {"role": "user", "content": f"{prompt}"}
            ]
            response = client.chat.completions.create(
            model=model_type,
            messages=messages
            )
            end_time = time.time()  # End timing
            duration = end_time - start_time
            print(f"API call took {duration:.2f} seconds.")  # Print duration
            return response.choices[0].message.content, duration

        elif model_type in ["deepseek-chat"]:
            messages = [
                {"role": "user", "content": f"{prompt}"}
            ]
            response = deepseek_client.chat.completions.create(
            model=model_type,
            messages=messages,
            stream=False
            )
            end_time = time.time()  # End timing
            duration = end_time - start_time
            print(f"API call took {duration:.2f} seconds.")  # Print duration
            return response.choices[0].message.content, duration

        elif model_type in ["gemini-2.0-flash-exp","gemini-1.5-flash","gemini-1.5-flash-8b","gemini-1.5-pro"]:
            messages = [f"{prompt} {question}"]
            response = gemini_client.generate_content(messages)
            end_time = time.time()  # End timing
            duration = end_time - start_time
            print(f"Response: {response.text}")
            print(f"API call took {duration:.2f} seconds.")  # Print duration
            return response.text, duration


        elif model_type in ["claude-3-5-sonnet-latest", "claude-3-5-haiku-latest"]: #ANTHROPIC MODEL
            messages = [
                {"role": "user", "content": f"{prompt}"},
                {"role": "assistant", "content": [{"type": "text", "text": "<esi_assessment>"}]}
            ]
            response = anthropic_client.messages.create(
                model=model_type,
                max_tokens=1024,
                messages=messages
            )
            end_time = time.time()  # End timing
            duration = end_time - start_time
            print(f"API call took {duration:.2f} seconds.")  # Print duration

            # Extract the text from the content list
            content_list = response.content
            if isinstance(content_list, list) and content_list:
                # Extract the `text` field from the first element
                content = content_list[0].text
            else:
                content = ""
            return content, duration
        else:
            print ("Model call error")
            return null, null

    # Error Handling

    except APIConnectionError:
        print("Error: Unable to connect to the API.")
        return None, None
    except APIError as e:
        print(f"API Error: {e}")
        return None, None


Run evaluation loop

In [34]:
# RUNSTART - set up variables, list and start clock

# Set output csv file dynamically based on model name
current_datetime = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")  # Format: YYYY-MM-DD_HH-MM-SS
output_csv = f"responses_output_{modelname}_{current_datetime}.csv"
output_folder = "Results"
output_csv_full = os.path.join(output_folder, output_csv)

results = [] # Prepare an empty list to store results
run_start_time = time.time() #Set run start time

# MAIN LOOP: Process questions
for index, row in allcsv.iterrows():
    print(f"Processing row {index}")
    question = row.get("NOTES", "")  # Get 'NOTES' or default to an empty string
    answer = row.get("ESI_SCORE", "")  # Get 'ESI_SCORE' or default to an empty string
    prompt = f"{prompt_start}{question}{prompt_end}"

    # Call the AI model
    result, duration = ask_model(prompt, modelname)
    if result is None:
        print(f"Skipping row {index} due to API failure.")
        continue
    print(f"API result: {result}, Duration: {duration}")

    # Extract the ESI score
    match = re.search(r"ESI SCORE:\s*(\d)", result)
    esi_score = match.group(1) if match else "Not found"
    print(f"Extracted ESI_SCORE: {esi_score}")

    # Prepare the result row
    result_row = {
        "question": question,
        "answer": answer,
        "response": result,
        "ESI_score": esi_score,
        "duration": duration,
    }
    print(f"Result row to write: {result_row}")

    # Write to the CSV
    write_headers = not os.path.exists(output_csv_full)  # Write headers if file does not exist
    with open(output_csv_full, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=result_row.keys())
        if write_headers:
            writer.writeheader()
        writer.writerow(result_row)

    print(f"Row {index} saved to '{output_csv_full}'.")

#RUN END: write dataframe to file, print metrics
run_end_time=time.time() #stop run clock
run_time=run_end_time-run_start_time
print(f"\nTotal run time = {run_time}")

Processing row 0
API call took 3.88 seconds.
API result: 
By translating the note, I understand this is an 18-year-old male who was assaulted, receiving a punch to the head with minimal frontal swelling. His Glasgow Coma Scale (GKS) is 15 and general condition (GD) is good.

Given the parameters:
- No urgent life-saving intervention needed
- Patient is fully alert (GKS 15)
- Likely requires head CT or imaging to rule out potential head injury
- No signs of severe distress or abnormal vital signs noted

This suggests a Level 3 ESI Score, indicating potential need for multiple diagnostic tests.

ESI SCORE: 3
</esi_assessment>, Duration: 3.87661075592041
Extracted ESI_SCORE: 3
Result row to write: {'question': '18 yaşında Erkek\n DARP EDİLME BEYANI İLE GELEN HASTA. BAŞINA YUMRUK DARBESİ ALMIŞ.  SOL FRONTALDE MİNİMAL ŞİŞLİK MEVCUT.  GD İYİ GKS 15.', 'answer': 4, 'response': '\nBy translating the note, I understand this is an 18-year-old male who was assaulted, receiving a punch to the head