# LLM Testing

In this code, we load 10 patient stems and 26 anticoagulation profiles, creating 260 scenarios. We define a way to systematically query the GPT API with these scenarios, and then for a given `PROMPT_ID` in `prompts` we test all scenarios. We do this in batches of `batch_size` and we repeat our tests for a number of `duplication_folds`. Finally, we concatenate our results, take the mean of each scenario, and save.

This code should run with relatively little setup. Be sure to have initialized an OpenAI account (see https://platform.openai.com/docs/api-reference/introduction). Once you have done that, point `openai.api_key_path` to wherever you have saved your API key.

In [1]:
import openai
import time
import pandas as pd
import numpy as np
import requests.exceptions
import random

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
%matplotlib inline

print(f'{pd.__version__=}, {np.__version__=}, {matplotlib.__version__=}')

from io import StringIO

pd.__version__='1.5.3', np.__version__='1.24.2', matplotlib.__version__='3.7.1'


In [2]:
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Edit the code below to wherever your API key lives.

In [3]:
openai.api_key_path = '../api_key'

Models:

- gpt-3.5-turbo : most powerful (at time of experiments) model, costs \$0.0020 per 1k tokens. Max 4,096 tokens per request. Current version is gpt-3.5-turbo-0301. Trained up to Sept 2021.
- ada : fastest model, costs \$0.0004 per 1k tokens (1/5 of 3.5). Max 2,049 tokens per request. Trained up to Oct 2019.
- babbage : slightly slower than ada, but more nuanced. Costs \$0.0005 per 1k tokens. Max 2,049 tokens per request. Trained up to Oct 2019.
- curie : again slower, costs \$0.0020 per 1k tokens (same price as 3.5). Max 2,049 tokens per request. Trained up to Oct 2019.
- davinci : strongest 3.0 model, comparable to 3.5 turbo. Costs \$0.0200 per 1k tokens (10x cost of 3.5). Max 2,049 tokens per request. Trained up to Oct 2019.

In [4]:
model = 'gpt-3.5-turbo-0301'

In [5]:
with open('patient_stems.txt') as f:
    pt_stems = [line.strip() for line in f]
    
with open('med_stems.txt') as f:
    meds = [' ' + line.strip() for line in f]
    
with open('prompt_stems.txt') as f:
    prompts = [' ' + line.strip() for line in f]
    
print(f'Scenarios loaded. There are currently {len(pt_stems)*len(meds)} possible scenarios.')

scenarios = []
for i in range(len(pt_stems)):
    for j in range(len(meds)):
        patient = chr(ord('A') + i) + chr(ord('A') + j)
        scenario = f'Patient {patient}: {pt_stems[i]} {meds[j]}'
        scenarios.append(scenario)


scenarios[:3]

Scenarios loaded. There are currently 260 possible scenarios.


['Patient AA: A 50-year-old woman with a history of diabetes, atrial fibrillation, and ulcerative colitis presents for an open colectomy.  The surgical team is requesting epidural analgesia for postoperative pain control.  Her medications include metformin, acetaminophen, and  5000 Units subcutaneous heparin TID. The last dose of heparin was 3 hours before the planned procedure.',
 'Patient AB: A 50-year-old woman with a history of diabetes, atrial fibrillation, and ulcerative colitis presents for an open colectomy.  The surgical team is requesting epidural analgesia for postoperative pain control.  Her medications include metformin, acetaminophen, and  5000 Units subcutaneous heparin TID. The last dose of heparin was 8 hours before the planned procedure.',
 'Patient AC: A 50-year-old woman with a history of diabetes, atrial fibrillation, and ulcerative colitis presents for an open colectomy.  The surgical team is requesting epidural analgesia for postoperative pain control.  Her medic

In the box below, choose which prompt to test by setting PROMPT_ID.

- 0 = baseline
- 1 = Summary by ChatGPT 3.5
- 2 = Summary by ChatGPT 4
- 3 = Corrected Summary by ChatGPT 4
- 4 = Explicit Answers

In [6]:
PROMPT_ID = 0

prompt = prompts[PROMPT_ID]

save_name = 'out.csv'
if PROMPT_ID == 0:
    save_name = '0baseline.csv'
elif PROMPT_ID == 1:
    save_name = '1gpt3.csv'
elif PROMPT_ID == 2:
    save_name = '2gpt4.csv'
elif PROMPT_ID == 3:
    save_name = '3gpt4corr.csv'
elif PROMPT_ID == 4:
    save_name = '4explicit.csv'

In [7]:
batch_size = 5
duplication_folds = 10

In [8]:
def get_messages(test_scenarios):
    messages = [
        {'role': 'system', 'content': 'You are an assistant to an anesthesiologist planning procedures.'},
        {'role': 'user', 'content': prompt},
        {'role': 'assistant', "content": "Sure, I can help you with that. Please provide me with the patient scenarios.",},
        #{
        #    'role': 'user',
        #    'content': PROMPT_HERE
        #},
    ]
    for scenario in test_scenarios:
        messages.append(
                {
            'role': 'user',
            'content': scenario
        }
            )

    return messages

In [9]:
# Shuffle the scenarios
random.seed(1234)
random.shuffle(scenarios)

token_count = 0
start_time = time.time()
results = []
for fold in range(duplication_folds):
    
    print(f"Starting fold {fold}", end=" ")
    random.shuffle(scenarios)
    
    for i in range(0, len(scenarios), batch_size):
        thous_tokens = token_count // 1000
        cost = 0.002 * thous_tokens
        #print(f'{i=}, {thous_tokens=:.0f}k tokens ${cost:.3f}', end=' ')
        
        while True:
            try:
                response = openai.ChatCompletion.create(
                    model=model,
                    messages=get_messages(scenarios[i:i+batch_size]),
                    temperature=0,
                )
                #Check to ensure model wasn't cut off
                if response['choices'][0]['finish_reason'] != 'stop':
                    raise Exception('Response failed- message incomplete')
                
                #Check to ensure output data is formatted correctly
                if len(response['choices'][0]['message']['content'].split('~~~')) >= 2:
                    break
                else:
                    print('\nError occurred with response. Likely misformatted:')
                    print(f'Failed on set {i=}, {fold=}. This was the response:')
                    print(response['choices'][0]['message']['content'])
                    print('Continuing. Reattempting this fold. This response was not added to overall data.')
            except (
                openai.error.APIConnectionError,
                requests.exceptions.Timeout,
                requests.exceptions.ConnectionError,
                openai.error.APIError,
                openai.error.ServiceUnavailableError,
                TimeoutError
            ) as e:
                print(f"\nConnection error occurred: {str(e)}: Retrying in 30 seconds...")
                time.sleep(30)                
            except Exception as e:
                print(f"\nUnexpected error occurred: {str(e)}. Retrying in 30 seconds...")
                time.sleep(30)
                
        token_count += response['usage']['total_tokens']
        r = response['choices'][0]['message']['content'].split('~~~')[1].strip()
        results.append(pd.read_csv(StringIO(r), sep=';'))
        #print(f"completed. Tokens: {response['usage']['total_tokens']}, time elapsed: {time.time()-start_time:.2f}s")
    print(f'Finished fold {fold}. Total time elapsed: {time.time()-start_time:.2f}s')
print(f'Task completed with {token_count} total tokens and a total cost of ${cost:.2f}. Total time elapsed: {time.time()-start_time:.2f}s ({(time.time()-start_time)/60:.2f} minutes)')


Starting fold 0 Finished fold 0. Total time elapsed: 116.81s
Starting fold 1 Finished fold 1. Total time elapsed: 223.63s
Starting fold 2 Finished fold 2. Total time elapsed: 335.25s
Starting fold 3 Finished fold 3. Total time elapsed: 437.10s
Starting fold 4 Finished fold 4. Total time elapsed: 537.90s
Starting fold 5 Finished fold 5. Total time elapsed: 655.56s
Starting fold 6 Finished fold 6. Total time elapsed: 774.01s
Starting fold 7 Finished fold 7. Total time elapsed: 875.69s
Starting fold 8 Finished fold 8. Total time elapsed: 993.65s
Starting fold 9 Finished fold 9. Total time elapsed: 1101.18s
Task completed with 427543 total tokens and a total cost of $0.85. Total time elapsed: 1101.18s (18.35 minutes)


In [None]:
all_results = pd.concat(results, ignore_index=True)

all_results = all_results[all_results['Safe Intervention'].isin([0,1,'0','1'])]

all_results = all_results.astype({'Safe Intervention': 'int'})

all_results.shape

In [None]:
agg_df = all_results.groupby(['Patient ID']).agg({'Safe Intervention': 'mean'})

In [None]:
agg_df.to_csv('results/' + save_name)
agg_df = agg_df.reset_index()
agg_df