In [2]:
import pandas as pd

## Manual Assessment of Correctness
This is a simple python workflow to assess correctness for therapeutic outcomes identified through use of the *outlines* package to enforce strict schemas for LLM responses. In short, for each trial the datasets are loaded and the field experted is prompted with the identified outcomes/values alongside the original text. The field expert is expected to identify whether the outcomes/values are factually correct against the original text and provide the block of text that supports the answer. Additionally, the user should make note of any trends or oddities noticed in responses.

### Trial 1: Mistral-7b-Instruct-v0.3, JSON Schema
Load data

In [5]:
# file_name = ['20240607_llm-mistral.xlsx']
# df = pd.read_excel(file_name[0]).drop(labels='Unnamed: 0',axis=1)

df = pd.read_excel('20240607_mistralTrialEval-inProgress.xlsx').drop(labels='Unnamed: 0',axis=1)

df[0:5]

Unnamed: 0,outcome,value,study_group,brand_name,application_number,clinical_studies,assessment,notes
0,Hamilton Depression Rating Scale (total score),significantly superior to placebo,venlafaxine,venlafaxine,ANDA090555,CLINICAL TRIALS The efficacy of venlafaxine hy...,Yes,"Correct but not specific: In these 5 studies, ..."
1,Pain reduction,greater than placebo,Duloxetine delayed-release capsules in Major D...,Duloxetine Delayed-Release,ANDA203088,14 CLINICAL STUDIES 14.1 Overview of the Clini...,Yes,"After 13 weeks of treatment, patients taking D..."
2,Success rate,Proportion of subjects with a-0 (clear) or 1 (...,"Clobetasol propionate shampoo, 0.05%",clobetasol propionate,ANDA090974,14 CLINICAL STUDIES The safety and efficacy of...,No,Definition for success rate
3,overall survival,12.1 for pemetrexed for injection plus cisplat...,pemetrexed for injection plus cisplatin is com...,Pemetrexed,ANDA204890,14 CLINICAL STUDIES 14.1 Non-Squamous NSCLC In...,Yes,Efficacy Parameter All Randomized and Treated ...
4,incidence of endometritis,27.6% for placebo,a group given placebo in the trial,Cefoxitin,ANDA065415,"CLINICAL STUDIES A prospective, randomized, do...",Yes,Endometritis occurred in 16/58 (27.6%) patient...


In [14]:
df[15:]

Unnamed: 0,outcome,value,study_group,brand_name,application_number,clinical_studies,assessment,notes
15,percentage of subjects with composite treatmen...,on day 30,XEOMIN,Xeomin,BLA125360,14 CLINICAL STUDIES 14.1 Chronic Sialorrhea Ch...,No,Nonsensical
16,excellent to cleared clinical response,60% for patients on active treatment,patients with active treatment,Fluocinolone acetonide,ANDA202848,CLINICAL STUDIES In a vehicle-controlled study...,Yes,"Correct but not very detailed, quantified: In ..."
17,Responder rate in patients receiving gabapenti...,Proportion of patients with a 50% or greater r...,Gabapentin users versus placebo users,GABAPENTIN,NDA020235,14 CLINICAL STUDIES 14.1 Postherpetic Neuralgi...,No,Definition?
18,clinical cure rate,72%,Cefuroxime Axetil Tablets 500 mg Twice Daily,Cefuroxime axetil,ANDA065496,14 CLINICAL STUDIES 14.1 Acute Bacterial Maxil...,Yes,Table 13. Clinical Effectiveness of Cefuroxime...
19,reduction in joint swelling,< 50%,Naproxen,Naproxen Sodium,ANDA212199,14 CLINICAL STUDIES Naproxen has been studied ...,No,< 50% is not anywhere in the text
20,Mean percent change in inflammatory lesion cou...,43.1%,Minocycline Hydrochloride Extended-Release Tab...,Minocycline Hydrochloride,ANDA202261,14 CLINICAL STUDIES The safety and efficacy of...,Yes,Table 4. Table 4: Efficacy Results at Week 12 ...
21,percentage of patients with complete clearing ...,46%,"Diclofenac Sodium Gel (14 patients, 90 days tr...",DICLOFENAC SODIUM,ANDA208301,14 CLINICAL STUDIES Clinical trials were condu...,No,46% does not appear anywhere in text
22,WOMAC pain score,,Diclofenac sodium N=154,Diclofenac Sodium,ANDA204132,14. CLINICAL STUDIES 14.1 Pivotal Studies in O...,No,Nonsensical
23,Risk ratio,", 0.78","Ramipril (10 mg daily),",Ramipril,ANDA077900,14 CLINICAL STUDIES 14.1 Hypertension Ramipril...,No,"I think this actually might be correct, but it..."
24,improvement in PANSS total score,-15.5,Aripiprazole 15mg/day,Aripiprazole,ANDA207105,14 CLINICAL STUDIES Efficacy of the oral formu...,Yes,Table 26: Schizophrenia Studies Study Number T...


Evaluate correctness

In [None]:
from IPython.display import clear_output
_start = 21
_current = _start

while _current <= len(df):

    clear_output(wait=True)
    print(f'{df["application_number"][_current]} \n')
    print(f'{df["study_group"][_current]}:\n {df["outcome"][_current]} -> {df["value"][_current]}\n')
    print(df["clinical_studies"][_current])

    assessment = input("Correct? (Yes/No): ")
    notes = input("Reason? (description): ")
    df.at[_current, 'assessment'] = assessment
    df.at[_current, 'notes'] = notes

    _current += 1


Save

In [15]:
# df.to_excel('20240607_mistralTrialEval.xlsx')

### Trial 2: Mistral-7B-OpenOrca, Pydantic Class
Load data

In [30]:
file_names = ['20240611_llm-orca-ner.xlsx','20240611_llm-orca-ner-set2.xlsx','20240611_llm-orca-ner-set3.xlsx']

df = pd.DataFrame()

for file in file_names:
    tdf = pd.read_excel(file).drop(labels='Unnamed: 0',axis=1)
    df = pd.concat([df,tdf]).reset_index(drop=True)

df


Unnamed: 0,outcome,value,regiment,brand_name,application_number,clinical_studies
0,Hamilton Depression Rating Scale (total score),significantly superior to placebo,venlafaxine hydrochloride in a range of 75 to ...,venlafaxine,ANDA090555,CLINICAL TRIALS The efficacy of venlafaxine hy...
1,4,2,15523,Duloxetine Delayed-Release,ANDA203088,14 CLINICAL STUDIES 14.1 Overview of the Clini...
2,Success rate,42.1%,"Clobetasol propionate shampoo, 0.05%",clobetasol propionate,ANDA090974,14 CLINICAL STUDIES The safety and efficacy of...
3,No Outcome Found.,",",No Regimen Found.,Pemetrexed,ANDA204890,14 CLINICAL STUDIES 14.1 Non-Squamous NSCLC In...
4,Median overall survival,9.0 months,triplet-therapy group,Cefoxitin,ANDA065415,"CLINICAL STUDIES A prospective, randomized, do..."
...,...,...,...,...,...,...
88,Risk of NSAID-induced mucosal injury,10–30%,misoprostol,Cytotec,NDA019268,Clinical studies In a series of small short-te...
89,ulcer recurrence,4%,omeprazole + clarithromycin,Omeprazole,ANDA091672,14 CLINICAL STUDIES 14.1 Active Duodenal Ulcer...
90,Objective response rate,80%,LUNSUMIO treatment,Lunsumio,BLA761263,14 CLINICAL STUDIES The efficacy of LUNSUMIO w...
91,Median time to alleviation of influenza signs ...,1.5 days,Oseltamivir phosphate treatment of 2 mg per kg...,Oseltamivir phosphate,ANDA208348,14 CLINICAL STUDIES 14.1 Treatment of Influenz...


Evaluate

In [33]:
from IPython.display import clear_output
_start = 33
_current = _start

while _current <= len(df[_start:]):

    clear_output(wait=True)
    print(f'{df["application_number"][_current]} \n')
    print(f'{df["regiment"][_current]}:\n {df["outcome"][_current]} -> {df["value"][_current]}\n')
    print(df["clinical_studies"][_current])

    assessment = input("Correct? (Yes/No): ")
    notes = input("Reason? (description): ")
    df.at[_current, 'assessment'] = assessment
    df.at[_current, 'notes'] = notes

    _current += 1


ANDA078904 

nan:
 Partial Onset Seizures Effectiveness in Partial Onset Seizures in Adults with Epilepsy -> levetiracetam as adjunctive therapy (added to other antiepileptic drugs) in adults was established in three multicenter, randomized, double-blind, placebo-controlled clinical studies in patients who had refractory partial onset seizures with or without secondary generalization.

14 CLINICAL STUDIES In the following studies, statistical significance versus placebo indicates a p value <0.05. 14.1 Partial Onset Seizures Effectiveness in Partial Onset Seizures in Adults with Epilepsy The effectiveness of levetiracetam as adjunctive therapy (added to other antiepileptic drugs) in adults was established in three multicenter, randomized, double-blind, placebo-controlled clinical studies in patients who had refractory partial onset seizures with or without secondary generalization. The tablet formulation was used in all these studies. In these studies, 904 patients were randomized to pl

KeyboardInterrupt: Interrupted by user

In [32]:
df[30:45]

Unnamed: 0,outcome,value,regiment,brand_name,application_number,clinical_studies
30,primary,100%,duloxetine delayed-release capsules} 60-120 mg...,Duloxetine DR,ANDA208706,14. Clinical Studies 14.1 Overview of the Clin...
31,mean mirtazapine dose,21 to 32 mg/day,titrated with mirtazapine from a dose range of...,Mirtazapine,ANDA076921,14 CLINICAL STUDIES The efficacy of mirtazapin...
32,uncertain stability in clinical outcomes,,,Fludeoxyglucose F 18,ANDA203591,14 CLINICAL STUDIES 14.1 Oncology The efficacy...
33,Partial Onset Seizures Effectiveness in Partia...,levetiracetam as adjunctive therapy (added to ...,,Levetiracetam,ANDA078904,"14 CLINICAL STUDIES In the following studies, ..."
34,reduction of the Ashworth score,statistically significant,tizanidine,TIZANIDINE HYDROCHLORIDE,ANDA210021,14 CLINICAL STUDIES Tizanidine's capacity to r...
35,Median overall survival,9.0 months,metoprolol tartrate tablets,Metoprolol Tartrate,ANDA074644,14 CLINICAL STUDIES 14.1 Hypertension In contr...
36,pregnancy rate,1.41,drospirenone and ethinyl estradiol tablets,Drospirenone and ethinyl estradiol,ANDA211944,14 CLINICAL STUDIES 14.1 Oral Contraceptive Cl...
37,weight loss,fraction of a pound a week,anorectic drugs and diet,Phentermine Hydrochloride,ANDA040876,14 CLINICAL STUDIES In relatively short-term c...
38,Median overall survival,9.0 months,triplet-therapy group,Benazepril Hydrochloride,ANDA078212,14 CLINICAL STUDIES Hypertension Adult Patient...
39,"Mean event rate of serious, acute, bacterial i...",0.037,BIVIGAM infusion administered every 3 or 4 weeks,BIVIGAM,BLA125389,14 CLINICAL STUDIES 14. 1 Treatment of Primary...


Save

In [None]:
# df.to_excel('20240611_orcaTrialEval-inProgress.xlsx')

Interestingly, OpenOrca would return back my original example to me at times

In [None]:
# Model didn't 'understand' and threw back the example?
print(f'{(12/93)*100}% returned example from prompt')
df[df['outcome']=='Median overall survival'].reset_index(drop=True)


12.903225806451612% returned example from prompt


Unnamed: 0,outcome,value,regiment,brand_name,application_number,clinical_studies
0,Median overall survival,9.0 months,triplet-therapy group,Cefoxitin,ANDA065415,"CLINICAL STUDIES A prospective, randomized, do..."
1,Median overall survival,9.0 months,triplet-therapy group,LISINOPRIL,ANDA076180,14 CLINICAL STUDIES 14.1 Hypertension Two dose...
2,Median overall survival,9.0 months,triplet-therapy group,Azithromycin Dihydrate,ANDA208250,14 CLINICAL STUDIES 14.1 Adult Patients Acute ...
3,Median overall survival,9.0 months,triplet-therapy group,Naproxen Sodium,ANDA212199,14 CLINICAL STUDIES Naproxen has been studied ...
4,Median overall survival,9.0 months,triplet-therapy group,NURTEC ODT,NDA212728,14 CLINICAL STUDIES 14.1 Acute Treatment of Mi...
5,Median overall survival,9.0 months,metoprolol tartrate tablets,Metoprolol Tartrate,ANDA074644,14 CLINICAL STUDIES 14.1 Hypertension In contr...
6,Median overall survival,9.0 months,triplet-therapy group,Benazepril Hydrochloride,ANDA078212,14 CLINICAL STUDIES Hypertension Adult Patient...
7,Median overall survival,9.0 months,triplet-therapy group,Nitrofurantoin (monohydrate/macrocrystals),ANDA207372,CLINICAL STUDIES Controlled clinical trials co...
8,Median overall survival,9.0 months,triplet-therapy group,NAGLAZYME,BLA125117,14 CLINICAL STUDIES A total of 56 patients wit...
9,Median overall survival,9.0 months,triplet-therapy group,Clofarabine,NDA021673,14 CLINICAL STUDIES Seventy-eight (78) pediatr...


## Quantify and Compare Results
Open the evaluation data from each trial set and compare the field expert assessed results against each other. Observe correctness and note any oddities across datasets.

In [19]:
# Trial 1, Mistral
mistral = pd.read_excel('20240607_mistralTrialEval.xlsx').drop(labels='Unnamed: 0',axis=1)

# Trial 2, Orca
orca = df = pd.read_excel('20240611_orcaTrialEval-inProgress.xlsx').drop(labels='Unnamed: 0',axis=1)


In [23]:
orca['assessment'][0:30].value_counts()

assessment
No     16
Yes    14
Name: count, dtype: int64

In [21]:
mistral['assessment'].value_counts()

assessment
Yes    16
No     14
Name: count, dtype: int64

In [29]:
orca['notes'].str.lower().value_counts()[0:2]

notes
nonsensical            8
example from prompt    5
Name: count, dtype: int64

In [26]:
mistral['notes'].str.lower().value_counts()[0:1]

notes
nonsensical    5
Name: count, dtype: int64

<u>**Manual Assessment**</u>:  
Mistral -> 16 Right, 14 Wrong but 5 of those were nonsensical, leaving 9 factually wrong   
Orca -> 14 Right, 16 Wrong but 5 of those were directly from the prompt, 8 were nonsensical, leaving 3 factually wrong
  
<u>**Runtime**</u>:  
Mistral -> 100%|██████████| 30/30 [47:30:21<00:00, 5700.73s/it]  
Orca -> 100%|██████████| 32/32 [36:24:09<00:00, 4095.31s/it]

Both methods of prompting/enforcing schema ended up about the same from this tiny sample of 30 replicates. The Orca trial ran faster (4095.31s per it vs 5700.73 s per IT) though both took a significant time to run. Notably, Orca has less responses that were factually incorrect or made up than Mistral. At times, Orca would spit out the exact sample from the prompt. One of the factually wrong response from Orca was simply "No Outcomes Found", which was incorrect as there were many outcomes. 
  
It is also worth noting that each of these prompts produced exactly one outcome while each of these trial sections commonly had many outcomes. This could potentially be overcome by specifying an array of dictionaries as the output in the schema definition rather than just a single dictionary.
