In [1]:
import pandas as pd
import os 

from pymongo import MongoClient
# imports for langchain and Chroma and plotly

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import json
import re
from sklearn.model_selection import train_test_split

#  environmental variables passed into service
mongouser = os.getenv('MONGO_INITDB_ROOT_USERNAME')
mongopass = os.getenv('MONGO_INITDB_ROOT_PASSWORD')
OPENAI_KEY = os.getenv('OPENAI_KEY')

# Add these plus the key to the environmental variables
client = MongoClient(f"mongodb://{mongouser}:{mongopass}@mongodb:27017")

# key read in from env file at home on the local machine
from openai import OpenAI
openai_client = OpenAI()

In [2]:
terms = pd.read_excel('/pipeline_datalake/List of clinical definitions lookups.xlsx')


In [158]:
annotated_df = pd.read_excel('/pipeline_datalake/Annoted terms With Labels.xlsx', sheet_name='entity_clean', dtype=str)

### Summary
This notebook contains the results for AI Engineer Candidate Assessment requirements 2 and 3

2. Connect the application to one or more LLMs using industry standard methods (or be
ready to explain why you chose to deviate from those standards)
3. Develop an appropriate evaluation approach to identify errors and improve the prompt’s
performance in generating accurate clinical definitions. Please show the results of your
evaluation and what you changed as a result.

#### Defined Rules for Correctness
The key requirements is to "generate accurate clinical definitions"
However due to the limitations of only using certain codes we are going to take a strict approach to this with emphasis on "clinical definitions"
This model is NOT designed to say what code COULD represent -- especially if the vocabularies are too broad in that coding language
Success will be measured also by recognizing the limitations in a certain vocabulary and if an "accurate clinical definitions" then to not answer

Some entity type specific annotation rules
1. diagnosis: ICD10 has major limitatoins in defining a specific medical concept. If the medical concept is too abstract for ICD10 a correct answer is a non answer
2. medication: A drug ingredient should be result in only the RxNorm of the ingredient term, not all possible drugs that have that ingredient
3. medication: A regime should result in all RxNorms of the regime active ingredients only
4. drug_class: merely having an ATC representation does not make a concept a drug class in and of itself -- usually that is level 4 or higher
5. procedure: Similar to diagnosis. If a term represents multiple procedures, multple CPT values should be returned
6. measurements/labs: biomarkers and substances (that are not medicinal compounds) should return the avaliable tests

 changed as a result.

#### Evaluation
Based on the annotation from the defined rules above I will evaluate:
- primarily entity type and code - together in one full score and separately
- I will also evaluate text but not to the same degree - my thinking is that if the code is correct, this value can be looked up with a tool
- I will also run a measure on the code differences just for good measure (like jaccard)
- I will not kfold as my set is too small

The evaluation is different from typical classification and is more in line with set thinking. So looping through the tidy output each entity will contribute along

We will go a little basic and say a prediction but incorrect is FP. This will make our recall not a good metric to look at, but it will emphasize precision accuracy as our main metric

Full entity level 
- Positive Prediction (P'): The LLM outputs any code for an entity.
- Negative Prediction (N'): The LLM outputs no codes for an entity.    

For the breakdown:
- True Positive (TP): An entity that should have codes, and the LLM correctly provides "Full Marks" (all correct codes, no incorrect ones).
- False Positive (FP): An entity should have no codes, but the LLM returns codes. An entity  should have codes, but the LLM returns incorrect codes (as a whole) or additional incorrect codes
- True Negative (TN):  entity should have no codes, and  LLM correctly returns no codes
- False Negative (FN): An entity that should have codes, but the LLM  returns no codes


So each entity is contributing a full true/false negatives/positives overall in that line of thinking, and separately for entity type (as a whole) and code (as a whole)
We will tally per entity

And we do the same for entity alone and code alone
and if we have time to do some jaccard or set measures we will

In [159]:
print(f"{annotated_df['entity_name'].nunique()} annotated validation concepts")
print(f"{annotated_df.shape[0]} unique coded concepts expected")

20 annotated validation concepts
37 unique coded concepts expected


In [160]:
annotated_df.head()

Unnamed: 0,entity_name,types,is_regime,set,should_say_no,codes_pipe,vocabulary,text,validated
0,Liver Transplant Rejection,diagnosis,False,val,,T86.41,ICD-10,Liver transplant rejection,1
1,Oseltamivir,medication,False,val,,260101,RxNorm,oseltamivir,1
2,Lurbinectedin,medication,False,val,,2374729,RxNorm,lurbinectedin,1
3,Wheezing,diagnosis,False,val,,R06.2,ICD-10,Wheezing,1
4,eptifibatide,medication,False,val,,75635,RxNorm,eptifibatide,1


In [88]:
annotated_terms = annotated_df[['entity_name']].drop_duplicates().copy()

In [89]:
annotated_terms

Unnamed: 0,entity_name
0,Liver Transplant Rejection
1,Oseltamivir
2,Lurbinectedin
3,Wheezing
4,eptifibatide
5,Amoxicillin
6,green nails
7,"Dilation of hypoglossal nerve, open approach"
10,Uric Acid (Urate)
13,Alvimopan


### Step 1) Run the default prompt

In [39]:
# Confirmed from Jason, need to add a little bit of instructions to this (bare minimal) to also give the entity types

def create_default_entity_prompt(concept_names_list):
    '''Function attempts to automate prompt generation. This is likely more than we need but chatgpt isn't specifically design for medicine'''


    entities_dict = {
        "entities": concept_names_list
    }

    entity_json_string = json.dumps(entities_dict, indent=2)
    # as little tweaking as possilbe
    prompt = """
    You are an expert in medical coding logic. given the following json, return a json with the most appropriate medical codes for each concept, as well as the one or more types of entity.
    
    {
      "entities": [
        "folfirinox",
        "bismuth quadruple therapy",
        "lung cancer",
        "Lp(a) measurement",
        "appendectomy",
        "ACE inhibitors"
      ]
    }
    
        
    The allowable entity types you must assign are the exact strings:
    * 'diagnosis'
    * 'procedure' 
    * 'measurements/labs'
    * 'medication'
    * 'drug_class' 
    If more than one is correct, include both as one string separated by a comma

    
    The allowable code libraries are ICD-10 for diagnoses, ICD-10 category codes for groups of diagnoses, CPT for procedures, LOINC for measurements/labs, and RXnorm for medications, ATC for drug classes.
    Each entity can have more than 1 code if applicable. For example, medication regimens should have a code per element.
    For each code give a 0-100 score of your confidence in the accuracy of the code selected (100 is 100% confident). Do not include codes you are not very confident in. False positives are worse than false negatives.
    
    
    Generate your response in the following format and only return the formatted JSON in the response
    {
    "entities": {
    "[ENTITY_NAME]": {
    "entity_name": "[ENTITY_NAME]",
    "types": "[ENTITY_TYPE]",
    "codes": [
    {
    "code": "[CODE_VALUE]",
    "system": "[CODE_SYSTEM]",
    "description": "[HUMAN_READABLE_DESCRIPTION]",
    "confidence": [0-100]
    }
    ]
    }
    }

    Here is the entity input json:

    """
    prompt = prompt + entity_json_string
    return prompt

def flatten_entity_to_df(response_json):
    '''Convert entity-type-only LLM output JSON to a DataFrame including regime flag.'''
    
        
    records = []
    
    # Iterate through each entity in the 'entities' dictionary
    for entity_key, entity_value in response_json['entities'].items():
        entity_name = entity_value['entity_name']
        entity_types = entity_value['types']
    
        for code_entry in entity_value['codes']:
            record = {
                'entity_name': entity_name,
                'types': entity_types,
                'code': code_entry.get('code'),  
                'system': code_entry.get('system'),
                'description': code_entry.get('description'),
                'confidence': code_entry.get('confidence')
            }
            records.append(record)
    
    df_long_form_entities = pd.DataFrame(records)
    return df_long_form_entities


# standard completion function
def get_completion(prompt, model="gpt-4-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    content = response.choices[0].message.content
    
    return content
# need to do this to process
def strip_markdown_fences(text):
    return re.sub(r"^```json\s*|```$", "", text.strip(), flags=re.MULTILINE)

In [37]:
batch_size=10
model="gpt-4-turbo"


   
# this isn't relaly batching I'm just calling it that for right now. I didnt want to put the full list in at once
all_raw_results = []
for i in range(0, len(annotated_terms), batch_size):
    batch_terms = annotated_terms["entity_name"].iloc[i:i+batch_size].tolist()
    prompt = create_default_entity_prompt(batch_terms)
    response_json = get_completion(prompt, model=model)  # get raw JSON/dict
    all_raw_results.append(response_json)
    

In [40]:

df_full = pd.DataFrame()
for ent in all_raw_results:
    ent_clean = json.loads(strip_markdown_fences(ent))
    df_clean = flatten_entity_to_df(ent_clean)
    df_full = pd.concat([df_full, df_clean])

df_full = df_full.reset_index(drop=True)


In [47]:
# so only declined once
len( list( set(df_full['entity_name'].to_list()) & set(annotated_terms['entity_name'].to_list()) ) )

19

In [65]:
df_pred = df_full[['entity_name','types','code','system','description','confidence']].copy() \
    .rename(columns={'types':'pred_type','code':'pred_code','system':'pred_vocab','description':'pred_text','confidence':'pred_confidence'})



df_pred = annotated_terms[['entity_name']].drop_duplicates().merge(df_pred, on='entity_name', how='left')

In [66]:
df_pred['entity_name'].nunique()

20

In [69]:
df_pred['pred_no'] = df_pred['pred_type'].apply(lambda x: 1 if pd.isnull(x) else 0)
df_pred['pred_confidence'] = df_pred['pred_confidence'].fillna(0)

In [111]:
df_pred

Unnamed: 0,entity_name,pred_type,pred_code,pred_vocab,pred_text,pred_confidence,pred_no
0,Liver Transplant Rejection,diagnosis,T86.42,ICD-10,Liver transplant rejection,95.0,0
1,Oseltamivir,medication,84989,RXnorm,Oseltamivir,98.0,0
2,Lurbinectedin,medication,2363736,RXnorm,Lurbinectedin,98.0,0
3,Wheezing,diagnosis,R06.2,ICD-10,Wheezing,95.0,0
4,eptifibatide,medication,35623,RXnorm,Eptifibatide,98.0,0
5,Amoxicillin,medication,723,RXnorm,Amoxicillin,98.0,0
6,green nails,diagnosis,L60.8,ICD-10,Other nail disorders,80.0,0
7,"Dilation of hypoglossal nerve, open approach",procedure,04V03DZ,ICD-10-PCS,"Dilation of Hypoglossal Nerve, Open Approach",90.0,0
8,Uric Acid (Urate),measurements/labs,3084-1,LOINC,Uric acid [Mass/volume] in Serum or Plasma,95.0,0
9,Alvimopan,medication,544725,RXnorm,Alvimopan,98.0,0


In [210]:
annotated_terms['entity_name'].nunique()

20

In [161]:
truth_df = annotated_df[['entity_name','types','codes_pipe','vocabulary','text','should_say_no']] \
    .rename(columns={'types':'true_type','should_say_no':'true_no','codes_pipe':'true_code','vocabulary':'true_vocab','text':'true_text'})

truth_df['true_no'] = truth_df['true_no'].fillna(0)


In [162]:
for col in truth_df[list(truth_df.columns)[:-1]]:
    truth_df[col] = truth_df[col].str.strip()
    

In [175]:
score_base = truth_df[['entity_name']].drop_duplicates().copy()

for index, row in score_base.iterrows():
    # outputs we want are tallys of
    entity_name = row['entity_name']
    tmp_truth = truth_df[truth_df['entity_name'] == entity_name].copy()
    tmp_pred = pred_df[pred_df['entity_name'] == entity_name].copy()

    # if this is 1 then I reviewed and said this would not be a possible definition for the vocabularies
    # if ttrue_code_sethis is 0 then it is possible
    true_negative = int(tmp_truth['true_no'].max())
    # if this is 1 then the model decline to answer with a possible definition for the vocabularies
    # if this is 0 then it answered
    pred_negative = int(tmp_pred['pred_no'].max())
    
    # create for easy set math
    if true_negative == 1:
        true_entity_set = set()
        true_code_set = set()
        true_text_set =  set()
    else:
        true_entity_set = set(tmp_truth['true_type'].to_list())
        true_code_set = set(tmp_truth['true_code'].to_list())
        true_text_set = set(tmp_truth['true_text'].to_list())
    if pred_negative == 1:
        pred_entity_set = set()
        pred_code_set = set()
        pred_text_set = set()

    else:
        pred_entity_set = set(tmp_pred['pred_type'].to_list())
        pred_code_set = set(tmp_pred['pred_code'].to_list())
        pred_text_set = set(tmp_pred['pred_text'].to_list())


    pred_text_set = set([c.lower() for c in list(pred_text_set)])
    true_text_set = set([c.lower() for c in list(true_text_set)])

    if len(true_entity_set) == 0 and len(pred_entity_set) == 0:
        entity_res = 'tn' 
    elif true_entity_set == pred_entity_set:
        entity_res = 'tp'
    elif len(true_code_set) > 0 and len(pred_entity_set) == 0:
        entity_res = 'fn' 
    else:
        # the key decision here was this -- which I had to think about for a while
        entity_res = 'fp' 
    
    if len(true_code_set) == 0 and len(pred_code_set) == 0:
        code_res = 'tn' 
    elif true_code_set == pred_code_set:
        code_res = 'tp'
    elif len(true_code_set) > 0 and len(pred_code_set) == 0:
        code_res = 'fn' 
    else:
        code_res = 'fp' 

    if len(true_text_set) == 0 and len(pred_text_set) == 0:
        text_res = 'tn' 
    elif true_text_set == pred_text_set:
        text_res = 'tp'
    elif len(true_text_set) > 0 and len(pred_text_set) == 0:
        text_res = 'fn' 
    else:
        text_res = 'fp' 
        

    score_base.loc[index,'true_no'] = true_negative
    score_base.loc[index,'pred_no'] = pred_negative

    score_base.loc[index,'entity_eval'] = entity_res
    score_base.loc[index,'code_eval'] = code_res
    score_base.loc[index,'text_eval'] = text_res
    
    score_base.at[index,'entity_pred_set'] = pred_entity_set
    score_base.at[index,'entity_true_set'] = true_entity_set
    
    
    score_base.at[index,'code_pred_set'] = pred_code_set
    score_base.at[index,'code_true_set'] = true_code_set

    score_base.at[index,'text_pred_set'] = pred_text_set
    score_base.at[index,'text_true_set'] = true_text_set

In [None]:
# full cleaned comparison

In [176]:
score_base

Unnamed: 0,entity_name,true_no,pred_no,entity_eval,code_eval,text_eval,entity_pred_set,entity_true_set,code_pred_set,code_true_set,text_pred_set,text_true_set
0,Liver Transplant Rejection,0.0,0.0,tp,fp,tp,{diagnosis},{diagnosis},{T86.42},{T86.41},{liver transplant rejection},{liver transplant rejection}
1,Oseltamivir,0.0,0.0,tp,fp,tp,{medication},{medication},{84989},{260101},{oseltamivir},{oseltamivir}
2,Lurbinectedin,0.0,0.0,tp,fp,tp,{medication},{medication},{2363736},{2374729},{lurbinectedin},{lurbinectedin}
3,Wheezing,0.0,0.0,tp,tp,tp,{diagnosis},{diagnosis},{R06.2},{R06.2},{wheezing},{wheezing}
4,eptifibatide,0.0,0.0,tp,fp,tp,{medication},{medication},{35623},{75635},{eptifibatide},{eptifibatide}
5,Amoxicillin,0.0,0.0,tp,tp,tp,{medication},{medication},{723},{723},{amoxicillin},{amoxicillin}
6,green nails,1.0,0.0,fp,fp,fp,{diagnosis},{},{L60.8},{},{other nail disorders},{}
7,"Dilation of hypoglossal nerve, open approach",1.0,0.0,fp,fp,fp,{procedure},{},{04V03DZ},{},"{dilation of hypoglossal nerve, open approach}",{}
10,Uric Acid (Urate),0.0,0.0,tp,fp,fp,{measurements/labs},{measurements/labs},{3084-1},"{3084-4, 3087-7, 3084-1, 2916-2}",{uric acid [mass/volume] in serum or plasma},"{uric acid [mass/volume] in blood, uric acid [..."
14,Alvimopan,0.0,0.0,tp,fp,tp,{medication},{medication},{544725},{480639},{alvimopan},{alvimopan}


In [186]:
pred_df.to_excel('/pipeline_datalake/clean_default_predictions.xlsx',index=False)
truth_df.to_excel('/pipeline_datalake/clean_annotations.xlsx',index=False)
score_base.to_excel('/pipeline_datalake/clean_default_model_eval.xlsx',index=False)

In [191]:
df_pred['pred_confidence'].value_counts()
# does not lack confidence

pred_confidence
90.0     13
95.0      8
98.0      5
100.0     2
80.0      1
0.0       1
Name: count, dtype: int64

I am a little concerned about my annotation of text, multiple sources (athena and UMLS) may have different 

In [211]:
from sklearn.metrics import precision_score, recall_score, f1_score

# we do not get as good coverage
# but this is more straightforward
def get_eval_metrics(df, column):
    y_true = df[column].apply(lambda x: 1 if x in ['tp','fn'] else 0)
    y_pred = df[column].apply(lambda x: 1 if x in ['tp','fp'] else 0)

    precision = precision_score(y_true, y_pred, zero_division=0)
    f1        = f1_score(y_true, y_pred, zero_division=0)
    counts    = df[column].value_counts().to_dict()

    return {
        'counts':    counts,
        'precision': round(precision, 3),
        'f1':        round(f1,        3)
    }



entity_metrics = get_eval_metrics(score_base, 'entity_eval')
code_metrics = get_eval_metrics(score_base, 'code_eval')
text_metrics = get_eval_metrics(score_base, 'text_eval')

print("Entity Evaluation Metrics:", entity_metrics)
print()
print("Code Evaluation Metrics:", code_metrics)
print()
print("Text Evaluation Metrics:", text_metrics)



Entity Evaluation Metrics: {'counts': {'tp': 15, 'fp': 4, 'tn': 1}, 'precision': 0.789, 'f1': 0.882}

Code Evaluation Metrics: {'counts': {'fp': 13, 'tp': 6, 'tn': 1}, 'precision': 0.316, 'f1': 0.48}

Text Evaluation Metrics: {'counts': {'tp': 10, 'fp': 9, 'tn': 1}, 'precision': 0.526, 'f1': 0.69}


In [212]:
from sklearn.metrics import precision_score, f1_score

def make_entity_code_label(row):
    ent = row['entity_eval']  # 'tp', 'fp', 'fn', or 'tn'
    code = row['code_eval']   # 'tp', 'fp', 'fn', or 'tn'

    #True Negative
    if ent == 'tn' and code == 'tn':
        return 'tn'

    #True Positive
    if ent != 'tn' and code == 'tp':
        return 'tp'

    #False Negative 
    if ent != 'tn' and code == 'tn':
        return 'fn'

    # basically when the codes are not perfect
    return 'fp'

score_base['entity_code_eval'] = score_base.apply(make_entity_code_label, axis=1)

# same
y_true = score_base['entity_code_eval'].map(lambda x: 1 if x in ['tp','fn'] else 0)
y_pred = score_base['entity_code_eval'].map(lambda x: 1 if x in ['tp','fp'] else 0)

precision = precision_score(y_true, y_pred, zero_division=0)
f1        = f1_score(       y_true, y_pred, zero_division=0)
counts    = score_base['entity_code_eval'].value_counts().to_dict()

print("Entity+Code Precision-first Metrics:", {
    'counts':    counts,
    'precision': round(precision, 3),
    'f1':        round(f1,        3)
})


Entity+Code Precision-first Metrics: {'counts': {'fp': 13, 'tp': 6, 'tn': 1}, 'precision': 0.316, 'f1': 0.48}


In [None]:
# Basically the overall kind of becomes the code level prediction

In [213]:
def compute_jaccard(row):
    pred = row['code_pred_set']
    truth = row['code_true_set']
    if isinstance(pred, set) and isinstance(truth, set):
        union = pred | truth
        intersection = pred & truth
        return len(intersection) / len(union) if union else 1.0
    return 0

score_base['jaccard'] = score_base.apply(compute_jaccard, axis=1)
print("Mean Jaccard Score:", round(score_base['jaccard'].mean(), 3))

Mean Jaccard Score: 0.379


Corrections made:
- add 3084-1 to truth for Uric Acid (Urate)
- 85.2}is more appropriate for Alcohol induced pancreatitis than 85.20
- 74182, 74183, 74150, 74160, 74170, 74181 is the right combination for Abdominal Imaging (CT or MRI)
- A02BA is correct for H2 Receptor Antagonists
- Athena and UMLS may display different text names so I may have messed some of those up
  

## Review of results and plans for improvement
- We do not do kfold its too small, plus not tuning here, plus doing entity-level + code-level + jaccard
- High recall is not impressive really because over predictive behavior resulting in many fps

Entity:
- Precision looks mediocre but when we look at only cases where the model SHOULD have guess at all, it has perfect precision
- Maybe do not need to adjust into a separate call but rather focus on it deciding when to not guess at all
  
Code:
- Terrible precision but this eval method punishes.
- Jaccard however confirms poor judgement when coding
- Reviewing the codes we see a lot of wrong drug codes that dont make any sense
- I may try and specify more about ingredients -- they are often just named woith the same name as their corresponding concept even
- specify to not use other vocabularies - used a ICD10PCS
- specify what a non clinical mapping looks like, think of an example for the prompt with a rare disease not being compatible wtih icd10 and that a no guess is better
  
Tet 
- For drugs when it is incorrect in the code it looks like it is just repeating the text from the input
- Text is inconsistent but also I am using both athena and UMLS and there are a few cases where I may have use confusing version
- In general I am not going to focus on addressing this in the prompt. If I get the code correct that should be easy to do a lookup