In [14]:
import pandas as pd
import os 

from pymongo import MongoClient
# imports for langchain and Chroma and plotly

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import json
import re
from sklearn.model_selection import train_test_split

#  environmental variables passed into service
mongouser = os.getenv('MONGO_INITDB_ROOT_USERNAME')
mongopass = os.getenv('MONGO_INITDB_ROOT_PASSWORD')

# Add these plus the key to the environmental variables
client = MongoClient(f"mongodb://{mongouser}:{mongopass}@mongodb:27017")

# key read in from env file at home on the local machine
from openai import OpenAI
openai_client = OpenAI()

In [15]:
db = client["umls"]    # Replace with your database name
collection = db["mrconso"] # Replace with your collection name


In [16]:
terms = pd.read_excel('/pipeline_datalake/List of clinical definitions lookups.xlsx')


In [17]:
annotated_df = pd.read_excel('/pipeline_datalake/Annoted terms With Labels.xlsx', sheet_name='entity_clean', dtype=str)

### Summary
This notebook contains the results for AI Engineer Candidate Assessment requirements 4 and 5

4. Author a “final” version of the prompt that performs the task more accurately than the
original (ideally with quantitative justification)
5. List the limitations of your implementation and/or discussion topics regarding how to
make the service more robust for the future



### Approach for improvement

Base on low precision driven by two key issues:
- The prompt may need to do a better job telling the model when to give up. If there is a disease that has no icd10 code that represents that medical concept well, then don't guess
- If it was going to place a code from a nother vocabulary other than ours, don't
- Drugs missed a lot. It almost looks like its making them up. Need to give more information about active ingredient, its often the same concept name and they're actually pretty easy to spot
- Text is kind of being made up. I will give the prompt steps and the last one will be the text -- I will ask it to utilize that code to find it
- Confidence was always high, maybe provide confidence as a step in making the judgement call about if this concept even has an analogous code
- Do one by one

My next steps are to:
- Do one by one
- Try emphasizing focus on ingredients in my updates
- Add a second agent -- "reviewer" (although really the RxNorm is the issue it looks like)
- If I have time I will try and add a "tool" or a function call for the model in the reviewer to demonstrate next steps
- I want to do a vector embedding of ingredients -- in other words empower in the areas the models are weakest, but I will not have time

Other notes

 RxNorm is a tough ontology to learn from text
It has thousands of numeric codes (e.g., 3098), but very few are referenced in public web text

Drug codes like 214354 are difficult for model— because models aren’t trained on RxNorm structure

So the model may pick a "close" RxNorm code based on surface patterns, not correctness

If I have time I will try and give the model the power of mongodb and UMLS

I also want to try and create a

### Limitations and Discussion at the end

In [18]:
# Let's start by making a useful function
# we will see if we can give it to the model

def get_possible_ingredient_code_lookup(concept_name):
        
    pipeline = [
        {
            "$match": {
                "STR_LOWER": concept_name,
                "SAB": "RXNORM",
                "TTY":'IN'
            }
        },
        {
            "$project": {
                "_id": 0,
                "STR": 1,
                "SAB": 1,
                "STR_LOWER": 1,
                "CODE": 1
            }
        }
    ]
        # Execute the aggregation pipeline and convert to DataFrame immediately
    aggregation_result  = list(collection.aggregate(pipeline))
    result_codes = [ c['CODE'] + ' coded for ingredient ' + c['STR'] for c in aggregation_result ]
    return result_codes


In [19]:
get_possible_ingredient_code_lookup('amoxicillin')

['723 coded for ingredient amoxicillin']

In [20]:
print(f"{annotated_df['entity_name'].nunique()} annotated validation concepts")
print(f"{annotated_df.shape[0]} unique coded concepts expected")

20 annotated validation concepts
37 unique coded concepts expected


In [21]:
annotated_terms = annotated_df[['entity_name']].drop_duplicates().copy()

In [22]:
terms = annotated_terms['entity_name'].to_list()
terms

['Liver Transplant Rejection',
 'Oseltamivir',
 'Lurbinectedin',
 'Wheezing',
 'eptifibatide',
 'Amoxicillin',
 'green nails',
 'Dilation of hypoglossal nerve, open approach',
 'Uric Acid (Urate)',
 'Alvimopan',
 'Splenomegaly',
 'Covid-19',
 'Dacarbazine',
 'Methylxanthine',
 'Alcohol induced pancreatitis',
 'Abdominal Imaging (CT or MRI)',
 'H2 Receptor Antagonists',
 'long-acting beta agonist',
 'Census Subregion - Mountain',
 'R-CHOP']

### Implement Updates



In [23]:
# Confirmed from Jason, need to add a little bit of instructions to this (bare minimal) to also give the entity types

def create_improved_entity_prompt(concept_name):


    # 2. Create the Python dictionary structure for the entities


    # 3. Convert the Python dictionary to a JSON formatted string
    # Using indent=2 for readability if you were to print it, though for the LLM it's not strictly necessary.

    prompt = """
    You are an expert medical coding assistant.

    Your job is to assign medical codes from specific vocabularies to clinical concepts provided to you. Each concept should be evaluated *individually* and carefully. You must **only return codes when you are confident the concept maps clearly and specifically to a real entry in the medical coding system that capture the clinical meaning of the concept without ambiguity**.
    
    ## Your task:
    
    For an input concept, return:
    - The one or more correct **entity types**
    - A list of matching medical codes
    - A **confidence score (0-100)** for each code
    - The code **system**, human-readable **description**, and the original concept name
    
    Only include codes when your confidence is high (≥ 80). **False positives are worse than false negatives** — it’s better to leave something uncoded than to assign a wrong or approximate code.
    
    If the concept does not clearly correspond to a code in the specified vocabulary, **return no codes at all for that concept. Do not guess.**
    
    ---
    
    ## Allowable vocabularies:
    - **ICD-10**: for diagnoses
    - **ICD-10-CM categories**: for diagnosis groupings
    - **CPT**: for procedures
    - **LOINC**: for labs/measurements
    - **RxNorm**: for medications 
    - **ATC**: for drug classes
    
    ---
    
    ## Allowable entity types:
    - `'diagnosis'`
    - `'procedure'`
    - `'measurements/labs'`
    - `'medication'`
    - `'drug_class'`
    
    If multiple apply, separate them with commas.
    
    ---
    
    ## Coding Strategy:
    
    1. Identify the **correct vocabulary**
    2. Confirm a **clear and valid match** — if not, skip coding. A concept that COULD be represented by a code but also may not should not be included
    3. For medications, unless a brand or dosage is specified, provide the RxNorm for the active ingredient.
    4. Use the concept as a guide to find the proper description based on that concept from the determined vocabulary in 1)
    5. Provide structured output in valid JSON
    
    ---
    
    ## Examples of concepts that should NOT be coded:
    
    - `"delayed sleep-wake phase disorder"` → Not clearly represented in ICD-10, do not code
    - `"high output cardiac failure"` → If no direct ICD-10 match, skip
    - `"combination regimens not listed as RxNorm drugs"` → do not guess component drugs
    
    ---
    
    ## Output format:
    Return only a well-formed JSON response, no prose or explanation.
    
    ```json
    {
      "entities": {
        "[ENTITY_NAME]": {
          "entity_name": "[ENTITY_NAME]",
          "types": "[ENTITY_TYPE]",
          "codes": [
            {
              "code": "[CODE_VALUE]",
              "system": "[CODE_SYSTEM]",
              "description": "[HUMAN_READABLE_DESCRIPTION]",
              "confidence": [0-100]
            }
          ]
        }
      }
    }


    The input concept name is: """
    prompt = prompt + concept_name
    return prompt

# instead of evaluating itself while searching for the answer lets add another agent who is the REVIEWER who hopefully can see with more clarity 
# if I have time demonstrate how I would use a bidirectional autoencoder to help here especially for RXNORM codes which are syntactically kind of meaningless as they are just integers

def create_reviewer_entity_result_prompt(concept_name, initial_model_json):
    '''The reviewer will never Add in if it is missing, its job is to reduce false positives '''
    prompt = """
    You are an expert medical coding Reviewer.
    You work at the end of a chain of agents and your job is to provide the highest quality to our staff and clients who rely on your results to help build powerful software that drives key medical research.

    The input you will receive is the output of the previous agent
     {
      "entities": {
        "[ENTITY_NAME]": {
          "entity_name": "[ENTITY_NAME]",
          "types": "[ENTITY_TYPE]",
          "codes": [
            {
              "code": "[CODE_VALUE]",
              "system": "[CODE_SYSTEM]",
              "description": "[HUMAN_READABLE_DESCRIPTION]",
              "confidence": [0-100]
            }
          ]
        }
      }
    }
    Where [ENTITY_NAME] is the original input concept and the other elements are the results given by the previous agent. Below I outline the previous agents task for clarity for you, and then your key tasks

    
    ## Previous Agent task:
    The previous agents task was to 
    1) First determine what entity type the concept belogs, and procedure to step 2 given one specific vocabulary that can be used for that entity type
    The Allowable entity types with corresponding allowable vocabulary:
    - `'diagnosis'` -> for **ICD-10** and **ICD-10CM** vocabularies only
    - `'procedure'` -> for the **CPT**  vocabulary only
    - `'measurements/labs'` -> for the **LOINC**  vocabulary only
    - `'medication'` -> for the **RxNorm**  vocabulary only
    - `'drug_class'` -> for the **ATC** vocabulary only

    2) Confirm a **clear and valid match** — if not, skip coding. A concept that COULD be represented by a code but also may not should not be included
    ** Your Task

    3)  Use the concept as a guide to find the proper description based on that concept from **only** the determined vocabulary in 1) and:
    For an input concept, return:
    - The one or more correct **entity types**
    - A list of matching medical codes
    - A **confidence score (0-100)** for each code
    - The code **system**, human-readable **description**, and the original concept name
    
    The previous agent should only include codes when their confidence is high. **False positives are worse than false negatives** — it’s better to leave something uncoded than to assign a wrong or approximate code.
    
    If the concept does not clearly correspond to a code in the specified vocabulary, **they should return no codes at all for that concept. Do not guess.**

    ---

    ## YOUR TASK NOW
    Perform a review of the previous agent's work and help REDUCE false positives. You will make modifications of the output of the previous agent if necessary, and provide an output in the same form.  If the previous agent made a mistake or deviated from its instructions in any way, return no concept 

    Follow these steps and either approave (and return the input as your result) or modify the output. Here are your two main tasks

    1) Check for drastic error: If the input for the "[ENTITY_TYPE] and [CODE_SYSTEM] valuse are not from the previous agent's approved entity type and vocabulary lists, return no concept as the correct output

    2) Check for clinical meaning error of codes and modify if necessary: If your first check has passed 
        a) Keeping in mind  **clear and valid clinical match** — if the vocabulary in question is likely not able to succesfully capture the specificity of the original medical concept, return no output
        b) If the code provided is not quite correct, but with a high degree of certainty a better alternative exists, you can modify and correct the output for "[CODE_VALUE], and official [HUMAN_READABLE_DESCRIPTION] that would go with that code. With the goal in mind of obtaining an overall correct set of codes

    **Known limitatoins of previous agent**
    - Sometimes uses an invalid vocabulary which should be invalidated and no codes returned
    - Often for medications provides a totally incorrect RXNorm value. Efforts should be made to make sure the correct Rxnorm of the active ingredient is provdided
    ---
    
    ## Your Final Output format:
    Return only a well-formed JSON response, no prose or explanation.
    
    ```json
    {
      "entities": {
        "[ENTITY_NAME]": {
          "entity_name": "[ENTITY_NAME]",
          "types": "[ENTITY_TYPE]",
          "codes": [
            {
              "code": "[CODE_VALUE]",
              "system": "[CODE_SYSTEM]",
              "description": "[HUMAN_READABLE_DESCRIPTION]",
              "confidence": [0-100]
            }
          ]
        }
      }
    }


    The initial concept name is: """
    prompt = prompt + concept_name

    prompt += """
    And the previous agents output is:
    
    """

    prompt += initial_model_json
    return prompt

    
def flatten_entity_to_df(response_json):
    '''Convert entity-type-only LLM output JSON to a DataFrame including regime flag.
    '''
    
        
    # Prepare a list to hold all the records
    records = []
    
    # Iterate through each entity in the 'entities' dictionary
    for entity_key, entity_value in response_json['entities'].items():
        # Extract common entity details
        entity_name = entity_value['entity_name']
        entity_types = entity_value['types']
    
        # Iterate through each code within the current entity
        for code_entry in entity_value['codes']:
            # Create a dictionary for the current record
            record = {
                'entity_name': entity_name,
                'types': entity_types,
                'code': code_entry.get('code'),  # Use .get() for safer access
                'system': code_entry.get('system'),
                'description': code_entry.get('description'),
                'confidence': code_entry.get('confidence')
            }
            records.append(record)
    
    # Create the DataFrame from the list of records
    df_long_form_entities = pd.DataFrame(records)
    return df_long_form_entities



def get_completion(prompt, model="gpt-4-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    content = response.choices[0].message.content
    # check if no response
    
    return content

def strip_markdown_fences(text):
    # Remove triple backticks and optional "json" label around the JSON block
    return re.sub(r"^```json\s*|```$", "", text.strip(), flags=re.MULTILINE)

In [24]:
model="gpt-4-turbo"

# we will pre clean before adding to any of the below
# that way we can make dataframes from initial and post review to compare
all_raw_results = []
all_modified_results = []
all_final_results = []
# added in lookup injection into prompt for drug codes before the reviewer
# reviewer allows us to 
qa_lookups = {}
for term in terms:
    prompt = create_improved_entity_prompt(term)
    response_json = get_completion(prompt, model=model)  # get raw JSON/dict

    ent_clean = json.loads(strip_markdown_fences(response_json))
    all_raw_results.append(ent_clean)
    # if non zero, pass to our reviewer
    positive_prediction = len(ent_clean['entities'].items())

    if positive_prediction == 0:
        # do not review, add output
        all_final_results.append(ent_clean)
    else:
        # send to revieier
        
         # Build a lookup string only if there's a medication
        lookup_str = ""
        all_types = [ent_data["types"] 
                     for ent_data in ent_clean["entities"].values()]
        codes_found = []
        if "medication" in all_types:   # match exactly your earlier prompt’s “medication”
            all_codes = [
                c["description"]
                for ent_data in ent_clean["entities"].values()
                for c in ent_data["codes"]
            ]
            if len(all_codes) > 0:
                # consider adding this to the prompt to expect if found
                lookup_str = "Database tool results (if any) as supporting evidence to help your response:\n"
                for c in all_codes:
                    codes_found = []
                    codes_found = get_possible_ingredient_code_lookup(c.lower())
                    # codes_found is always a list; join them with newline + a leading space
                    lookup_str += "\n".join(f"  {code}" for code in codes_found) + "\n"
            

        # Now pass the original ent_clean *and* the lookup_str into your reviewer prompt
        # prompt in this case will be lookup_str + JSON
        if len(codes_found)>0:
            # insert a blank line between lookup_str and the JSON dump
            prompt_for_reviewer = lookup_str + "\n" + json.dumps(ent_clean)
            qa_lookups[term] = lookup_str
        else:
            prompt_for_reviewer = json.dumps(ent_clean)
            
        review_prompt = create_reviewer_entity_result_prompt(term, prompt_for_reviewer)
        review_response_json = get_completion(review_prompt, model=model)  # get raw JSON/dict
        review_ent_clean = json.loads(strip_markdown_fences(review_response_json))
        # add review to both of these
        all_modified_results.append(review_ent_clean)
        all_final_results.append(review_ent_clean)
        
    

In [28]:
def get_single_response(concept_text):
    prompt = create_improved_entity_prompt(concept_text)
    response_json = get_completion(prompt, model=model)  # get raw JSON/dict

    ent_clean = json.loads(strip_markdown_fences(response_json))
    # if non zero, pass to our reviewer
    positive_prediction = len(ent_clean['entities'].items())

    if positive_prediction == 0:
        # do not review, add output
        return "No results"
    else:
        # send to revieier
        
         # Build a lookup string only if there's a medication
        lookup_str = ""
        all_types = [ent_data["types"] 
                     for ent_data in ent_clean["entities"].values()]
        codes_found = []
        if "medication" in all_types:   # match exactly your earlier prompt’s “medication”
            all_codes = [
                c["description"]
                for ent_data in ent_clean["entities"].values()
                for c in ent_data["codes"]
            ]
            if len(all_codes) > 0:
                # consider adding this to the prompt to expect if found
                lookup_str = "Database tool results (if any) as supporting evidence to help your response:\n"
                for c in all_codes:
                    codes_found = []
                    codes_found = get_possible_ingredient_code_lookup(c.lower())
                    # codes_found is always a list; join them with newline + a leading space
                    lookup_str += "\n".join(f"  {code}" for code in codes_found) + "\n"
            

        # Now pass the original ent_clean *and* the lookup_str into your reviewer prompt
        # prompt in this case will be lookup_str + JSON
        if len(codes_found)>0:
            # insert a blank line between lookup_str and the JSON dump
            prompt_for_reviewer = lookup_str + "\n" + json.dumps(ent_clean)
        else:
            prompt_for_reviewer = json.dumps(ent_clean)
            
        review_prompt = create_reviewer_entity_result_prompt(term, prompt_for_reviewer)
        review_response_json = get_completion(review_prompt, model=model)  # get raw JSON/dict
        review_ent_clean = json.loads(strip_markdown_fences(review_response_json))
        return review_ent_clean

In [34]:
get_single_response("Relapse Multiple Myeloma")

{'entities': {}}

In [33]:
get_single_response("Relapsed Multiple Myeloma")

{'entities': {'Relapsed Multiple Myeloma': {'entity_name': 'Relapsed Multiple Myeloma',
   'types': 'diagnosis',
   'codes': [{'code': 'C90.00',
     'system': 'ICD-10',
     'description': 'Multiple myeloma not having achieved remission',
     'confidence': 90}]}}}

In [32]:
get_single_response("Multiple Myeloma in Relapse")

{'entities': {'Multiple Myeloma in Relapse': {'entity_name': 'Multiple Myeloma in Relapse',
   'types': 'diagnosis',
   'codes': [{'code': 'C90.02',
     'system': 'ICD-10-CM',
     'description': 'Multiple myeloma in relapse',
     'confidence': 100}]}}}

In [25]:
qa_lookups

{'Oseltamivir': 'Database tool results (if any) as supporting evidence to help your response:\n  260101 coded for ingredient oseltamivir\n',
 'Lurbinectedin': 'Database tool results (if any) as supporting evidence to help your response:\n  2374729 coded for ingredient lurbinectedin\n',
 'eptifibatide': 'Database tool results (if any) as supporting evidence to help your response:\n  75635 coded for ingredient eptifibatide\n',
 'Amoxicillin': 'Database tool results (if any) as supporting evidence to help your response:\n  723 coded for ingredient amoxicillin\n',
 'Alvimopan': 'Database tool results (if any) as supporting evidence to help your response:\n  480639 coded for ingredient alvimopan\n',
 'Dacarbazine': 'Database tool results (if any) as supporting evidence to help your response:\n  3098 coded for ingredient dacarbazine\n'}

In [193]:

df_full_final = pd.DataFrame()
df_full_raw = pd.DataFrame()

for ent in all_final_results:
    df_clean_final = flatten_entity_to_df(ent)
    df_full_final = pd.concat([df_full_final, df_clean_final])

df_full_final = df_full_final.reset_index(drop=True)


In [None]:
#My verbose prompt and reviwer looks like cause slightly less prediction

In [194]:
len( list( set(df_full_final['entity_name'].to_list()) & set(annotated_terms['entity_name'].to_list()) ) )

16

In [195]:
df_full_final

Unnamed: 0,entity_name,types,code,system,description,confidence
0,Liver Transplant Rejection,diagnosis,T86.42,ICD-10,Liver transplant rejection,95
1,Oseltamivir,medication,260101,RxNorm,Oseltamivir 75 mg oral capsule,100
2,Lurbinectedin,medication,2374729,RxNorm,Lurbinectedin,100
3,Wheezing,diagnosis,R06.2,ICD-10,Wheezing,100
4,eptifibatide,medication,204963,RxNorm,Eptifibatide,100
5,Amoxicillin,medication,723,RxNorm,Amoxicillin,100
6,Uric Acid (Urate),measurements/labs,3084-1,LOINC,Uric acid [Moles/volume] in Serum or Plasma,95
7,Uric Acid (Urate),measurements/labs,14959-1,LOINC,Uric acid [Mass/volume] in Serum or Plasma,95
8,Alvimopan,medication,480639,RxNorm,Alvimopan 12 MG Oral Capsule [Entereg],100
9,Splenomegaly,diagnosis,R16.1,ICD-10,"Splenomegaly, not elsewhere classified",95


In [196]:
df_pred = df_full_final[['entity_name','types','code','system','description','confidence']].copy() \
    .rename(columns={'types':'pred_type','code':'pred_code','system':'pred_vocab','description':'pred_text','confidence':'pred_confidence'})

# left join on fuill and mark pred_no


df_pred = annotated_terms[['entity_name']].drop_duplicates().merge(df_pred, on='entity_name', how='left')

In [197]:
df_pred['entity_name'].nunique()

20

In [198]:
df_pred['pred_no'] = df_pred['pred_type'].apply(lambda x: 1 if pd.isnull(x) else 0)
df_pred['pred_confidence'] = df_pred['pred_confidence'].fillna(0)

In [199]:
df_pred

Unnamed: 0,entity_name,pred_type,pred_code,pred_vocab,pred_text,pred_confidence,pred_no
0,Liver Transplant Rejection,diagnosis,T86.42,ICD-10,Liver transplant rejection,95.0,0
1,Oseltamivir,medication,260101,RxNorm,Oseltamivir 75 mg oral capsule,100.0,0
2,Lurbinectedin,medication,2374729,RxNorm,Lurbinectedin,100.0,0
3,Wheezing,diagnosis,R06.2,ICD-10,Wheezing,100.0,0
4,eptifibatide,medication,204963,RxNorm,Eptifibatide,100.0,0
5,Amoxicillin,medication,723,RxNorm,Amoxicillin,100.0,0
6,green nails,,,,,0.0,1
7,"Dilation of hypoglossal nerve, open approach",,,,,0.0,1
8,Uric Acid (Urate),measurements/labs,3084-1,LOINC,Uric acid [Moles/volume] in Serum or Plasma,95.0,0
9,Uric Acid (Urate),measurements/labs,14959-1,LOINC,Uric acid [Mass/volume] in Serum or Plasma,95.0,0


In [200]:
annotated_terms['entity_name'].nunique()

20

In [201]:
truth_df = annotated_df[['entity_name','types','codes_pipe','vocabulary','text','should_say_no']] \
    .rename(columns={'types':'true_type','should_say_no':'true_no','codes_pipe':'true_code','vocabulary':'true_vocab','text':'true_text'})

truth_df['true_no'] = truth_df['true_no'].fillna(0)


In [202]:
for col in truth_df[list(truth_df.columns)[:-1]]:
    truth_df[col] = truth_df[col].str.strip()
    

In [203]:
score_base = truth_df[['entity_name']].drop_duplicates().copy()

score_base['true_no'] = np.nan
score_base['pred_no'] = np.nan

score_base['entity_eval'] = ''
score_base['code_eval'] = ''
score_base['text_eval'] = ''

score_base['entity_pred_set'] = None
score_base['entity_true_set'] = None
score_base['code_pred_set'] = None
score_base['code_true_set'] = None
score_base['text_pred_set'] = None
score_base['text_true_set'] = None

for index, row in score_base.iterrows():
    # outputs we want are tallys of
    entity_name = row['entity_name']
    tmp_truth = truth_df[truth_df['entity_name'] == entity_name].copy()
    tmp_pred = df_pred[df_pred['entity_name'] == entity_name].copy()

    # if this is 1 then I reviewed and said this would not be a possible definition for the vocabularies
    # if ttrue_code_sethis is 0 then it is possible
    true_negative = int(tmp_truth['true_no'].max())
    # if this is 1 then the model decline to answer with a possible definition for the vocabularies
    # if this is 0 then it answered
    pred_negative = int(tmp_pred['pred_no'].max())
    
    # create for easy set math
    if true_negative == 1:
        true_entity_set = set()
        true_code_set = set()
        true_text_set =  set()
    else:
        true_entity_set = set(tmp_truth['true_type'].to_list())
        true_code_set = set(tmp_truth['true_code'].to_list())
        true_text_set = set(tmp_truth['true_text'].to_list())
    if pred_negative == 1:
        pred_entity_set = set()
        pred_code_set = set()
        pred_text_set = set()

    else:
        pred_entity_set = set(tmp_pred['pred_type'].to_list())
        pred_code_set = set(tmp_pred['pred_code'].to_list())
        pred_text_set = set(tmp_pred['pred_text'].to_list())


    pred_text_set = set([c.lower() for c in list(pred_text_set)])
    true_text_set = set([c.lower() for c in list(true_text_set)])

    if len(true_entity_set) == 0 and len(pred_entity_set) == 0:
        entity_res = 'tn' 
    elif true_entity_set == pred_entity_set:
        entity_res = 'tp'
    elif len(true_code_set) > 0 and len(pred_entity_set) == 0:
        entity_res = 'fn' 
    else:
        entity_res = 'fp' 
    
    if len(true_code_set) == 0 and len(pred_code_set) == 0:
        code_res = 'tn' 
    elif true_code_set == pred_code_set:
        code_res = 'tp'
    elif len(true_code_set) > 0 and len(pred_code_set) == 0:
        code_res = 'fn' 
    else:
        code_res = 'fp' 

    if len(true_text_set) == 0 and len(pred_text_set) == 0:
        text_res = 'tn' 
    elif true_text_set == pred_text_set:
        text_res = 'tp'
    elif len(true_text_set) > 0 and len(pred_text_set) == 0:
        text_res = 'fn' 
    else:
        text_res = 'fp' 
        

    score_base.loc[index,'true_no'] = true_negative
    score_base.loc[index,'pred_no'] = pred_negative

    score_base.loc[index,'entity_eval'] = entity_res
    score_base.loc[index,'code_eval'] = code_res
    score_base.loc[index,'text_eval'] = text_res
    
    score_base.at[index,'entity_pred_set'] = pred_entity_set
    score_base.at[index,'entity_true_set'] = true_entity_set
    
    
    score_base.at[index,'code_pred_set'] = pred_code_set
    score_base.at[index,'code_true_set'] = true_code_set

    score_base.at[index,'text_pred_set'] = pred_text_set
    score_base.at[index,'text_true_set'] = true_text_set

In [204]:
score_base

Unnamed: 0,entity_name,true_no,pred_no,entity_eval,code_eval,text_eval,entity_pred_set,entity_true_set,code_pred_set,code_true_set,text_pred_set,text_true_set
0,Liver Transplant Rejection,0.0,0.0,tp,fp,tp,{diagnosis},{diagnosis},{T86.42},{T86.41},{liver transplant rejection},{liver transplant rejection}
1,Oseltamivir,0.0,0.0,tp,tp,fp,{medication},{medication},{260101},{260101},{oseltamivir 75 mg oral capsule},{oseltamivir}
2,Lurbinectedin,0.0,0.0,tp,tp,tp,{medication},{medication},{2374729},{2374729},{lurbinectedin},{lurbinectedin}
3,Wheezing,0.0,0.0,tp,tp,tp,{diagnosis},{diagnosis},{R06.2},{R06.2},{wheezing},{wheezing}
4,eptifibatide,0.0,0.0,tp,fp,tp,{medication},{medication},{204963},{75635},{eptifibatide},{eptifibatide}
5,Amoxicillin,0.0,0.0,tp,tp,tp,{medication},{medication},{723},{723},{amoxicillin},{amoxicillin}
6,green nails,1.0,1.0,tn,tn,tn,{},{},{},{},{},{}
7,"Dilation of hypoglossal nerve, open approach",1.0,1.0,tn,tn,tn,{},{},{},{},{},{}
10,Uric Acid (Urate),0.0,0.0,tp,fp,fp,{measurements/labs},{measurements/labs},"{3084-1, 14959-1}","{2916-2, 3084-4, 3087-7, 3084-1}","{uric acid [mass/volume] in serum or plasma, u...","{uric acid [mass/volume] in serum or plasma, u..."
14,Alvimopan,0.0,0.0,tp,tp,fp,{medication},{medication},{480639},{480639},{alvimopan 12 mg oral capsule [entereg]},{alvimopan}


In [206]:
# just check to see 
for key, value in qa_lookups.items():
    print(f"{key} {value}")

Liver Transplant Rejection 
Oseltamivir Database tool results (if any) as supporting evidence to help your response:
  260101 coded for ingredient oseltamivir

Lurbinectedin Database tool results (if any) as supporting evidence to help your response:
  2374729 coded for ingredient lurbinectedin

Wheezing 
eptifibatide Database tool results (if any) as supporting evidence to help your response:
  75635 coded for ingredient eptifibatide

Amoxicillin Database tool results (if any) as supporting evidence to help your response:
  723 coded for ingredient amoxicillin

Dilation of hypoglossal nerve, open approach 
Uric Acid (Urate) 
Alvimopan Database tool results (if any) as supporting evidence to help your response:
  480639 coded for ingredient alvimopan

Splenomegaly 
Covid-19 
Dacarbazine Database tool results (if any) as supporting evidence to help your response:
  3098 coded for ingredient dacarbazine

Methylxanthine 
Alcohol induced pancreatitis 
Abdominal Imaging (CT or MRI) 
H2 Rece

In [19]:
def breakdown_true_entity_type(x):
    try:
        if x == set():
            return 'abstain'
        elif len(x) > 1:
            return 'multiple'
        else:
            return list(x)[0]
    except:
        print(x)
        raise
score_base['true_entity_breakdown'] = score_base['entity_true_set'].apply(breakdown_true_entity_type)

In [38]:
score_base['true_code_breakdown'] = score_base['code_true_set'].apply(breakdown_true_entity_type)

Coding medications WAS our biggest error. 


In [209]:
score_base[['true_entity_breakdown','code_eval']].value_counts(normalize=True)

true_entity_breakdown  code_eval
medication             tp           0.25
abstain                tn           0.15
diagnosis              tp           0.15
abstain                fp           0.10
diagnosis              fp           0.10
drug_class             tp           0.05
measurements/labs      fp           0.05
medication             fn           0.05
                       fp           0.05
procedure              tp           0.05
Name: proportion, dtype: float64

In [27]:
from sklearn.metrics import precision_score, recall_score, f1_score

# we do not get as good coverage
# but this is more straightforward
def get_eval_metrics(df, column):
    y_true = df[column].apply(lambda x: 1 if x in ['tp','fn'] else 0)
    y_pred = df[column].apply(lambda x: 1 if x in ['tp','fp'] else 0)

    precision = precision_score(y_true, y_pred, zero_division=0)
    f1        = f1_score(y_true, y_pred, zero_division=0)
    counts    = df[column].value_counts().to_dict()

    return {
        'counts':    counts,
        'precision': round(precision, 3),
        'f1':        round(f1,        3)
    }



entity_metrics = get_eval_metrics(score_base, 'entity_eval')
code_metrics = get_eval_metrics(score_base, 'code_eval')
text_metrics = get_eval_metrics(score_base, 'text_eval')

print("Entity Evaluation Metrics:", entity_metrics)
print()
print("Code Evaluation Metrics:", code_metrics)
print()
print("Text Evaluation Metrics:", text_metrics)



Entity Evaluation Metrics: {'counts': {'tp': 14, 'tn': 3, 'fp': 2, 'fn': 1}, 'precision': 0.875, 'f1': 0.903}

Code Evaluation Metrics: {'counts': {'tp': 10, 'fp': 6, 'tn': 3, 'fn': 1}, 'precision': 0.625, 'f1': 0.741}

Text Evaluation Metrics: {'counts': {'fp': 10, 'tp': 6, 'tn': 3, 'fn': 1}, 'precision': 0.375, 'f1': 0.522}


In [28]:
from sklearn.metrics import precision_score, f1_score

def make_entity_code_label(row):
    ent = row['entity_eval']  # 'tp', 'fp', 'fn', or 'tn'
    code = row['code_eval']   # 'tp', 'fp', 'fn', or 'tn'

    #True Negative
    if ent == 'tn' and code == 'tn':
        return 'tn'

    #True Positive
    if ent != 'tn' and code == 'tp':
        return 'tp'

    #False Negative 
    if ent != 'tn' and code == 'tn':
        return 'fn'

    # basically when the codes are not perfect
    return 'fp'

score_base['entity_code_eval'] = score_base.apply(make_entity_code_label, axis=1)

# same
y_true = score_base['entity_code_eval'].map(lambda x: 1 if x in ['tp','fn'] else 0)
y_pred = score_base['entity_code_eval'].map(lambda x: 1 if x in ['tp','fp'] else 0)

precision = precision_score(y_true, y_pred, zero_division=0)
f1        = f1_score(       y_true, y_pred, zero_division=0)
counts    = score_base['entity_code_eval'].value_counts().to_dict()

print("Entity+Code Precision-first Metrics:", {
    'counts':    counts,
    'precision': round(precision, 3),
    'f1':        round(f1,        3)
})


Entity+Code Precision-first Metrics: {'counts': {'tp': 10, 'fp': 7, 'tn': 3}, 'precision': 0.588, 'f1': 0.741}


In [22]:
score_base

Unnamed: 0,entity_name,true_no,pred_no,entity_eval,code_eval,text_eval,entity_pred_set,entity_true_set,code_pred_set,code_true_set,text_pred_set,text_true_set,true_entity_breakdown,jaccard
0,Liver Transplant Rejection,0,0,tp,fp,tp,{'diagnosis'},{'diagnosis'},{'T86.42'},{'T86.41'},{'liver transplant rejection'},{'liver transplant rejection'},diagnosis,0.0
1,Oseltamivir,0,0,tp,tp,fp,{'medication'},{'medication'},{'260101'},{'260101'},{'oseltamivir 75 mg oral capsule'},{'oseltamivir'},medication,1.0
2,Lurbinectedin,0,0,tp,tp,tp,{'medication'},{'medication'},{'2374729'},{'2374729'},{'lurbinectedin'},{'lurbinectedin'},medication,1.0
3,Wheezing,0,0,tp,tp,tp,{'diagnosis'},{'diagnosis'},{'R06.2'},{'R06.2'},{'wheezing'},{'wheezing'},diagnosis,1.0
4,eptifibatide,0,0,tp,fp,tp,{'medication'},{'medication'},{'204963'},{'75635'},{'eptifibatide'},{'eptifibatide'},medication,0.0
5,Amoxicillin,0,0,tp,tp,tp,{'medication'},{'medication'},{'723'},{'723'},{'amoxicillin'},{'amoxicillin'},medication,1.0
6,green nails,1,1,tn,tn,tn,set(),set(),set(),set(),set(),set(),abstain,1.0
7,"Dilation of hypoglossal nerve, open approach",1,1,tn,tn,tn,set(),set(),set(),set(),set(),set(),abstain,1.0
8,Uric Acid (Urate),0,0,tp,fp,fp,{'measurements/labs'},{'measurements/labs'},"{'3084-1', '14959-1'}","{'2916-2', '3084-4', '3087-7', '3084-1'}","{'uric acid [mass/volume] in serum or plasma',...","{'uric acid [mass/volume] in serum or plasma',...",measurements/labs,0.2
9,Alvimopan,0,0,tp,tp,fp,{'medication'},{'medication'},{'480639'},{'480639'},{'alvimopan 12 mg oral capsule [entereg]'},{'alvimopan'},medication,1.0


In [36]:
def compute_jaccard(row):
    pred = row['code_pred_set']
    truth = row['code_true_set']
    if isinstance(pred, set) and isinstance(truth, set):
        union = pred | truth
        intersection = pred & truth
        return len(intersection) / len(union) if union else 1.0
    return 0

score_base['jaccard'] = score_base.apply(compute_jaccard, axis=1)

# not correct, I converted set to a string at the end, need to fix for presentation

In [114]:
df_pred.to_excel('/pipeline_datalake/clean_improve_model_2_predictions.xlsx',index=False)
score_base.to_excel('/pipeline_datalake/clean_improve_model_2_eval.xlsx',index=False)

### 5) Limitations and Discussions
List the limitations of your implementation and/or discussion topics regarding how to
make the service more robust for the future


#### Limitations:
- Sample bias -> manual review took some time and yielded small sample size so I'm not clear on how this would be in use with all of the codes
- Simple Use Case -> much more controlled than users querying and getting frustrated in the moment of course
- Still probably should have spent more time forcing negatives, I would have like to see more false negatives, those are actually a good thing in our case
- If in production, getting one or two codes of a set is still very useful, then perhaps could have emphasized partial match accuracy. My meaure of precision and using FP for anything incorrect where the model made a decision was quite hard. It would have been good to dive deeper and have a better way of incoporating coverage -- FNs 
- I made some very strict requirements after annotating and since I am not a medical researcher (or doctor), this may have lead to some confusion
- I only used OpenaI but Claude apparently is better at holding back when it doesnt know for sure
- There are also medically trained models I could have used
- I also did not focus on text too much, I took the route that
- I also didnt focus on confidence too much from the model because I didnt have a lot of confidence in it. This may relat to the fact that I used OpenAI 

####M ain Discussion points:
- It was very interesting to decide how to clasify fp vs fn. I chose a more straightforward route, but it was important to supplement with at least one set metric. Focus on codes for this because if all the codes are correct and the entity was wrong-- I feel like the odds are low and the impact minimal
- A topic I would like to bring up is - Coding vs Clinical Concept understanding: in other words if I said NSCLC, would the ICD10s generic Malignant Neoplasmn of the Lung -- would that be a true positive for the use case? My feeling is that you would want to separate two core agentic items
    1) Agent that knows what the person is talkiing about at a conceptual real world level
    2) Agent that knows the best way to find a cohort in a dataset with that medical disease for example given the constraints of the vocabulary system it has to work with. M
- My approach chose to focus on 1) -- snomed would have been a better choice in that case
- The model really is terrible at
- - Rare disease and regime use cases are particularly interesting

#### A few big takeaways:
- I believe the separation between clinical concept undertanding and medical coding for a use case is key -- thus my empohasis
- At least in my case the LLM is really bad at finding codes for some vocabularies that have no semantic undertones like RxNorm, and really have to be plugged into tools that can be used. And the LLm can be used to intelligently put those tools together
- In order to achieve strong

#### Future Ideas:
- I really want to understand more about how Atropos is thinking about Entity understanding (clinically) vs coding. In the context of RxNorm also
- Have Human-in-the-loop give the model that understanding that aligns to company goals
- The use of snomed would be interesting and help that
- Maybe have some ways of giving the model context of who its talking to. Is this a researcher in stoke patients? if so "Bleeding" may mean something specific in that context vs others
- Semantic understanding of relationships between vocbabularies is key for future development
