Before running the notebook, please read the cell below. There you can find two flags:

`DRY_RUN`:
*   `True` - use a prepared validation set, do not run prediction
*   `False` - run prediction on a new validation dataset (prediction: ~12 sec/row)

`GDRIVE` (link to data: [GDrive Share](https://drive.google.com/drive/folders/1ZKX9aQNIsRchM8bJfkeaX-0JUGYjnTin?usp=sharing) - add folder shortcut to your Drive):
*   `True` - if you run data in colab and use my GDRIVE to access data
*   `False` - if you run locally or without access to my Google Drive (please, make sure to keep the same file structure for seamless notebook running)

If you confirm the settings, you can click "**Run All**".  

In [None]:
# notebook setting
DRY_RUN = True  # check False if need to run prediction (takes a while)
GDRIVE = True  # check False if runs locally

if GDRIVE:
    from google.colab import drive
    drive.mount('/content/drive')

Mounted at /content/drive


### Requirements

In [None]:
!pip install -q msoffcrypto-tool
if not DRY_RUN:
    !pip install -qU bitsandbytes
    !pip install -qU transformers accelerate
    # !pip install datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.7/48.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.6/114.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h

### Imports

In [None]:
# read data
import os
from io import BytesIO
import msoffcrypto

# process data
import numpy as np
import pandas as pd
import re, json

# LLM
if not DRY_RUN:
  from transformers import AutoTokenizer, AutoModelForCausalLM
  import torch

# performance
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
    fbeta_score,
    roc_auc_score
)

### Vars

In [None]:
# read data
DRIVE_PATH = r'/content/drive/MyDrive/MedicalRecords/' if GDRIVE else ''
PATH_TO_DATA = DRIVE_PATH + 'medical_records_dataset.xlsx'
PASSWORD = '---'
SHEETS = ['Harmonized_Labeled', 'Seperate_Labeled']
PATH_TO_DATA_HARMONIZED = DRIVE_PATH + 'data_harmonized.csv'
PATH_TO_DATA_SEPARATE = DRIVE_PATH + 'data_separate.csv'
PATH_TO_VAL_DATA = DRIVE_PATH + 'val_df.csv'

# process data
ALL_COLUMNS = ['note', 'stroke', 'diabetes', 'cancer', 'heart_attack']
SYMPTOMS_COLS = ALL_COLUMNS[1:]

# model
MODEL_NAME = "johnsnowlabs/JSL-MedLlama-3-8B-v2.0"

# check performance
PATH_TO_RESULT = DRIVE_PATH + 'Results/'
PATH_TO_RESULT_VAL = PATH_TO_RESULT + 'res_val.csv'

### Data analysis and processing

In [None]:
# decrypt
with open(PATH_TO_DATA, 'rb') as file:
    encrypted = msoffcrypto.OfficeFile(file)
    encrypted.load_key(password=PASSWORD)

    decrypted = BytesIO()
    encrypted.decrypt(decrypted)

In [None]:
# to pandas DataFrame
df_harmonized = pd.read_excel(decrypted, sheet_name=SHEETS[0])
df_separate = pd.read_excel(decrypted, sheet_name=SHEETS[1])
# uncomment below to read directly from csv files
# df_harmonized = pd.read_csv(PATH_TO_DATA_HARMONIZED)
# df_separate = pd.read_csv(PATH_TO_DATA_SEPARATE)

##### Check NaN values

In [None]:
# firstly, check NaN values
df_harmonized.isna().sum(), df_separate.isna().sum()
## first row in df_separate is NaN (same for df_harmonized, it's not NaN, but meaningless)
## but the last column has 3 other NaN rows

(String            0
 False Positive    0
 dtype: int64,
 String                         1
 False Positive stroke          1
 False positive diabetes        1
 False positive Cancer          1
 False positive heart attack    4
 dtype: int64)

In [None]:
# let's find these rows and compare to df_harmonized value
df_harmonized.loc[df_separate[df_separate['False positive heart attack'].isna()].index, :]
## the values are all 0, then it's safe to fill NaN with 0

Unnamed: 0,String,False Positive
0,Note_Original,0
333,Reason for referral: Pericardial fluid. Pres...,0
419,S: \n\nFew concerns:\n\n1) Needs document ...,0
774,S: \n\nPlesant 27yo female walk-in patient...,0


In [None]:
# prepare data
df_harmonized = df_harmonized.drop(0, axis=0)  # the first row is meaningless
df_harmonized['False Positive'] = df_harmonized['False Positive'].astype(int)
df_separate = df_separate.drop(0, axis=0)  # the first row is NaN
df_separate.columns = ALL_COLUMNS  # rename columns
df_separate.heart_attack = df_separate.heart_attack.fillna(0.0)  # replace NaN
df_separate[SYMPTOMS_COLS] = df_separate[SYMPTOMS_COLS].astype(int)  # from float to int


In [None]:
df = df_separate.copy()  # rename main dataframe for the easier usage

##### Class distribution

In [None]:
print("Distribution:")
print(SYMPTOMS_COLS)
df_separate[SYMPTOMS_COLS].value_counts().to_dict()

Distribution:
['stroke', 'diabetes', 'cancer', 'heart_attack']


{(0, 0, 0, 0): 792,
 (0, 0, 1, 0): 123,
 (0, 1, 0, 0): 55,
 (1, 0, 0, 0): 12,
 (0, 0, 0, 1): 7,
 (0, 1, 1, 0): 6,
 (1, 0, 0, 1): 2,
 (1, 0, 1, 0): 1}

In [None]:
print("Distribution per symptom:")
for col in SYMPTOMS_COLS:
    print(f"  {col}:", df_separate[col].value_counts().to_dict())

Distribution per symptom:
  stroke: {0: 983, 1: 15}
  diabetes: {0: 937, 1: 61}
  cancer: {0: 868, 1: 130}
  heart_attack: {0: 989, 1: 9}


**Conclusion:**

The dataset is highly imbalanced.

I assumed that a transformers solution (like Bio_ClinicalBERT) might not generalize on minor classes well. It'd require a custom loss per label or any sort of data augmentation.

That's why I sticked to an LLM solution.


##### Validation data
This code ensures that all the classes will be in the validation dataset.

In [None]:
# create a validation dataset with controlled distribution
def sample_select(df: pd.DataFrame, counts: dict,
                  total: int = 150, random_state: int = 42) -> pd.DataFrame:
    not_healthy_count = sum(counts.values())
    healthy_count = total - not_healthy_count  # all 0 classes

    dataset = []
    for disease, n in counts.items():
        disease_subset = df[df[disease] == 1]
        sampled = (
            disease_subset
            .sample(n=n, random_state=random_state)
            .reset_index(drop=True)
        )
        dataset.append(sampled)

    healthy_subset = df[(df[SYMPTOMS_COLS].sum(axis=1) == 0)]
    healthy_sampled = (
        healthy_subset
        .sample(n=healthy_count, random_state=random_state)
        .reset_index(drop=True)
    )
    dataset.append(healthy_sampled)

    result = (
        pd.concat(dataset, axis=0)
        .sample(frac=1, random_state=random_state)
        .reset_index(drop=True)
    )
    return result

In [None]:
if not DRY_RUN:
    val_df = sample_select(df,
                          counts={  # preselected number of classes in val dataset
                              'stroke': 8,
                              'diabetes': 30,
                              'cancer': 40,
                              'heart_attack': 5},
                          total=150)  # total len of val dataset
    val_df.to_csv('val_df.csv', index=False)
else:
    val_df = pd.read_csv(PATH_TO_VAL_DATA)

### LLM

#### Prompt
The prompt is created according to the problem statement and rules for identifying classes.

In [None]:
def create_prompt(symptom_description: str) -> str:
    return f"""You are a medical expert AI specializing in clinical symptom assessment. Given the following medical description, your task is to evaluate it and determine the presence of symptoms related to four specific conditions.

For each of the conditions below, respond with either:
- 1 (True) — if there is clear evidence such as a diagnosis, personal/family medical history, or confirmed test results.
- 0 (False) — if the description only mentions a recommendation for testing, medications without diagnosis, vague/uncertain language (e.g., "possible", "might have"), or early risk factors (e.g., prediabetes).

Conditions to evaluate:
1. Stroke
2. Diabetes
3. Cancer
4. Heart attack

Return your answer strictly in the following JSON format:
{{"stroke": 0 or 1, "diabetes": 0 or 1, "cancer": 0 or 1, "heart_attack": 0 or 1}}

Medical description:
\"\"\"{symptom_description}\"\"\"

Your JSON response:
"""
# Examples:
# 1. INPUT: 'S:     \n\nDuration of pain: 1 week\nLocation of pain: nape of neck\nAggravating/Alleviating factors: bending head forward\nPrevious imaging? none\nPain scale: 5\n\nRed Flags:\nAge >50 (especially if osteoporosis or compression factor) Yes\nAge > 70 No\nTrauma/Cumulative trauma No - but did carry heavy bags on one shoulder the day before this started\nUnexplained fever or hx of urinary or other infections No\nUnexplained weight loss No\nImmunosuppression or Diabetes No\nHistory of Cancer No\nIV drug use No\nProlonged use of corticosteroids, osteoporosis No\nFocal neurological deficit with progressive or disabling symptoms No\nBowel/bladder dysfunction No\nLeg symptoms No\nDuration longer than 6 weeks No\nPrior surgery No\n\nO:     \n\nInspection of back & posture: normal\nROM: FROM\nPalpation of vertebrae:  Tender\nPain when flexing neck and ear to shoulder bilaterally, especially around right trapezius\n\nA:       \n\nCervical Neck Strain\n\nP:     \n\nVimovo PRN. If not better in 10 days, will send for XR.\n\nMedication side effects were discussed in great detail\n\nActive Medications: \tVIMOVO 375 MG-20 MG TABLET\n\n\n\n'; OUTPUT: {{"stroke": 0, "diabetes": 0, "cancer": 0, "heart_attack": 0}}
# 2. INPUT: 'S:     \n\nPatient presents when being called in for positive colon cancer screening kit (1 out of 3 were positive for blood). He has never had a colonoscopy, and he is now 56yo. There is no history of colon cancer in the family. He does have a history of hemorrhoids. \n\nPatient became very emotional during the visit and confessed that he is very stressed out as of late. He states his son in the Phillipines, has been arrested for murder in self defense. He is worried about his son's wife and his grandson. He does not know what to do about the situation. He states his workplace is also planning on going on strike soon, and he is worried about the financial implications this may have. \n\n\nO:     \n\nGen: anxious\nPsych: no suicidal/homicidal ideation\n\n\nA:       \n\nAnxiety\nColon Cancer Screening - Positive\nDM2\n\n\nP:     \n\n1) We spoke in detail about the things he can do for himself and his son in this very difficult time. He states he is a Christian, and he is strong in his faith and knows he can get through this. I encouraged him to pray for his son and family, as well as for himself. I think this will help him tremendously. I have also asked him not to neglect his own health in this difficult time. Although I do not feel he requires medication at this time, should he have any problems with insomnia or intrusive thoughts in the near future, I encouraged him to return and share these feelings with me. He will try and relax as best as possible and figure out a way as to how he can help his son back home. \n\n2) Will send for colonoscopy. Explained the procedure to him in great detail. He is reluctant (because he is scared of the process and the possible outcome), but I reassured him this was a mere screening test and there is nothing for him to be worried about at the present time.\n\n3) I have switched his Metformin to Janumet XR. He is to take 2 tablet po qam 50/1000mg tablets). We will check his A1C at a future diabetes visit. \n\nCounselling Time (for Anxiety only): 11:05-11:55am\n\n\nActive Medications: \tJANUMET XR 50-1,000 MG TABLET, Lancets  Miscellaneous Each, Orthotic footwear, Custom Orthotics, Metformin HCL 500 mg Oral Tablet, Indomethacin 25 mg Oral Capsule, CRESTOR 20 MG TABLET, Metformin HCL 500 mg Oral Tablet, Telmisartan 80 mg Oral Tablet, ACCU-CHEK AVIVA T/S\n\n\n\n\n; OUTPUT: {{"stroke": 0, "diabetes": 0, "cancer": 1, "heart_attack": 0}}
# 3. INPUT: 'S:     \n\nPresents for results review. Her labs were good with the exception of prediabetes. Her A1c is at 6.1%. She has Hepatitis C and on medications. She has a long family history of DM2.\n\n\nO:     \n\nGen: NAD\n\n\nA:       \n\nPrediabetes\n\n\nP:     \n\nSuggested she continue with her dietitian, and consider seeing Dr. Metabolic Weight Loss Clinic as she has in the past. She will speak to Dr., her PCP, about this.\n\n\nActive Medications: \tHYDROVAL 0.2 % CREAM, ZITHROMAX 250 MG TABLET, NORTRIPTYLINE 10 MG CAPSULE, STIEPROX 1.5 % SHAMPOO, LAMISIL 250 MG TABLET, KETODERM 2% CREAM, DICETEL 100 MG TABLET\n\n\n\n\n'; OUTPUT: {{"stroke": 0, "diabetes": 1, "cancer": 0, "heart_attack": 0}}
# 4. INPUT: 'S:     feeling unwell at the beginning of the week - achy, low energy, bed at 6:30 every night\nhad improved\ntoday woke with dizzy spells & ringing in ear \ncan make dizzy with turning\nprevious ringing in ear but no dizziness - had ear drops & pills - a while back (leftover ear infection)\nhearing is fine\ndizzy with movement of head\n\n\nO:     cerebellar tests normal - finger to nose, rapid alt mvmt\nnormal gait\nnormal strength & reflexes bilat\n\n\nA:       Labyrynthitis\n\n\nP:     serc\nrtc if no resolution\nto ER if stroke-like sx\n\n\nActive Medications: \tNone Recorded\n\n\n\n\n'; OUTPUT: {{"stroke": 1, "diabetes": 0, "cancer": 0, "heart_attack": 0}}
# 5. INPUT: 'S:     \n\nPatient presents with severe anxiety. She is shaking and crying during the visit. She was here less than 1 week ago and given Lorazepam. She states this worked very well, but she was taking 2mg twice daily as opposed to the way it was written - 1mg twice daily. She saw her Methadone physician Dr. who tried her on Clonazepam unsuccessfully. She has NEVER tried street drugs, and is taking Methadone for endometriosis. THis has now resolved, and thus she is being weaned off the Methadone. She feels like she is having a heart attack. She fears that her children will see her like this. She did have recent dental procedures which she feels are the cause of these attacks. She was placed on Celexa but only took it for 1 months stating that she could not tolerate the side effects and that it did not work for her. \n\n\nO:     \n\nGen: anxious, AAO x 3\nPsych: anxious, decreased sleep, decreased concentration \n\n\nA:       \n\nGAD\n\n\nP:     \n\nLorazepam 1mg po BID refilled x 30 days. Referred to psychiatrist for SSRI and possible CBT.\n\n\nActive Medications: \tLorazepam 1 mg Oral Tablet, CELEXA 20 MG TABLET, AURO-CIPROFLOXACIN 500 MG TAB, MACROBID 100 MG CAPSULE\n\n\n\n\n'; OUTPUT: {{"stroke": 0, "diabetes": 0, "cancer": 0, "heart_attack": 1}}
# 6. INPUT: 'S:     back pain persists\nusing tylenol arthritis & robax - help somewhat\nbut the more she does, busy -- more pain\nthe more she relaxes, the better it is, especially the next day\n\nAge >50 (especially if osteoporosis or compression factor) [Yes|No]\nAge > 70 [Yes|No]\nTrauma/Cumulative trauma [Yes|No]\nUnexplained fever or hx of urinary or other infections [Yes|No ]\nUnexplained weight loss [Yes|No] - did lose wt but did not eat well during her move\nImmunosuppression or Diabetes [Yes|No]\nHistory of Cancer [Yes|No]\nIV drug use [Yes|No]\nProlonged use of corticosteroids, osteoporosis [Yes|No] osteoporosis, previous prednisone use - shock tx in the 90's\nFocal neurological deficit with progressive or disabling symptoms [Yes|No]\nBowel/bladder dysfunction [Yes|No] (colitis related only)\nLeg symptoms [Yes|No]\nDuration longer than 6 weeks [Yes|No]\nPrior surgery [Yes|No]\n\nno radicular pain, sometimes radiates to muscles\n\n2) also needs zopiclone\nuses most nights - last few days just used half\ndiscussed Insomnia workbook\n\nO:     no vertebral tenderness\nmild L MSK tenderness - improved from previous\n\n\nA:       Back pain - OA\t\n\n\nP:     pain clinic - ?injection (?morgan)\n\nzopiclone renewed\ninsomnia work book\n\n\nActive Medications: \tSYNTHROID 112 MCG TABLET, Irbesartan/Hydrochlorothiazide 300 mg-25 mg Oral Tablet, ACTONEL DR 35 MG TABLET, Zopiclone 7.5 mg Oral Tablet\n\n\n\n\'; OUTPUT: {{"stroke": 0, "diabetes": 1, "cancer": 1, "heart_attack": 0}}
# 7. INPUT: 'S:     \n\nconcerned that she may have "palpitation with my heart"\ndiscussed that she has been worked up previously, \npt endorses that she is getting Aflibercept injections for the eye\nhas had two already\none of warnings is that it may cause heart attack, stroke or death\n\n\nfeeling palpitations with moving/walking\nno chest pain\nno SOB\nsame as previously\n\nno excessive caffeine - one per day\n\nO:     \n\nGen: NAD\n131/74, 67\nCVS - NS1S2, no ehs, regular\n\nA:       \n\nPalpitations, Anxiety\nFlu shot\n\n\nP:     \n\nreassured for now\nFlu shot\n\n\nActive Medications: \tOMNARIS 50 MCG NASAL SPRAY, SYMBICORT 100 TURBUHALER, Lorazepam 2 mg Oral Tablet, LAMISIL 1% CREAM, Metoprolol Tartrate 50 mg Oral Tablet, MACROBID 100 MG CAPSULE, FUCIDIN 2 % CREAM\n\n\n\n\n'; OUTPUT: {{"stroke": 1, "diabetes": 0, "cancer": 0, "heart_attack": 1}}


The commented section above was meant to improve the model by providing few-shot examples.
I removed all of them from the prompt because it was too "heavy". I did this to get results faster.

#### Model

For the model, I chose a  LLama 3 (8B) specifically fine-tuned on clinical tasks - [MedLlama](https://huggingface.co/johnsnowlabs/JSL-MedLlama-3-8B-v2.0). It shows the best performance across similar models.

I quantised it to decrease output generation time by 2-3x.

In [None]:
if not DRY_RUN:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        device_map="auto",
        torch_dtype=torch.float16,
        load_in_8bit = True  # quantise for faster access
    )
    model.eval()

#### Prediction

In [None]:
# run prompt
def evaluate_symptom(symptom_description: str) -> dict:
    prompt = create_prompt(symptom_description)

    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(model.device) for k,v in inputs.items()}

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )

    gen_ids = output_ids[0][inputs["input_ids"].shape[-1]:]
    output_text = tokenizer.decode(gen_ids, skip_special_tokens=True)

    match = re.search(r"\{[^{}]+\}", output_text)
    if match:
        try:
            return json.loads(match.group())
        except Exception:
            return None
    return None

# iterate over the dataset
def compare_predictions(df: pd.DataFrame,
                        index_from: int = 0, index_to: int = 500) -> pd.DataFrame:
    y_pred = []
    # drop few-shot examples index: [60, 149, 307, 315, 379, 402, 764, 878]
    test_df = df.copy().reset_index(drop=True)
    notes = test_df['note'].tolist()
    index_to = min(index_to, len(test_df))
    for i in range(index_from, index_to):
        print(i)
        preds = evaluate_symptom(notes[i])

        if preds:
            res = [
                preds.get('stroke', 0),
                preds.get('diabetes', 0),
                preds.get('cancer', 0),
                preds.get('heart_attack', 0)
            ]
        else:
            res = [0, 0, 0, 0]
        y_pred.append(res)

    y_pred_df = pd.DataFrame(y_pred, columns=['stroke_pred', 'diabetes_pred', 'cancer_pred', 'heart_attack_pred'])
    result_df = test_df.iloc[index_from:index_to].reset_index(drop=True)
    return pd.concat([result_df, y_pred_df], axis=1)


In [None]:
# predict the validation dataset
if not DRY_RUN:
    res_val = compare_predictions(val_df)

In [None]:
# predict the complete dataset
# batch: 200-300, to avoid colab time limits
if not DRY_RUN:
    INDEX_FROM = 500
    INDEX_TO = 800
    result_df = compare_predictions(df, index_from=INDEX_FROM, index_to=INDEX_TO)
    result_df.to_csv(f'res_{INDEX_FROM}_{INDEX_TO}.csv', index=False)

#### Validation
I focused more on recall and F_beta scores, as the key is to avoid FN since missing a symptom can have serious consequences.

In [None]:
def show_metrics(df: pd.DataFrame):
    eval_class = SYMPTOMS_COLS if SYMPTOMS_COLS[0] in df.columns else ['harmonized']
    for symptom in eval_class:
        y_true = df[symptom]
        y_pred = df[f'{symptom}_pred']

        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred)
        f2 = fbeta_score(y_true, y_pred, beta=2)  # more weight on recall

        print(f'=== {symptom.title()} ===')
        print(f'  TP: {tp},  FP: {fp},  FN: {fn},  TN: {tn}')
        print(f'  Precision: {precision:.2f}')
        print(f'  Recall (Sensitivity): {recall:.2f}')
        print(f'  F1 Score: {f1:.2f}')
        print(f'  F2 Score: {f2:.2f}')
        print()

In [None]:
if DRY_RUN:
    res_val = pd.read_csv(PATH_TO_RESULT_VAL)
else:
    # check metrics over the complete dataset
    res_df_1 = pd.read_csv(PATH_TO_RESULT + 'res_0_200.csv', index_col=False).dropna(axis=0)
    res_df_2 = pd.read_csv(PATH_TO_RESULT + 'res_200_500.csv')
    res_val = pd.concat([res_df_1, res_df_2], axis=0).drop('Unnamed: 0', axis=1)
    res_val.iloc[:, 1:] = res_val.iloc[:, 1:].astype(int)

In [None]:
show_metrics(res_val)

=== Stroke ===
  TP: 4,  FP: 4,  FN: 6,  TN: 136
  Precision: 0.50
  Recall (Sensitivity): 0.40
  F1 Score: 0.44
  F2 Score: 0.42

=== Diabetes ===
  TP: 27,  FP: 30,  FN: 4,  TN: 89
  Precision: 0.47
  Recall (Sensitivity): 0.87
  F1 Score: 0.61
  F2 Score: 0.75

=== Cancer ===
  TP: 24,  FP: 16,  FN: 19,  TN: 91
  Precision: 0.60
  Recall (Sensitivity): 0.56
  F1 Score: 0.58
  F2 Score: 0.57

=== Heart_Attack ===
  TP: 3,  FP: 5,  FN: 3,  TN: 139
  Precision: 0.38
  Recall (Sensitivity): 0.50
  F1 Score: 0.43
  F2 Score: 0.47



### Results analysis
Metrics don't look good enough.

However, I typically start by investigating potential issues on the data side. It's possible that the model performed well, but incorrect labels may have negatively impacted its scores.

In [None]:
# return notes, true and predicted labels for each missed match
def get_mismatched(
    df: pd.DataFrame,
    disease: str,
    metric: str = 'fn'
) -> pd.DataFrame:
    true_col = disease
    pred_col = f"{disease}_pred"
    if metric.lower() == 'fn':
        mask = (df[true_col] == 1) & (df[pred_col] == 0)
    elif metric.lower() == 'fp':
        mask = (df[true_col] == 0) & (df[pred_col] == 1)
    else:
        return 'Select either FN or FP metrics.'
    return df.loc[mask].reset_index(drop=True)


In [None]:
get_mismatched(res_val, 'cancer').values[:5]

array([["S:     \n\nPresents with request to become a new patient of mine. 3 months ago moved to this area with his wife - seeking family physician. \n\nPMHx: none\n\nPSurgHx: none\n\nDaily Medications: none\n\nAllergies: NKDA\n\nFamilyHx: nothing pertinent \n\nSocialHx: Married, 1 Child, Non-Smoker, Drinks Socially and Self-employed - oversee the operation of importing company (retail)\n\nROS: negative\n\nO:     \n\nGen: NAD\nHEENT: NC/AT, PERRLA, EOMI, no throat issues\nCVS: S1/S2+, RRR, no murmur\nResp: CTAB/L\nLN: no lymphadenopathy\n\nA:       \n\nMeet and Greet\nImmunization Review\nRoutine Labs\nColon Cancer Screening Test\n\nP:     \n\n1) I will be happy to accept him into my rostered practice. I will need a copy of their past medical records from any previous PCP's or specialists they may have had.  \n\n2) Immunizations were reviewed and Tdap given today, rx for Shingrix and Prevnar 13.  \n\n3) Routine labs were sent for today.\n\n4) FOBT sent for. \n\n\nActive Medications: \t

I have some questions about the labels. I'm not an expert, but it seems that labels either don't match the doctor's notes, or some important information is missing (likely due to coping from PDFs), or have ambiguous options ([Yes|No] in survey questions without explicitly stating the answer) or the rules for identifying symptoms might need to be reviewed.

Let's break down 5 examples (not predicted cancer) from the output:

1. It seems that patient has no history of cancer, no health concerns, but was reffered to Colon Cancer Screening Test - which seems to be just a checkup recommendation (rule for FP).
2. Similar to the previous example - just recommendations for checkups. However the patient's mother had DM2 (family history?).
3. No cancer diagnosis, but a clear family history of diabetes - here is an example when the model predicted diabetes, while the true label for diabetes is 0.
4. Doctor stated: "normal exam" (MS is not listed in the labels).
5. Similar to the above.

In [None]:
print(get_mismatched(res_val, 'cancer').values[0][0])

S:     

Presents with request to become a new patient of mine. 3 months ago moved to this area with his wife - seeking family physician. 

PMHx: none

PSurgHx: none

Daily Medications: none

Allergies: NKDA

FamilyHx: nothing pertinent 

SocialHx: Married, 1 Child, Non-Smoker, Drinks Socially and Self-employed - oversee the operation of importing company (retail)

ROS: negative

O:     

Gen: NAD
HEENT: NC/AT, PERRLA, EOMI, no throat issues
CVS: S1/S2+, RRR, no murmur
Resp: CTAB/L
LN: no lymphadenopathy

A:       

Meet and Greet
Immunization Review
Routine Labs
Colon Cancer Screening Test

P:     

1) I will be happy to accept him into my rostered practice. I will need a copy of their past medical records from any previous PCP's or specialists they may have had.  

2) Immunizations were reviewed and Tdap given today, rx for Shingrix and Prevnar 13.  

3) Routine labs were sent for today.

4) FOBT sent for. 


Active Medications: 	None Recorded







In [None]:
get_mismatched(res_val, 'cancer', 'fp').values[:5]

array([["Reason for referral:  Several family members (mainly from father's side - at least 3 aunts, and others) who have had breast cancer. Patient has a lump in the left breast. Most recent mammogram will be attached. I'd like her to do genetics testing to see her risk levels and if we should do anything other than yearly screening for her. \n\nThank you for seeing this patient.  I look forward to hearing from you.\n",
        0, 0, 0, 0, 0, 0, 1, 0],
       ['S:     \n\nWith husband - multiple issues to note:\n\n1) Flu shot in office today.\n\n2) Bilirubin elevated in Florida - suggested she get an U/S of the liver.\n\n3) Left cheek - lesion not disappearing after 1 year - possible BCC. \n\n\nO:     \n\nGen: NAD\nIntegumentary: left cheek - possible BCC\n\n\nA:       \n\nElevated LFTs\nImmunizations\n?BCC\n\n\nP:    \n\n1) U/S of abdomen.\n\n2) Flu shot given in office today. Prevnar 13 given in office today as well. Already had Pneumovax 23 in Nov 2011. \n\n3) Sent to dermatologist

#### Harmonized
How well the model predicts healthy vs. not healthy patient?

In [None]:
def harmonize_row(row):
    if (row == 1).any():
        return 1
    return 0

In [None]:
y_true_df = res_val.loc[:,['stroke',	'diabetes',	'cancer',	'heart_attack']].copy()
y_pred_df = res_val.loc[:,['stroke_pred',	'diabetes_pred',	'cancer_pred',	'heart_attack_pred']].copy()

y_true_df_harmonized = y_true_df.apply(harmonize_row, axis=1)
y_pred_df_harmonized = y_pred_df.apply(harmonize_row, axis=1)

res_val_harmonized = pd.concat([y_true_df_harmonized, y_pred_df_harmonized], axis=1)
res_val_harmonized.columns = ['harmonized', 'harmonized_pred']

In [None]:
show_metrics(res_val_harmonized)

=== Harmonized ===
  TP: 62,  FP: 40,  FN: 21,  TN: 27
  Precision: 0.61
  Recall (Sensitivity): 0.75
  F1 Score: 0.67
  F2 Score: 0.71



The score is slightly better.

## Future Work
1. Optimize prompt with DSPy
2. Fine-tune model on this dataset with QLoRA/LoRA
3. Review the dataset