### This notebook performs post-processing of ClinicBench data to generate the evaluation dataset

#### To setup:

- git clone the tutorial repo
- To clone the healthcare submodule:
    - git submodule init
    - git submodule update 

In [1]:
import pandas as pd
import random
import jsonlines

### Load the healthcare data that include symptom, disease, commonMedications

In [2]:
path_to_csv = "datasets/healthcare/Treatment-Recommendation.csv"
df = pd.read_csv(path_to_csv)
df 

Unnamed: 0,idx,disease,Symptom,reason,TestsAndProcedures,commonMedications
0,0,Panic disorder,"['Anxiety and nervousness', 'Depression', 'Sho...",Panic disorder is an anxiety disorder characte...,"['Psychotherapy', 'Mental health counseling', ...","['Lorazepam', 'Alprazolam (Xanax)', 'Clonazepa..."
1,1,Vocal cord polyp,"['Hoarse voice', 'Sore throat', 'Difficulty sp...","beclomethasone nasal product,","['Tracheoscopy and laryngoscopy with biopsy', ...","['Esomeprazole (Nexium)', 'Beclomethasone Nasa..."
2,2,Turner syndrome,"['Groin mass', 'Leg pain', 'Hip pain', 'Suprap...",Turner syndrome or Ullrich\xe2\x80\x93Turner s...,"['Complete physical skin exam performed (ML)',...","['Somatropin', 'Sulfamethoxazole (Bactrim)', '..."
3,3,Cryptorchidism,"['Symptoms of the scrotum and testes', 'Swelli...",Cryptorchidism (derived from the Greek \xce\xb...,"['Complete physical skin exam performed (ML)',...","['Haemophilus B Conjugate Vaccine (Obsolete)',..."
4,4,Poisoning due to ethylene glycol,"['Abusing alcohol', 'Fainting', 'Hostile behav...","thiamine,","['Intravenous fluid replacement', 'Hematologic...","['Lorazepam', 'Thiamine', 'Naloxone (Suboxone)..."
...,...,...,...,...,...,...
791,791,Kaposi sarcoma,"['Leg pain', 'Foot or toe pain', 'Cough', 'Lip...",Kaposi sarcoma (KS) is a tumor caused by Human...,"['Glucose measurement (Glucose level)', 'Compl...","['Ritonavir (Norvir)', 'Lopinavir', 'Tenofovir..."
792,792,Spondylolisthesis,"['Low back pain', 'Back pain', 'Leg pain', 'Hi...",Spondylolisthesis is the anterior or posterior...,"['Radiographic imaging procedure', 'Plain x-ra...","['Diclofenac', 'Piroxicam', 'Cortisone', 'Levo..."
793,793,Pseudotumor cerebri,"['Headache', 'Diminished vision', 'Dizziness',...","Idiopathic intracranial hypertension (IIH), so...","['Intravenous fluid replacement', 'Magnetic re...","['Acetazolamide (Diamox)', 'Topiramate (Topama..."
794,794,Conjunctivitis due to virus,"['Eye redness', 'Pain in eye', 'Itchiness of e...",Conjunctivitis (also called pink eye or madras...,['Ophthalmic examination and evaluation (Eye e...,"['Erythromycin', 'Gentamicin Ophthalmic', 'Sod..."


### Function to generate question associated with each sample, over all scenarios

- To generate more realistic questions in Scenarios 1, 2, 3, we randomly sample a few symptoms from all of the available symptoms.

In [3]:
def add_question_column(df, scenario="scenario_1"):
    def form_question_scenario_1(symptoms, nsamples=8):
        # SingleStep
        return "I have " + ", ".join(random.sample(eval(symptoms), nsamples)) + " symptoms. What disease do I have?"
        
    def form_question_scenario_2(symptoms, nsamples=8):
        # MultiStep
        return "I have " + ", ".join(random.sample(eval(symptoms), nsamples)) + " symptoms. What are the recommended medications?"
        
    def add_expected_tools_column(df):
        # Expected tools used
        df["expected_tools_scenario_2"] = ["find_disease find_common_medications"] * len(df)
        return df
        
    def form_question_scenario_3(symptoms, nsamples=8):
        # Same questions over samples in dataset
        return "I have " + ", ".join(random.sample(eval(symptoms), nsamples)) + " symptoms. What disease do I have?"
        
    def form_question_scenario_4():
        # No symptoms
        return "I feel great and don't have medical symptoms. Do I have any disease or need medications?"

    column = "question_" + scenario
    
    df[column] = df["Symptom"]
    
    if scenario == "scenario_1":
        df[column] = df[column].apply(lambda x: form_question_scenario_1(x))
    elif scenario == "scenario_2":
        df[column] = df[column].apply(lambda x: form_question_scenario_2(x))
        df = add_expected_tools_column(df)
    elif scenario == "scenario_3":
        df[column] = [form_question_scenario_3(df["Symptom"].iloc[0])] * len(df)
    elif scenario == "scenario_4":
        df[column] = [form_question_scenario_4()] * len(df)
    else:
        raise

    return df

### Prepare data (csv) to be used by our tools. A tool will retrieve data from this csv

- Save first n rows of data to csv.

In [4]:
nrows = 3
first_nrows = df.head(nrows)
first_nrows.to_csv('datasets/healthcare_postprocessed/Treatment-Recommendation-First-' + str(nrows) + '-Rows.csv', index=False)

### Prepare evaluation data (jsonl) to be used by InsepctAI for evaluating agent

- Save first n rows of data to jsonl.

In [5]:
path_to_jsonl = 'datasets/healthcare_postprocessed/Treatment-Recommendation-First-' + str(nrows) + '-Rows.jsonl'

first_nrows = df.head(nrows)
first_nrows = add_question_column(first_nrows, scenario="scenario_1")
first_nrows = add_question_column(first_nrows, scenario="scenario_2")
first_nrows = add_question_column(first_nrows, scenario="scenario_3")
first_nrows = add_question_column(first_nrows, scenario="scenario_4")

list_of_dicts = first_nrows.to_dict(orient='records')
for d in list_of_dicts:
    for k, v in d.items():
        if type(v) == str and v[0] == "[" and type(eval(v)) == list:
            d[k] = eval(v)

with jsonlines.open(path_to_jsonl, 'w') as w:
    w.write_all(list_of_dicts)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df["Symptom"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df[column].apply(lambda x: form_question_scenario_1(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = df["Symptom"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[