# 0. Construct the evaluation dataset

## Part A. Vignettes derived from case reports in the medical literature.

### 1. Load in the spreadsheets of case reports pulled from open-access journals.

First, we load the combined dataset of medical literature vignettes (n=183).

In [22]:
# Import modules for CSV processing
import pandas as pd
import json

# Load all individual CSV files from individual open access journals
case_reports_in_psychiatry = pd.read_csv("../datasets/medical_literature/original/open_access_journals/case_reports_in_psychiatry.csv")
jama_psychiatry = pd.read_csv("../datasets/medical_literature/original/open_access_journals/jama_psychiatry.csv")
jama = pd.read_csv("../datasets/medical_literature/original/open_access_journals/jama.csv")
lancet = pd.read_csv("../datasets/medical_literature/original/open_access_journals/lancet.csv")
nejm_evidence = pd.read_csv("../datasets/medical_literature/original/open_access_journals/nejm_evidence.csv")
nejm = pd.read_csv("../datasets/medical_literature/original/open_access_journals/nejm.csv")

# Combine CSV files into a single dataframe of all open access journal case reports
open_access_journals = pd.concat([case_reports_in_psychiatry, jama_psychiatry, jama, lancet, nejm_evidence, nejm], ignore_index=True)

# View the first few rows of the dataframe
open_access_journals.head()

Unnamed: 0,case,diagnosis,treatment,title,url,source
0,"A 36-year-old woman, presented with apathy and...",Major depression with psychotic and melancholi...,,Thiamine Deficiency Neuropathy in a Patient wi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,case_reports_in_psychiatry
1,A black male in his early 20s with a remote hi...,Catatonia,,Catatonia as a Result of a Traumatic Brain Injury,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,case_reports_in_psychiatry
2,"In October 2023, a 24-year-old unemployed male...",Opioid use disorder,,Guanfacine Treatment in a Patient with Intrave...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,case_reports_in_psychiatry
3,A 12-year-old male with no specific birth or m...,Anorexia nervosa with focal cortical dysplasia,,A Case of Anorexia Nervosa with Focal Cortical...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,case_reports_in_psychiatry
4,"Mr. A, a 66-year-old Caucasian male, recently ...",Cotard syndrome,,Psychotropic Management in Cotard Syndrome: Ca...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,case_reports_in_psychiatry


In [23]:
# Count the number of case reports in the combined dataframe
open_access_journals.shape[0]

79

In [24]:
# Grab input and diagnosis (output_1) columns only to create a new dataframe
open_access_journals = open_access_journals[['case', 'diagnosis', 'source']]

# Rename columns for clarity
open_access_journals = open_access_journals.rename(columns={'case': 'vignette'})

# View the first few rows of the diagnosis dataframe
open_access_journals.head()

Unnamed: 0,vignette,diagnosis,source
0,"A 36-year-old woman, presented with apathy and...",Major depression with psychotic and melancholi...,case_reports_in_psychiatry
1,A black male in his early 20s with a remote hi...,Catatonia,case_reports_in_psychiatry
2,"In October 2023, a 24-year-old unemployed male...",Opioid use disorder,case_reports_in_psychiatry
3,A 12-year-old male with no specific birth or m...,Anorexia nervosa with focal cortical dysplasia,case_reports_in_psychiatry
4,"Mr. A, a 66-year-old Caucasian male, recently ...",Cotard syndrome,case_reports_in_psychiatry


In [25]:
# Check for missing values
open_access_journals.isnull().sum()

vignette     0
diagnosis    0
source       0
dtype: int64

### 2. Load in the spreadsheets of case reports pulled from the "DSM-5-TR Clinical Cases" textbook.

In [26]:
# Load all individual CSV files from textbooks
import pandas as pd

textbooks = pd.read_csv("../datasets/medical_literature/original/textbooks/dsm_5_tr_clinical_cases.csv")
textbooks.head()

Unnamed: 0,case,discussion,diagnosis,category,title,url,source
0,"Ashley, age 17, was referred for a diagnostic ...","In regard to diagnosis, Ashley’s cognitive tes...",1. Intellectual developmental disorder (intell...,Neurodevelopmental Disorders,A Second Opinion on Autism,https://dsm.psychiatryonline.org/doi/full/10.1...,dsm_5_tr_clinical_cases
1,Brandon was a 12-year-old boy brought in by hi...,Brandon presents with symptoms consistent with...,Autism spectrum disorder requiring support for...,Neurodevelopmental Disorders,Temper Tantrums,https://dsm.psychiatryonline.org/doi/full/10.1...,dsm_5_tr_clinical_cases
2,Carlos was a 19-year-old Hispanic college stud...,Carlos presents with a history of ADHD that wa...,"1. Major depressive disorder, mild, single epi...",Neurodevelopmental Disorders,Academic Difficulties,https://dsm.psychiatryonline.org/doi/full/10.1...,dsm_5_tr_clinical_cases
3,"Daphne, a 13-year-old in the ninth grade, was ...","Daphne has symptoms of inattention, anxiety, a...",1. Specific learning disorder (mathematics)\n2...,Neurodevelopmental Disorders,School Problems,https://dsm.psychiatryonline.org/doi/full/10.1...,dsm_5_tr_clinical_cases
4,"Ethan, a 9-year-old boy, was referred to a psy...",Ethan presents with declining school performan...,1. Provisional tic disorder\n2. Separation anx...,Neurodevelopmental Disorders,Fidgety and Distracted,https://dsm.psychiatryonline.org/doi/full/10.1...,dsm_5_tr_clinical_cases


In [27]:
# Count the number of case reports
textbooks.shape[0]

104

In [28]:
# Grab input and diagnosis columns only to create a new dataframe
textbooks = textbooks[['case', 'diagnosis', 'source']]

# Rename columns for clarity
textbooks = textbooks.rename(columns={'case': 'vignette'})

# View the first few rows
textbooks.head()

Unnamed: 0,vignette,diagnosis,source
0,"Ashley, age 17, was referred for a diagnostic ...",1. Intellectual developmental disorder (intell...,dsm_5_tr_clinical_cases
1,Brandon was a 12-year-old boy brought in by hi...,Autism spectrum disorder requiring support for...,dsm_5_tr_clinical_cases
2,Carlos was a 19-year-old Hispanic college stud...,"1. Major depressive disorder, mild, single epi...",dsm_5_tr_clinical_cases
3,"Daphne, a 13-year-old in the ninth grade, was ...",1. Specific learning disorder (mathematics)\n2...,dsm_5_tr_clinical_cases
4,"Ethan, a 9-year-old boy, was referred to a psy...",1. Provisional tic disorder\n2. Separation anx...,dsm_5_tr_clinical_cases


In [29]:
# Check for missing values
textbooks.isnull().sum()

vignette     0
diagnosis    0
source       0
dtype: int64

### 3. Combine the literature vignettes into one dataset and curate it according to clinician annotator feedback

In [30]:
### 4. Combine the open-access and textbook vignettes
medical_literature = pd.concat([open_access_journals, textbooks],
                               ignore_index=True)

# Give each of the medical literature cases an ID
case_id = range(1, 1 + len(medical_literature))
medical_literature.insert(loc=0, column='case_id', value=case_id)

medical_literature.head()

Unnamed: 0,case_id,vignette,diagnosis,source
0,1,"A 36-year-old woman, presented with apathy and...",Major depression with psychotic and melancholi...,case_reports_in_psychiatry
1,2,A black male in his early 20s with a remote hi...,Catatonia,case_reports_in_psychiatry
2,3,"In October 2023, a 24-year-old unemployed male...",Opioid use disorder,case_reports_in_psychiatry
3,4,A 12-year-old male with no specific birth or m...,Anorexia nervosa with focal cortical dysplasia,case_reports_in_psychiatry
4,5,"Mr. A, a 66-year-old Caucasian male, recently ...",Cotard syndrome,case_reports_in_psychiatry


In [31]:
# Count the number of medical literature cases
medical_literature.shape[0]

183

In [32]:
# Enumerate by source
print(medical_literature['source'].value_counts())

source
dsm_5_tr_clinical_cases       104
case_reports_in_psychiatry     38
nejm                           24
jama_psychiatry                 6
jama                            6
lancet                          4
nejm_evidence                   1
Name: count, dtype: int64


### 5. Curate the literature vignettes according to clinician annotator feedback

Next, we manually drop the vignettes scoring <3 by the clinicians.

In [33]:
# Manually drop the vignettes scored <3 by the clinician annotators
# Curation spreadsheet: https://docs.google.com/spreadsheets/d/1ZH5kUI5tm_NzUOf30SgtN5RqoumHJnUxJFBlmy8hpT4/edit?gid=1741607768#gid=1741607768
dropped_ids = [1,5,6,7,10,11,41,42,45,47,49,54,56,57,58,59,61,65,69,71,74,75,76,79,84,88,98,112,140,143,3,8,12,16,31,38,46,51,52,53,55,63,67,164,183,9,29,68]

# Drop the problematic vignettes from the dataframe
medical_literature_curated = medical_literature[~medical_literature["case_id"].isin(dropped_ids)]

medical_literature_curated.head()

Unnamed: 0,case_id,vignette,diagnosis,source
1,2,A black male in his early 20s with a remote hi...,Catatonia,case_reports_in_psychiatry
3,4,A 12-year-old male with no specific birth or m...,Anorexia nervosa with focal cortical dysplasia,case_reports_in_psychiatry
12,13,We report on a 35-year-old female with a histo...,Pseudocyesis,case_reports_in_psychiatry
13,14,The patient is a 58-year-old woman with a psyc...,Illness anxiety disorder (IAD),case_reports_in_psychiatry
14,15,The patient is a 20-year-old single man. He is...,Social anxiety disorder (SAD) and major depres...,case_reports_in_psychiatry


In [34]:
# Count remaining medical literature vignettes
medical_literature_curated.shape[0]

135

In [35]:
# Enumerate remaining medical literature vignettes by source
print(medical_literature_curated['source'].value_counts())

source
dsm_5_tr_clinical_cases       96
case_reports_in_psychiatry    24
nejm                           9
jama_psychiatry                4
jama                           2
Name: count, dtype: int64


## Part B. Fictitious vignettes authored by psychiatry clinicians.

Finally, we add the synthetic/fictitious vignettes authored by clinicians.

### 5. Load in the fictitious vignettes

In [36]:
# Load the clinician-created fictitious vignettes
huang = "../datasets/fictitious/Ashley Huang/ashley_huang.json"
nagpal = "../datasets/fictitious/Caesa Nagpal/caesa_nagpal.json"
zaboski = "../datasets/fictitious/Brian Zaboski/brian_zaboski.json"
weathers = "../datasets/fictitious/Judah Weathers/judah_weathers.json"
zhang = "../datasets/fictitious/Juliana Zhang/juliana_zhang.json"
chaudhary = "../datasets/fictitious/Pooja Chaudhary/pooja_chaudhary.json"

with open(weathers, "r", encoding="utf-8") as file:
    weathers_data = json.load(file)

with open(zaboski, "r", encoding="utf-8") as file:
    zaboski_data = json.load(file)

with open(huang, "r", encoding="utf-8") as file:
    huang_data = json.load(file)

with open(nagpal, "r", encoding="utf-8") as file:
    nagpal_data = json.load(file)

with open(zhang, "r", encoding="utf-8") as file:
    zhang_data = json.load(file)

with open(chaudhary, "r", encoding="utf-8") as file:
    chaudhary_data = json.load(file)

judah_weathers = pd.DataFrame(weathers_data)
brian_zaboski = pd.DataFrame(zaboski_data)
ashley_huang = pd.DataFrame(huang_data)
caesa_nagpal = pd.DataFrame(nagpal_data)
juliana_zhang = pd.DataFrame(zhang_data)
pooja_chaudhary = pd.DataFrame(chaudhary_data)

# Combine the clinician-created vignettes into a single DataFrame
fictitious = pd.concat([judah_weathers, brian_zaboski, ashley_huang, caesa_nagpal, juliana_zhang, pooja_chaudhary], ignore_index=True)
fictitious["case_id"] = fictitious.index + 1 # Create a new case_id column starting from 1

# Add 1000 to the fictitious case IDs to differentiate them from the case reports
fictitious["case_id"] = fictitious["case_id"] + 1000
fictitious.head()

Unnamed: 0,case_id,case,diagnosis,difficulty,source,diagnosis_dsm,diagnosis_icd,reasoning
0,1001,"Leroy is a 34 year old male, whose chief compl...",Obsessive compulsive disorder (F42.8),,judah_weathers,,,
1,1002,Emil is a 22 year old male that reports “my mo...,"Bipolar disorder, most recent episode manic, i...",,judah_weathers,,,
2,1003,Neveah is a 14 year old girl that comes to her...,"Bipolar disorder, current episode manic (F31.2)",,judah_weathers,,,
3,1004,Elain is a 63 year old mother of 5 children an...,Acute stress disorder (F43.0),,judah_weathers,,,
4,1005,A 19 year old male comes to his primary care o...,Schizophreniform Disorder (F20.81) or F33.3 (D...,,judah_weathers,,,


In [37]:
# Count number of fictitious vignettes
fictitious.shape[0]

61

In [38]:
# Enumerate fictitious vignettes per clinician
print(fictitious['source'].value_counts())

source
judah_weathers     15
brian_zaboski      10
ashley_huang       10
juliana_zhang      10
pooja_chaudhary    10
caesa_nagpal        6
Name: count, dtype: int64


## Part C. Combine the literature and fictitious vignettes

In [39]:
# Combine into one dataframe
combined = pd.concat([medical_literature_curated, fictitious], ignore_index=True)
combined.head()

Unnamed: 0,case_id,vignette,diagnosis,source,case,difficulty,diagnosis_dsm,diagnosis_icd,reasoning
0,2,A black male in his early 20s with a remote hi...,Catatonia,case_reports_in_psychiatry,,,,,
1,4,A 12-year-old male with no specific birth or m...,Anorexia nervosa with focal cortical dysplasia,case_reports_in_psychiatry,,,,,
2,13,We report on a 35-year-old female with a histo...,Pseudocyesis,case_reports_in_psychiatry,,,,,
3,14,The patient is a 58-year-old woman with a psyc...,Illness anxiety disorder (IAD),case_reports_in_psychiatry,,,,,
4,15,The patient is a 20-year-old single man. He is...,Social anxiety disorder (SAD) and major depres...,case_reports_in_psychiatry,,,,,


In [40]:
# Count the number of vignettes total
combined.shape[0]

196

Some JSON files have two different types of diagnoses. In this case, we will choose the ICD-10 diagnosis when it is available.

In [41]:
# Loop through the dataframe - if the diagnosis column is empty, fill it with the diagnosis_icd column
for index, row in combined.iterrows():
    if pd.isna(row["diagnosis"]):
        combined.at[index, "diagnosis"] = row["diagnosis_icd"]

## Part D. Save and export the evaluation dataset

In [42]:
# Export as JSON, then manually edit the dataset to include clinician comments
output_path = "../datasets/combined/combined_jama.json"
combined = json.loads(combined.to_json(orient='records', force_ascii=False))
with open(output_path, "w", encoding="utf-8") as file:
    json.dump(combined, file, ensure_ascii=False, indent=4)
print(f"Exported combined dataset to {output_path}.")

Exported combined dataset to ../datasets/combined/combined_jama.json.
