# NUS Datathon Example Pipeline
### Beatrice Brown-Mulry
### 12/06/2024

EMBED consists of a clinical dataframe (Magview) and a metadata dataframe. Each row in Magview corresponds to a unique finding while each row in Metadata corresponds to a unique image. Magview should be used to index the desired sample clinically, then Metadata can be used to index the findings to their corresponding images.

![magview_metadata](../images/magview_metadata.png)

Since data in EMBED is hierarchical, following this general pattern while working with it can help prevent some hard-to-catch errors:

![embed_logical_steps](../images/embed_logical_steps.png)

In this notebook we'll be going over a sample pipeline showing how the derived features in EMBED can be used to define a cancer vs no-cancer dataset in the clinical dataframe and index the relevant images from the metadata dataframe.

An understanding of the basic diagnostic pathway for breast cancer is important to working with EMBED. We've derived exam-level outcome variables to simplify this process, but you should keep the following graphic in your mind as you work with it. It shows the standard clinical pathway for most patients (*Not all!* People with non-standard screening risk profiles generally follow a different pathway).

![0_full_pathway](../images/0_full_pathway.png)

## Contents
- #### 0. Data Preparation
- #### 1. Define Clinical Sample
    - #### 1.1 EDA
    - #### 1.2 Sample Definition
- #### 2. Merge Image Metadata
    - #### 2.1 Identify Images with ROIs
    - #### 2.2 Identify Images with 1-to-1 Finding-to-ROI Mappings

---
# 0. Data Preparation

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import os
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 500)
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
def dataframe_stats(df, title: str or None = None):
    if title is not None:
        print(f"\n{title}")
        
    num_patients = df.empi_anon.nunique()
    num_exams = df.acc_anon.nunique()
    
    print(f"Patients: {num_patients}")
    print(f"Exams: {num_exams}")
    
    if 'png_path' in df.columns:
        print(f"Images: {df.png_path.nunique()}\n")
    else:
        print(f"Findings: {len(df)}\n")

There are two versions of each dataframe available, the full data and the reduced column set. The reduced dataframes have all of the commonly used columns so should be suitable for most people

In [3]:
clinical_df = pd.read_csv("datathon_tables/EMBED_OpenData_NUS_Datathon_clinical_reduced.csv")
meta_df = pd.read_csv("datathon_tables/EMBED_OpenData_NUS_Datathon_metadata_reduced.csv")

  clinical_df = pd.read_csv("final_data/EMBED_OpenData_NUS_Datathon_clinical_reduced.csv")
  meta_df = pd.read_csv("final_data/EMBED_OpenData_NUS_Datathon_metadata_reduced.csv")


In [4]:
dataframe_stats(clinical_df, "clinical data")
dataframe_stats(meta_df, "image metadata")


clinical data
Patients: 23379
Exams: 76334
Findings: 85669


image metadata
Patients: 23253
Exams: 72731
Images: 480081



---
# 1. Define Clinical Sample

## 1.1 EDA

Before we define the sample, let's do a brief EDA on the clinical dataframe.

In [5]:
clinical_df.head()

Unnamed: 0,empi_anon,age_at_study,ETHNIC_GROUP_DESC,race,ethnicity,MARITAL_STATUS_DESC,GENDER_DESC,first_3_zip,cohort_num,acc_anon,desc,screen_exam,study_date_anon,total_L_find,total_R_find,exam_laterality,exam_outcome,outcome_side,exam_birads,exam_path_severity,exam_path_desc,mass_side,asymmetry_side,arch_distortion_side,calcification_side,followup_birads,followup_path_severity,followup_bside,followup_type,earliest_dx_acc,earliest_us_acc,numfind,side,asses,path_severity,bside,procdate_anon,path1,path2,path3,tissueden
0,39849800,42.380063,Non-Hispanic or Latino,Black,Not Hispanic or Latino,Married,Female,300.0,1,1734930095489167,MG Diagnostic Mammo Bilateral,False,2012-09-11,1.0,1.0,B,,,B,,No Pathology,R,,,R,,,,,,,1,L,B,,,,,,,3.0
1,39849800,42.380063,Non-Hispanic or Latino,Black,Not Hispanic or Latino,Married,Female,300.0,1,1734930095489167,MG Diagnostic Mammo Bilateral,False,2012-09-11,1.0,1.0,B,,,B,,No Pathology,R,,,R,,,,,,,2,R,B,,,,,,,3.0
2,64383145,72.792734,"Unreported, Unknown, Unavailable",White,Unknown,Widow(er),Female,300.0,1,9568336662008394,MG Screening Bilateral w/CAD,True,2013-07-31,0.0,0.0,B,Screen Negative,B,N,,No Pathology,,,,,,,,,,,1,L,N,,,,,,,3.0
3,47229065,54.708858,Non-Hispanic or Latino,White,Not Hispanic or Latino,Married,Female,300.0,1,5874208938464010,MG Screening Bilateral w/CAD,True,2013-08-28,0.0,0.0,B,Screen Negative,B,N,,No Pathology,,,,,,,,,,,1,L,N,,,,,,,3.0
4,60286517,63.804185,"Unreported, Unknown, Unavailable",White,Unknown,Married,Female,301.0,1,3220760667053677,MG Diagnostic Left,False,2013-08-08,2.0,0.0,L,,,S,0.0,Invasive Cancer,L,,,L,,,,,,,1,L,P,,,,,,,3.0


In [6]:
clinical_df.screen_exam.value_counts(dropna=False)

screen_exam
True     61737
False    23932
Name: count, dtype: int64

> Exams in EMBED are either screening or diagnostic. Patients generally receive regular screening exams to ensure they don't have a developing breast cancer. Due to this, most screen exams are (and should be) negative, with only a small proportion continuing to develop cancer. Diagnostics exams are generally performed to confirm the presence of some identified finding. These can be performed as followups to abnormal screening studies (a BIRADS 'A'/0 finding) or as symptomatic admissions (the patient had some symptom suggestive of a lesion so they're directly scheduled for a diagnostic).

> Diagnostic images commonly used paddles to spread and magnify specific regions of tissue (these are indicated where `spot_mag == 1.0` in the image metadata). These can make their inclusion in screening-stage models dangerous, as models will form shortcuts based on their presence. Most models should exclusively used screening images, but if you'd like to use diagnostics, we've algorithmically extracted the target tissue patches from most images with a `spot_mag` paddle. These images can be located with the `spot_mag_png_path` column where relevant.

>A number of features have been derived to make working with EMBED simpler:

>- `exam_birads`:
  > Carries the most severe BIRADS score for the exam (finding level BIRADS is carried by the column `asses`). 'Invalid' if an invalid BIRADS for that exam type (screening or diagnostic) is present.
  >
  > Possible screen values: `['A', 'N', 'B']`
  >
  > Possible diagnostic values: `['N', 'B', 'P', 'S', 'M', 'K']`
  
>- `exam_path_severity`:
  > Carries the most severe pathology finding for the exam as a numeric 0-5
  >
  > Possible values: `[0: 'Invasive Cancer', 1: 'Noninvasive Cancer', 2: 'High Risk Lesion', 3: 'Borderline Lesion', 4: 'Benign Lesion', 5: 'Non-Breast Cancer']`

>- `exam_path_desc`:
  > Carries the most severe pathology finding for the exam as a text description
  >
  > Possible values: `['Invasive Cancer', 'Noninvasive Cancer', 'High Risk Lesion', 'Borderline Lesion', 'Benign Lesion', 'Non-Breast Cancer']`
  
>- `exam_outcome`:
  > Text description of the overall exam outcome (considering any available biopsy information and follow-up diagnostics)
  >
  > Possible values: `['Screen Negative', 'Diagnostic Negative', 'Confirmed Benign', 'Screen Detected Cancer', 'Interval Cancer']`

>- `outcome_side`:
  > Exam side relevant to the `exam_outcome` (generally the same as `exam_laterality`, unless the outcome depends on a unilateral biopsy finding)

In [7]:
clinical_df.drop_duplicates('acc_anon').loc[clinical_df.screen_exam == True, 'exam_outcome'].value_counts(dropna=False)

exam_outcome
Screen Negative        49587
Diagnostic Negative     4752
NaN                     3067
Confirmed Benign         969
Confirmed Cancer         323
Interval Cancer           33
Other Cancer              11
Name: count, dtype: int64

>Each `exam_outcome` corresponds to a general category of exams.

>- `Screen Negatives` are screening exams with no abnormal findings that required a follow-up diagnostic.

>- `Diagnostic Negatives` are exams that had at least one abnormal (BIRADS 0) finding at screen, but were found to be negative/benign during a follow-up diagnostic.

>- `Confirmed Benigns` are exams that had at least one abnormal (BIRADS 0) finding at screen, were considered likely to be malignant at diagnostic (BIRADS 4/5), then had only benign pathology findings (high-risk lesions, borderline lesions, or benign/negative lesions) at biopsy.

>- `Confirmed Benigns` are exams that had at least one abnormal (BIRADS 0) finding at screen, were considered likely to be malignant at diagnostic (BIRADS 4/5), then had an invasive or non-invasive cancer finding at biopsy.

>- `Interval Cancers` are exams that were exclusively negative/benign at screening, but developed a cancer within the following year which was detected on a symptomatic diagnostic.

>- `Other Cancers` are exams that had a non-breast cancer pathology result on biopsy. These exams should generally be excluded as they fall outside the standard screening-risk profile.

In [8]:
clinical_df.drop_duplicates('acc_anon').loc[clinical_df.screen_exam == True, 'outcome_side'].value_counts(dropna=False)

outcome_side
B      55098
L       1833
R       1773
NaN       38
Name: count, dtype: int64

> An `outcome_side` feature has been derived to make it obvious which side of the exam is relevant to the outcome (for example, cancers don't always develop on both left/right breasts simultaneously). For most exams, this is the same as the overall exam laterality but in cases where the outcome was based on a pathology finding, this feature indicates which side(s) the biopsy results correspond to.

In [9]:
clinical_df.drop_duplicates('acc_anon')[['mass_side', 'asymmetry_side', 'arch_distortion_side', 'calcification_side']].value_counts(dropna=False)

mass_side  asymmetry_side  arch_distortion_side  calcification_side
NaN        NaN             NaN                   NaN                   57442
           B               NaN                   NaN                    5040
           NaN             NaN                   B                      2633
B          NaN             NaN                   NaN                    2459
NaN        L               NaN                   NaN                    1090
                                                                       ...  
           B               B                     B                         1
           R               R                     R                         1
                           B                     NaN                       1
           L               L                     L                         1
                                                 R                         1
Name: count, Length: 110, dtype: int64

>The `mass_side`, `asymmetry_side`, `arch_distortion_side`, and `calcification_side` features were derived to indicate where finding types were identified on exams. 'L' indicates than an exam had a finding type on the left side, 'R' indicates a finding type on the right side, and 'B' indicates a bilateral finding. 

>**Note**: these features are not mutually exclusive, and an exam side may have multiple finding types of each side.

>**Note**: in EMBED it's not possible to directly match a mammographic finding to a pathology finding (UNLESS that exam only has 1 finding on the relevant side). For example, for an exam with both a calcification and mass on the left side, it would not be possible to localize them to individual ROIs or determine which of the two developed into cancer. If you want a 1-to-1 mapping between finding characteristics and ROIs, you must select exams with only 1 finding per laterality (this can be determined by the `total_L_find` and `total_R_find` columns) and match them to images with ROIs.

## 1.2 Sample Definition

In [10]:
# i only want screen-stage images, so i'll first filter my dataframe to only contain screening images
# `screen_exam` was derived from the exam descriptions, so this is equivalent to:
# screen_df = clinical_df[clinical_df.desc.str.contains('screen', case=False)]
screen_df = clinical_df[clinical_df.screen_exam == True]
dataframe_stats(screen_df, "screen exam clinical data")



screen exam clinical data
Patients: 20618
Exams: 58742
Findings: 61737



In [11]:
# i'm going to formulate this problem for a binary classification model with confirmed cancer
# (you may choose to include/exclude interval cancers entirely) as the positive
# class, and other outcomes as the negative class
pos_class_list = ['Confirmed Cancer', 'Interval Cancer']
neg_class_list = ['Screen Negative', 'Diagnostic Negative', 'Confirmed Benign']

In [12]:
# lets make a new column to carry the overall class for each exam as an integer
screen_df['exam_class'] = (screen_df['exam_outcome'].isin(pos_class_list)).astype(int)
# remember to drop duplicates on 'acc_anon' (exam_id) if you're checking the frequency of an exam-level feature
display(screen_df.drop_duplicates('acc_anon')[['exam_outcome', 'exam_class']].value_counts(dropna=False))
display(screen_df.drop_duplicates('acc_anon')['exam_class'].value_counts(dropna=False, normalize=True))

exam_outcome         exam_class
Screen Negative      0             49587
Diagnostic Negative  0              4752
NaN                  0              3067
Confirmed Benign     0               969
Confirmed Cancer     1               323
Interval Cancer      1                33
Other Cancer         0                11
Name: count, dtype: int64

exam_class
0    0.99394
1    0.00606
Name: proportion, dtype: float64

> For most cancer-related contexts you'll have to deal with pretty extreme class imbalances. Less than 1% of our exams are cancers

In [13]:
# you may have noticed we also have some exams with a NaN outcome, these are exams that did not fit any of the
# clinical criteria cleanly so they could not be categorized, we'll exclude these cases and the exams with a non-breast cancer
filt_screen_df = screen_df[screen_df.exam_outcome.isin(pos_class_list) | screen_df.exam_outcome.isin(neg_class_list)]
display(filt_screen_df.drop_duplicates('acc_anon')[['exam_outcome', 'exam_class']].value_counts(dropna=False))
display(filt_screen_df.drop_duplicates('acc_anon')['exam_class'].value_counts(dropna=False, normalize=True))

exam_outcome         exam_class
Screen Negative      0             49587
Diagnostic Negative  0              4752
Confirmed Benign     0               969
Confirmed Cancer     1               323
Interval Cancer      1                33
Name: count, dtype: int64

exam_class
0    0.993604
1    0.006396
Name: proportion, dtype: float64

In [14]:
dataframe_stats(filt_screen_df, "filtered screen dataframe")


filtered screen dataframe
Patients: 19342
Exams: 55664
Findings: 57985



In [15]:
# now i'm going to use the `outcome_side` feature to make two new columns so it's easy to tell which
# images we should match to each exam
filt_screen_df['L_class'] = 0
filt_screen_df.loc[(filt_screen_df.exam_class == 1) & filt_screen_df.outcome_side.isin(['B', 'L']), 'L_class'] = 1

filt_screen_df['R_class'] = 0
filt_screen_df.loc[(filt_screen_df.exam_class == 1) & filt_screen_df.outcome_side.isin(['B', 'R']), 'R_class'] = 1

display(filt_screen_df.drop_duplicates('acc_anon')[['exam_outcome', 'exam_class', 'L_class', 'R_class']].value_counts(dropna=False))

exam_outcome         exam_class  L_class  R_class
Screen Negative      0           0        0          49587
Diagnostic Negative  0           0        0           4752
Confirmed Benign     0           0        0            969
Confirmed Cancer     1           0        1            152
                                 1        0            146
                                          1             23
Interval Cancer      1           0        1             18
                                 1        0             15
Confirmed Cancer     1           0        0              2
Name: count, dtype: int64

In [17]:
# ensure all of our positive exams are positive on at least one side
# there may be some where we have insufficient info to verify the side of the cancer
# you may choose to retain these exams (knowing they may introduce some noise) 
# but we'll exclude them for now
filt_screen_df = filt_screen_df[(filt_screen_df.exam_class == 0.0) | ((filt_screen_df.exam_class == 1.0) & ((filt_screen_df.L_class == 1.0) | (filt_screen_df.R_class == 1.0)))]
display(filt_screen_df.drop_duplicates('acc_anon')[['exam_outcome', 'exam_class', 'L_class', 'R_class']].value_counts(dropna=False))

exam_outcome         exam_class  L_class  R_class
Screen Negative      0           0        0          49587
Diagnostic Negative  0           0        0           4752
Confirmed Benign     0           0        0            969
Confirmed Cancer     1           0        1            152
                                 1        0            146
                                          1             23
Interval Cancer      1           0        1             18
                                 1        0             15
Name: count, dtype: int64

In [34]:
# to make it easier to merge with the image metadata, i'm going to filter this data to 
# only contain exam-level (or patient-level) data so i can drop duplicate rows and 
# ensure each row maps to a unique exam
exam_cols = ['empi_anon', 'age_at_study', 'race', 'ethnicity', 'cohort_num', 'tissueden', 'acc_anon', 'desc', 'screen_exam', 'study_date_anon', 'total_L_find', 'total_R_find', 'exam_laterality', 'exam_outcome', 'exam_class', 'L_class', 'R_class', 'outcome_side', 'exam_birads', 'exam_path_severity', 'exam_path_desc', 'mass_side', 'asymmetry_side', 'arch_distortion_side', 'calcification_side']
filt_exam_df = filt_screen_df[exam_cols].drop_duplicates()

print('Num. Unique Rows:', len(filt_exam_df))
print('Num. Unique Exam IDs:', filt_exam_df.acc_anon.nunique())
print(f'Any Duplicate Accessions? {len(filt_exam_df) != filt_exam_df.acc_anon.nunique()}')

Num. Unique Rows: 55662
Num. Unique Exam IDs: 55662
Any Duplicate Accessions? False


---
# 2. Merge Image Metadata

In [51]:
# i only want to use 2D images for my dataset, so i'll first filter the metadata dataframe to select these
# and exclude any spot_mag images
meta_2d = meta_df[(meta_df.FinalImageType == '2D') & (meta_df.spot_mag != 1.0)]
dataframe_stats(meta_2d, "2D image metadata")


2D image metadata
Patients: 23245
Exams: 72308
Images: 334515



In [52]:
# lets merge the image metadata onto our clinical exam dataframe on the exam ID variable 'acc_anon'
meta_drop_cols = ['empi_anon', 'cohort_num', 'study_date_anon'] # we'll drop these columns during the merge to avoid duping them
merge_df = filt_exam_df.merge(meta_2d.drop(columns=meta_drop_cols), how='inner', on='acc_anon')
dataframe_stats(merge_df, "clinical/metadata merge")


clinical/metadata merge
Patients: 19120
Exams: 52919
Images: 261451



In [54]:
# finally, we need to filter these so we're only retaining the images from the sides relevant to our exam outcomes
# for unilateral positive cases it can be a good idea in practice to include their contralateral negatives in the negative set
# but for the sake of demonstration we won't consider these
merge_df = merge_df[
    (merge_df.exam_class == 0) # allow any negative images
    | (
        (merge_df.exam_class == 1) # or left pos images if the left side is pos
        & (merge_df.ImageLateralityFinal == "L") 
        & (merge_df.L_class == 1)
    ) 
    | (
        (merge_df.exam_class == 1) # or right pos images if the right side is pos
        & (merge_df.ImageLateralityFinal == "R") 
        & (merge_df.R_class == 1.0)
    )
]
dataframe_stats(merge_df, "finalized clinical/metadata merge")



finalized clinical/metadata merge
Patients: 19120
Exams: 52919
Images: 260673



> We've now matched our images to their clinical data, and built a Cancer vs No-Cancer dataframe which we could use for some downstream purpose. Alternately, we could choose to refine this further

## 2.1 Identify Images with ROIs

> Let's say we instead wanted to use patches to train a model instead of full images. Let's filter this dataframe to identify our images with ROIs and plot a few.

> We can identify the number of ROIs per-image with the `num_roi` column. As you can see, most images don't have ROIs and their distribution is not even across our data. Most ROIs are matched to abnormal screening exams (with very few on negative/benign screening exams or diagnostic follow-ups).

In [55]:
display(merge_df[merge_df.num_roi > 0].exam_class.value_counts(dropna=False))
display(merge_df[merge_df.num_roi > 0].exam_class.value_counts(dropna=False, normalize=True))

exam_class
0    3087
1     223
Name: count, dtype: int64

exam_class
0    0.932628
1    0.067372
Name: proportion, dtype: float64

> For now we'll just require all of our images to have >=1 ROI, but in a proper study you'd likely need to generate some additional negative patches. We've otherwise just heavily skewed the distribution of our dataset (and any results/metrics we get from a downstream model are unlikely to be useful in practice)

In [56]:
roi_df = merge_df[merge_df.num_roi > 0]
display(roi_df.loc[roi_df.exam_class == 0, ['exam_outcome', 'num_roi']].value_counts())
display(roi_df.loc[roi_df.exam_class == 1, ['exam_outcome', 'num_roi']].value_counts())

exam_outcome         num_roi
Diagnostic Negative  1          2057
Confirmed Benign     1           479
Diagnostic Negative  2           368
Screen Negative      1            66
Confirmed Benign     2            56
Diagnostic Negative  3            33
                     4            10
Confirmed Benign     3             6
Screen Negative      2             6
                     3             4
Confirmed Benign     4             1
Diagnostic Negative  6             1
Name: count, dtype: int64

exam_outcome      num_roi
Confirmed Cancer  1          186
                  2           28
                  3            6
Interval Cancer   1            2
Confirmed Cancer  4            1
Name: count, dtype: int64

This dataframe is now ready for downstream tasks. Please see the **EMBED ROI Reference** and the accompanying notebook for more information on using ROIs.

## 2.2 Identify Images with 1-to-1 Finding-to-ROI Mappings

If you wanted to maximize confidence in your ROI characteristics you may want to enforce a 1-to-1 mapping requirement. You can do this by requiring all images to have 1 ROI, and the corresponding exam side to only have 1 finding (this is tracked by the `total_L_find` and `total_R_find` columns).

In [61]:
one_to_one_df = merge_df[
    # require all images to have exactly 1 ROI
    # you could try relaxing this to allow multiple ROIs (though this may introduce noise to the classes)
    (merge_df.num_roi == 1)
    & (
        # only allow left images from exams with 1 ROI and 0/1 left-side findings
        ((merge_df.total_L_find <= 1) & (merge_df.ImageLateralityFinal == "L"))
        # only allow right images from exams with 1 ROI and 0/1 right-side findings
        | ((merge_df.total_R_find <= 1) & (merge_df.ImageLateralityFinal == "R"))
    )
]
one_to_one_df[['exam_class', 'exam_outcome']].value_counts(dropna=False)

exam_class  exam_outcome       
0           Diagnostic Negative    1884
            Confirmed Benign        416
1           Confirmed Cancer        166
0           Screen Negative          66
1           Interval Cancer           2
Name: count, dtype: int64