## Notebook purpose:
In this notebook we merge patient level amyloid diagnosis labels all 3 datasets: Mayo Reports, Cardiac Path Reports, PYP Reports. The notebook reads labels from various sources and combines them in a way that respects the quality of the different sources.

### Datasets:
**cohort patient labels**:
  * These are labels obtained from 3 sources:
    1. Cardiac Path Reports
      - AMYLOID DIAGNOSIS: GOLD STANDARD
          - Classes:
            - POSITIVE 
            - NEGATIVE
            - INDETERMINATE (reports were incomplete or did not include diagnosis)
        - CHART REVIEW (When a patient has POSITIVE and NEGATIVE reports)
    2. Mayo Reports
      - AMYLOID DIAGNOSIS: Gold Standard
          - Classes:
            - POSITIVE 
            - NEGATIVE
            - INDETERMINATE (reports were incomplete or did not include diagnosis)
            - CHART REVIEW (When a patient has POSITIVE and NEGATIVE reports)
      - AMYLOID SUBTYPE DIAGNOSIS: Gold Standard
          - Classes: 
            - TTR
            - AL
            - CHART REVIEW (When a patient has TTR and AL reports)
            - NaN (missing value)
      - TTR SUBTYPE DIAGNOSIS: Gold Standard
          - Classes: 
            - hTTR (Mayo reports include positive tests for hTTR)
            - INDETERMINATE (defaults to wTTR because Mayo reports include tests for hTTR. Negative results imply wTTR but hTTR cannot be 100% ruled out)
            - CHART REVIEW (When a patient has hTTR and INDETERMINATE reports)
            - NaN (missing value. INDETERMINATE means there was a test for hTTR and it was negative. NaN means there was no such information available)
    3. PYP Reports
      - AMYLOID DIAGNOSIS: Silver Standard
          - Classes:
            - STRONGLY SUGGESTIVE (for TTR Amyloidosis)
            - NOT SUGGESTIVE
            - EQUIVOCAL (reports were not conclusive or the patient might have AL Amyloidosis, in which case PYP is not good for diagnosis)
            - CHART REVIEW (When a patient has STRONGLY SUGGESTIVE and NOT SUGGESTIVE reports)
      - AMYLOID SUBTYPE DIAGNOSIS: Silver Standard
          - Classes: 
            - TTR (STRONGLY SUGGESTIVE implies TTR)
            - NaN (missing value)
            
### Output
**all_datasets_patient_amyloid_diagnosis** is a csv of the patient diagnoses labels.
The label columns for this aggregation have prefix `final__`. Clashes are flagged with `CHART_REVIEW`. Aggregation rules and classes are described below:
- AMYLOID DIAGNOSIS:
  - Column Name: `final__amyloid_diagnosis`
  - Gold Standard: Cardiac Path Reports, Mayo Reports
  - Silver Standard: PYP Reports
  - Classes: 
    - POSITIVE 
    - NEGATIVE
    - INDETERMINATE (reports were incomplete or did not include diagnosis)
    - CHART REVIEW (clash between Gold Standard)
- AMYLOID SUBTYPE DIAGNOSIS:
  - Column Name: `final__amyloid_subtype_diagnosis`
  - Gold Standard: Mayo Reports
  - Silver Standard: PYP Reports (STRONGLY SUGGESTIVE implies ATTR)
  - Classes: 
    - TTR
    - AL
    - CHART REVIEW (clash between Gold Standard)
    - NaN (missing value)
- TTR SUBTYPE DIAGNOSIS:
  - Column Name: `final__ttr_amyloid_subtype_diagnosis`
  - Gold Standard: Mayo Reports
  - Classes: 
    - hTTR (Mayo reports include positive tests for hTTR)
    - INDETERMINATE (defaults to wTTR because Mayo reports include tests for hTTR. Negative results imply wTTR but hTTR cannot be 100% ruled out)
    - CHART REVIEW (clash between Gold Standard)
    - NaN (missing value. INDETERMINATE means there was a test for hTTR and it was negative. NaN means there was no such information available)
- AMYLOID DIAGNOSIS DATE:
  - Column Name: `final__amyloid_diagnosis_date`
  - Label date is taken as the earliest date of a definitive diagnosis, with the following order of priority: Cardiac Path Reports, Mayo Reports, PYP Reports. 

In [10]:
import re
from copy import deepcopy
from collections import Counter
from pathlib import Path
import numpy as np
import pandas as pd

from datasets import load_dataset, load_annotations, dataset_config_mapping

pd.set_option("display.max_colwidth", None)

In [12]:
def get_patient_level_diagnosis(dataset: str) -> pd.DataFrame:
    assert dataset in ["cardiac_path_reports", "mayo_labs", "pyp_reports"]
    date = dataset_config_mapping[dataset]["date"]
    patient_amyloid_diagnosis_path = (
        Path("/data/datasets/Amyloidosis/patient_amyloid_diagnosis")
        / dataset
        / f"patient_amyloid_diagnosis.csv"
    )
    df = pd.read_csv(patient_amyloid_diagnosis_path)
    return df


cp = get_patient_level_diagnosis(dataset="cardiac_path_reports")
pyp = get_patient_level_diagnosis(dataset="pyp_reports")
mayo = get_patient_level_diagnosis(dataset="mayo_labs")

cp["cp__amyloid_label"] = cp["cp__amyloid_diagnosis"].copy(deep=True)
cp["cp__amyloid_diagnosis"] = cp["cp__amyloid_diagnosis"].map(
    {1: "POSITIVE", 0: "NEGATIVE", 2: "INDETERMINATE"}
)
pyp["pyp__amyloid_label"] = pyp["pyp__amyloid_diagnosis"].copy(deep=True)
pyp["pyp__amyloid_diagnosis"] = pyp["pyp__amyloid_diagnosis"].map(
    {1: "STRONGLY_SUGGESTIVE", 0: "NOT_SUGGESTIVE", 2: "EQUIVOCAL"}
)

mayo["mayo__amyloid_label"] = mayo["mayo__amyloid_diagnosis"].map(
    {
        "NEGATIVE": 0,
        "POSITIVE": 1,
        "INDETERMINATE": 2,
        "OTHER_SITE": 3,
        "IRRELEVANT": -1,
    }
)

cp = cp[
    [
        "ir_id",
        "document_ID",
        "cp__amyloid_diagnosis",
        "cp__amyloid_label",
        "cp__amyloid_diagnosis_date",
    ]
]
pyp = pyp[
    [
        "ir_id",
        "document_ID",
        "pyp__amyloid_diagnosis",
        "pyp__amyloid_label",
        "pyp__amyloid_diagnosis_date",
    ]
]
mayo = mayo[
    [
        "ir_id",
        "document_ID",
        "mayo_ID",
        "mayo__amyloid_diagnosis",
        "mayo__amyloid_label",
        "mayo__amyloid_diagnosis_date",
        "mayo__amyloid_subtype_diagnosis",
        "mayo__amyloid_subtype_diagnosis_date",
        "mayo__ttr_amyloid_subtype_diagnosis",
        "mayo__ttr_amyloid_subtype_diagnosis_date",
    ]
]
mayo.rename(columns = {"document_ID": "document_ID_mayo"}, inplace=True)

assert set(cp.ir_id.to_list() + pyp.ir_id.to_list()) == set(
    cp.ir_id.to_list() + pyp.ir_id.to_list() + mayo.ir_id.to_list()
)
labels = cp.merge(pyp, how="outer", on="ir_id", suffixes=("_cp", "_pyp"))
labels = labels.merge(mayo, how="outer", on="ir_id", suffixes=("", "_mayo"))


In [14]:
def get_amyloid_label(row):
    definite_diagnosis = [0, 1]
    cp_label, cp_date = row["cp__amyloid_label"], row["cp__amyloid_diagnosis_date"]
    pyp_label, pyp_date = row["pyp__amyloid_label"], row["pyp__amyloid_diagnosis_date"]
    mayo_label, mayo_date = (
        row["mayo__amyloid_label"],
        row["mayo__amyloid_diagnosis_date"],
    )
    # if we have cp diagnosis
    if cp_label in definite_diagnosis:
        if mayo_label in definite_diagnosis and mayo_label != cp_label:
            return -1, None
        else:
            return cp_label, cp_date
    elif mayo_label in definite_diagnosis:
        return mayo_label, mayo_date

    elif pyp_label in definite_diagnosis:
        return pyp_label, pyp_date

    else:
        assert cp_label == 2 or np.isnan(cp_label)
        assert pyp_label == 2 or np.isnan(pyp_label)
        assert mayo_label in [-1, 2, 3] or np.isnan(mayo_label)
        if mayo_label == 3:
            return -1, None
        else:
            if cp_label == 2:
                return cp_label, cp_date
            elif mayo_label == 2:
                return mayo_label, mayo_date
            else:
                assert pyp_label == 2
                return pyp_label, pyp_date


labels[["final__amyloid_label", "final__amyloid_diagnosis_date"]] = labels.apply(
    get_amyloid_label, axis=1, result_type="expand"
)

labels["final__amyloid_diagnosis"] = labels["final__amyloid_label"].map(
    {0: "NEGATIVE", 1: "POSITIVE", 2: "INDETERMINATE", -1: "CHART_REVIEW"}
)



In [15]:
labels.columns

Index(['ir_id', 'document_ID_cp', 'cp__amyloid_diagnosis', 'cp__amyloid_label',
       'cp__amyloid_diagnosis_date', 'document_ID_pyp',
       'pyp__amyloid_diagnosis', 'pyp__amyloid_label',
       'pyp__amyloid_diagnosis_date', 'document_ID_mayo', 'mayo_ID',
       'mayo__amyloid_diagnosis', 'mayo__amyloid_label',
       'mayo__amyloid_diagnosis_date', 'mayo__amyloid_subtype_diagnosis',
       'mayo__amyloid_subtype_diagnosis_date',
       'mayo__ttr_amyloid_subtype_diagnosis',
       'mayo__ttr_amyloid_subtype_diagnosis_date', 'final__amyloid_label',
       'final__amyloid_diagnosis_date', 'final__amyloid_diagnosis'],
      dtype='object')

In [16]:
labels = labels[
    [
        "ir_id",
        "document_ID_cp",
        "cp__amyloid_diagnosis",
        "cp__amyloid_label",
        "cp__amyloid_diagnosis_date",
        "document_ID_pyp",
        "pyp__amyloid_diagnosis",
        "pyp__amyloid_label",
        "pyp__amyloid_diagnosis_date",
        "document_ID_mayo",
        "mayo_ID",
        "mayo__amyloid_diagnosis",
        "mayo__amyloid_label",
        "mayo__amyloid_diagnosis_date",
        "mayo__amyloid_subtype_diagnosis",
        "mayo__amyloid_subtype_diagnosis_date",
        "mayo__ttr_amyloid_subtype_diagnosis",
        "mayo__ttr_amyloid_subtype_diagnosis_date",
        "final__amyloid_label",
        "final__amyloid_diagnosis",
        "final__amyloid_diagnosis_date",
    ]
]


In [17]:
def get_amyloid_subtype_label(row):
    amyloid_diagnosis = row["final__amyloid_diagnosis"]
    pyp_diagnosis = row["pyp__amyloid_diagnosis"]
    mayo_subtype_diagnosis, mayo_ttr_subtype_diagnosis = (
        row["mayo__amyloid_subtype_diagnosis"],
        row["mayo__ttr_amyloid_subtype_diagnosis"],
    )
    if amyloid_diagnosis == "POSITIVE":
        if not pd.isnull(mayo_subtype_diagnosis) and mayo_subtype_diagnosis != "INDETERMINATE":
            return (
                mayo_subtype_diagnosis,
                mayo_ttr_subtype_diagnosis,
            )
        elif pyp_diagnosis == "STRONGLY_SUGGESTIVE":
            return "TTR", np.nan
        else:
            return (
                mayo_subtype_diagnosis,
                mayo_ttr_subtype_diagnosis,
            )
    elif amyloid_diagnosis == "INDETERMINATE":
        if mayo_subtype_diagnosis == "AL" and pyp_diagnosis == "EQUIVOCAL":
            return "CHART_REVIEW", np.nan
        assert pd.isnull(mayo_subtype_diagnosis), mayo_subtype_diagnosis
        assert pd.isnull(mayo_ttr_subtype_diagnosis), mayo_ttr_subtype_diagnosis
        return np.nan, np.nan
    else:
        assert amyloid_diagnosis in ["NEGATIVE", "CHART_REVIEW"]
        return np.nan, np.nan


labels[
    [
        "final__amyloid_subtype_diagnosis",
        "final__ttr_amyloid_subtype_diagnosis",
    ]
] = labels.apply(get_amyloid_subtype_label, axis=1, result_type="expand")



In [18]:
view_columns = [
    "ir_id",
    "cp__amyloid_diagnosis",
    "pyp__amyloid_diagnosis",
    "mayo__amyloid_diagnosis",
    "mayo__amyloid_subtype_diagnosis",
    "mayo__ttr_amyloid_subtype_diagnosis",
    "final__amyloid_diagnosis",
    "final__amyloid_diagnosis_date",
    "final__amyloid_subtype_diagnosis",
    "final__ttr_amyloid_subtype_diagnosis"
]

In [19]:
df = labels[view_columns].copy(deep=True)

In [20]:
df.columns

Index(['ir_id', 'cp__amyloid_diagnosis', 'pyp__amyloid_diagnosis',
       'mayo__amyloid_diagnosis', 'mayo__amyloid_subtype_diagnosis',
       'mayo__ttr_amyloid_subtype_diagnosis', 'final__amyloid_diagnosis',
       'final__amyloid_diagnosis_date', 'final__amyloid_subtype_diagnosis',
       'final__ttr_amyloid_subtype_diagnosis'],
      dtype='object')

In [21]:
l = df[
    [
        "final__amyloid_diagnosis",
        "final__amyloid_subtype_diagnosis",
        "final__ttr_amyloid_subtype_diagnosis",
        "cp__amyloid_diagnosis",
        "pyp__amyloid_diagnosis",
        "mayo__amyloid_diagnosis",
        "mayo__amyloid_subtype_diagnosis",
        "mayo__ttr_amyloid_subtype_diagnosis",
    ]
].drop_duplicates().sort_values(
    by=[
        "final__amyloid_diagnosis",
        "final__amyloid_subtype_diagnosis",
        "final__ttr_amyloid_subtype_diagnosis",
    ]
)



In [None]:
df[
    [
        "final__amyloid_diagnosis",
        "final__amyloid_subtype_diagnosis",
        "final__ttr_amyloid_subtype_diagnosis",
    ]
].drop_duplicates().sort_values(
    by=[
        "final__amyloid_diagnosis",
        "final__amyloid_subtype_diagnosis",
        "final__ttr_amyloid_subtype_diagnosis",
    ]
)

In [24]:
df.columns

Index(['ir_id', 'cp__amyloid_diagnosis', 'pyp__amyloid_diagnosis',
       'mayo__amyloid_diagnosis', 'mayo__amyloid_subtype_diagnosis',
       'mayo__ttr_amyloid_subtype_diagnosis', 'final__amyloid_diagnosis',
       'final__amyloid_diagnosis_date', 'final__amyloid_subtype_diagnosis',
       'final__ttr_amyloid_subtype_diagnosis'],
      dtype='object')

In [None]:
df.rename(columns={"cp__amyloid_diagnosis": "cardiac_path__amyloid_diagnosis"}, inplace=True)
df.head()

In [26]:
final_dataset_path = Path("/data/datasets/Amyloidosis/patient_amyloid_diagnosis/all_datasets_patient_amyloid_diagnosis.csv")
df.to_csv(final_dataset_path, index=False)