## Notebook purpose:
In this notebook we aggregate patient level amyloid diagnosis data for our cohort. The notebook reads labels from various sources and combines them in a way that respects the quality of the different sources.

### Datasets:
**all datasets patient amyloid diagnosis**:
  - These are labels obtained from 3 sources:
    1. Labels from Mayo Reports, Cardiac Path Reports, PYP Reports. These labels are provided and have already been aggregated in the notebook `merge__cp_pyp_mayo.ipynb` and are saved in the file `all_datasets_patient_amyloid_diagnosis.csv`. The aggregation rules are described below. Label date is taken as the earliest date of a definitive diagnosis, with the following order of priority: Cardiac Path Reports, Mayo Reports, PYP Reports. The label columns for this aggregation have prefix `final__`: `final__amyloid_diagnosis`, `final__amyloid_subtype_diagnosis`, `final__ttr_amyloid_subtype_diagnosis`, `final__amyloid_diagnosis_date`
      - AMYLOID DIAGNOSIS:
        - Gold Standard: Cardiac Path Reports, Mayo Reports
        - Silver Standard: PYP Reports
        - Classes: 
          - POSITIVE 
          - NEGATIVE
          - INDETERMINATE (reports were incomplete or did not include diagnosis)
          - CHART REVIEW (clash between Gold Standard)
      - AMYLOID SUBTYPE DIAGNOSIS:
        - Gold Standard: Mayo Reports
        - Silver Standard: PYP Reports (STRONGLY SUGGESTIVE implies ATTR)
        - Classes: 
          - TTR
          - AL
          - CHART REVIEW (clash between Gold Standard)
          - NaN (missing value)
      - TTR SUBTYPE DIAGNOSIS:
        - Gold Standard: Mayo Reports
        - Classes: 
          - hTTR (Mayo reports include positive tests for hTTR)
          - INDETERMINATE (defaults to wTTR because Mayo reports include tests for hTTR. Negative results imply wTTR but hTTR cannot be 100% ruled out)
          - CHART REVIEW (clash between Gold Standard)
          - NaN (missing value. INDETERMINATE means there was a test for hTTR and it was negative. NaN means there was no such information available) 
    2. Labels from clinical chart review for TTR
      - Gold Standard
      - Classes: 
        - hTTR (gets mapped to hTTR)
        - wTTR (gets mapped to INDETERMINATE)
        - TTR - w/u pending (gets mapped to NaN)
        - NaN (missing value)

    3. Labels from clinical chart review for AL
      - Gold Standard
      - Classes: 
        - AL (gets mapped to AL)
        - NaN (missing value)

  - Aggregation process:
    1. We combine sources 2 and 3 first, because they don't overlap. This yields a dataframe with columns: `ir_id`, `chart_reviews__amyloid_diagnosis`, `chart_reviews__amyloid_subtype_diagnosis`, `chart_reviews__ttr_amyloid_subtype_diagnosis`, `chart_reviews__amyloid_diagnosis_date`
    2. We combine the `chart_reviews__` columns with the `final__` columns to yield `label__amyloid_diagnosis`, `label__amyloid_subtype_diagnosis`, `label__ttr_amyloid_subtype_diagnosis`, `label__amyloid_diagnosis_date`
    3. Clashes are flagged with `CHART_REVIEW`
          
**Amyloidosis Patients Cohort Entry** is the list of patients in our cohort. Includes a column that indicates if a patient has been prescribed **Tafamidis**. This drug is prescribed for TTR Amyloidosis (hTTR and wTTR).
  - Gold Standard
  - Classes: 
    - 1 (Patient is prescribed Tafamidis. Gets mapped to Amyloid Diagnosis = POSITIVE, Amyloid Subtype = TTR, TTR Subtype = NaN)
    - 0 (missing value. The Patient is either not prescribed Tafamidis, or we don't have the record.)
   
### Output
1. **cohort patient labels** is the dataframe of labels for patients in our cohort. We created labels from 2 sources, **all datasets patient amyloid diagnosis** and **Amyloidosis Patients Cohort Entry**. We want to link patient level data to each patient in the cohort. Patients with no labels are removed. The `label__amyloid_diagnosis`, `label__amyloid_subtype_diagnosis`, `label__ttr_amyloid_subtype_diagnosis`, `label__amyloid_diagnosis_date` columns have the label obtained from merging labels obtained from these sources. Again, clashes are flagged with `CHART_REVIEW`.

2. **cohort chart reviews** is an empty dataframe for the patients requiring further chart review. We save it as a `.csv` and as an `.xlsx`.


In [1]:
import numpy as np
import pandas as pd
from pathlib import Path


pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)


In [4]:
# The path to the Amyloid data
BASE = Path("/data/datasets/Amyloidosis/")
PULL_2023 = BASE / "2023 pull"


# A file with chart review to obtain diagnoses in clinic for TTR patients
clinic_cohort_path = BASE / "Amyloid Clinic cohort.xlsx"

# A file with chart review to obtain diagnoses in clinic for AL patients
amyloid_program_al_path = BASE / "Amyloid program AL.xlsx"

# The cohort file to which we add diagnosis info - includes TAFAMIDIS prescription
cohort_entry_path = BASE / "Amyloidosis Patients Cohort Entry.csv"
cohort_entry_2023_path = PULL_2023 / "Amyloidosis Patients Cohort Entry 2023.csv"

demographics_path = BASE / "Amyloidosis Patients Demographics v2.csv"
demographics_2023_path = PULL_2023 / "Amyloidosis Patients Demographics 2023.csv"

comorbidities_path = PULL_2023 / "Amyloidosis Patients Comorbidities 2023.csv"
hf_subtype_2023_path = PULL_2023 / "Amyloidosis Patients HF_Subtype 2023.csv"
hf_subtype_path = BASE / "Amyloidosis Patients HF_Subtype.csv"

echomaster_path = BASE / "Amyloidosis Patients EchoMaster.csv"
echomaster_2023_path = PULL_2023 / "Amyloidosis Patients EchoMaster 2023.csv"

echosyngo_path = BASE / "Amyloidosis Patients EchoSyngo.csv"
echosyngo_2023_path = PULL_2023 / "Amyloidosis Patients EchoSyngo 2023.csv"

icd_codes_path = BASE / "Amyloidosis Patients ICD Codes.csv"
icd_codes_2023_path = PULL_2023 / "Amyloidosis Patients ICD Codes 2023.csv"

outpt_encounters_path = BASE / "Amyloidosis Patients Outpt Clinic Encounters.csv"
outpt_encounters_2023_path = PULL_2023 / "Amyloidosis Patients Outpt Clinic Encounters 2023.csv"

discharge_summary_path = BASE / "Amyloidosis Patients Discharge Summary Notes.csv"
discharge_summary_2023_path = PULL_2023 / "Amyloidosis Patients Discharge Summary Notes 2023.csv"
# Patient level diagnoses from cardiac path reports, pyp reports, and mayo labs - GOLD STANDARD for amyloid diagnosis
amyloid_diagnosis_labels_path = (
    BASE
    / "patient_amyloid_diagnosis"
    / "all_datasets_patient_amyloid_diagnosis.csv"
)


In [4]:
df = pd.read_excel(PULL_2023/"Amyloid_Clinic_Anna.xlsx")

In [None]:
keep_cols = [
    'ir_id', 'heart_failure', 'heart_failure_date', 'hypertension', 'hypertension_date',
]

In [16]:
df = pd.read_csv(demographics_path, sep="|", error_bad_lines=True)
df.drop(df.tail(2).index, inplace=True)

df_2023 = pd.read_csv(demographics_2023_path, sep="|", error_bad_lines=True)
df_2023.drop(df_2023.tail(2).index, inplace=True)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [18]:
df.columns

Index(['ir_id', 'Age_cohort', 'Gender_EDW', 'Race_EDW', 'Ethnicity_EDW',
       'race_ethncty_combined', 'Insurance_EDW_cohort',
       'Insurance_Mapped_cohort'],
      dtype='object')

In [24]:
labeled_cohort_path = BASE / "Amyloidosis Patients Cohort Entry - Labeled V3.csv"
labeled_cohort = pd.read_csv(labeled_cohort_path)

In [30]:
cases = labeled_cohort[labeled_cohort.patient_group__amyloid_cases ==1]

In [31]:
l = set(cases.ir_id.values)

### Load Patient Level Diagnosis Data

In [3]:
# Read the patient level diagnosis data
# I created this CSV so there is no need
diagnoses = pd.read_csv(amyloid_diagnosis_labels_path)


# we keep all columns so that we know which set of records the final label came from
# cardiac path reports, pyp reports, and mayo labs

diagnoses["final__amyloid_diagnosis_date"] = pd.to_datetime(
    diagnoses["final__amyloid_diagnosis_date"]
)



In [None]:
print(
    f'We have {len(diagnoses["ir_id"].unique())} patients with a gold standard amyloid diagnosis.'
)

print(diagnoses["final__amyloid_diagnosis"].value_counts(dropna=False))

print(diagnoses["final__amyloid_subtype_diagnosis"].value_counts(dropna=False))

print(diagnoses["final__ttr_amyloid_subtype_diagnosis"].value_counts(dropna=False))



### Load Clinic Cohort (Includes Chart Review Labels)

In [5]:
chart_reviews = pd.read_excel(clinic_cohort_path)
assert chart_reviews.shape[0] == len(
    chart_reviews["ir_id"].unique()
), "Some patients have more than one record"
# add prefix 'chart_reviews__' to each column except for 'ir_id'
chart_reviews = pd.concat(
    [
        chart_reviews[chart_reviews.columns[0]],
        chart_reviews[chart_reviews.columns[1:]].add_prefix("chart_reviews__"),
    ],
    axis=1,
)


Note that chart review data is all ATTR amyloid. We have TTR subtype label, except in 6 cases, where the TTR subtype is pending. In other words, patients in this data are positive for amyloid, the subtype is TTR, and the TTR subtype is provided by the column "chart_reviews__Amyloid_type".

In [None]:
print(
    f'We have {len(chart_reviews["ir_id"].unique())} patients with TTR Amyloid Subtype from Chart review.'
)

chart_reviews["chart_reviews__Amyloid_type"].value_counts(dropna=False)


### Load Amyloid program AL data

In [None]:
amyloid_program_al = pd.read_excel(amyloid_program_al_path)
amyloid_program_al.drop_duplicates(inplace=True)
amyloid_program_al.rename(
    columns={"Amyloid_type": "chart_reviews__Amyloid_type"}, inplace=True
)

print(
    f'We have {len(amyloid_program_al["ir_id"].unique())} patients with AL Amyloid Subtype from Chart review.'
)
amyloid_program_al["chart_reviews__Amyloid_type"].value_counts()


### Merge Chart Review Subtype data

In [None]:
chart_reviews = pd.concat(
    [chart_reviews.copy(deep=True), amyloid_program_al], axis=0, ignore_index=True
)
assert chart_reviews.shape[0] == len(
    chart_reviews["ir_id"].unique()
), "Some patients have more than one record"
# add prefix 'char_review__' to each column except for 'ir_id'

print(
    f'We have {len(chart_reviews["ir_id"].unique())} patients with Amyloid Subtype from Chart review.'
)
chart_reviews["chart_reviews__Amyloid_type"].value_counts()


### Merge Diagnosis and Clinic Cohort

We use an outer merge here because some patients are present in clinic cohort and not in diagnosis

In [9]:
cohort_labels = diagnoses.merge(chart_reviews, on="ir_id", how="outer")


In [10]:
# Rename columns
cohort_labels.rename(
    columns={
        "chart_reviews__Date_of_Diagnosis": "chart_reviews__amyloid_diagnosis_date",
    },
    inplace=True,
)

# We map chart review values to label
def get_chart_review_amyloid_diagnosis(row):
    amyloid_type = row["chart_reviews__Amyloid_type"]
    if amyloid_type == "hTTR":
        return "POSITIVE", "TTR", "HTTR"
    elif amyloid_type == "wTTR":
        return "POSITIVE", "TTR", "INDETERMINATE"
    elif amyloid_type == "TTR - w/u pending":
        return "POSITIVE", "TTR", "TTR - w/u pending"
    elif amyloid_type == "AL":
        return "POSITIVE", "AL", np.nan
    else:
        return np.nan, np.nan, np.nan


cohort_labels[
    [
        "chart_reviews__amyloid_diagnosis",
        "chart_reviews__amyloid_subtype_diagnosis",
        "chart_reviews__ttr_amyloid_subtype_diagnosis",
    ]
] = cohort_labels.apply(
    lambda row: get_chart_review_amyloid_diagnosis(row), axis=1, result_type="expand"
)



In [None]:
cohort_labels["chart_reviews__amyloid_diagnosis"].value_counts(dropna=False)


In [None]:
cohort_labels["chart_reviews__amyloid_subtype_diagnosis"].value_counts(dropna=False)


In [None]:
cohort_labels["chart_reviews__ttr_amyloid_subtype_diagnosis"].value_counts(dropna=False)


The two functions below follow the same decision tree. The logic compares the values from labels and chart review data. `check_label_consistency` returns integer values to help us sift through agreements and disagreements. `combine_labels` returns labels and a date given the information provided by both sources.

In [14]:
def check_label_consistency(row):
    label_CA, label_CA_subtype, label_ATTR_subtype, label_date = (
        row["final__amyloid_diagnosis"],
        row["final__amyloid_subtype_diagnosis"],
        row["final__ttr_amyloid_subtype_diagnosis"],
        row["final__amyloid_diagnosis_date"],
    )
    (
        chart_review_CA,
        chart_review_CA_subtype,
        chart_review_ATTR_subtype,
        chart_review_date,
    ) = (
        row["chart_reviews__amyloid_diagnosis"],
        row["chart_reviews__amyloid_subtype_diagnosis"],
        row["chart_reviews__ttr_amyloid_subtype_diagnosis"],
        row["chart_reviews__amyloid_diagnosis_date"],
    )
    if pd.isna(chart_review_CA):
        return np.nan

    assert chart_review_CA == "POSITIVE"

    if pd.isna(label_CA) or label_CA in [
        "INDETERMINATE",
        "CHART_REVIEW",
    ]:
        # we can keep clinic
        # 1: chart review provides new diagnosis
        return 1
    elif label_CA == "NEGATIVE":
        # diagnosis clash
        # -1: chart review disagrees on diagnosis
        return -1
    else:
        # label is Positive
        assert label_CA == "POSITIVE", (label_CA, chart_review_CA, row["ir_id"])

        # Check the subtypes

        # if label has no subtype,
        # 2: chart review provides new subtype
        if pd.isna(label_CA_subtype) or label_CA_subtype in [
            "INDETERMINATE",
            "CHART_REVIEW",
        ]:
            return 2

        # if label subtype is AL, check if chart review is AL or TTR
        if label_CA_subtype == "AL":

            # full agreement for AL
            if chart_review_CA_subtype == "AL":
                return 0
            # -2: chart review disagrees on subtype
            assert chart_review_CA_subtype == "TTR"
            return -2

        # if label subtype is TTR, check if chart review is AL or TTR.
        elif label_CA_subtype == "TTR":
            # -2: chart review disagrees on subtype
            if chart_review_CA_subtype == "AL":
                return -2

            # agreement for TTR, check agreement on TTR subtypes
            if chart_review_CA_subtype == "TTR":
                # check TTR subtype
                # 3: chart review provides new ttr subtype
                if pd.isna(label_ATTR_subtype) or label_ATTR_subtype == "CHART_REVIEW":
                    return 3
                elif label_ATTR_subtype != chart_review_ATTR_subtype:
                    # -3: chart review disagrees on ttr subtype
                    if chart_review_ATTR_subtype == "TTR - w/u pending":
                        # chart review is indeterminate, so keep label.
                        # We still flag this as a disagreement.
                        return -3
                    return -3
                else:
                    # 0: total agreement
                    assert label_ATTR_subtype == chart_review_ATTR_subtype, (
                        label_ATTR_subtype,
                        chart_review_ATTR_subtype,
                        row["ir_id"],
                    )
                    return 0


def combine_labels(row):
    label_CA, label_CA_subtype, label_ATTR_subtype, label_date = (
        row["final__amyloid_diagnosis"],
        row["final__amyloid_subtype_diagnosis"],
        row["final__ttr_amyloid_subtype_diagnosis"],
        row["final__amyloid_diagnosis_date"],
    )
    (
        chart_review_CA,
        chart_review_CA_subtype,
        chart_review_ATTR_subtype,
        chart_review_date,
    ) = (
        row["chart_reviews__amyloid_diagnosis"],
        row["chart_reviews__amyloid_subtype_diagnosis"],
        row["chart_reviews__ttr_amyloid_subtype_diagnosis"],
        row["chart_reviews__amyloid_diagnosis_date"],
    )
    if pd.isna(chart_review_CA):
        return (label_CA, label_CA_subtype, label_ATTR_subtype, label_date)

    assert chart_review_CA == "POSITIVE"

    if pd.isna(label_CA) or label_CA in [
        "INDETERMINATE",
        "CHART_REVIEW",
    ]:
        # we can keep clinic
        # 1: chart review provides new diagnosis
        return (
            chart_review_CA,
            chart_review_CA_subtype,
            chart_review_ATTR_subtype,
            chart_review_date if pd.notna(chart_review_date) else label_date,
        )
    elif label_CA == "NEGATIVE":
        # diagnosis clash
        # -1: chart review disagrees on diagnosis
        return ("CHART_REVIEW", np.nan, np.nan, pd.NaT)

    else:
        # label is Positive
        assert label_CA == "POSITIVE", (label_CA, chart_review_CA, row["ir_id"])

        # Check the subtypes

        # if label has no subtype,
        # 2: chart review provides new subtype
        if pd.isna(label_CA_subtype) or label_CA_subtype in [
            "INDETERMINATE",
            "CHART_REVIEW",
        ]:
            return (
                label_CA,
                chart_review_CA_subtype,
                chart_review_ATTR_subtype,
                chart_review_date,
            )

        # if label subtype is AL, check if chart review is AL or TTR
        if label_CA_subtype == "AL":

            # full agreement for AL
            if chart_review_CA_subtype == "AL":
                return (label_CA, label_CA_subtype, label_ATTR_subtype, label_date)

            # -2: chart review disagrees on subtype
            assert chart_review_CA_subtype == "TTR"
            return (label_CA, "CHART_REVIEW", np.nan, label_date)

        # if label subtype is TTR, check if chart review is AL or TTR.
        elif label_CA_subtype == "TTR":
            # -2: chart review disagrees on subtype
            if chart_review_CA_subtype == "AL":
                return (label_CA, "CHART_REVIEW", np.nan, label_date)
            # agreement for TTR, check agreement on TTR subtypes
            if chart_review_CA_subtype == "TTR":
                # check TTR subtype
                # 3: chart review provides new ttr subtype
                if pd.isna(label_ATTR_subtype) or label_ATTR_subtype == "CHART_REVIEW":
                    return (
                        label_CA,
                        label_CA_subtype,
                        chart_review_ATTR_subtype,
                        label_date,
                    )

                elif label_ATTR_subtype != chart_review_ATTR_subtype:
                    # -3: chart review disagrees on ttr subtype
                    if chart_review_ATTR_subtype == "TTR - w/u pending":
                        # chart review is indeterminate, so keep label.
                        return (
                            label_CA,
                            label_CA_subtype,
                            label_ATTR_subtype,
                            label_date,
                        )
                    return (label_CA, label_CA_subtype, "CHART_REVIEW", label_date)
                else:
                    # 0: total agreement
                    assert label_ATTR_subtype == chart_review_ATTR_subtype, (
                        label_ATTR_subtype,
                        chart_review_ATTR_subtype,
                        row["ir_id"],
                    )
                    return (label_CA, label_CA_subtype, label_ATTR_subtype, label_date)


In [15]:
cohort_labels["merge_chart_reviews_consistency"] = cohort_labels.apply(
    lambda row: check_label_consistency(row), axis=1
)
cohort_labels["merge_chart_reviews_consistency_description"] = cohort_labels[
    "merge_chart_reviews_consistency"
].map(
    {
        0: "total agreement",
        1: "chart review provides new diagnosis",
        2: "chart review provides new subtype",
        3: "chart review provides new ttr subtype",
        -1: "chart review disagrees on diagnosis",
        -2: "chart review disagrees on subtype",
        -3: "chart review disagrees on ttr subtype",
    }
)

cohort_labels[
    [
        "label__amyloid_diagnosis",
        "label__amyloid_subtype_diagnosis",
        "label__ttr_amyloid_subtype_diagnosis",
        "label__amyloid_diagnosis_date",
    ]
] = cohort_labels.apply(lambda row: combine_labels(row), axis=1, result_type="expand")



In [None]:
view_columns_2 = [
    "ir_id",
    "final__amyloid_diagnosis",
    "final__amyloid_subtype_diagnosis",
    "final__ttr_amyloid_subtype_diagnosis",
    "final__amyloid_diagnosis_date",
    "chart_reviews__amyloid_diagnosis",
    "chart_reviews__amyloid_subtype_diagnosis",
    "chart_reviews__ttr_amyloid_subtype_diagnosis",
    "chart_reviews__amyloid_diagnosis_date",
    "merge_chart_review_consistency",
    "merge_chart_review_consistency_description",
    "label__amyloid_diagnosis",
    "label__amyloid_subtype_diagnosis",
    "label__ttr_amyloid_subtype_diagnosis",
    "label__amyloid_diagnosis_date",
]

view_columns = [
    "ir_id",
    "merge_chart_reviews_consistency",
    "merge_chart_reviews_consistency_description",
    "final__amyloid_diagnosis",
    "chart_reviews__amyloid_diagnosis",
    "label__amyloid_diagnosis",
    "final__amyloid_subtype_diagnosis",
    "chart_reviews__amyloid_subtype_diagnosis",
    "label__amyloid_subtype_diagnosis",
    "final__ttr_amyloid_subtype_diagnosis",
    "chart_reviews__ttr_amyloid_subtype_diagnosis",
    "label__ttr_amyloid_subtype_diagnosis",
    "final__amyloid_diagnosis_date",
    "chart_reviews__amyloid_diagnosis_date",
    "label__amyloid_diagnosis_date",
]

cohort_labels[
    (cohort_labels["merge_chart_reviews_consistency"] < 0)
][view_columns]



The following cell can be used to look at all cases where we merged labels with chart review data. Just change `.head()` for the part of the dataframe you want to visualize.

In [None]:
cohort_labels[cohort_labels["merge_chart_reviews_consistency"].notna()][
    view_columns
].sort_values(
    by=[
        "merge_chart_reviews_consistency",
        "chart_reviews__amyloid_diagnosis",
        "chart_reviews__amyloid_subtype_diagnosis",
        "chart_reviews__ttr_amyloid_subtype_diagnosis",
    ]
).head()


#### Counting the patients after this merge

In [None]:
cohort_labels["merge_chart_reviews_consistency_description"].value_counts()


In [None]:
cohort_labels["label__amyloid_diagnosis"].value_counts(dropna=False)


In [None]:
cohort_labels["label__amyloid_subtype_diagnosis"].value_counts(dropna=False)


In [None]:
cohort_labels["label__ttr_amyloid_subtype_diagnosis"].value_counts(dropna=False)


# Load Tafamidis Data
Patients with Tafamidis have the following label:
`POSITIVE` amyloid diagnosis, `TTR` amyloid subtype, `NaN` TTR subtype (unknown). Tafamidis prescriptions are Gold Standard labels.

In [22]:
# read file and drop the last 2 rows, containing information from SQL operation
cohort_entry = pd.read_csv(cohort_entry_path, sep="|", error_bad_lines=True)
cohort_entry.drop(cohort_entry.tail(2).index, inplace=True)
cohort_entry["ir_id"] = cohort_entry["ir_id"].astype(int)
# only keep ir_id and Tafamidis columns
cohort_entry = cohort_entry[
    ["ir_id", "Tafamidis_cohort_entry", "Tafamidis_cohort_entry_date"]
]
assert (
    len(cohort_entry.ir_id.unique()) == cohort_entry.shape[0]
), "cohort entry has more than one record per patient"





  cohort_entry = pd.read_csv(cohort_entry_path, sep="|", error_bad_lines=True)
  cohort_entry = pd.read_csv(cohort_entry_path, sep="|", error_bad_lines=True)


### Merge Labels into Cohort Entry Dataframe 

In [23]:
cohort = cohort_entry.merge(cohort_labels, on="ir_id", how="outer")


Tafamidis adds 49 new cases of amyloid, 1 new case to chart review.

In [None]:
cohort[(cohort["Tafamidis_cohort_entry"] == 1)][
    "label__amyloid_diagnosis"
].value_counts(dropna=False)



We see that Tafamidis adds 1 new cases of TTR, 1 new case to chart review, because it clashes with an AL case.

In [None]:
cohort[
    (cohort["Tafamidis_cohort_entry"] == 1)
    & (cohort["label__amyloid_diagnosis"] == "POSITIVE")
]["label__amyloid_subtype_diagnosis"].value_counts(dropna=False)



In [26]:
def merge_tafamidis_with_labels(row, debug=False):
    label_CA, label_CA_subtype, label_ATTR_subtype, label_date = (
        row["label__amyloid_diagnosis"],
        row["label__amyloid_subtype_diagnosis"],
        row["label__ttr_amyloid_subtype_diagnosis"],
        row["label__amyloid_diagnosis_date"],
    )
    tafamidis, tafamidis_date = (
        row["Tafamidis_cohort_entry"],
        row["Tafamidis_cohort_entry_date"],
    )

    # TAFAMIDIS 0 or NaN, keep label data
    if tafamidis != 1:
        if debug:
            return np.nan
        return (label_CA, label_CA_subtype, label_ATTR_subtype, label_date)

    assert tafamidis == 1

    # TAFAMIDIS 1 and label is missing or inconclusive
    # 1: TAFAMIDIS provides new amyloid diagnosis
    if pd.isna(label_CA) or label_CA in [
        "INDETERMINATE",
        "CHART_REVIEW",
    ]:
        if debug:
            return 1
        return (
            "POSITIVE",
            "TTR",
            np.nan,
            tafamidis_date if pd.notna(tafamidis_date) else label_date,
        )
    # TAFAMIDIS 1 and label NEGATIVE
    # -1: TAFAMIDIS disagrees on amyloid diagnosis
    elif label_CA == "NEGATIVE":
        if debug:
            return -1
        return ("CHART_REVIEW", np.nan, np.nan, pd.NaT)

    # We have TAFAMIDIS and label is missing or inconclusive
    # Check the subtypes
    else:
        assert label_CA == "POSITIVE"

        # Label subtype is missing or
        # 2: TAFAMIDIS provides new subtype = TTR
        if pd.isna(label_CA_subtype) or label_CA_subtype in [
            "INDETERMINATE",
            "CHART_REVIEW",
        ]:
            if debug:
                return 2
            return (
                label_CA,
                "TTR",
                np.nan,
                label_date,
            )

        # if label subtype is AL, TAFAMIDIS clashes with subtype
        # -2: TAFAMIDIS => TTR
        if label_CA_subtype == "AL":
            if debug:
                return -2
            return (label_CA, "CHART_REVIEW", np.nan, label_date)

        # TA
        # 0: total agreement
        assert label_CA_subtype == "TTR"
        if debug:
            return 0
        return (label_CA, label_CA_subtype, label_ATTR_subtype, label_date)



In [27]:
cohort["merge_tafamidis_consistency"] = cohort.apply(
    lambda row: merge_tafamidis_with_labels(row, debug=True), axis=1
)
cohort["merge_tafamidis_consistency_description"] = cohort[
    "merge_tafamidis_consistency"
].map(
    {
        0: "total agreement",
        1: "tafamidis provides new diagnosis",
        2: "tafamidis provides new TTR subtype",
        -1: "tafamidis disagrees on diagnosis",
        -2: "tafamidis disagrees on subtype",
    }
)


We see that the label merging function is consistent with out observations, so now we do the merge.

In [None]:
cohort["merge_tafamidis_consistency_description"].value_counts(dropna=False)


Since we use these column names elsewhere in our code - we overwrite the columns from the previous merge.

In [29]:
cohort[
    [
        "label__amyloid_diagnosis",
        "label__amyloid_subtype_diagnosis",
        "label__ttr_amyloid_subtype_diagnosis",
        "label__amyloid_diagnosis_date",
    ]
] = cohort.apply(
    lambda row: merge_tafamidis_with_labels(row, debug=False),
    axis=1,
    result_type="expand",
)


  ] = cohort.apply(


We just want a label file so cohort entries with no labels can be dropped.

In [30]:
# drop rows where
cohort = cohort[cohort["label__amyloid_diagnosis"].notna()]


We reorder the columns so that this file is a bit more readable.

In [31]:
reordered_columns = [
    "ir_id",
    "cardiac_path__amyloid_diagnosis",
    "pyp__amyloid_diagnosis",
    "mayo__amyloid_diagnosis",
    "mayo__amyloid_subtype_diagnosis",
    "mayo__ttr_amyloid_subtype_diagnosis",
    "final__amyloid_diagnosis",
    "final__amyloid_diagnosis_date",
    "final__amyloid_subtype_diagnosis",
    "final__ttr_amyloid_subtype_diagnosis",
    "chart_reviews__Amyloid_type",
    "chart_reviews__Method_of_diagnosis",
    "chart_reviews__amyloid_diagnosis_date",
    "chart_reviews__Age_at_Diagnosis",
    "chart_reviews__amyloid_diagnosis",
    "chart_reviews__amyloid_subtype_diagnosis",
    "chart_reviews__ttr_amyloid_subtype_diagnosis",
    "merge_chart_reviews_consistency",
    "merge_chart_reviews_consistency_description",
    "Tafamidis_cohort_entry",
    "Tafamidis_cohort_entry_date",
    "merge_tafamidis_consistency",
    "merge_tafamidis_consistency_description",
    "label__amyloid_diagnosis",
    "label__amyloid_subtype_diagnosis",
    "label__ttr_amyloid_subtype_diagnosis",
    "label__amyloid_diagnosis_date",
]


In [32]:
cohort = cohort[reordered_columns]


#### Check Tafamidis or PYP only patients
We want to flag patients for chart review in the following cases:
- The only positive label we have is from PYP. 
- The only positive label we have is from tafamidis

In [33]:
check_columns_pyp_tafamidis = [
    "pyp__amyloid_diagnosis",
    "Tafamidis_cohort_entry",
    "cardiac_path__amyloid_diagnosis",
    "mayo__amyloid_diagnosis",
    "chart_reviews__Amyloid_type",
    "label__amyloid_diagnosis",
    "label__amyloid_diagnosis_date",
]

cohort["pyp_or_tafamidis_only"] = (
    (cohort.pyp__amyloid_diagnosis == "STRONGLY_SUGGESTIVE")
    | (cohort.Tafamidis_cohort_entry == 1)
) & (
    (cohort.chart_reviews__Amyloid_type.isna())
    & (cohort.mayo__amyloid_diagnosis != "POSITIVE")
    & (~cohort.cardiac_path__amyloid_diagnosis.isin(["POSITIVE", "NEGATIVE"]))
)

In [None]:
cohort["pyp_or_tafamidis_only"].value_counts(dropna=False)

In [None]:
cohort[cohort["pyp_or_tafamidis_only"] == True][check_columns_pyp_tafamidis].head()

#### Counting the patients after this merge

In [None]:
cohort["label__amyloid_diagnosis"].value_counts(dropna=False)


In [None]:
cohort["label__amyloid_subtype_diagnosis"].value_counts(dropna=False)


In [None]:
cohort["label__ttr_amyloid_subtype_diagnosis"].value_counts(dropna=False)


### Saving the data.
We save 2 files:
- updated labels
- a csv with ir_ids for chart review

First we get the patients who require chart review. We check columns from diagnosis data and from clinic chart review. We create 2 columns: `review__chart_review` and `label__chart_review`.

`diagnosis__chart_review`:
- The combination of diagnosis data and clinic chart review data still requires chart review. For example:
  - diagnosis data which required chart review was not completed with clinic chart review data.
  - diagnosis data and clinic chart review data disagree.

`review__chart_review`:
- A boolean flag indicating that we want further chart review. it is the union of `label__chart_review` and cases from diagnosis data. In other words, we still flag cases where clinic chart review data completed diagnosis data.
- Only 3 of these cases and a few extra chart reviews is feasible.

This next cell creates a dataframe to collect values from chart review

In [39]:
label_columns = [
    "label__amyloid_diagnosis",
    "label__amyloid_subtype_diagnosis",
    "label__ttr_amyloid_subtype_diagnosis",
]
diagnosis_columns = [
    "final__amyloid_diagnosis",
    "final__amyloid_subtype_diagnosis",
    "final__ttr_amyloid_subtype_diagnosis",
]

# add a flag for chart review to the cohort df
cohort["label__chart_review"] = (
    cohort[label_columns + diagnosis_columns] == "CHART_REVIEW"
).any(axis="columns")
cohort["diagnosis__chart_review"] = (cohort[label_columns] == "CHART_REVIEW").any(
    axis="columns"
)

In [40]:
# Patient level diagnoses from cardiac path reports, pyp reports, and mayo labs
# Merged with clinic chart review labels
# This yields the cohort amyloid labels, which are GOLD STANDARD for amyloid diagnosis
cohort_amyloid_labels_path = (
    PULL_2023 / "patient_amyloid_diagnosis" / "cohort_amyloid_labels.csv"
)

cohort.to_csv(cohort_amyloid_labels_path, index=False)