# FHIR-Aggregator survival anaysis

In this notebook we will show how to retrieve data from breast cancer patients in TCGA and compare the Kaplan-Meier curves of two cohorts.
The cohorts are white and african american pateints that are 50 years or younger.

In [1]:
!pip install lifelines -q

In [2]:
!pip install fhir-aggregator-client==0.1.8 --no-cache-dir --quiet

## Use FHIR-Aggregator to retrieve the necessary data

### Export TCGA-BRCA data to a local database

In [3]:
# run query against released data
# !rm /root/.fhir-aggregator/fhir-graph.sqlite
%env  FHIR_BASE=https://google-fhir.fhir-aggregator.org
!fq run patient-survival-graph    /ResearchStudy?identifier=TCGA-BRCA

env: FHIR_BASE=https://google-fhir.fhir-aggregator.org


patient-survival-graph is valid FHIR R5 GraphDefinition

¡ Fetching https://google-fhir.fhir-aggregator.org/ResearchStudy?identifier=TCGA-BRCA

Error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Basic Constraints of CA cert not marked critical (_ssl.c:1028)



### Create a tsv file from the extracted data

In [4]:
# The previous query included a Specimen,  the dataframe type defaults to Specimen
# Since the optimized query only has Patient, we as for a Patient dataframe type
# Note: default output is in the current directory and is a TSV
!fq results dataframe Patient

Saved fhir-graph.tsv


## Surviving analysis

After retrieveing the data, we then use the python library lifelines to plor Kaplan-Meier plots of two groups (white and african american) of Breat cancer patients that are 50 years old or younger.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()


# read the data into a dataframe
df = pd.read_csv('fhir-graph.tsv', sep='\t')

# get days to death data in the necessary format
df['days_to_death'] = (
    df['observation_days_between_diagnosis_and_death']
    .str.replace(' days', '', regex=False)
    .replace('', np.nan)
    .astype(float)
)
# get age data in the necessary format
df['age_at_diagnosis'] = (
    df['observation_days_between_birth_and_diagnosis']
    .str.replace(' days', '', regex=False)
    .replace('', np.nan)
    .astype(float)
)

# group by patient_id
df_unique = df.drop_duplicates(subset=['patient_id'])

EmptyDataError: No columns to parse from file

Select Breast cancer patients that are white, african american, and 50 years old or younger.

In [None]:
df_cohort = df_unique[ (df_unique['age_at_diagnosis'] >= -50*365 )
                      & (df_unique['patient_us_core_race'].isin(['black or african american','white']) )
                      & (df_unique['patient_us_core_ethnicity'] == 'not hispanic or latino')   ]

Get the necessary data for [`lifelines` package](lifelines.readthedocs.io).

In [None]:
# Fill in NAs in days_to_death with the max from the days to death
T = df_cohort['days_to_death'].fillna(df_cohort['days_to_death'].max())

# Convert the vital status to numbers
E = df_cohort['patient_deceasedBoolean'].astype(bool)

Plot the survival curves

In [None]:
fig=plt.figure(figsize=(13, 8), dpi= 80)
#plt.style.use('seaborn-colorblind')
ax = plt.subplot(111,
                 title = "Survival Curve")

for r in  df_cohort['patient_us_core_race'].sort_values().unique() :
  if (r != None):
    cohort = df_cohort['patient_us_core_race'] == r
    kmf.fit(T.loc[cohort], E.loc[cohort], label=r)
    kmf.plot(ax=ax, )
  else:
    print("")

ax.set_ylabel("Percent Survival")
ax.set_xlabel("Days")