# Survival Analysis of Ovarian Cancer: Tumor Pathological Stage

Through modern statistical methods, we can determine survival risk based on a variety of factors.  In this tutorial, we will walk through a small example of something you could do with our data to understand what factors relate with survival in various different types of cancer.  In this use case, we will be looking at Ovarian Cancer

## Step 1: Import Data and Dependencies

In [1]:
import pandas as pd
import cptac
import numpy as np
import sksurv
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import lifelines
from lifelines import KaplanMeierFitter
%matplotlib inline

In [2]:
ov = cptac.Ovarian()
clinical = ov.get_clinical()
proteomics = ov.get_proteomics()
follow_up = pd.read_excel('../Follow_Up_Data/Ovary_One_Year_Clinical_Data_20160927.xls')

                                    

## Step 2: Data Preparation
We will be focusing on the discovery cohort of tumors, for which we have follow-up data. We will perform some data cleaning, and then merge the tables together for analysis. While you could study a wide variety of factors related to survival, such as country of origin or number of full pregnancies, we will be focusing on tumor stage and grade.

In [3]:
#Replace things that mess up the analysis       
to_replace = ['Not Reported/ Unknown', 'Reported/ Unknown',
              'Not Applicable', 'na', 'unknown', 'Not Performed',
              'Unknown tumor status', 'Unknown',
              'Unknown Tumor Status', 'Not specified']

for col in follow_up.columns:
    follow_up[col] = follow_up[col].replace(to_replace, 
                                            np.nan)

In [4]:
#Rename column to merge on, and then merge follow-up with clinical data
follow_up = follow_up.rename({'PPID': 'Patient_ID'}, axis='columns')

patient_data = pd.merge(clinical, follow_up, on = 'Patient_ID')

In [5]:
patient_data.columns

Index(['Patient_ID', 'Sample_Tumor_Normal', 'Participant_Procurement_Age',
       'Participant_Gender', 'Participant_Race', 'Participant_Ethnicity',
       'Participant_Jewish_Heritage', 'Participant_History_Malignancy',
       'Participant_History_Chemotherapy',
       'Participant_History_Neo-adjuvant_Treatment',
       ...
       'Number of Days from Date of Initial Pathologic Diagnosis of the Tumor Submitted to CPTAC to Date Radiation Therapy Started for this Other Malignancy',
       'Was the patient staged using FIGO?',
       'FIGO Staging System (Gynecologic Tumors Only)', 'FIGO Stage',
       ' Was the patient staged using AJCC?', 'AJCC Cancer Staging Edition',
       'Pathologic Spread: Primary Tumor (pT)',
       'Pathologic Spread: Lymph Nodes (pN)', 'Distant Metastases (M)',
       'AJCC Tumor Stage'],
      dtype='object', length=159)

In [6]:
for col in patient_data.columns:
    print(col)
    print(patient_data[col].value_counts(), '\n')

Patient_ID
01OV045    8
01OV018    8
20OV005    7
26OV008    6
01OV017    5
          ..
17OV018    1
17OV036    1
11OV009    1
17OV015    1
17OV004    1
Name: Patient_ID, Length: 105, dtype: int64 

Sample_Tumor_Normal
Tumor    304
Name: Sample_Tumor_Normal, dtype: int64 

Participant_Procurement_Age
790.0    8
535.0    8
665.0    8
612.0    8
687.0    8
        ..
656.0    1
908.0    1
688.0    1
620.0    1
861.0    1
Name: Participant_Procurement_Age, Length: 88, dtype: int64 

Participant_Gender
Female    304
Name: Participant_Gender, dtype: int64 

Participant_Race
White                                          267
Asian                                           22
Unknown (Could not be determined or unsure)      8
Black or African American                        5
American Indian or Alaska Native                 2
Name: Participant_Race, dtype: int64 

Participant_Ethnicity
Not Hispanic or Latino    280
Unknown                    15
Not Evaluated               6
Hispanic or Latin

Series([], Name: Total Number of Fractions, dtype: int64) 

Radiation Treatment Ongoing
Series([], Name: Radiation Treatment Ongoing, dtype: int64) 

Number of Days from Date of Initial Pathologic Diagnosis to the Date Radiation Therapy Ended
Series([], Name: Number of Days from Date of Initial Pathologic Diagnosis to the Date Radiation Therapy Ended, dtype: int64) 

Measure of Best Response of Radiation Treatment
Series([], Name: Measure of Best Response of Radiation Treatment, dtype: int64) 

Was Patient Treated on a Clinical Trial?
No     154
Yes     32
Name: Was Patient Treated on a Clinical Trial?, dtype: int64 

Drug Name (Brand or Generic)
Carboplatin                                 53
Paclitaxel                                  45
Cisplatin                                   31
Taxol                                       13
Bevacizumab                                  6
Docetaxel                                    6
Gemcitabine                                  5
Taxotere        

In [7]:
#Determine columns to focus on, and create a subset to work with
columns_to_focus_on = ['Patient_ID', 'Vital_Status', 'Vital Status (at time of last contact)',
                       'Tumor_Stage_Ovary_FIGO', 'Days_Between_Collection_And_Last_Contact', 
                       'Tumor_Grade', 'New_Tumor_Event_After_Initial_Treatment', 
                       'Date of Last Contact (Do not answer if patient is deceased)', 
                       'Date of Death', 'Tumor Status at Time of Last Contact or Death', 
                       'Date of New Tumor Event', 'Pharmaceutical Type', 'tumor_Stage-Pathological']

focus_group = patient_data[columns_to_focus_on].copy().drop_duplicates()
focus_group = focus_group[['Vital Status',
                           'Path Diag to Last Contact(Day)',
                           'Histologic_Grade_FIGO',
                           'tumor_Stage-Pathological']].copy()

KeyError: "['tumor_Stage-Pathological'] not in index"

## Step 2b: Prepare data for Kaplan Meier Plotting and Survival Analysis

In [None]:
patient_data[patient_data['Vital Status'] == "Deceased"]

In [None]:
focus_group = focus_group.replace('Living', 0)
focus_group = focus_group.replace('Deceased', 1)
focus_group = focus_group.dropna()

For Kaplan Meier plots, your data needs to be in a format similar to that shown below.  Particularly, it needs a boolean column for the 'event' you are interested in (in this case, vital status), where True denotes the event you are tracking, and False denotes an individual that never had the event of interest occur (in this case, their vital status is 'living').  It also needs a column with a numeric time frame, which we have as 'Path Diag to Last Contact(Day)'.  The other columns contain categorical data that we are testing to find meaningful connections with positive or negative event outcomes.

In [None]:
focus_group.head()

In [None]:
time = focus_group['Path Diag to Last Contact(Day)'].copy()
vital_status = focus_group['Vital Status'].copy()

Kaplan Meier plots show us the probability of some event occuring over a given length of time, based on some attribute.  Oftentimes, they are used to plot the probability of death for different attributes, however they could also be used in a variety of other contexts.  Below are a few examples of Kaplan Meier Plots in regards to Histologic Grade, and Tumor Stage of patients with Endometrial Cancer:

In [None]:
kmf = KaplanMeierFitter()
kmf.fit(time, event_observed = vital_status)

In [None]:
#groups = focus_group
kmf.plot()

In [None]:
from lifelines import CoxPHFitter

In [None]:
figo_map = {"FIGO grade 1": 1, "FIGO grade 2": 2, "FIGO grade 3" : 3}
focus_group['Histologic_Grade_FIGO'] = focus_group['Histologic_Grade_FIGO'].map(figo_map)

In [None]:
tumor_map = {"Stage I" : 1, "Stage II" : 2, "Stage III" : 3, "Stage IV" : 4}
focus_group['tumor_Stage-Pathological'] = focus_group['tumor_Stage-Pathological'].map(tumor_map)

In [None]:
cph = CoxPHFitter()
cph.fit(focus_group, duration_col = "Path Diag to Last Contact(Day)", event_col = "Vital Status")

In [None]:
cph.print_summary(model="untransformed variables", decimals=3)

In [None]:
focus_group['Vital Status'].value_counts()

In [None]:
from lifelines.statistics import proportional_hazard_test

results = proportional_hazard_test(cph, focus_group, time_transform='rank')
results.print_summary(decimals=3, model="untransformed variables")

In [None]:
cph.plot()

In [None]:
wbf = lifelines.WeibullFitter().fit(time, vital_status)

In [None]:
wbf.plot_survival_function()

## Step 3: Separate the data for pre-processing, and to prepare for Cox Proportional Hazard Test
Cox Proportional Hazard Test is a statistical test used to interpret the significance of Kaplan Meier plots, and potential connections between attributes and an event of interest (in this case, survival).

For the test to be performed properly, our data needs to be in a specific format. It requires data about Vital Status and time to event/last contact to be in a structured array, which we have titled 'survival_array' below.  

Additionally, the attributes we are interested in studying, which are the tumor stage and histologic grade, need to be separate from this array. We separated this data into a DataFrame we call 'tumor_stage_data' below.

In [None]:
tumor_stage_data = focus_group[["tumor_Stage-Pathological",
                                "Histologic_Grade_FIGO"]].copy()
#tumor_stage_data = pd.DataFrame(focus_group['tumor_Stage-Pathological'].copy())

survival_data = focus_group[['Vital Status',
                             "Path Diag to Last Contact(Day)"]].copy()

survival_array = np.zeros(len(survival_data),
                          dtype={"names":("Vital Status",
                                          "Path Diag to Last Contact(Day)"),
                                 "formats":("?", "<f8")})

survival_array['Vital Status'] = survival_data['Vital Status'].values
survival_array['Path Diag to Last Contact(Day)'] = survival_data['Path Diag to Last Contact(Day)'].values 

Sometimes in the conversion, merging, and manipulation of DataFrames, columns' data types get mixed up.  Sci-kit learn requires columns of type "category" or "numeric" to perform pre-processing, which will create a binary DataFrame ready for performing Cox Proportional Hazard, as well as many other predictive analyses.  Oftentimes, these data types become an "object" type, so we will change them to type "category".

In [None]:
for col in tumor_stage_data.columns:
    tumor_stage_data[col] = tumor_stage_data[col].astype("category")

## Step 4: Perform Cox Proportional Hazard Test

In order to perform Cox Proportional Hazard Test, as well as many other Machine Learning and other statistical tests, our data needs to be pre-processed.  This is oftentimes done with Sci-kit Learn's encoder entitled "OneHotEncoder".  OneHotEncoder will create a new binary DataFrame of 0 and 1 for all the attributes in the DataFrame you give it, which will now be in the format for the test to be performed.

In [None]:
for col in tumor_stage_data.columns:
    print(tumor_stage_data[col].value_counts(), '\n')

In [None]:
tumor_stage_numeric = OneHotEncoder().fit_transform(tumor_stage_data)
test2 = OneHotEncoder().fit_transform(tumor_stage_data[["Histologic_Grade_FIGO", "tumor_Stage-Pathological"]])

In [None]:
#What happened to stage I and grade 1?
tumor_stage_numeric.columns

In [None]:
test2.columns

Here we will use our estimator of choice, Cox Proportional Hazard, to perform the test.

In [None]:
estimator = CoxPHSurvivalAnalysis()
estimator.fit(tumor_stage_numeric, survival_array)

We can now look at our proportional hazard rations by viewing the following Series, based on the coefficients of the estimator, and the index of the pre-processed data.  These ratios help us know how influential a particular attribute is in relation to survival.  For instance, the ratio of 2.6235030 for Stage III tumors shows us that patients with a Stage III tumor had a 26.2% higher likelihood of death than if they did not have a Stage III tumor.

In [None]:
pd.Series(estimator.coef_, index=tumor_stage_numeric.columns)

If you are interested in the accuracy of your model for predicting a positive or negative outcome based on your survival_array, you can view the estimator's "score" method.  This score is particularly relevant when testing your model against new data, but is still helpful in understanding its standalone accuracy as well.

In [None]:
estimator.score(tumor_stage_numeric, survival_array)

Those are the basics of Survival Analysis on the cancer data we have provided in the cptac package.  This is by no means the only way this could have been done.  There are many other questions we could ask ourselves, and continue to study once we get to this point: 

How well does our model hold up for new data? What other attributes may be important for survival? What if I want to study the connection between clinical attributes and the likelihood of developing a second tumor during treatment?

With the functionality and flexibility of cptac, Sci-kit Survival and Sci-kit Learn, these answers can be explored, and a variety of other research projects could be done.  This use case is intended to be a springboard to help researchers get started in survival analysis, and leverage the cancer data we have to find important connections between clinical or molecular attributes and clinical outcomes.