# Code snippets for loading survival datasets into R

In [1]:
import pandas as pd
from sksurv.datasets import get_x_y

## `lung`

Sourced from the R `survival` package.

**Shape:**
167 observations of 7 variables

Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.

**References:**
- Loprinzi CL. et al. (1994). Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7.

In [2]:
lung_df = pd.read_csv('lung_dataset.csv')
lung_df = lung_df.dropna()

X, y = get_x_y(lung_df, ["status", "time"], pos_label=1)

## `diabetic`

Sourced from the R `survival` package.

**Shape:**
394 observations of 6 variables

Partial results from a trial of laser coagulation for the treatment of diabetic retinopathy.

**References:**
- Huster, Brookmeyer and Self, Biometrics, 1989.
- American Journal of Ophthalmology, 1976, 81:4, pp 383-396

In [4]:
diabetic_df = pd.read_csv('diabetic_dataset.csv')
diabetic_df = diabetic_df.dropna()

X, y = get_x_y(diabetic_df, ["status", "time"], pos_label=1)

## `flchain`

Sourced from the R `survival` package.

**Shape:**
1962 observations of 8 variables

This is a stratified random sample containing 1/2 of the subjects from a study of the relationship between serum free light chain (FLC) and mortality. 

**References:**
- A Dispenzieri et al. (2012). Use of monclonal serum immunoglobulin free light chains to predict overall survival in the general population, Mayo Clinic Proceedings 87:512-523.
- R Kyle et al. (2006). Prevalence of monoclonal gammopathy of undetermined significance, New England J Medicine 354:1362-1369.

**Notes:**
- sex:  0=male, 1=female

In [5]:
flchain_df = pd.read_csv('flchain_dataset.csv')
flchain_df = flchain_df.dropna()
X, y = get_x_y(flchain_df, ["death", "futime"], pos_label=1)

## `rotterdam`

Sourced from the R `survival` package.

**Shape:**
2982 observations of 9 variables

This dataset includes 2982 primary breast cancers patients whose records were included in the Rotterdam tumor bank.

**References:**
- Patrick Royston and Douglas Altman. (2013). External validation of a Cox prognostic model: principles and methods. BMC Medical Research Methodology, 13:33

**Notes:**
- size: 1: <=20;  2: 20-50;  3: >50

In [6]:
rotterdam_df = pd.read_csv('rotterdam_dataset.csv')
rotterdam_df = rotterdam_df.dropna()

X, y = get_x_y(rotterdam_df, ["death", "dtime"], pos_label=1)

## `patient`

Sourced from the R `pammtools` package.

**Shape:**
1985 observations of 6 variables

A data set containing the survival time (or hospital release time) among other covariates. The full data is available [here](https://github.com/adibender/elra-biostats). 

**Notes:**
- Gender:  0=male, 1=female


In [7]:
patient_df = pd.read_csv('patient_dataset.csv')
patient_df = patient_df.dropna()
patient_df = pd.get_dummies(patient_df, drop_first = True, dtype = float)

X, y = get_x_y(patient_df, ["PatientDied", "Survdays"], pos_label=1)

## `pbc`

Sourced from the R `randomForestSRC` package.

**Shape:**
276 observations of 17 variables

This data is from the Mayo Clinic trial in PBC conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine.

**References:**
- T. Therneau and P. Grambsch (2000). Modeling Survival Data: Extending the Cox Model, Springer-Verlag, New York. ISBN: 0-387-98784-3.

In [8]:
pbc_df = pd.read_csv('pbc_dataset.csv')
pbc_df = pbc_df.dropna()

X, y = get_x_y(pbc_df, ["status", "days"], pos_label=1)

## `ttm`

Sourced from the R `censored` package.

**Shape:**
551 observations of 11 variables. 

Number of days before a movie grosses $1M USD. These data are a somewhat biased random sample of 551 movies released between 2015 and 2018.


In [9]:
ttm_df = pd.read_csv('ttm_dataset.csv')
ttm_df = ttm_df.dropna()

X, y = get_x_y(ttm_df, ["event", "time"], pos_label=1)