Survival Analysis in R

15/05/2019

Who am I?

History:
- (Ex) Physicist
- (Ex) System Administrator
- (Ex) Software Developer
- (Ex) FOSS Community Manager
- Data scientist (~1 year)
  - Caveat Emptor!
Contact:

Time-to-event data

A study about cancer recurrance
Time is recurrance-free time in days
An “event” is the cancer recurring (triangle)
Otherwise the patient gets a “censored” marker (circle) when they leave the study
Also applies to any other time-to-event data
- Unemployment duration
- Failure times in equipment, etc.

So, about that mean…

Characterizing censored data is hard
- Averages operate on a vector of numbers
- So what do we do with the censored ones?
- Use them? Extend them? NA them?

print(small_data)

##   rowid time cens start
## 1   127  357    1     0
## 2   482 1922    0     0
## 3   393  867    1     0
## 4   115  857    0     0
## 5   644 1692    0     0

mean(c(357,1922,867,857,1692))

## [1] 1139

mean(c(357,NA,867,NA,NA),na.rm = T)

## [1] 612

Survival Analysis to the rescue

What is the probability that a breast cancer patient survives longer than 5 years?
What is the typical waiting time for a cab?
Out of 100 unemployed people, how many do we expect to have a job again after 2 months?

Theory

\[ S(t) = 1 - F(t) = P(T > t) \]

Interpretation

Probability that duration is greater than \(t\)

Survival functions

Median: The median duration is t.

Proportion at time \(t\): \(100 \cdot \hat S(t)\) percent of durations are longer than t.

Estimating the survival function

The Kaplan-Meier estimate

The Kaplan-Meier estimate (code)

km <- survfit(Surv(time,cens) ~ 1,
              data=small_data)

ggsurvplot(km,conf.int = F,
           censor.shape = 4,
           censor.size = 9,
           risk.table = 'nrisk_cumevents',
           legend = 'none')

Formal Definition \[ \hat S(t) = \prod_{i: t_i < t} \frac{n_i - d_i}{n_i} \]

Kaplan-Meier (whole dataset)

km <- survfit(Surv(time,cens) ~ 1, data = GBSG2)
ggsurvplot(km, censor = F, conf.int = T, surv.median.line = 'hv', legend = 'none')

Modelling on factors

km <- survfit(
  Surv(time,cens) ~ 1,
  data = GBSG2)

Modelling on factors

km <- survfit(Surv(time, cens) ~ horTh, data = GBSG2)
ggsurvplot(km, data = GBSG2, surv.median.line = "hv",
           legend.title = "Hormone Therapy", legend = 'right',
           pval = TRUE, conf.int = TRUE
)

Things we haven’t covered

Weibull model
- Smooth function, better for point predictions
- Tricky to graph (need to fake-out a data frame)

wb=survreg(Surv(time, cens)~horTh, GBSG2)
predict(wb, type = "quantile",
        p = 1 - 0.9,
        newdata = data.frame(
          horTh='yes'))

##        1 
## 475.1155

Cox Proportional Hazards model
- Step function like KM
- Takes the values of the covariate(s) into account
- Basically same as KM for binary covariates

Example 1 - Code Complexity vs Merge Time

Original Blog Post

Good for CI and comparative plots across predictors

Example 2 - Merge times in parts of a codebase

Shiny App

Thanks!

Questions?

Comments
Corrections
Future ideas

Who am I?

Time-to-event data

So, about that mean…

Survival Analysis to the rescue

Survival functions

Estimating the survival function

The Kaplan-Meier estimate

The Kaplan-Meier estimate (code)

Kaplan-Meier (whole dataset)

Modelling on factors

Modelling on factors

Modelling on factors

Things we haven’t covered

Example 1 - Code Complexity vs Merge Time

Example 2 - Merge times in parts of a codebase

Thanks!

Questions?

Contact