15/05/2019

Who am I?

Time-to-event data

  • A study about cancer recurrance

  • Time is recurrance-free time in days

  • An “event” is the cancer recurring (triangle)

  • Otherwise the patient gets a “censored” marker (circle) when they leave the study

  • Also applies to any other time-to-event data

    • Unemployment duration
    • Failure times in equipment, etc.

So, about that mean…

  • Characterizing censored data is hard
    • Averages operate on a vector of numbers
    • So what do we do with the censored ones?
    • Use them? Extend them? NA them?
print(small_data)
##   rowid time cens start
## 1   127  357    1     0
## 2   482 1922    0     0
## 3   393  867    1     0
## 4   115  857    0     0
## 5   644 1692    0     0


mean(c(357,1922,867,857,1692))
## [1] 1139
mean(c(357,NA,867,NA,NA),na.rm = T)
## [1] 612

Survival Analysis to the rescue

  • What is the probability that a breast cancer patient survives longer than 5 years?
  • What is the typical waiting time for a cab?
  • Out of 100 unemployed people, how many do we expect to have a job again after 2 months?

Theory

\[ S(t) = 1 - F(t) = P(T > t) \]

Interpretation

Probability that duration is greater than \(t\)

Survival functions

Median: The median duration is t.

Proportion at time \(t\): \(100 \cdot \hat S(t)\) percent of durations are longer than t.

Estimating the survival function

The Kaplan-Meier estimate

The Kaplan-Meier estimate (code)

km <- survfit(Surv(time,cens) ~ 1,
              data=small_data)

ggsurvplot(km,conf.int = F,
           censor.shape = 4,
           censor.size = 9,
           risk.table = 'nrisk_cumevents',
           legend = 'none')

Formal Definition \[ \hat S(t) = \prod_{i: t_i < t} \frac{n_i - d_i}{n_i} \]

Kaplan-Meier (whole dataset)

km <- survfit(Surv(time,cens) ~ 1, data = GBSG2)
ggsurvplot(km, censor = F, conf.int = T, surv.median.line = 'hv', legend = 'none')

Modelling on factors

km <- survfit(
  Surv(time,cens) ~ 1,
  data = GBSG2)

Modelling on factors

Modelling on factors

km <- survfit(Surv(time, cens) ~ horTh, data = GBSG2)
ggsurvplot(km, data = GBSG2, surv.median.line = "hv",
           legend.title = "Hormone Therapy", legend = 'right',
           pval = TRUE, conf.int = TRUE
)

Things we haven’t covered

  • Weibull model
    • Smooth function, better for point predictions
    • Tricky to graph (need to fake-out a data frame)
wb=survreg(Surv(time, cens)~horTh, GBSG2)
predict(wb, type = "quantile",
        p = 1 - 0.9,
        newdata = data.frame(
          horTh='yes'))
##        1 
## 475.1155
  • Cox Proportional Hazards model
    • Step function like KM
    • Takes the values of the covariate(s) into account
    • Basically same as KM for binary covariates

Example 1 - Code Complexity vs Merge Time

Example 2 - Merge times in parts of a codebase