# Survival Analysis in Python Statsmodels

Source code for statsmodels survival methods:

https://github.com/statsmodels/statsmodels/tree/master/statsmodels/duration

## Introduction

[Survival analysis](https://en.wikipedia.org/wiki/Survival_analysis)
is used to analyze data in which the primary variable of interest is a
time duration.  Some examples of duration data that can be analyzed using
methods from survival analysis are:

* A person's lifespan

* The duration that a person survives after being diagnosed with a
serious disease

* The time from the diagnosis of a disease until the
person recovers

* The duration of time that a piece of machinery remains in good working order

* The time after a person looses a job until they find a new job

* The time after a person is released from prison until they are arrested again

Duration data can be used to answer questions such as:

* What is the mean duration for a population?

* What is the 75th percentile of the durations in a population?

* When comparing two populations, which has the shorter expected or median duration?

* Is a given predictor variable associated with a duration outcome?

Here are some key concepts in duration analysis:

* __Time origin__: The durations of interest correspond to time
  intervals that begin and end when something specific happens.  It is
  important to be very explicit about what defines the time origin
  (time zero) from which the duration is calculated.  For example,
  when looking at the survival duration of people with a disease, the time
  origin could be the date of diagnosis.  When looking at human
  lifespans ("all cause mortality") it might make sense to define
  the time origin to be the date of birth.

* __Event__: This term refers to whatever happens that concludes the
  time interval of interest.  It may be death, or some other type of
  "failure", or it may be something more favorable, like recovery or
  cure from a disease.  Most survival analysis is based on the idea
  that every subject will eventually experience the event.

* __Survival time distribution__: This is a marginal distribution
  defining the proportion of the population that has experienced the
  event on or before time T.  Usually it is expressed as the
  complementary "survival function" (e.g. the proportion of people who
  have not yet died as of time T).

* __Censoring__: Censoring occurs when we do not observe when a
  subject experiences the event of interest, but we do have some
  partial information about that time.  The most common form of
  censoring is _right censoring_, in which we observe a time T such
  that we know the event did not occur prior to time T.  Other forms
  of censoring are _interval censoring_ and _left censoring_.

* __Risk set__: This is the set of units (e.g. people) in a sample at
  a given time who may possibly experience the event at that time.  It
  is usually the set of people who have not already experienced the
  event and who have not been censored (but the risk set may be only a
  subset of these people when using "entry times").

* __Hazard__: This is the probability of experiencing the event in the
  next time unit, given that it has not already occurred (technically,
  this is the discrete time definition of the hazard, the continuous
  time definition involves rates but follows the same logic).


Marginal survival function and hazard estimation
------------------------------------------------

The *marginal survival function* is a function $S(t)$ that returns
the probability that a randomly-selected member member of the population
experiences the event after time $t$ (equivalently, that they have not
experienced the event as of time $t$).

If there is no censoring, the marginal survival function can be
estimated using the complement of the empirical [cumulative
distribution
function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
of the data.  If there is censoring, the standard method for
estimating the survival function is the product-limit estimator or
[Kaplan Meier
estimator](https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator).

The idea behind the Kaplan-Meier estimate is not difficult.  Group the
data by the distinct times $t_1 < t_2 < \ldots$ (times here can be either
event times or censoring times), let $R(t)$ denote the risk set size
at time $t$, and let $d(t)$ indicate the number of events at time t (if
there are no ties, $d(t)$ will always be equal to either 0 or 1).  The
probability of the event occurring at time $t$ (given that it has not
occurred already) is estimated to be $d(t)/R(t)$.  The probability of
the event not occurring at time $t$ is therefore estimated to be $1 -
d(t)/R(t)$.  Finally, the probability of making it to time $t_k$ without
experiencing the event is estimated to be

$$
(1 - d(t_1)/R(t_1)) \cdot (1 - d(t_2)/R(t_2)) \cdots (1 - d(t_k)/R(t_k)).
$$

A consequence of this definition is that the estimated survival
function obtained using the product-limit method is a step function
with steps at the event times.

A closely related
[calculation](https://en.wikipedia.org/wiki/Nelson%E2%80%93Aalen_estimator)
estimates the cumulative marginal hazard function

$$
H(T) = \int_0^T h(t)dt,
$$

where $h(t)$ is the marginal hazard function.  To estimate the hazard function
itself, we first estimate the cumulative hazard function, then differentiate
it to obtain an estimate of the hazard function.  Some smoothing and interpolation
is needed to estimate the derivative a function that is estimated with a
(non-differentiable) step function.  We do not cover these details here, but
some examples of how this may be done are shown below.

In Statsmodel, the code `sm.SurvfuncRight(time, status)` estimates the marginal
survival function using the product-limit estimator.  In this
function call, `time` is a duration of time, either to the event
of interest (if `status=1`) or to the last time when the event
was known not to have occurred (if `status=0`).

Survival and hazard functions are usually presented as plots, often
by overlaying survival or hazard functions for several groups
of subjects on the same
axes. The code below estimates the marginal survival functions for two
groups and overlays the estimates in a plot:

```
sf1 = sm.SurvfuncRight(time1, died1)
sf2 = sm.SurvfuncRight(time2, died2)

ax = plt.axes()
sf1.plot(ax)
sf2.plot(ax)
```


## Regression analysis for survival/duration data

Regression analysis is used to understand how one or more factors of
interest are related to an "outcome" variable.  If the
outcome variable is a duration, we are doing *survival regression*.

By direct analogy with linear regression, we might seek to model the
expected survival time as a function of covariates.  If there is no
censoring, we could, for example, use least squares regression to
relate the survival time $T$, or a transformation of it (e.g. $\log(T)$)
to a linear function of the covariates, i.e.


$$
E[\log T | x] = b_0 + b_1 x_1 + \cdots b_p x_p.
$$

While this is sometimes done, it is more common to approach regression
for duration data by modeling
the hazards rather than by directly modeling the duration.

As noted above, the hazard is the conditional probability of experiencing the event of
interest at time $T$, given that it has not yet occurred.  For example,
in a medical study this may be the probability of a subject dying at
time $T$ given that the subject was still alive just before time $T$ (in
continuous time we would substitute "rate" for "probability" but we
ignore this distinction here).

In survival regression, we view the hazard as a function that is
determined by the covariates.  For example, the hazard may be
determined by age and gender.  A very popular form of hazard
regression models the conditional hazards as multiples of
a shared *baseline hazard function*, specifically

$$
h(t, x) = b(t) \cdot \exp(b_1 \cdot x_1 + \ldots + b_p \cdot x_p),
$$

where $b(t)$ is the "baseline hazard function", the scalars
$b_0, \ldots, b_p$
are unknown regression coefficients, and the $x_j$ are the observed
covariates for one subject.  Note that there is no intercept term
($b_0$) in the linear predictor, since it could be factored out
as a factor of $\exp(b_0)$ and absorbed into the baseline hazard
function $b(t)$.

This model can also be written in log
form

$$
\log h(t, x) = \log b(t) + b^\prime x
$$

where $b = (b_0, \ldots, b_p)$, $x = (1, x_1, \ldots, x_p)$.
Thus, the log hazard
is modeled as a time-varying intercept plus a linear predictor that is
not time varying (there are generalizations of this model
in which the linear
predictor is also time-varying).

This regression model is called [proportional hazards
regression](https://en.wikipedia.org/wiki/Proportional_hazards_model)
or the "Cox model".  A key feature of this model is that it is
possible to estimate the coefficients $b_j$ using a partial likelihood
that does not involve the baseline hazard function.  This makes the
procedure "semi-parametric".

The key point to remember about interpreting this model is that a
coefficient $b_j$, for a covariate, say age, has the property that the
hazard of the outcome event (e.g. of dying) changes multiplicatively by a factor of $\exp(b_j)$
for each unit increase in the value of the covariate $x_j$.

The following code fits a proportional hazards regression model in
Statsmodels with no censoring:

```
model = sm.PHReg.from_formula("time ~ disease + gender + age",
              data=df)
result = model.fit()
```


After fitting the model, `result.summary()` prints the usual table of
regression coefficients and standard errors.

__More advanced topics in proportional hazards regression:__

* Proportional hazards regression models allow the data to be
  stratified.  Stratifiation is a partitioning of the data into
  groups.  When estimating the coefficients, individuals are only
  compared to other people in the same group.  This means that the
  results are unaffected by confounding factors that are stable within
  groups.  It is common to use stratification as a proxy for
  difficult-to-measure confounders.  For example, in social research the
  geographic location of a person's residence may be used to define
  strata.

* In many settings, we do not observe every subject starting from their time
  origin.  If we begin monitoring a subject at a time t, then they
  could not have been observed to have the event before that time.
  For that reason, the subject should be removed from the risk set for
  all times prior to t.  This can be accomplished by specifying t as
  an _entry time_.

* Survival regression can use weights to project results from a sample
  to a population that differs from the population the data were
  sampled from.

## More on censoring

A large part of survival analysis is concerned with appropriately
handling censoring (if there is no censoring, it is generally possible
to analyze the log durations using standard, non-survival methods).
Censoring can be a subtle topic.  All survival methods have
limitations on the type of censoring they can handle, and it is not
always easy or even possible to determine in a given setting whether a
survival method can accommodate the type of censoring that is present.

To make things more concrete, we usually imagine that every subject
has both an event time $T$ and a censoring time $C$.  That is, every
subject would eventually experience the event (if there were no
censoring), and would eventually be censored (if the event did not
happen).  We observe ${\rm min}(T, C)$, and a *survival status*
indicator of whether the event occured
occured (i.e. that ${\rm min}(T, C)$ is equal to $T$).

The key requirement for most surival methods is that we have
"independent censoring", meaning that $T$ and $C$ are statistically
independent quantities.  Since we never observe both $T$ and $C$ for the
same person, it is usually not possible to directly assess whether $T$
and $C$ are dependent.  However knowledge about the data collection
process is often used to assess whether independent censoring is
plausible to hold.

For example, one type of censoring that is quite common is
"administrative censoring".  This occurs when a study has a fixed data
collection window, say a thee year interval from January 1, 2012 to
January 1, 2015.  Suppose that people are randomly recruited into the
study, and if the event has not occured by January 1, 2015, the
subject is censored.  Thus, subjects who are recruited into the study
later are more likely to be censored.  As long as the reruitment date
is not dependent with the true survival time, $T$ and $C$ are independent.
However we can imagine a setting when administrative censoring may
induce dependence, e.g. if the subjects recruited later in the study
were healthier than those recruited earlier.  But in many situations,
this can be excluded as a likely circumstance based on knowledge of
how the study was conducted.

On the other hand, in some cases there is strong reason to believe
that subjects are more likely to be censored as they grow sicker,
which may mean that $T$ and $C$ are positively dependent.  For example, if
we have medical study in which the data come from insurance records,
as people get sicker they are more likely to become unable to work,
and may have to quit their job (leading to them being censored).
Similarly, as people age they may retire or become eligible for
Medicare, leading to age-dependent censoring.  Since age is likely
correlated with survival time, this could also induce dependent
censoring.

One remedy for dependent censoring is to identify covariates such that
$T$ and $C$ become independent after conditioning on the covariates.  For
example, age or a measure of overall health may be sufficient to
substantially reduce the dependence between $T$ and $C$.

## Case study using the birth and death dates of notable people

Next we will apply the methods discussed above to a large dataset containing
the birth and death dates of notable people.  The data are available
[here](http://science.sciencemag.org/content/suppl/2014/07/30/345.6196.558.DC1).

This data set contains the birth year and death year for over 120,000
"notable people".  This population roughly corresponds to people about
whom Wikipedia articles have been written.  The data set also includes
information about the location where each person was born and died,
but we will not use that information here.

First we import the libraries that we will be using

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
import os

The data are provided in Excel format, so we next use Pandas to
convert the Excel file to a text/csv file.
We also drop several columns that we will not be using.

The notebook cell below only needs to
run once.  To run the cell one time, chance the `False` value in the `if`
statement to a `True`, then after running the cell
change the value back to `False` to
prevent the cell from running again.  Note that you will also need to change
the file path to point to the location where you have put the Excel sheet in
your file system.

In [None]:
if False:
    # Change the path to point to the data file
    df = pd.read_excel("/nfs/kshedden/Schich/SchichDataS1_FB.xlsx")
    df = df[["PrsLabel", "BYear", "DYear", "Gender"]]
    df.to_csv("schich.csv.gz", index=None, compression="gzip")

Now we can read the data file.  As usual, it is a good idea to check the shape
(dimention) of the data, to confirm that the types are appropriate, and to
inspect the first few rows of data.

In [None]:
df = pd.read_csv("schich.csv.gz")
print(df.shape)
print(df.dtypes)
df.head()

To work with survival analysis methods, we need a duration variable
and a "censoring status variable".  The
natural choice here is to define the duration as a person's life span.

The convention used in Statsmodels for the "censoring status variable"
is that the status is 1 if the person experiences the event at the end
of their recorded
duration, and the status is 0 if the person is censored at their recorded
duration.  This dataset does not include records for people who have
not died, so our status variable is identically 1.

In [None]:
df["status"] = 1

Note that in most
cases, a data set that excludes people who have not experienced the
event of interest (i.e. death) yields biased
results under survival analysis.  However in this example, the overall
time span
of the data is very large (thousands of years), so a relatively small
fraction of the population has been omitted due to still being alive.
The bias is therefore likely to be small.

Now we can construct our duration variable, and make a histogram
of its distribution.  Note that in general this histogram
would be influenced by
deaths as well as by censoring, so is not something that would
normally be presented as part of the results of the analysis.
However it is a useful summary to use for initial data checking.

In [None]:
df["lifespan"] = df.DYear - df.BYear

df = df.loc[df.lifespan > 0, :]
sns.distplot(df.lifespan)

Above we defined the key concept of a marginal survival function, and discussed how it can
be estimated.  Next, we estimate the marginal survival functions for females and for males.

The graph below shows that the curve for men is higher than the curve for women up to around age 60, but
after this point, the curves cross.  One possible explanation for this could be that
this population has a somewhat larger number of women who die young compared to men
who die young.  But once we move to the subpopulation of people who survive to age
60, women tend to live longer than men.

In [None]:
ax = plt.gca()
plt.grid(True)
plt.xlabel("Age (years)", size=15)
plt.ylabel("Proportion", size=15)
for sex in "Female", "Male":
    dx = df.loc[df.Gender == sex, :]
    s = sm.SurvfuncRight(dx.lifespan, dx.status, title=sex)
    s.plot(ax=ax)

# Create a legend
ha, lb = ax.get_legend_handles_labels()
ha = [ha[0], ha[2]] # Optional, hide points from legend
lb = [lb[0], lb[2]]
leg = plt.figlegend(ha, lb, loc="upper center", ncol=2)
leg.draw_frame(False)

The marginal hazard function mathematically contains the same information as the marginal
survival function.  But the information in a hazard function
is represented and interpreted in a very different
way.  If $S(t)$ is the survival function, the hazard function is $h(t) = -S^\prime(t)/S(t)$.
Statsmodels does not have a function for directly constructing the hazard estimate,
so we include that below.

The hazard function estimates shown below suggest a substantial gap between the hazards
of dying for women and men up to around age 20, with women having greater hazard
(greater probability of death) for these ages.  The gap narrows until the hazards
cross at around age 45.  This crossing could be due to women dying in childbirth, although
selection bias due to gender differences in who is considered "notable" is also a possibility.
It is not surprising that older men have greater hazards than older women, since women
tend to live longer than men in most human populations.

In [None]:
# The hazard function is the derivative of the cumulative hazard function.
def hazard(sf):
    tm = s.surv_times
    pr = s.surv_prob
    ii = (pr > 0)
    tm = tm[ii]
    pr = pr[ii]
    lpr = np.log(pr)
    return tm[0:-1], -np.diff(lpr) / np.diff(tm)

plt.grid(True)
for sex in "Female", "Male":
    dx = df.loc[df.Gender == sex, :]
    s = sm.SurvfuncRight(dx.lifespan, dx.status, title=sex)
    tm, hz = hazard(s)
    plt.plot(tm, np.log(hz), lw=3, label=sex)
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=2)
leg.draw_frame(False)
plt.xlabel("Age", size=15)
plt.ylabel("Log hazard", size=15)
_ = plt.xlim(0, 90)

We can also subdivide the population by era, and see how the survival
distributions vary over time.

In [None]:
ax = plt.gca()
plt.grid(True)
plt.xlabel("Lifespan (years)", size=15)
plt.ylabel("Proportion", size=15)
for byear in np.arange(0, 2000, 500):
    ii = (df.BYear >= byear) & (df.BYear < byear + 500)
    dx = df.loc[ii, :]
    s = sm.SurvfuncRight(dx.lifespan, dx.status,
                         title="%d-%d" % (byear, byear+500))
    s.plot(ax=ax)

# Create a legend
ha, lb = ax.get_legend_handles_labels()
ha = [ha[i] for i in range(0, len(ha), 2)] # Optional, hide points from legend
lb = [lb[i] for i in range(0, len(lb), 2)]
leg = plt.figlegend(ha, lb, loc="upper center", ncol=4)
leg.draw_frame(False)

Below we show the hazard functions corresponding to the survival functions
plotted above.

In [None]:
plt.grid(True)
for byear in np.arange(0, 2000, 500):
    ii = (df.BYear >= byear) & (df.BYear < byear + 500)
    dx = df.loc[ii, :]
    s = sm.SurvfuncRight(dx.lifespan, dx.status)
    tm, hz = hazard(s)
    plt.plot(tm, np.log(hz), lw=3, label="%d-%d" % (byear, byear+500))
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=4)
leg.draw_frame(False)
plt.xlabel("Age", size=15)
plt.ylabel("Log hazard", size=15)
plt.xlim(0, 90)

Finally we can fit a proportional hazards model to the data, to see
how the hazards (of dying) vary with age and gender.

In [None]:
df = df.loc[df.Gender.isin(["Female", "Male"]), :]

fml = "lifespan ~ BYear + Gender + BYear*Gender"
model3 = sm.PHReg.from_formula(fml, data=df)
result3 = model3.fit()
result3.summary()

To visualize the hazards as estimated by this model, we can plot the
log hazards with respect to age, for females and for males.

In [None]:
plt.grid(True)

for gender in "Female", "Male":
    dx = df.iloc[0:100, :].copy()
    dx["BYear"] = np.linspace(0, 2019, 100)
    dx["Gender"] = gender
    y = result3.predict(exog=dx)
    plt.plot(dx.BYear, y.predicted_values, label=gender)

ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=2)
leg.draw_frame(False)

plt.xlabel("Birth year")
_ = plt.ylabel("Log hazard ratio")

We can also plot the baseline hazard function, which shows as expected
that the hazard of dying increases sharply with age.

In [None]:
# Plot the baseline hazard function
bhaz = result3.baseline_cumulative_hazard[0]
x = bhaz[0]
y = bhaz[1]
haz = np.diff(y, 1) / np.diff(x, 1)
plt.clf()
plt.grid(True)
plt.plot(x[0:-1], haz, lw=3)
plt.xlim(0, 90)
plt.ylim(0, 2)
plt.xlabel("Age (years)", size=15)
_ = plt.ylabel("Hazard", size=15)

# Survival analysis of NHANES III data

Next we turn to another example, using the NHANES III data.
This is a special set of NHANES data in which the social
security death index (and other sources) were used to
assess which subjects from an earlier wave of NHANES were
still alive at a given point in time.

Data sources:

https://wwwn.cdc.gov/nchs/nhanes/nhanes3/datafiles.aspx

https://www.cdc.gov/nchs/data-linkage/mortality-public.htm

Performing this analysis requires us to merge a datafile
of measures obtained at the time of their NHANES assessments
with a separate datafile collected much later to assess their mortality
status.

In [None]:
dpath = "/nfs/kshedden/NHANES/"

First we read the survival data.

In [None]:
fname = "NHANES_III_MORT_2011_PUBLIC.dat.gz"
colspecs = [(0, 5), (14, 15), (15, 16), (43, 46), (46, 49)]
names = ["seqn", "eligstat", "mortstat", "permth_int", "permth_exam"]
f = os.path.join(dpath, fname)
surv = pd.read_fwf(f, colspecs=colspecs, names=names, compression="gzip")

Next we read the interview/examination data and merge it with the survival data.

In [None]:
fname = "adult.dat.gz"
colspecs = [(0, 5), (14, 15), (17, 19), (28, 31), (33, 34), (32, 33), (34, 35), (35, 41)]
names = ["seqn", "sex", "age", "county", "urbanrural", "state", "region", "poverty"]
f = os.path.join(dpath, fname)
df = pd.read_fwf(f, colspecs=colspecs, names=names, compression="gzip")
df = pd.merge(surv, df, left_on="seqn", right_on="seqn")

Before starting the analysis, we will modify a few of the variables for clarity.

In [None]:
df["poverty"] = df["poverty"].replace({888888: np.nan})
df["female"] = (df.sex == 2).astype(np.int)
df["rural"] = (df.urbanrural == 2).astype(np.int)
df["age_int"] = 12*df.age  # months
df["end"] = df.age_int + df.permth_int  # months

df = df.dropna()

All of these subjects were known to be alive at the time
that they participated in the NHANES study.  The variable
`age` is the age of a subject when they participated in
NHANES.  We convert this to months above
and call it `age_int`.  The variable `end` contains
the age of a subject when their status was checked in the
NHANES mortality study.  This is either their age
when they died (if they died), or the last known age when
they were known to be alive.  The `mortstat` variable indicates
which of these two states applies.

`SurvfuncRight` can't handle 0 survival times, so we remove them.

In [None]:
df = df.loc[df.end > df.age_int]

We can estimate the hazard function (for death) for women and for men,
and plot them together for comparison.

In [None]:
# The hazard function is the derivative of the cumlative hazard function.
def hazard(sf):
    tm = s.surv_times
    pr = s.surv_prob
    ii = (pr > 0)
    tm = tm[ii]
    pr = pr[ii]
    lpr = np.log(pr)
    return tm[0:-1], -np.diff(lpr) / np.diff(tm)

# Plot hazard functions for women and men
plt.grid(True)
sex = {0: "Male", 1: "Female"}
for female in (0, 1):
    ii = df.female == female
    dx = df.loc[ii, :]
    s = sm.SurvfuncRight(dx.loc[:, "end"], dx.loc[:, "mortstat"],
                   entry=dx.loc[:, "age_int"])
    tm, hz = hazard(s)
    ha = sm.nonparametric.lowess(np.log(hz), tm/12)
    plt.plot(ha[:, 0], ha[:, 1], lw=3, label=sex[female])
ha, lb = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=2)
leg.draw_frame(False)
plt.xlabel("Age", size=15)
plt.ylabel("Log hazard", size=15)
_ = plt.xlim(18, 90)

Next we fit a proportional hazards regression model, looking
at risk for dying as a function of a subject's gender,
urban/rural status, and poverty status (higher values of the
`poverty` variable correspond to greater wealth).

In [None]:
fml = "end ~ female + rural + C(region) + poverty"
model1 = sm.PHReg.from_formula(fml, status="mortstat", entry=df.age_int,
                               data=df)
result1 = model1.fit()
result1.summary()

We now extend this model by controlling for regions as covariates.

In [None]:
fml = "end ~ female + rural + C(region) + poverty"
model2 = sm.PHReg.from_formula(fml, status="mortstat", entry=df.age_int,
                               strata=df.state, data=df)
result2 = model2.fit()
result2.summary()

We now further control for geographical heterogeneity, by
stratifying on the counties.

In [None]:
fml = "end ~ female + rural + C(region) + poverty"
model3 = sm.PHReg.from_formula(fml, status="mortstat", entry=df.age_int,
                               strata=df.county, data=df)
result3 = model3.fit()
result3.summary()

In these analyses, the gender and poverty effects are very similar
across three analytic approaches.  However the association between
urban/rural status and mortality changes substantially depending on how we
model the role of geography.  It is possible that there is an increased
hazard of dying for people in a rural location, if we compare them
to people who live in a more urban part of the same county.