In the previous section</Survival Analysis intro>
, we introduced how survival analysis is used, needed, and the mathematical objects that it relies on. In this article, we will work with real data and the lifelines library to estimate these mathematical objects.
For this example, we will be investigating the lifetimes of political leaders around the world. A political leader in this case is defined by a single individual's time in office who controls the ruling regime. This could be an elected president, unelected dictator, monarch, etc. The birth event is the start of the individual's tenure, and the death event is the retirement of the individual. Censorship can occur if they are a) still in offices at the time of dataset complilation (2008), or b) die while in office (this includes assassinations).
For example, the Bush regime began in 2000 and officially ended in 2008 upon his retirement, thus this regime's lifespan was 8 years and the "death" event was observed. On the other hand, the JFK regime lasted 2 years, from 1961 and 1963, and the regime's official death event was not observed -- JFK died before his official retirement.
(This is an example that has gladly redefined the birth and death events, and infact completely flips the idea upside down by using deaths as the censorship event. This is also an example where the current time is not the only cause of censorship -- there are alternative events (eg: death in office) that can censor.)
To estimate the survival function, we first will use the Kaplan-Meier Estimate, defined:
where di are the number of death events at time t and ni is the number of subjects at risk of death just prior to time t.
Let's bring in our dataset.
import pandas as pd
import lifelines
data = lifelines.datasets.load_dd()
data.sample(6)
#the boolean columns `observed` refers to whether the death (leaving office)
#was observed or not.
ctryname | cowcode2 | politycode | un_region_name | un_continent_name | ehead | leaderspellreg | democracy | regime | start_year | duration | observed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
164 | Bolivia | 145 | 145.0 | South America | Americas | Rene Barrientos Ortuno | Rene Barrientos Ortuno.Bolivia.1966.1968.Milit... | Non-democracy | Military Dict | 1966 | 3 | 0 |
740 | India | 750 | 750.0 | Southern Asia | Asia | Chandra Shekhar | Chandra Shekhar.India.1990.1990.Parliamentary Dem | Democracy | Parliamentary Dem | 1990 | 1 | 1 |
220 | Bulgaria | 355 | 355.0 | Eastern Europe | Europe | Todor Zhivkov | Todor Zhivkov.Bulgaria.1954.1988.Civilian Dict | Non-democracy | Civilian Dict | 1954 | 35 | 1 |
772 | Ireland | 205 | 205.0 | Northern Europe | Europe | Charles Haughey | Charles Haughey.Ireland.1979.1980.Mixed Dem | Democracy | Mixed Dem | 1979 | 2 | 1 |
1718 | United States of America | 2 | 2.0 | Northern America | Americas | Gerald Ford | Gerald Ford.United States of America.1974.1976... | Democracy | Presidential Dem | 1974 | 3 | 1 |
712 | Iceland | 395 | 395.0 | Northern Europe | Europe | Stefan Stefansson | Stefan Stefansson.Iceland.1947.1948.Mixed Dem | Democracy | Mixed Dem | 1947 | 2 | 1 |
6 rows × 12 columns
From the lifelines
library, we'll need the KaplanMeierFitter
for this exercise:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
Note
Other ways to estimate the survival function in lifelines are BreslowFlemingHarringtonFitter
, WeibullFitter
, ExponentialFitter
For this estimation, we need the duration each leader was/has been in office, and whether or not they were observed to have left office (leaders who died in office or were in office in 2008, the latest date this data was record at, do not have observed death events)
We next use the KaplanMeierFitter
method fit
to fit the model to the data. (This is similar to, and inspired by, scikit-learn's fit/predict API)
KaplanMeierFitter.fit(durations, event_observed=None,
timeline=None, entry=None, label='KM_estimate',
alpha=None, left_censorship=False, ci_labels=None)
Parameters:
duration: an array, or pd.Series, of length n -- duration subject was observed for
timeline: return the best estimate at the values in timelines (postively increasing)
event_observed: an array, or pd.Series, of length n -- True if the the death was observed, False if the event
was lost (right-censored). Defaults all True if event_observed==None
entry: an array, or pd.Series, of length n -- relative time when a subject entered the study. This is
useful for left-truncated (not left-censored) observations. If None, all members of the population
were born at time 0.
label: a string to name the column of the estimate.
alpha: the alpha value in the confidence intervals. Overrides the initializing
alpha for this call to fit only.
left_censorship: True if durations and event_observed refer to left censorship events. Default False
ci_labels: add custom column names to the generated confidence intervals
as a length-2 list: [<lower-bound name>, <upper-bound name>]. Default: <label>_lower_<alpha>
Returns:
a modified self, with new properties like 'survival_function_'.
Below we fit our data with the KaplanMeierFitter
:
T = data["duration"]
E = data["observed"]
kmf.fit(T, event_observed=E)
<lifelines.KaplanMeierFitter: fitted with 1808 observations, 340 censored>
After calling the fit
method, the KaplanMeierFitter
has a property called survival_function_
. (Again, we follow the styling of scikit-learn, and append an underscore to all properties that were computational estimated) The property is a Pandas DataFrame, so we can call plot
on it:
kmf.survival_function_.plot()
plt.title('Survival function of political regimes');
How do we interpret this? The y-axis represents the probability a leader is still around after t years, where t years is on the x-axis. We see that very few leaders make it past 20 years in office. Of course, like all good stats, we need to report how uncertain we are about these point estimates, i.e. we need confidence intervals. They are computed in the call to fit
, and are located under the confidence_interval_
property. (The mathematics can be found in these notes.)
S(t) = Pr(T > t)
Alternatively, we can call plot
on the KaplanMeierFitter
itself to plot both the KM estimate and its confidence intervals:
kmf.plot()
Note
Don't like the shaded area for confidence intervals? See below for examples on how to change this.
The median time in office, which defines the point in time where on average 1/2 of the population has expired, is a property:
kmf.median_
# 4
#
Interesting that it is only 3 years. That means, around the world, when a leader is elected there is a 50% chance he or she will be gone in 3 years!
Let's segment on democratic regimes vs non-democratic regimes. Calling plot
on either the estimate itself or the fitter object will return an axis
object, that can be used for plotting further estimates:
ax = plt.subplot(111)
dem = (data["democracy"] == "Democracy")
kmf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
kmf.plot(ax=ax, ci_force_lines=True)
kmf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
kmf.plot(ax=ax, ci_force_lines=True)
plt.ylim(0, 1);
plt.title("Lifespans of different global regimes");
We might be interested in estimating the probabilities in between some points. We can do that with the timeline
argument. We specify the times we are interested in, and are returned a DataFrame with the probabilties of survival at those points:
ax = subplot(111)
t = np.linspace(0, 50, 51)
kmf.fit(T[dem], event_observed=E[dem], timeline=t, label="Democratic Regimes")
ax = kmf.plot(ax=ax)
print "Median survival time of democratic:", kmf.median_
kmf.fit(T[~dem], event_observed=E[~dem], timeline=t, label="Non-democratic Regimes")
ax = kmf.plot(ax=ax)
print "Median survival time of non-democratic:", kmf.median_
plt.ylim(0,1)
plt.title("Lifespans of different global regimes");
Median survival time of democratic: Democratic Regimes 3 dtype: float64 Median survival time of non-democratic: Non-democratic Regimes 6 dtype: float64
It is incredible how much longer these non-democratic regimes exist for. A democratic regime does have a natural bias towards death though: both via elections and natural limits (the US imposes a strict 8 year limit). The median of a non-democractic is only about twice as large as a democratic regime, but the difference is really apparent in the tails: if you're a non-democratic leader, and you've made it past the 10 year mark, you probably have a long life ahead. Meanwhile, a democratic leader rarely makes it past 10 years, and then have a very short lifetime past that.
Here the difference between survival functions is very obvious, and performing a statistical test seems pedantic. If the curves are more similar, or we possess less data, we may be interested in performing a statistical test. In this case, lifelines contains routines in lifelines.statistics
to compare two survival curves. Below we demonstrate this routine. The function logrank_test
is a common statistical test in survival analysis that compares two event series' generators. If the value returned exceeds some prespecified value, then we rule that the series have different generators.
from lifelines.statistics import logrank_test
results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()
- Results
df: 1 alpha: 0.99 t 0: -1 test: logrank null distribution: chi squared
__ p-value _______ test results ____ 208.306 | Reject Null | True
Lets compare the different types of regimes present in the dataset:
regime_types = data['regime'].unique()
for i,regime_type in enumerate(regime_types):
ax = plt.subplot(2, 3, i+1)
ix = data['regime'] == regime_type
kmf.fit( T[ix], E[ix], label=regime_type)
kmf.plot(ax=ax, legend=False)
plt.title(regime_type)
plt.xlim(0, 50)
if i==0:
plt.ylabel('Frac. in power after $n$ years')
plt.tight_layout()
lifelines data format is consistent across all estimator class and functions: an array of individual durations, and the individuals event observation (if any). These are often denoted T
and E
respectively. For example:
T = [0,3,3,2,1,2]
E = [1,1,0,0,1,1]
kmf.fit(T, event_observed=E)
The raw data is not always available in this format -- lifelines includes some helper functions to transform data formats to lifelines format. These are located in the lifelines.utils
sublibrary. For example, the function datetimes_to_durations
accepts an array or Pandas object of start times/dates, and an array or Pandas objects of end times/dates (or None
if not observed):
from lifelines.utils import datetimes_to_durations
start_date = ['2013-10-10 0:00:00', '2013-10-09', '2013-10-10']
end_date = ['2013-10-13', '2013-10-10', None]
T, E = datetimes_to_durations(start_date, end_date, fill_date='2013-10-15')
print 'T (durations): ', T
print 'E (event_observed): ', E
T (durations): [ 3. 1. 5.] E (event_observed): [ True True False]
The function datetimes_to_durations
is very flexible, and has many keywords to tinker with.
The survival curve is a great way to summarize and visualize the lifetime data, however it is not the only way. If we are curious about the hazard function λ(t) of a population, we unfortunately cannot transform the Kaplan Meier estimate -- statistics doesn't work quite that well. Fortunately, there is a proper estimator of the cumulative hazard function:
Λ(t) = ∫0tλ(z) dz
The estimator for this quantity is called the Nelson Aalen estimator:
where di is the number of deaths at time ti and ni is the number of susceptible individuals.
In lifelines, this estimator is available as the NelsonAalenFitter
. Let's use the regime dataset from above:
T = data["duration"]
E = data["observed"]
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T,event_observed=E)
After fitting, the class exposes the property cumulative_hazard_
as a DataFrame:
print naf.cumulative_hazard_.head()
naf.plot()
NA-estimate
0 0.000000 1 0.325912 2 0.507356 3 0.671251 4 0.869867
[5 rows x 1 columns]
The cumulative hazard has less immediate understanding than the survival curve, but the hazard curve is the basis of more advanced techniques in survival analysis. Recall that we are estimating cumulative hazard curve, Λ(t). (Why? The sum of estimates is much more stable than the point-wise estimates.) Thus we know the rate of change of this curve is an estimate of the hazard function.
Looking at figure above, it looks like the hazard starts off high and gets smaller (as seen by the decreasing rate of change). Let's break the regimes down between democratic and non-democratic, during the first 20 years:
Note
We are using the loc
argument in the call to plot
here: it accepts a slice
and plots only points within that slice.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot(loc=slice(0, 20))
naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot(ax=ax, loc=slice(0, 20))
plt.title("Cumulative hazard function of different global regimes");
Looking at the rates of change, I would say that both political philosophies have a constant hazard, albeit democratic regimes have a much higher constant hazard. So why did the combination of both regimes have a decreasing hazard? This is the effect of frailty, a topic we will discuss later.
Interpretation of the cumulative hazard function can be difficult -- it is not how we usually interpret functions. (On the other hand, most survival analysis is done using the cumulative hazard function, so understanding it is recommended).
Alternatively, we can derive the more-interpretable hazard curve, but there is a catch. The derivation involves a kernel smoother (to smooth out the differences of the cumulative hazard curve) , and this requires us to specify a bandwidth parameter that controls the amount of smoothing. This functionality is provided in the smoothed_hazard_
and hazard_confidence_intervals_
methods. (Why methods? They require an argument representing the bandwidth).
There is also a plot_hazard
function (that also requires a bandwidth
keyword) that will plot the estimate plus the confidence intervals, similar to the traditional plot
functionality.
b = 3.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=b)
naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=b)
plt.title("Hazard function of different global regimes | bandwidth=%.1f"%b);
plt.ylim(0, 0.4)
plt.xlim(0, 25);
It is more clear here which group has the higher hazard, and like hypothesized above, both hazard rates are close to being constant.
There is no obvious way to choose a bandwidth, and different bandwidths can produce different inferences, so best to be very careful here. (My advice: stick with the cumulative hazard function.)
b = 8.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=b)
naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=b)
plt.title("Hazard function of different global regimes | bandwidth=%.1f"%b);
We've mainly been focusing on right-censorship, which describes cases where we do not observe the death event. This situation is the most common one. Alternatively, there are situations where we do not observe the birth event occurring. Consider the case where a doctor sees a delayed onset of symptoms of an underlying disease. The doctor is unsure when the disease was contracted (birth), but knows it was before the discovery.
Another situation where we have left censored data is when measurements have only an upperbound, that is, the measurements instruments could only detect the measurement was less than some upperbound.
lifelines has support for left-censored datasets in the KaplanMeierFitter
class, by adding the keyword left_censorship=True
(default False
) to the call to fit
.
from lifelines.datasets import load_lcd
lcd_dataset = load_lcd()
ix = lcd_dataset['group'] == 'alluvial_fan'
T = lcd_dataset[ix]['T']
E = lcd_dataset[ix]['E'] #boolean array, True if observed.
kmf = KaplanMeierFitter()
kmf.fit(T, E, left_censorship=True)
Instead of producing a survival function, left-censored data is more interested in the cumulative density function of time to birth. This is available as the cumulative_density_
property after fitting the data.
print kmf.cumulative_density_
kmf.plot() #will plot the CDF
Another form of bias that can be introduced into a dataset is called left-truncation. (Also a form of censorship). This occurs when individuals may die even before ever entering into the study. Both KaplanMeierFitter
and NelsonAalenFitter
have an optional arugment for entry
, which is an array of equal size to the duration array. It describes the offset from birth to entering the study. This is also useful when subjects enter the study at different points in their lifetime. For example, if you are measuring time to death of prisoners in prison, the prisoners will enter the study at different ages.
Note
Nothing changes in the duration array: it still measures time from entry of study to time left study (either by death or censorship)
Note
Other types of censorship, like interval-censorship, are not implemented in lifelines yet.