Skip to content

Commit

Permalink
Merge 6d1bdb6 into 66ada92
Browse files Browse the repository at this point in the history
  • Loading branch information
CamDavidsonPilon committed Apr 11, 2019
2 parents 66ada92 + 6d1bdb6 commit 9d63e6a
Show file tree
Hide file tree
Showing 23 changed files with 1,832 additions and 250 deletions.
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
### Changelog


#### 0.21.0

##### New features
- `weights` is now a optional kwarg for parametric univariate models.
- all univariate and multivariate parametric models now have ability to handle left, right and interval censored data (the former two being special cases of the latter). Users can use the `fit_right_censoring` (which is an alias for `fit`), `fit_left_censoring` and `fit_interval_censoring`.
- a new interval censored dataset is available under `lifelines.datasets.load_diabetes`

##### API changes
- `left_censorship` on all univariate fitters has been deprecated. Please use the new
api `model.fit_left_censoring(...)`.
- `invert_y_axis` in `model.plot(...` has been removed.

##### Bug fixes
- Fixed an error that didn't let users use Numpy arrays in prediction for AFT models


#### 0.20.5 - 2019-04-08

##### New features
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@ If you are interested in contributing to lifelines (and we thank you for the int
### Setting up a lifelines development environment

1. From the root directory of `lifelines` activate your [virtual environment](https://realpython.com/python-virtual-environments-a-primer/) (if you plan to use one).
2. Install the development requirements and [`pre-commit`](https://pre-commit.com) hooks. If you are on Mac, Linux, or [Windows `WSL`](https://docs.microsoft.com/en-us/windows/wsl/faq) you can use the provided [`Makefile`](https://github.com/CamDavidsonPilon/lifelines/blob/master/Makefile). Just type `make` into the console and you're ready to start developing.
2. Install the development requirements and [`pre-commit`](https://pre-commit.com) hooks. If you are on Mac, Linux, or [Windows `WSL`](https://docs.microsoft.com/en-us/windows/wsl/faq) you can use the provided [`Makefile`](https://github.com/CamDavidsonPilon/lifelines/blob/master/Makefile). Just type `make` into the console and you're ready to start developing. This will also install the dev-requirements.

### Formatting

`lifelines` uses the [`black`](https://github.com/ambv/black) python formatter.
There are 3 different ways to format your code.
1. Use the [`Makefile`](https://github.com/CamDavidsonPilon/lifelines/blob/master/Makefile).
* `make format`
* `make lint`
2. Call `black` directly and pass the correct line length.
* `black . -l 120`
3. Have you code formatted automatically during commit with the `pre-commit` hook.
Expand Down
3 changes: 0 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,6 @@ else
prospector --output-format grouped
endif

format:
black . --line-length 120

check_format:
ifeq ($(TRAVIS_PYTHON_VERSION), 3.6)
black . --check --line-length 120
Expand Down
45 changes: 44 additions & 1 deletion docs/Survival Regression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ After fitting, the value of the maximum log-likelihood this available using ``cp
Goodness of fit
-----------------------

After fitting, you may want to know how "good" of a fit your model was to the data. Aside from traditional approaches, two methods the author has found useful is to 1. look at the concordance-index (see below section on :ref:`Model Selection in Survival Regression`), available as ``cph.score_`` or in the ``print_summary`` and 2. compare spread between the baseline survival function vs the Kaplan Meier survival function (Why? Interpret the spread as how much "variance" is provided by the baseline hazard versus the partial hazard. The baseline hazard is approximately equal to the Kaplan-Meier curve if none of the variance is explained by the covariates / partial hazard. Deviations from this provide a visual measure of variance explained). For example, the first figure below is a good fit, and the second figure is a much weaker fit.
After fitting, you may want to know how "good" of a fit your model was to the data. Aside from traditional approaches, a few methods the author has found useful is to 1. look at the concordance-index (see below section on :ref:`Model Selection in Survival Regression`), available as ``cph.score_`` or in the ``print_summary`` and 2. compare spread between the baseline survival function vs the Kaplan Meier survival function (Why? Interpret the spread as how much "variance" is provided by the baseline hazard versus the partial hazard. The baseline hazard is approximately equal to the Kaplan-Meier curve if none of the variance is explained by the covariates / partial hazard. Deviations from this provide a visual measure of variance explained). For example, the first figure below is a good fit, and the second figure is a much weaker fit.

.. image:: images/goodfit.png

Expand Down Expand Up @@ -685,6 +685,49 @@ Often, you don't know *a priori* which AFT model to use. Each model has some ass
print(wf._log_likelihood) # -679.60
Left, right and interval censored data
-----------------------------------------------

The AFT models have APIs that handle left and interval censored data, too. The API for them is different than the API for fitting to right censored data. Here's an example with interval censored data.

.. code::python
from lifelines.datasets import load_diabetes
df = load_diabetes()
df['gender'] = df['gender'] == 'male'
print(df.head())
"""
left right gender
1 24 27 True
2 22 22 False
3 37 39 True
4 20 20 True
5 1 16 True
"""
wf = WeibullAFTFitter().fit_interval_censoring(df, start_col='left', stop_col='right')
wf.print_summary()
"""
<lifelines.WeibullAFTFitter: fitted with 731 observations, 136 censored>
event col = 'E'
number of subjects = 731
number of events = 595
log-likelihood = -2027.20
time fit was run = 2019-04-11 19:39:42 UTC
---
coef exp(coef) se(coef) z p -log2(p) lower 0.95 upper 0.95
lambda_ gender 0.05 1.05 0.03 1.66 0.10 3.38 -0.01 0.10
_intercept 2.91 18.32 0.02 130.15 <0.005 inf 2.86 2.95
rho_ _intercept 1.04 2.83 0.03 36.91 <0.005 988.46 0.98 1.09
---
Log-likelihood ratio test = 2.74 on 1 df, -log2(p)=3.35
"""
Aalen's additive model
=============================
Expand Down
31 changes: 25 additions & 6 deletions docs/Survival analysis with lifelines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -623,6 +623,8 @@ is unsure *when* the disease was contracted (birth), but knows it was before the
Another situation where we have left-censored data is when measurements have only an upper bound, that is, the measurements
instruments could only detect the measurement was *less* than some upper bound. This bound is often called the limit of detection (LOD). In practice, there could be more than one LOD. One very important statistical lesson: don't "fill-in" this value naively. It's tempting to use something like one-half the LOD, but this will cause *lots* of bias in downstream analysis. An example dataset is below:

.. note:: The recommended API for modeling left-censored data using parametric models changed in version 0.21.0. Below is the recommended API.

.. code:: python
from lifelines.datasets import load_nh4
Expand All @@ -638,15 +640,16 @@ instruments could only detect the measurement was *less* than some upper bound.
5 <0.006 0.006 True
"""
*lifelines* has support for left-censored datasets in most univariate models, including the ``KaplanMeierFitter`` class, by adding the keyword ``left_censoring=True`` (default ``False``) to the call to ``fit``.
*lifelines* has support for left-censored datasets in most univariate models, including the ``KaplanMeierFitter`` class, by using the ``fit_left_censoring`` method.

.. code:: python
T, E = df['NH4.mg.per.L'], ~df['Censored']
kmf = KaplanMeierFitter()
kmf.fit(T, E, left_censorship=True)
kmf.fit_left_censoring(T, E)
Instead of producing a survival function, left-censored data analysis is more interested in the cumulative density function. This is available as the ``cumulative_density_`` property after fitting the data.

Expand Down Expand Up @@ -678,9 +681,9 @@ Alternatively, you can use a parametric model to model the data. This allows for
fig, axes = plt.subplots(3, 2, figsize=(9, 9))
timeline = np.linspace(0, 0.25, 100)
wf = WeibullFitter().fit(T, E, left_censorship=True, label="Weibull", timeline=timeline)
lnf = LogNormalFitter().fit(T, E, left_censorship=True, label="Log Normal", timeline=timeline)
lgf = LogLogisticFitter().fit(T, E, left_censorship=True, label="Log Logistic", timeline=timeline)
wf = WeibullFitter().fit_left_censoring(T, E, label="Weibull", timeline=timeline)
lnf = LogNormalFitter().fit_left_censoring(T, E, label="Log Normal", timeline=timeline)
lgf = LogLogisticFitter().fit_left_censoring(T, E, label="Log Logistic", timeline=timeline)
# plot what we just fit, along with the KMF estimate
kmf.plot_cumulative_density(ax=axes[0][0], ci_show=False)
Expand All @@ -700,7 +703,23 @@ Alternatively, you can use a parametric model to model the data. This allows for

Based on the above, the log-normal distribution seems to fit well, and the Weibull not very well at all.

.. note:: Other types of censoring, like interval-censoring, are not implemented in *lifelines* yet.

Interval censored data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Data can also be interval censored. An example of this is periodically recording the population of micro-organisms as they die-off. Their deaths are interval censored because you know a subject died between two observations periods. New to lifelines in version 0.21.0, all parametric models have support for interval censored data.

.. note:: The API for ``fit_interval_censoring`` is different than right and left censored data. Also,

.. code:: python
from lifelines.datasets import load_diabetes
df = load_diabetes()
wf = WeibullFitter().fit_interval_censoring(start=df['left'], stop=df['right'])
Left truncated (late entry) data
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Contents:

Gitter channel <https://gitter.im/python-lifelines/Lobby>
Create a GitHub issue <https://github.com/camdavidsonpilon/lifelines/issues>
Development blog <https://dataorigami.net/blogs/napkin-folding/tagged/lifelines>

Installation
------------------------------
Expand Down
26 changes: 26 additions & 0 deletions lifelines/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,3 +483,29 @@ def load_lymphoma(**kwargs):
From https://www.statsdirect.com/help/content/survival_analysis/logrank.htm
"""
return _load_dataset("lymphoma.csv", **kwargs)


def load_diabetes(**kwargs):
"""
An interval censored dataset.
References
----------
Borch-Johnsens, K, Andersen, P and Decker, T (1985). "The effect of proteinuria on relative mortality in Type I (insulin-dependent) diabetes mellitus." Diabetologia, 28, 590-596.
::
Size: (731, 3)
Example:
left right gender
24 27 male
22 22 female
37 39 male
20 20 male
1 16 male
8 20 female
14 14 male
"""

return _load_dataset("interval_diabetes.csv", index_col=0, **kwargs)

0 comments on commit 9d63e6a

Please sign in to comment.