Merge 6d1bdb6 into 66ada92

CamDavidsonPilon · Apr 11, 2019 · 9d63e6a · 9d63e6a
2 parents 66ada92 + 6d1bdb6
commit 9d63e6a
Show file tree

Hide file tree

Showing 23 changed files with 1,832 additions and 250 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,22 @@
 ### Changelog
 
+
+#### 0.21.0
+
+##### New features
+ - `weights` is now a optional kwarg for parametric univariate models.
+ - all univariate and multivariate parametric models now have ability to handle left, right and interval censored data (the former two being special cases of the latter). Users can use the `fit_right_censoring` (which is an alias for `fit`), `fit_left_censoring` and `fit_interval_censoring`.
+ - a new interval censored dataset is available under `lifelines.datasets.load_diabetes`
+
+##### API changes
+ - `left_censorship` on all univariate fitters has been deprecated. Please use the new
+ api `model.fit_left_censoring(...)`.
+ - `invert_y_axis` in `model.plot(...` has been removed.
+
+##### Bug fixes
+ - Fixed an error that didn't let users use Numpy arrays in prediction for AFT models
+
+
 #### 0.20.5 - 2019-04-08
 
 ##### New features

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -24,14 +24,14 @@ If you are interested in contributing to lifelines (and we thank you for the int
 ### Setting up a lifelines development environment
 
 1. From the root directory of `lifelines` activate your [virtual environment](https://realpython.com/python-virtual-environments-a-primer/) (if you plan to use one).
-2. Install the development requirements and [`pre-commit`](https://pre-commit.com) hooks. If you are on Mac, Linux, or [Windows `WSL`](https://docs.microsoft.com/en-us/windows/wsl/faq) you can use the provided [`Makefile`](https://github.com/CamDavidsonPilon/lifelines/blob/master/Makefile). Just type `make` into the console and you're ready to start developing.
+2. Install the development requirements and [`pre-commit`](https://pre-commit.com) hooks. If you are on Mac, Linux, or [Windows `WSL`](https://docs.microsoft.com/en-us/windows/wsl/faq) you can use the provided [`Makefile`](https://github.com/CamDavidsonPilon/lifelines/blob/master/Makefile). Just type `make` into the console and you're ready to start developing. This will also install the dev-requirements.
 
 ### Formatting
 
 `lifelines` uses the [`black`](https://github.com/ambv/black) python formatter.
 There are 3 different ways to format your code.
 1. Use the [`Makefile`](https://github.com/CamDavidsonPilon/lifelines/blob/master/Makefile).
-   * `make format`
+   * `make lint`
 2. Call `black` directly and pass the correct line length.
    * `black . -l 120`
 3. Have you code formatted automatically during commit with the `pre-commit` hook.

diff --git a/Makefile b/Makefile
@@ -20,9 +20,6 @@ else
 		prospector --output-format grouped
 endif
 
-format:
-	black . --line-length 120
-
 check_format:
 ifeq ($(TRAVIS_PYTHON_VERSION), 3.6)
 		black . --check --line-length 120

diff --git a/docs/Survival Regression.rst b/docs/Survival Regression.rst
@@ -121,7 +121,7 @@ After fitting, the value of the maximum log-likelihood this available using ``cp
 Goodness of fit
 -----------------------
 
-After fitting, you may want to know how "good" of a fit your model was to the data. Aside from traditional approaches, two methods the author has found useful is to 1. look at the concordance-index (see below section on :ref:`Model Selection in Survival Regression`), available as ``cph.score_`` or in the ``print_summary`` and 2. compare spread between the baseline survival function vs the Kaplan Meier survival function (Why? Interpret the spread as how much "variance" is provided by the baseline hazard versus the partial hazard. The baseline hazard is approximately equal to the Kaplan-Meier curve if none of the variance is explained by the covariates / partial hazard. Deviations from this provide a visual measure of variance explained). For example, the first figure below is a good fit, and the second figure is a much weaker fit.
+After fitting, you may want to know how "good" of a fit your model was to the data. Aside from traditional approaches, a few methods the author has found useful is to 1. look at the concordance-index (see below section on :ref:`Model Selection in Survival Regression`), available as ``cph.score_`` or in the ``print_summary`` and 2. compare spread between the baseline survival function vs the Kaplan Meier survival function (Why? Interpret the spread as how much "variance" is provided by the baseline hazard versus the partial hazard. The baseline hazard is approximately equal to the Kaplan-Meier curve if none of the variance is explained by the covariates / partial hazard. Deviations from this provide a visual measure of variance explained). For example, the first figure below is a good fit, and the second figure is a much weaker fit.
 
 .. image:: images/goodfit.png
 
@@ -685,6 +685,49 @@ Often, you don't know *a priori* which AFT model to use. Each model has some ass
     print(wf._log_likelihood)  # -679.60
 
 
+Left, right and interval censored data
+-----------------------------------------------
+
+The AFT models have APIs that handle left and interval censored data, too. The API for them is different than the API for fitting to right censored data. Here's an example with interval censored data.
+
+.. code::python
+
+    from lifelines.datasets import load_diabetes
+
+    df = load_diabetes()
+    df['gender'] = df['gender'] == 'male'
+
+    print(df.head())
+    """
+       left  right  gender
+    1    24     27    True
+    2    22     22   False
+    3    37     39    True
+    4    20     20    True
+    5     1     16    True
+    """
+
+    wf = WeibullAFTFitter().fit_interval_censoring(df, start_col='left', stop_col='right')
+    wf.print_summary()
+
+    """
+    <lifelines.WeibullAFTFitter: fitted with 731 observations, 136 censored>
+             event col = 'E'
+    number of subjects = 731
+      number of events = 595
+        log-likelihood = -2027.20
+      time fit was run = 2019-04-11 19:39:42 UTC
+
+    ---
+                        coef exp(coef)  se(coef)      z      p  -log2(p)  lower 0.95  upper 0.95
+    lambda_ gender      0.05      1.05      0.03   1.66   0.10      3.38       -0.01        0.10
+            _intercept  2.91     18.32      0.02 130.15 <0.005       inf        2.86        2.95
+    rho_    _intercept  1.04      2.83      0.03  36.91 <0.005    988.46        0.98        1.09
+    ---
+    Log-likelihood ratio test = 2.74 on 1 df, -log2(p)=3.35
+    """
+
+
 
 Aalen's additive model
 =============================

diff --git a/docs/Survival analysis with lifelines.rst b/docs/Survival analysis with lifelines.rst
@@ -623,6 +623,8 @@ is unsure *when* the disease was contracted (birth), but knows it was before the
 Another situation where we have left-censored data is when measurements have only an upper bound, that is, the measurements
 instruments could only detect the measurement was *less* than some upper bound. This bound is often called the limit of detection (LOD). In practice, there could be more than one LOD. One very important statistical lesson: don't "fill-in" this value naively. It's tempting to use something like one-half the LOD, but this will cause *lots* of bias in downstream analysis. An example dataset is below:
 
+.. note:: The recommended API for modeling left-censored data using parametric models changed in version 0.21.0. Below is the recommended API.
+
 .. code:: python
 
     from lifelines.datasets import load_nh4
@@ -638,15 +640,16 @@ instruments could only detect the measurement was *less* than some upper bound.
     5            <0.006         0.006      True
     """
 
-*lifelines* has support for left-censored datasets in most univariate models, including the ``KaplanMeierFitter`` class, by adding the keyword ``left_censoring=True`` (default ``False``) to the call to ``fit``.
+
+*lifelines* has support for left-censored datasets in most univariate models, including the ``KaplanMeierFitter`` class, by using the ``fit_left_censoring`` method.
 
 .. code:: python
 
 
     T, E = df['NH4.mg.per.L'], ~df['Censored']
 
     kmf = KaplanMeierFitter()
-    kmf.fit(T, E, left_censorship=True)
+    kmf.fit_left_censoring(T, E)
 
 Instead of producing a survival function, left-censored data analysis is more interested in the cumulative density function. This is available as the ``cumulative_density_`` property after fitting the data.
 
@@ -678,9 +681,9 @@ Alternatively, you can use a parametric model to model the data. This allows for
     fig, axes = plt.subplots(3, 2, figsize=(9, 9))
     timeline = np.linspace(0, 0.25, 100)
 
-    wf = WeibullFitter().fit(T, E, left_censorship=True, label="Weibull", timeline=timeline)
-    lnf = LogNormalFitter().fit(T, E, left_censorship=True, label="Log Normal", timeline=timeline)
-    lgf = LogLogisticFitter().fit(T, E, left_censorship=True, label="Log Logistic", timeline=timeline)
+    wf = WeibullFitter().fit_left_censoring(T, E, label="Weibull", timeline=timeline)
+    lnf = LogNormalFitter().fit_left_censoring(T, E, label="Log Normal", timeline=timeline)
+    lgf = LogLogisticFitter().fit_left_censoring(T, E, label="Log Logistic", timeline=timeline)
 
     # plot what we just fit, along with the KMF estimate
     kmf.plot_cumulative_density(ax=axes[0][0], ci_show=False)
@@ -700,7 +703,23 @@ Alternatively, you can use a parametric model to model the data. This allows for
 
 Based on the above, the log-normal distribution seems to fit well, and the Weibull not very well at all.
 
-.. note:: Other types of censoring, like interval-censoring, are not implemented in *lifelines* yet.
+
+Interval censored data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Data can also be interval censored. An example of this is periodically recording the population of micro-organisms as they die-off. Their deaths are interval censored because you know a subject died between two observations periods. New to lifelines in version 0.21.0, all parametric models have support for interval censored data.
+
+.. note:: The API for ``fit_interval_censoring`` is different than right and left censored data. Also,
+
+.. code:: python
+
+
+    from lifelines.datasets import load_diabetes
+
+    df = load_diabetes()
+
+    wf = WeibullFitter().fit_interval_censoring(start=df['left'], stop=df['right'])
+
 
 
 Left truncated (late entry) data

diff --git a/docs/index.rst b/docs/index.rst
@@ -68,6 +68,7 @@ Contents:
 
   Gitter channel <https://gitter.im/python-lifelines/Lobby>
   Create a GitHub issue <https://github.com/camdavidsonpilon/lifelines/issues>
+  Development blog <https://dataorigami.net/blogs/napkin-folding/tagged/lifelines>
 
 Installation
 ------------------------------

diff --git a/lifelines/datasets/__init__.py b/lifelines/datasets/__init__.py
@@ -483,3 +483,29 @@ def load_lymphoma(**kwargs):
     From https://www.statsdirect.com/help/content/survival_analysis/logrank.htm
     """
     return _load_dataset("lymphoma.csv", **kwargs)
+
+
+def load_diabetes(**kwargs):
+    """
+    An interval censored dataset.
+
+    References
+    ----------
+    Borch-Johnsens, K, Andersen, P and Decker, T (1985). "The effect of proteinuria on relative mortality in Type I (insulin-dependent) diabetes mellitus." Diabetologia, 28, 590-596.
+
+    ::
+
+        Size: (731, 3)
+        Example:
+
+           left  right  gender
+             24     27    male
+             22     22  female
+             37     39    male
+             20     20    male
+              1     16    male
+              8     20  female
+             14     14    male
+    """
+
+    return _load_dataset("interval_diabetes.csv", index_col=0, **kwargs)