Skip to content

Commit

Permalink
doc edits
Browse files Browse the repository at this point in the history
  • Loading branch information
CamDavidsonPilon committed Jan 6, 2018
1 parent a52abc4 commit cafa6f1
Show file tree
Hide file tree
Showing 4 changed files with 54 additions and 58 deletions.
19 changes: 9 additions & 10 deletions docs/Examples.rst
Expand Up @@ -19,7 +19,7 @@ Subtract the difference between survival curves

If you are interested in taking the difference between two survival curves, simply trying to
subtract the ``survival_function_`` will likely fail if the DataFrame's indexes are not equal. Fortunately,
the ``KaplanMeierFitter`` and ``NelsonAalenFitter`` have a built in ``subtract`` method:
the ``KaplanMeierFitter`` and ``NelsonAalenFitter`` have a built-in ``subtract`` method:

.. code-block:: python
Expand All @@ -30,7 +30,7 @@ will produce the difference at every relevant time point. A similar function exi
Compare using a hypothesis test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For rigorous testing of differences, *lifelines* comes with a statistics library. The ``logrank_test`` function
For rigorous testing of differences, *lifelines* come with a statistics library. The ``logrank_test`` function
compares whether the "death" generation process of the two populations are equal:

.. code-block:: python
Expand Down Expand Up @@ -67,9 +67,9 @@ hypothesis that all the populations have the same "death" generation process).
Model selection using *lifelines*
#####################################################

If using *lifelines* for prediction work, it's ideal that you perform some sort of cross-validation scheme. This allows you to be confident that your out-of-sample predictions will work well in practice. It also allows you to choose between multiple models.
If using *lifelines* for prediction work, it's ideal that you perform some type of cross-validation scheme. This cross-validation allows you to be confident that your out-of-sample predictions will work well in practice. It also allows you to choose between multiple models.

*lifelines* has a built in k-fold cross-validation function. For example, consider the following example:
*lifelines* has a built-in k-fold cross-validation function. For example, consider the following example:

.. code-block:: python
Expand Down Expand Up @@ -300,9 +300,9 @@ Suppose your dataset has lifetimes grouped near time 60, thus after fitting
74 0.00
What you would really like is to have a predictable and full index from 40 to 75. (Notice that
in the above index, the last two time points are not adjacent -- this is caused by observing no lifetimes
existing for times 72 or 73) This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a `timeline` argument:
What you would like is to have a predictable and full index from 40 to 75. (Notice that
in the above index, the last two time points are not adjacent -- the cause is observing no lifetimes
existing for times 72 or 73). This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a `timeline` argument:

.. code-block:: python
Expand Down Expand Up @@ -420,11 +420,10 @@ Suppose you wish to measure the hazard ratio between two populations under the C
Problems with convergence in the Cox Proportional Hazard Model
################################################################

Since the estimation of the coefficients in the Cox proportional hazard model is done using the Newton-Raphson algorithm, there is sometimes a problem with convergence. Here are some common symptoms and possible resolutions:

- Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is equally as large. This can be seen using the ``print_summary`` method on a fitted ``CoxPHFitter`` object. Look for a ``RuntimeWarning`` about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn't have a traditional "intercept" term like other regression models). Or, the data is completely seperable, which means that there exists a covariate the completely determines whether an event occured or not. For example, for all "death" events in the dataset, there exists a covariate that is constant amongst all of them. Another problem may be a colinear relationship in your dataset - see the third point below.
- Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is equally as large. This can be seen using the ``print_summary`` method on a fitted ``CoxPHFitter`` object. Look for a ``RuntimeWarning`` about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn't have a traditional "intercept" term like other regression models). Or, the data is completely separable, which means that there exists a covariate the completely determines whether an event occurred or not. For example, for all "death" events in the dataset, there exists a covariate that is constant amongst all of them. Another problem may be a colinear relationship in your dataset - see the third point below.

- Adding a very small ``penalizer_coef`` significantly changes the results. This probably means that the step size is too large. Try decreasing it, and returning the ``penalizer_coef`` term to 0.

- ``LinAlgError: Singular matrix`` is thrown. This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset.
- ``LinAlgError: Singular matrix`` is thrown. This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset.
2 changes: 1 addition & 1 deletion docs/Quickstart.rst
Expand Up @@ -147,7 +147,7 @@ While the above ``KaplanMeierFitter`` and ``NelsonAalenFitter`` are useful, they
regression_dataset.head()
The input of the ``fit`` method's API in a regression is different. All the data, including durations, censorships and covariates must be contained in **a Pandas DataFrame** (yes, it must be a DataFrame). The duration column and event occured column must be specified in the call to ``fit``.
The input of the ``fit`` method's API in a regression is different. All the data, including durations, censorships and covariates must be contained in **a Pandas DataFrame** (yes, it must be a DataFrame). The duration column and event occurred column must be specified in the call to ``fit``.

.. code:: python
Expand Down
27 changes: 13 additions & 14 deletions docs/Survival Regression.rst
Expand Up @@ -5,14 +5,14 @@
Survival Regression
=====================================

Often we have additional data aside from the durations, and if
Often we have additional data aside from the duration, and if
applicable any censorships that occurred. In the regime dataset, we have
the type of government the political leader was part of, the country
they were head of, and the year they were elected. Can we use this data
in survival analysis?

Yes, the technique is called *survival regression* -- the name implies
we regress covariates (eg: year elected, country, etc.) against a
we regress covariates (e.g., year elected, country, etc.) against a
another variable -- in this case durations and lifetimes. Similar to the
logic in the first part of this tutorial, we cannot use traditional
methods like linear regression.
Expand All @@ -39,14 +39,13 @@ The estimator to fit unknown coefficients in Aalen's additive model is
located in ``estimators`` under ``AalenAdditiveFitter``. For this
exercise, we will use the regime dataset and include the categorical
variables ``un_continent_name`` (eg: Asia, North America,...), the
``regime`` type (eg: monarchy, civilan,...) and the year the regime
``regime`` type (e.g., monarchy, civilian,...) and the year the regime
started in, ``start_year``.

Aalen's additive model typically does not estimate the individual
:math:`b_i(t)` but instead estimates :math:`\int_0^t b_i(s) \; ds`
(similar to the estimate of the hazard rate using ``NelsonAalenFitter``
above). This is important to keep in mind when analzying the output.

above). This is important to keep in mind when analyzing the output.
.. code:: python
from lifelines import AalenAdditiveFitter
Expand Down Expand Up @@ -190,7 +189,7 @@ Below we create our fitter class. Since we did not supply an intercept
column in our matrix we have included the keyword ``fit_intercept=True``
(``True`` by default) which will append the column of ones to our
matrix. (Sidenote: the intercept term, :math:`b_0(t)` in survival
regression is often referred to as the *baseline* hazard.)
regression is often known as the *baseline* hazard.)

We have also included the ``coef_penalizer`` option. During the estimation, a
linear regression is computed at each step. Often the regression can be
Expand All @@ -204,7 +203,7 @@ or small sample sizes) -- adding a penalizer term controls the stability. I reco
An instance of ``AalenAdditiveFitter``
includes a ``fit`` method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and
two special columns: a *duration* column and a boolean *event occured* column (where event occured refers to the event of interest - expulsion from government in this case)
two individual columns: a *duration* column and a boolean *event occurred* column (where event occurred refers to the event of interest - expulsion from government in this case)


.. code:: python
Expand Down Expand Up @@ -344,9 +343,9 @@ containing the estimates of :math:`\int_0^t b_i(s) \; ds`:


Regression is most interesting if we use it on data we have not yet
seen, i.e. prediction! We can use what we have learned to predict
seen, i.e., prediction! We can use what we have learned to predict
individual hazard rates, survival functions, and median survival time.
The dataset we are using is aviable up until 2008, so let's use this data to
The dataset we are using is available up until 2008, so let's use this data to
predict the (already partly seen) possible duration of Canadian
Prime Minister Stephen Harper.

Expand Down Expand Up @@ -387,7 +386,7 @@ Prime Minister Stephen Harper.
Cox's Proportional Hazard model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Lifelines has an implementation of the Cox propotional hazards regression model (implemented in
Lifelines has an implementation of the Cox proportional hazards regression model (implemented in
R under ``coxph``). It has a similar API to Aalen's additive model. Like R, it has a ``print_summary``
function that prints a tabular view of coefficients and related stats.

Expand Down Expand Up @@ -421,7 +420,7 @@ This example data is from the paper `here <http://socserv.socsci.mcmaster.ca/jfo
Concordance = 0.640
"""
To access the coefficients and the baseline hazard, you can use ``cph.hazards_`` and ``cph.baseline_hazard_`` respectively. The likelihood is available too using ``cph._log_likelihood`` After fitting, you can use use the suite of prediction methods (similar to Aalen's additve model above): ``.predict_partial_hazard``, ``.predict_survival_function``, etc.
To access the coefficients and the baseline hazard, you can use ``cph.hazards_`` and ``cph.baseline_hazard_`` respectively. The likelihood is available too using ``cph._log_likelihood`` After fitting, you can use use the suite of prediction methods (similar to Aalen's additive model above): ``.predict_partial_hazard``, ``.predict_survival_function``, etc.

.. code:: python
Expand Down Expand Up @@ -451,7 +450,7 @@ With a fitted model, an altervative way to view the coefficients and their range
Checking the proportional hazards assumption
#############################################

A quick and visual way to check the proportional hazards assumption of a variable is to plot the survival curves segmented by the values of the variable. If the survival curves are the same "shape", and differ only by constant factor, then the assumption holds. A more clear way to see this is to plot what's called the loglogs curve: the log(-log(survival curve)) vs log(time). If the curves are parallel (and hence do not cross each other), then it's likely the variable satisfies the assumption. If the curves do cross, likely you'll have to "stratify" the variable (see next section). In lifelines, the ``KaplanMeierFitter`` object has a ``.plot_loglogs`` function for this purpose.
A quick and visual way to check the proportional hazards assumption of a variable is to plot the survival curves segmented by the values of the variable. If the survival curves are the same "shape" and differ only by a constant factor, then the assumption holds. A more clear way to see this is to plot what's called the logs curve: the loglogs (-log(survival curve)) vs log(time). If the curves are parallel (and hence do not cross each other), then it's likely the variable satisfies the assumption. If the curves do cross, likely you'll have to "stratify" the variable (see next section). In lifelines, the ``KaplanMeierFitter`` object has a ``.plot_loglogs`` function for this purpose.

The following is the loglogs curves of two variables in our regime dataset. The first is the democracy type, which does have (close to) parallel lines, hence satisfies our assumption:

Expand Down Expand Up @@ -490,7 +489,7 @@ The second variable is the regime type, and this variable does not follow the pr
Stratification
################

Sometimes a covariate may not obey the proportional hazard assumption. In this case, we can allow a factor to be adjusted for without estimating its effect. To specify categorical variables to be used in stratification, we specify them in the call to ``fit``:
Sometimes a covariate may not obey the proportional hazard assumption. In this case, we can allow a factor without estimating its effect to be adjusted. To specify categorical variables to be used in stratification, we define them in the call to ``fit``:

.. code:: python
Expand Down Expand Up @@ -533,7 +532,7 @@ Cross Validation
######################################

Lifelines has an implementation of k-fold cross validation under `lifelines.utils.k_fold_cross_validation`. This function accepts an instance of a regression fitter (either ``CoxPHFitter`` of ``AalenAdditiveFitter``), a dataset, plus `k` (the number of folds to perform, default 5). On each fold, it splits the data
into a training set and a testing set, fits itself on the training set, and evaluates itself on the testing set (using the concordance measure).
into a training set and a testing set fits itself on the training set and evaluates itself on the testing set (using the concordance measure).

.. code:: python
Expand Down

0 comments on commit cafa6f1

Please sign in to comment.