doc edits

CamDavidsonPilon · Jan 6, 2018 · cafa6f1 · cafa6f1
1 parent a52abc4
commit cafa6f1
Show file tree

Hide file tree

Showing 4 changed files with 54 additions and 58 deletions.
diff --git a/docs/Examples.rst b/docs/Examples.rst
@@ -19,7 +19,7 @@ Subtract the difference between survival curves
 
 If you are interested in taking the difference between two survival curves, simply trying to 
 subtract the ``survival_function_`` will likely fail if the DataFrame's indexes are not equal. Fortunately, 
-the ``KaplanMeierFitter`` and ``NelsonAalenFitter`` have a built in ``subtract`` method: 
+the ``KaplanMeierFitter`` and ``NelsonAalenFitter`` have a built-in ``subtract`` method: 
 
 .. code-block:: python
     
@@ -30,7 +30,7 @@ will produce the difference at every relevant time point. A similar function exi
 Compare using a hypothesis test
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-For rigorous testing of differences, *lifelines* comes with a statistics library. The ``logrank_test`` function
+For rigorous testing of differences, *lifelines* come with a statistics library. The ``logrank_test`` function
 compares whether the "death" generation process of the two populations are equal:
 
 .. code-block:: python
@@ -67,9 +67,9 @@ hypothesis that all the populations have the same "death" generation process).
 Model selection using *lifelines*
 #####################################################
 
-If using *lifelines* for prediction work, it's ideal that you perform some sort of cross-validation scheme. This allows you to be confident that your out-of-sample predictions will work well in practice. It also allows you to choose between multiple models.
+If using *lifelines* for prediction work, it's ideal that you perform some type of cross-validation scheme. This cross-validation allows you to be confident that your out-of-sample predictions will work well in practice. It also allows you to choose between multiple models.
 
-*lifelines* has a built in k-fold cross-validation function. For example, consider the following example:
+*lifelines* has a built-in k-fold cross-validation function. For example, consider the following example:
 
 .. code-block:: python
     
@@ -300,9 +300,9 @@ Suppose your dataset has lifetimes grouped near time 60, thus after fitting
     74         0.00
 
 
-What you would really like is to have a predictable and full index from 40 to 75. (Notice that
-in the above index, the last two time points are not adjacent -- this is caused by observing no lifetimes
-existing for times 72 or 73) This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a `timeline` argument: 
+What you would like is to have a predictable and full index from 40 to 75. (Notice that
+in the above index, the last two time points are not adjacent --  the cause is observing no lifetimes
+existing for times 72 or 73). This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a `timeline` argument: 
 
 .. code-block:: python
 
@@ -420,11 +420,10 @@ Suppose you wish to measure the hazard ratio between two populations under the C
 
 Problems with convergence in the Cox Proportional Hazard Model
 ################################################################
-
 Since the estimation of the coefficients in the Cox proportional hazard model is done using the Newton-Raphson algorithm, there is sometimes a problem with convergence. Here are some common symptoms and possible resolutions:
 
- - Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is equally as large. This can be seen using the ``print_summary`` method on a fitted ``CoxPHFitter`` object. Look for a ``RuntimeWarning`` about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn't have a traditional "intercept" term like other regression models). Or, the data is completely seperable, which means that there exists a covariate the completely determines whether an event occured or not. For example, for all "death" events in the dataset, there exists a covariate that is constant amongst all of them. Another problem may be a colinear relationship in your dataset - see the third point below. 
+ - Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is equally as large. This can be seen using the ``print_summary`` method on a fitted ``CoxPHFitter`` object. Look for a ``RuntimeWarning`` about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn't have a traditional "intercept" term like other regression models). Or, the data is completely separable, which means that there exists a covariate the completely determines whether an event occurred or not. For example, for all "death" events in the dataset, there exists a covariate that is constant amongst all of them. Another problem may be a colinear relationship in your dataset - see the third point below. 
 
  - Adding a very small ``penalizer_coef`` significantly changes the results. This probably means that the step size is too large. Try decreasing it, and returning the ``penalizer_coef`` term to 0. 
 
- - ``LinAlgError: Singular matrix`` is thrown. This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset. 
+ - ``LinAlgError: Singular matrix`` is thrown. This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset. 
diff --git a/docs/Quickstart.rst b/docs/Quickstart.rst
@@ -147,7 +147,7 @@ While the above ``KaplanMeierFitter`` and ``NelsonAalenFitter`` are useful, they
     regression_dataset.head()
 
 
-The input of the ``fit`` method's API in a regression is different. All the data, including durations, censorships and covariates must be contained in **a Pandas DataFrame** (yes, it must be a DataFrame). The duration column and event occured column must be specified in the call to ``fit``. 
+The input of the ``fit`` method's API in a regression is different. All the data, including durations, censorships and covariates must be contained in **a Pandas DataFrame** (yes, it must be a DataFrame). The duration column and event occurred column must be specified in the call to ``fit``. 
 
 .. code:: python
     

diff --git a/docs/Survival Regression.rst b/docs/Survival Regression.rst
@@ -5,14 +5,14 @@
 Survival Regression
 =====================================
 
-Often we have additional data aside from the durations, and if
+Often we have additional data aside from the duration, and if
 applicable any censorships that occurred. In the regime dataset, we have
 the type of government the political leader was part of, the country
 they were head of, and the year they were elected. Can we use this data
 in survival analysis?
 
 Yes, the technique is called *survival regression* -- the name implies
-we regress covariates (eg: year elected, country, etc.) against a
+we regress covariates (e.g., year elected, country, etc.) against a
 another variable -- in this case durations and lifetimes. Similar to the
 logic in the first part of this tutorial, we cannot use traditional
 methods like linear regression.
@@ -39,14 +39,13 @@ The estimator to fit unknown coefficients in Aalen's additive model is
 located in ``estimators`` under ``AalenAdditiveFitter``. For this
 exercise, we will use the regime dataset and include the categorical
 variables ``un_continent_name`` (eg: Asia, North America,...), the
-``regime`` type (eg: monarchy, civilan,...) and the year the regime
+``regime`` type (e.g., monarchy, civilian,...) and the year the regime
 started in, ``start_year``.
 
 Aalen's additive model typically does not estimate the individual
 :math:`b_i(t)` but instead estimates :math:`\int_0^t b_i(s) \; ds`
 (similar to the estimate of the hazard rate using ``NelsonAalenFitter``
-above). This is important to keep in mind when analzying the output.
-
+above). This is important to keep in mind when analyzing the output.
 .. code:: python
 
     from lifelines import AalenAdditiveFitter
@@ -190,7 +189,7 @@ Below we create our fitter class. Since we did not supply an intercept
 column in our matrix we have included the keyword ``fit_intercept=True``
 (``True`` by default) which will append the column of ones to our
 matrix. (Sidenote: the intercept term, :math:`b_0(t)` in survival
-regression is often referred to as the *baseline* hazard.)
+regression is often known as the *baseline* hazard.)
 
 We have also included the ``coef_penalizer`` option. During the estimation, a
 linear regression is computed at each step. Often the regression can be
@@ -204,7 +203,7 @@ or small sample sizes) -- adding a penalizer term controls the stability. I reco
 
 An instance of ``AalenAdditiveFitter``
 includes a ``fit`` method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and 
-two special columns: a *duration* column and a boolean *event occured* column (where event occured refers to the event of interest - expulsion from government in this case)
+two individual columns: a *duration* column and a boolean *event occurred* column (where event occurred refers to the event of interest - expulsion from government in this case)
 
 
 .. code:: python
@@ -344,9 +343,9 @@ containing the estimates of :math:`\int_0^t b_i(s) \; ds`:
 
 
 Regression is most interesting if we use it on data we have not yet
-seen, i.e. prediction! We can use what we have learned to predict
+seen, i.e., prediction! We can use what we have learned to predict
 individual hazard rates, survival functions, and median survival time.
-The dataset we are using is aviable up until 2008, so let's use this data to
+The dataset we are using is available up until 2008, so let's use this data to
 predict the (already partly seen) possible duration of Canadian
 Prime Minister Stephen Harper.
 
@@ -387,7 +386,7 @@ Prime Minister Stephen Harper.
 Cox's Proportional Hazard model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Lifelines has an implementation of the Cox propotional hazards regression model (implemented in 
+Lifelines has an implementation of the Cox proportional hazards regression model (implemented in 
 R under ``coxph``). It has a similar API to Aalen's additive model. Like R, it has a ``print_summary``
 function that prints a tabular view of coefficients and related stats. 
 
@@ -421,7 +420,7 @@ This example data is from the paper `here <http://socserv.socsci.mcmaster.ca/jfo
     Concordance = 0.640
     """
 
-To access the coefficients and the baseline hazard, you can use ``cph.hazards_`` and ``cph.baseline_hazard_`` respectively. The likelihood is available too using ``cph._log_likelihood`` After fitting, you can use use the suite of prediction methods (similar to Aalen's additve model above): ``.predict_partial_hazard``, ``.predict_survival_function``, etc.
+To access the coefficients and the baseline hazard, you can use ``cph.hazards_`` and ``cph.baseline_hazard_`` respectively. The likelihood is available too using ``cph._log_likelihood`` After fitting, you can use use the suite of prediction methods (similar to Aalen's additive model above): ``.predict_partial_hazard``, ``.predict_survival_function``, etc.
 
 .. code:: python
     
@@ -451,7 +450,7 @@ With a fitted model, an altervative way to view the coefficients and their range
 Checking the proportional hazards assumption
 #############################################
 
-A quick and visual way to check the proportional hazards assumption of a variable is to plot the survival curves segmented by the values of the variable. If the survival curves are the same "shape", and differ only by constant factor, then the assumption holds. A more clear way to see this is to plot what's called the loglogs curve: the log(-log(survival curve)) vs log(time). If the curves are parallel (and hence do not cross each other), then it's likely the variable satisfies the assumption. If the curves do cross, likely you'll have to "stratify" the variable (see next section). In lifelines, the ``KaplanMeierFitter`` object has a ``.plot_loglogs`` function for this purpose. 
+A quick and visual way to check the proportional hazards assumption of a variable is to plot the survival curves segmented by the values of the variable. If the survival curves are the same "shape" and differ only by a constant factor, then the assumption holds. A more clear way to see this is to plot what's called the logs curve: the loglogs (-log(survival curve)) vs log(time). If the curves are parallel (and hence do not cross each other), then it's likely the variable satisfies the assumption. If the curves do cross, likely you'll have to "stratify" the variable (see next section). In lifelines, the ``KaplanMeierFitter`` object has a ``.plot_loglogs`` function for this purpose. 
 
 The following is the loglogs curves of two variables in our regime dataset. The first is the democracy type, which does have (close to) parallel lines, hence satisfies our assumption:
 
@@ -490,7 +489,7 @@ The second variable is the regime type, and this variable does not follow the pr
 Stratification
 ################
 
-Sometimes a covariate may not obey the proportional hazard assumption. In this case, we can allow a factor to be adjusted for without estimating its effect. To specify categorical variables to be used in stratification, we specify them in the call to ``fit``:
+Sometimes a covariate may not obey the proportional hazard assumption. In this case, we can allow a factor without estimating its effect to be adjusted. To specify categorical variables to be used in stratification, we define them in the call to ``fit``:
 
 .. code:: python
 
@@ -533,7 +532,7 @@ Cross Validation
 ######################################
 
 Lifelines has an implementation of k-fold cross validation under `lifelines.utils.k_fold_cross_validation`. This function accepts an instance of a regression fitter (either ``CoxPHFitter`` of ``AalenAdditiveFitter``), a dataset, plus `k` (the number of folds to perform, default 5). On each fold, it splits the data 
-into a training set and a testing set, fits itself on the training set, and evaluates itself on the testing set (using the concordance measure). 
+into a training set and a testing set fits itself on the training set and evaluates itself on the testing set (using the concordance measure). 
 
 .. code:: python