# Censoring and Survival

In [1]:
import warnings

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt

from scipy import stats
from scipy.special import expit as logistic
from scipy.special import softmax

%config InlineBackend.figure_format = 'retina'
warnings.simplefilter(action="ignore", category=FutureWarning)
RANDOM_SEED = 8927
np.random.seed(286)

In [2]:
az.style.use("arviz-darkgrid")
az.rcParams["stats.credible_interval"] = 0.89


def standardize(series):
    """Standardize a pandas series"""
    return (series - series.mean()) / series.std()

Sometimes the right way to model discrete, countable events is to model not the counts themselves but rather the time between events. Suppose for exmaple we are interested in the rate at which cats are adopted from an animal shelter. The cat can only be adopted once, at least until it is given up for adoption again. How long it waits for adoption gives us information about the rate of adoptions. And a model can tell you how the rate varies by breed or colour. Maybe you don't care about cat adoptions, but you probably do caare about rates of disease onset and recovery, time to death, age at first reproduction, and meny other variables that have a similar structure.

Models for dealing with these data are called SURVIVAL MODELS. Survival models are models for countable things, but outcomes we want to predict are durations. Durations are continuous deviations from some point of reference. So they are all positive real values.

Distances are similarly positive real values, and both kinds of measurement are DISPLACEMENTS and can be modelled in very similar ways. The simplest distribution for displacements is the EXPONENTIAL DISTRIBUTION, which is the maximum entropy distribution when all we know about the values is their average displacement. So if our goal is just to estimate the average rate of events, its the most conservative choice. The GAMMA DISTRIBUTION is also commonly used. GAMMA is maximum entropy for fixed mean value and fixed mean magnitude. There are lots of models, but we'll start with exponential to keep things easy.

One reason to keep the outcome distribution simple is that the tricky bit with survivial analysis is not the probability distribution we  assign to durations. Instead the tricky bit is dealing with CENSORING. censoring occurs when the event of interest does not occur in the window of observation. This can happen most simply because observation ends before the event occurred. For example there are cats still waiting at the animal shelter. Many of them will get adopted. Another way censoring occurs is when some other event happens that makes the event of interest impossible. For example, a cat could die of old age while waiting to be adopted.

We can't just toss out the censored individuals. Imagine a cohort of 100 cats who start waiting for adoption at the same time. After one month, half of them have been adopted. Now whats the rate of adoption? You can't compute it using only the cats have been adopted. You need to also account for the cats that have not yet been adopted. The cats who haven't been adopted yet, but eventually will be adopted, clearly have longer waiting times than cats who have already been adopted. So the average rate among those who are already adopted is biased upwards - it is com=nfounded by conditioning on adoption.

It isnt hard to include censored observations, but it does require a new type of model that we haven't yet seen in this book. The key idea is that the same distribution assumption for the outcomes tells us both the probability of any obseeved duration without seeing the event. This is admittedly kind of an odd creature. It might help to start out with a generative model - a simulation and try to build intuition for the statistical model.

## Cat Example.

The probability comes from the cumulaive probability distribution. A CDF gives the proportion of cats adopted before or at a certain number of days. So one minus the CDF gives the probability a cat is not adopted by the same number of days. That is the probability we need. This distribution is called the COMPLIMENTARY CUMULATIVE PROBABILITY DISTRIBUTION. In the case of the exponential distribution, the cumulative is:

pr(D|lam) = 1-exp(lam D)

So the compliment is just

Pr(D|lam) = exp(-lam D)

So thats what we need in our model, since it is the probability of waiting D days without being adopted yet.

The model:


In [6]:
d = pd.read_csv('AustinCats.csv',sep=';')

d['adopt'] = (d['out_event']=='Adoption').astype(int)
d.head(5)

Unnamed: 0,id,days_to_event,date_out,out_event,date_in,in_event,breed,color,intake_age,adopt
0,A730601,1,07/08/2016 09:00:00 AM,Transfer,07/07/2016 12:11:00 PM,Stray,Domestic Shorthair Mix,Blue Tabby,7,0
1,A679549,25,06/16/2014 01:54:00 PM,Transfer,05/22/2014 03:43:00 PM,Stray,Domestic Shorthair Mix,Black/White,1,0
2,A683656,4,07/17/2014 04:57:00 PM,Adoption,07/13/2014 01:20:00 PM,Stray,Snowshoe Mix,Lynx Point,2,1
3,A709749,41,09/22/2015 12:49:00 PM,Transfer,08/12/2015 06:29:00 PM,Stray,Domestic Shorthair Mix,Calico,12,0
4,A733551,9,09/01/2016 12:00:00 AM,Transfer,08/23/2016 02:35:00 PM,Stray,Domestic Shorthair Mix,Brown Tabby/White,1,0


In [8]:
d.dtypes

id               object
days_to_event     int64
date_out         object
out_event        object
date_in          object
in_event         object
breed            object
color            object
intake_age        int64
adopt             int64
dtype: object