# The "middle-lane" study.

This study is based on the metaphor that there are some patients that are happily cruising along with their healthcare (left laners), there are some patients whose treatment has escalated in terms of frequency of clinical events or changes in treatment (right laners), and there are some patients who demonstrated a small degree of escalation but have not committed to further escalation or to returning to cruising (middle-lane hoggers). In this metaphor, a movement from the middle lane to the left lane is interpretted as a successful escalation that returns the patient to (conditionally and subjective) control of their condition; a movement from the middle lane to the right lane is interpretted as a decisive action to further escalate the patient's care because their condition has not been controlled by the previous escalation. It's assumed that middle-lane hoggers are examples of operational inefficiency, unimproved patient outcomes, and a patient whose care is in limbo.

The motivating question is whether coercing the middle-lane hoggers into the left or right lanes would improve operational efficiency and patient outcomes (Note: The third pillar - patient experience - is not something we can measure with routinely-collected data). THe details of _how_ a patient might justifiably be moved out of the middle lane is not the focus of this study. Instead, we are asking what would happen to outcomes of interest if all patients were like those who entered and then shortly exited the middle lane; a healthcare service in which patients aren't stuck in limbo.

## The purpose of this notebook.
This notebook describes a casual-inference study in which I estimate the "effect of treatment on the untreated", which is a statistical estimand that quantifies the effect of treatment on those who were not treated by borrowing information from those who were treated, rather than relying solely on information about those who were untreated. For this study, "treatment" refers to coercing a patient out of the middle lane into either the left or right lanes.

## My estimand: The effect of treatment on the treated.
The effect on an outcome (Y) of treatment on the untreated (ETU) for a binary intervention ($X \in \{x_0 = hogger, x_1 = switcher\}$) can be summarised as:

$$ETU_{x_0,x_1}(Y)=E[y|X=x_0] - E[Y=y_{X=x_1}|X=x_0]$$

The ETU can be read as the difference between the observed average HbA1c of those who are middle-lane hoggers, and the unobserved HbA1c of those who are middle-lane hoggers if they were lane switchers.

The term $X=x_0$ means "_the treatment value is $x_0=hogger$_", and similarly for $X=x_1=switcher$. The term $Y=y_{X=x_0}$ means "_the HbA1c value when the treatment value is $x_0=hoggers$_". These quantities are directly summarisable from our observations of middle-lane hoggers. On the other hand, the $(Y=y_{X=x_1}|X=x_0)$ term is our counterfactual that is not directly summarisable from our observations of either middle-lane hoggers or switchers. It can be read as "_the HbA1c value we $\underline{would}$ observe if the treatment were $X=x_1=switcher$, given that we know the treatment was $X=x_0=hogger$_". Quantifying this term seems impossible given that we can only observed one of these scenarios.

## The fundamental problem of causal inference.
Succinctly, the fundamental problem of causal inference is that we only observe what _actually happened_ so we can never be completely sure about would _would have happened_ if we did something different. At the moment of our intervening in, say, a binary decision, each option leads to only one of two potential outcomes. To quantify a causal effect, we would need to simultaneously observed both potential outcomes and calculate their difference, but this is impossible.

To get around this fundamental problem, the craft of causal inference is to impute the missing value of the potential outcome that we did not observe, as accurately and precisely as possible (i.e. with the least bias and least residual error as possible).

## My proposed remedy to the fundamental problem.
In the case of my estimand (the effect of treatment on the untreated), I must estimate the counterfactual described by $P(Y=y_{X=x_1}|X=x_0)$ using only observed data. In other words, I need to impute to unobserved potential outcome where those who hogged the middle lane switched lanes.

Of all the methods that could be used to impute this missing quantity, I choose an approach similar to the X-learner described in [Kunzel et al. (2018)](https://www.pnas.org/doi/epdf/10.1073/pnas.1804597116). Specifically, I will impute the missing quantity by fitting a regression model using observations of patients who _switched out of_ the middle lane, and use this model to "predict" the outcome values for patients who _hogged_ the middle lane. The model based on switchers gives me a way to retrieve $P(Y=y_{X=x_1}|\cdot)$. Feeding the observations of hoggers into this model sets the context of the predictions to $P(\cdot|X=x_0)$. Combined, I get $P(Y=y_{X=x_1}|X=x_0)$.

It must be noted that that this approach to imputing the missing potential outcome value assumes that hoggers and switchers are exchangeable if they share the same covariate values. Thus, this approach reduces to a covariate-matching scheme. However, using the regression approach gives greater functionality. In particular, our dataset is made of repeated observations of patients, and Generalised Estimating Equations can include the information from all repeated observations, rather than limiting to one observation per patient.

## The protocol.
They steps toward estimating the effect of treatment on the treated are:
1. Collate a dataset of patient's repeated observations of HbA1c value and R.A.M.E. status. Name the variables `HbA1c` and `lane`, respectively.
2. Update the `lane` variable to be "other" for all values that are not "middle".
3. Create new variables, `previous_HbA1c` and `previous_lane`, by lagging the `HbA1c` and `lane` variables by one, within time-ordered observations of patients.
4. Using observations for which the `lane` covariate (i.e. our exposure) is `lane` = "other" (i.e. for switchers), fit a Generalised Estimating Equation (GEE) model, blocked for patient ID and with a first-order autoregression. Include covariates for `previous_HbA1c` and `previous_lane` to adjust for observable confounding.
    - This step creates a model of the switcher world, a.k.a. the world in which everyone switched out of the middle lane.
5. Using observations for which the `lane` covariate is `lane` = "middle" (i.e. for hoggers), retrieve the 'predictions' from the fitted GEE model, for each observation.
    - This step brings the hoggers into the switcher world and asks what their outcome would have been if they had switched from the middle lane.
6. Calculate the arithmetic mean patient-specific difference between the observed `HbA1c` and predicted `HbA1c` for hoggers. This is the estimated treatment effect on the untreated.
    - By comparing the observed story of the hoggers to the counterfactual story of the switchers, we're trying to isolate the effect of switching beyond what was inherrent to switchers, i.e. selection bias.

In [27]:
pacman::p_load(
    lme4
    ,geepack
    ,tidyverse
)

In [60]:
 c(
        4:6, 5:8
    )

In [62]:
d <- 
data.frame(
    patient_id = rep(1:10, each = 4)
    ,HbA1c = c(
        1:4, 2:5, 3:6, 4:7, 5:8, 6:9, 7:10, 8:11, 9:12, 10:13
    )
) %>%
dplyr::group_by( patient_id ) %>%
dplyr::mutate(
    previous_HbA1c = lag( HbA1c )
) %>%
dplyr::ungroup() %>%
tidyr::drop_na()



mod <-
    lme4::lmer(
        formula = HbA1c ~ ( 1 | patient_id )
        ,data = d
    )
d2 <- 
data.frame(
    patient_id = rep(11:12, each = 3)
    ,previous_HbA1c = c(
        4:6, 5:7
    )
)
predict(mod, newdata = d2)

ERROR: Error in levelfun(r, n, allow.new.levels = allow.new.levels): new levels detected in newdata: 11, 12


In [74]:
gee1 <- geeglm(
    formula = HbA1c ~ previous_HbA1c
    ,data = d
    ,id = patient_id
    ,family = gaussian("identity")
    ,corstr = "ar1"
)
d$HbA1c
predict(gee1, newdata = d) %>% as.vector()
d2$previous_HbA1c
predict(gee1, newdata = d2) %>% as.vector()

In [51]:
d2

patient_id
<int>
11
11
11
11
12
12
12
12
13
13


In [2]:
if (!exists("spssData"))
{
  spssData <-
    haven::read_sav("ExtBinomPQL2T3RT7F_FractFinalA.sav") %>%
    mutate(across(everything(), as.vector))
}

# Assess over-/under-dispersion.

Assess whether the counts of crimes satisfy the Poisson assumption of equality of mean and variance. A value > 1 indicated over-dispersion, and a value < 1 indicates under-dispersion.

In [18]:
spssData %>% 
dplyr::select( DarknessCrime_sum, DaylightCrime_sum, SumDarkAndDaylight ) %>%
dplyr::summarise_all(
    ,.funs =  list( mean, var )
) %>%
dplyr::transmute(
    Darkness_PoissonRatio = .[[4]] / .[[1]]
    ,Daylight_PoissonRatio = .[[5]] / .[[2]]
    ,Total_PoissonRatio = .[[6]] / .[[3]]
) %>%
dplyr::mutate_if( is.numeric, round, 3 ) %>%
knitr::kable(format = 'rst') %>%
kableExtra::column_spec(column = 1:3, width = "2cm")

“Please specify format in kable. kableExtra can customize either HTML or LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.”




Darkness_PoissonRatio  Daylight_PoissonRatio  Total_PoissonRatio
               13.372                 16.507              25.982