# Modelling the effect of increased streetlighting on crime counts.

The purpose of this report is to model the association between the weekly count of lighting changed as part of Leeds City Councils relighting scheme, and the count of crimes in darkness, daylight, and in total.

The difference between this and `Lighting.ipynb` is that I use Generalised Estimating Equations (GEEs) in this notebook, while I used Generalised Linear Mixed Models (GLMMs) in the previous notebook.

## Why Generalised Estimating Equations?
GEEs are an estimating procedure rather than a model. It's a way of estimating the population average from a sample of observations, but it has the versatility to account for dependence of observations during the estimation.

For binary variates, GEEs estimate the population-averaged association, from the sample, assuming every unit of observation experienced a one-unit change in the exposure. In contrast, GLMMs estimates the conditional-mean when all other covariates are fixed to some value. Thus, the interpretation of the coefficients are different. If we want to make statements about a population (based on observations of its component units), then the GEE estimate is actually what we are after. If, on the other hand, we want to make statements about indivudals (based on information about the population, some of which we have gathered from our sample but some we have added with our parameteric assumptions), then the GLMM estimate is what we want. For this evaluation of Leeds City's relighting scheme, I argue that we want to make statements about the population of crimes, so the estimate from a GEE is more appropriate.

While GLMMs sink or swim based on assumptions about the distributions of the random-effects, GEEs sink or swim based on assumptions about the covariance structure. This means that GEEs are summarising the sample data, only, while GLMMs are summarising the sample data conditional on random-effects being correctly specified. The ramification of these setups is that a mispecified GLMM will give you an incorrect estimate and possibly an incorrected standard error, but a GEE with a mispecfiied covariance structure will only cause problems for the standard error; the estimate of the mean is robust. But this shortcoming of GEEs can be mitigated by using a sandwich estimator, especially when we have many "clusters" and the same number of observations in each cluster (which we do in our Leeds data).


## Why not Generalised Estimating Equations?
GEEs are not familiar to those schooled in likelihood-based inference. GEEs don't have model-fit statistics because they are not fitted using maximum likelihood or related methods (after all, they are an estimation procedure rather than a model). This means no deviance scores, no likelihood ratios, and no information-criteria statistics, which leaves many statisticians adrift. GEEs are non-parametric summaries of the observed sample so they are on the description side of things rather than the modelling side of things. Asking for a model-fit statistic for a GEE makes as much sense as asking for a model-fit statistic for an odds ratio calculated from a 2-by-2 contingency table!

Furthermore, I think it is this relative lack of modelling in comparison to likelihood-based methods that makes people uncomfortable. It shouldn't. A statisticians job is to summarise the data, not to build models, but I fear that most statisticians have forgotten that in favour of building ever-more-interesting models. GEEs are like complicated summary statistics, rather than models, and, confusingly, look fiendishly like likelihood-based model, at a glance. They look like a duck but they don't quack lie a duck, which is why run-of-the-mill statisticians are wary of them, in my opinion.



Two informative sources on GEEs are in these hyperlinks:
- ["To GEE or not to GEE: Comparing population average and mixed models for estimating the associations between neighborhood risk factors and health"](https://sci-hub.wf/10.1097/EDE.0b013e3181caeb90)
- ["Generalised Estimating Equations (GEE)"](https://rlbarter.github.io/Practical-Statistics/2017/05/10/generalized-estimating-equations-gee/)

## Load required packages and data

In [1]:
pacman::p_load(
    geepack
    ,haven
    ,lme4
    ,tidyverse
)

In [2]:
if (!exists("spssData"))
{
  spssData <-
    haven::read_sav("ExtBinomPQL2T3RT7F_FractFinalA.sav") %>%
    mutate(across(everything(), as.vector))
}

# Over-/under-dispersion.

Contrary to the parameterised GLMMs in `Lighting.ipynb`, one doesn't expect any dispersion statistic to change when we specify the covariance matric of a GEE. The deviance-based dispersion statistic doesn't even exist for the GEEs because they are not likelihood based so the concept of deviance doesn't apply.
Regarding the Pearson dispersion statistic, recall that it is a generalisation of the variance:mean quotient. In `Lighting.ipynb`, the Pearson statistic changed between models because we were changing the conditional mean, i.e. the denominator in the quotient. GEEs, on the other hand, do not produced conditional means so one does not expect the Pearson statistic to change when the dependence between observations is handled by specifying the covariance structure.

# The GEE approach to modelling dependence

I specify a covariance structure that blocks for MSOA and incorporates a first-order autoregression. And, just like our final GLMM in `Lighting.ipynb`, I remove MSOA #5 and #111 because they are outliers.

In [3]:
mod_GEE_Darkness <-
    geepack::geeglm(
        DarknessCrime_sum ~ 1 + N_LampsChanged
        ,id = MSOAN112
        ,corstr = "ar1"
        ,family = "poisson"
        ,data = spssData %>% dplyr::filter( !MSOAN112 %in% c( 5, 111 ) )
    )
mod_GEE_Daylight <-
    geepack::geeglm(
        DaylightCrime_sum ~ 1 + N_LampsChanged
        ,id = MSOAN112
        ,corstr = "ar1"
        ,family = "poisson"
        ,data = spssData %>% dplyr::filter( !MSOAN112 %in% c( 5, 111 ) )
    )
mod_GEE_Total <-
    geepack::geeglm(
        SumDarkAndDaylight ~ 1 + N_LampsChanged
        ,id = MSOAN112
        ,corstr = "ar1"
        ,family = "poisson"
        ,data = spssData %>% dplyr::filter( !MSOAN112 %in% c( 5, 111 ) )
    )

Below I present the exponentiated estimate for the `N_LampsChange` covariate (rounded to 3 decimal places).

In [15]:
df_final_results <-
    rbind(
        summary(mod_GEE_Darkness)$coefficients['N_LampsChanged', c( 'Estimate', 'Std.err') ] %>% `rownames<-`( "Darkness" ) %>% exp()

        ,summary(mod_GEE_Daylight)$coefficients['N_LampsChanged', c( 'Estimate', 'Std.err') ] %>% `rownames<-`( "Darkness" ) %>% exp()

        ,summary(mod_GEE_Total)$coefficients['N_LampsChanged', c( 'Estimate', 'Std.err') ] %>% `rownames<-`( "Darkness" ) %>% exp()
    ) %>%
    dplyr::mutate(
        CI_LB = Estimate - `Std.err` * qnorm(0.975)
        ,CI_UB = Estimate + `Std.err` * qnorm(0.975)
        ) %>%
    round(3)

df_final_results %>% formattable::formattable()

Unnamed: 0_level_0,Estimate,Std.err,CI_LB,CI_UB
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
Darkness,1.0,1.0,-0.961,2.96
Darkness1,1.0,1.0,-0.961,2.96
Darkness2,1.066,1.004,-0.902,3.034


The estimates from the GEE match almost exactly to those from the GLMM in `Lighting.ipynb` (at least, when rounded to three decimal place) __but it is crucial to keep in mind that the meaning of the coefficients differ__. The interpretation of the exponentiated GEE coefficent is that it is the population-averaged odds ratio relating crime the cumulative count to the number of new lamps operating that week. In other words, the estimated value of 1 indicates that, on average in this sample of observations, the odds of an additional crime are the equivalent whether or not an additional lamp is installed. We also note that the standard errors are so wide that, when converted to the standard 95% confidence interval, we conclude that, for 100 repeats of the Leeds City relighting scheme, we would expect that 95 of the trials to be inconclusive.

## Conclusion from quasi-liklelihood, GLMM and GEE analyses.
Whether we adjust our standard errors from a semi-parametric GLMM, use a fully-parametric GLMM, or a GEE, we always conclude that there is insufficient evidence to reject the null hypothesis that additional lamps were associated with additional or fewer crimes, in the Leeds City relighting scheme.