vignettes/psm-paradox.Rmd

---
title: "Examining the Propensity Score Matching Paradox"
author: "Noah Greifer"
date: "`r Sys.Date()`"
output:
  html_vignette:
    toc: no
vignette: |
  %\VignetteIndexEntry{Examining the Propensity Score Matching Paradox} 
  %\VignetteEngine{knitr::rmarkdown} 
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
link-citations: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE,
                      fig.width=7, fig.height=5)
options(width = 200)
```

```{=html}
<style>
pre {
  overflow-x: auto;
}
pre code {
  word-wrap: normal;
  white-space: pre;
}
</style>
```
## Introduction

The propensity score matching (PSM) paradox is feature of PSM highlighted in @king2019 "Why Propensity Scores Should Not Be Used for Matching" that serves as one of the reasons the authors warn against using PSM to balance covariates and estimate causal effects. The paradox is described in @king2019 (hereafter KN19) as follows:

> ...[A]fter PSM's goal of complete randomization has been approximated [...], pruning the observations with the worst matched observations, according to the absolute propensity score distance in treated and control pairs, will increase imbalance, model dependence, and bias; this will also be true when pruning the pair with the next largest distance, and so on.

The reason this is so pernicious is that the point of PSM is to decrease imbalance, and applying a caliper to the propensity score to ensure matches are close can have the opposite effect of making balance worse. That is, by restricting the matches so that only the closest pairs remain, balance can be worse than were the far apart pairs to remain. This is a feature other matching methods do not share with PSM (at least not to the same degree), which is why KN19 argue PSM is suboptimal and other methods, like Mahalanobis distance matching or coarsened exact matching, should be used instead.

## The PSM Paradox and `MatchingFrontier`

KN19's analysis of the PSM paradox in real and simulated datasets involved performing a propensity score match, pruning the most distant pairs one-by-one (i.e., imposing a stricter and stricter caliper), and measuring balance at each point. In this way, studying the PSM paradox is similar to creating a matching frontier, which also involves dropping units one-by-one starting with the most distant units. Given this similarity, as of version 4.1.0, `MatchingFrontier` includes tools to analyze the PSM paradox in supplied data using some of the same technology used to create and examine a matching frontier. We demonstrate these methods below.

The procedure to study the PSM paradox proceeds in the following steps:

1.  Perform PSM using `MatchIt::matchit()`

2.  Create a balance-sample size "frontier" using `makeFrontier()`

3.  Examine model dependence using `estimateEffects()`

We will use the `finkel2012` dataset in `MatchingFrontier` to study the PSM. See `help("finkel2012", package = "MatchingFrontier")` for more information. This dataset comes from @finkelCivicEducationDemocratic2012 and was used in KN19 to study the PSM paradox. The important points are that the treatment is `treat`, the outcome is `polknow`, and the other variables are covariates. No specific estimand or quantity of interest was specified as the target estimand in the original study; because the analysis presented below and used in @finkelCivicEducationDemocratic2012 involves discarding treated units, the feasible SATT (FSATT) becomes the target of inference.

```{r}
library(MatchingFrontier)
data("finkel2012")
head(finkel2012)
```

First, we'll use `MatchIt::matchit()` to perform propensity score matching. This analysis can only be done using `MatchingFrontier` after 1:1 matching with or without replacement. This matching does not have to involve a propensity score, but because we are investigating the PSM paradox, we perform 1:1 PSM (here, without replacement, as is most common).

```{r}
treat_formula <- treat ~ uraiamedia + age + churchgo +
  groupactive + income +  groupleader + male + educ +
  religion + media + polinterest + poldiscuss +
  civicgroup + polparty

# 1:1 PS matching w/o replacement using a logistic regression
# propensity score
m.out <- MatchIt::matchit(treat_formula, data = finkel2012,
                          method = "nearest")
m.out
```

We can examine balance on the original matched sample using `summary()`:

```{r}
summary(m.out)
```

Of note is that the sample prior to matching was well balanced already, meaning there is little advantage to discarding units by matching, and making this dataset ripe for the PSM paradox. By examining what happens when additional units are dropped, we can see the paradox in action. To do so, we call `makeFrontier()`, supplying a balance metric to use to track balance as additional units are dropped (in this case, the energy distance [@rizzo2016])^[To be precise, this is not really a frontier in that it does not represent the best balance possible for the given sample size; rather, it represents the balance corresponding to a sample with the closest pairs for a given sample size, conditional on having already formed the pairs. `makeFrontier()` is used here because it relies on the same technology used for making a true frontier, i.e., dropping units in a specific order and computing a global measure of balance for each sample of remaining units.].

```{r}
# Make a "frontier" by dropping the most distant pairs
f.out <- makeFrontier(m.out, metric = "energy", verbose = FALSE)
f.out
```

We can now plot the relationship between the number of units dropped and the resulting balance using `plot()`. (We zoom in on a part of the plot using `ggplot2::coord_cartesian()`.)

```{r}
plot(f.out) + ggplot2::coord_cartesian(xlim = c(0, 1250),
                                       ylim = c(0, .15))
```

As more units are dropped, leaving only units that are the most closely matched on the propensity score, balance worsens, whereas one would naively expect balance to improve because the remaining units have closer values of the propensity score; this is the PSM paradox.

We can also examine model dependence as additional units are dropped. The PSM paradox suggests that model dependence should increase as additional units are dropped, again, in contrast to the expectation that with only the most similar units remaining, the treatment effect estimate would depend less on the specific model used. We can use `estimateEffects()` to estimate treatment effects for each sample. We'll request the "extreme bounds" procedure to examine model dependence; this produces bounds representing the largest and smallest effect estimates across many different randomly selected model specifications that include some covariates and powers of them and omit others.

```{r}
set.seed(97531)
e.out <- estimateEffects(f.out, outcome = "polknow",
                         Ndrop = c(0, 1250),
                         method = "extreme-bounds",
                         verbose = FALSE)
e.out
```

Rather than plot the estimates themselves, we can plot the width of the model dependence bands as a measure of model dependence. The code below does this using `ggplot2`.

```{r}
library(ggplot2)

# Compute the difference between the model dependence bounds
widths <- sapply(e.out$mod.dependence, diff)

# Plot the differences against the number of units dropped
ggplot() + geom_line(aes(x = e.out$Xs, y = widths)) + 
  labs(x = "Number of treated units dropped",
       y = "Model dependence") +
        theme_bw()
```

As additional pairs of units are dropped, model dependence increases, in opposition to the purpose of matching, which is to decrease model dependence [@hoMatchingNonparametricPreprocessing2007]. This is a consequence of the PSM paradox and a reason KN19 urge researchers to be cautious when using PSM.

## Implications of the PSM Paradox

While the presence of the PSM paradox doesn't necessarily mean one should automatically abandon PSM as a method of adjusting for confounding, it does suggest that care needs to be taken when using PSM to ensure imbalance and model dependence are not worse after matching than they are before matching. One also should be cautious about using a caliper, especially without first examining balance in a match without one. Although using calipers can improve the balancing and bias-reduction abilities of PSM [@austinComparison12Algorithms2014], blindly using a caliper without assessing balance can engage the PSM paradox, making results less robust.

The PSM paradox also suggests that other methods that rely less on the propensity score might be tried instead. Mahalanobis distance matching, coarsened exact matching, cardinality matching, and genetic matching, which operate on the covariates directly without a propensity score, may perform well in some datasets and are less susceptible to the paradox (though they are not immune as described in the appendix of KN19); these methods are all available in `MatchIt`. An alternative is to manage the balance/sample size directly using the matching frontier in such a way as to optimize balance at each sample size, as described in @king2017 and the `MatchingFrontier` vignette (`vignette("MatchingFrontier")`), using functions in this package.

## References