In [1]:
pip install plotly pandas numpy

Note: you may need to restart the kernel to use updated packages.


# Balanced Risk Set Matching (Li et al., 2001)

This is a study of treatment, cystoscopy, and hydrodistention, given in response to the symptoms of the chronic, nonlethal disease interstitial cystitis. The idea of the journal is to match the treatment of the patient with a similar history of symptoms who have different times receiving their treatment. It is described as time $t$ for when the patient received their treatment.

## Data

The journal uses the Interstitial Cystitis Data Base (ICDB) for the data, but we will be using synthetic data simulating a similar result. The data currently being used is not trained to accurately reproduce a similar result to ICDB.

In [2]:
from defs import patients_entry

patients_entry.groupby("id").first().head(10)

Unnamed: 0_level_0,gender,pain,urgency,nocturnal frequency,treatment time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,M,3,3,4,
1,F,8,7,2,24.0
2,F,0,4,3,
3,F,4,2,3,36.0
4,M,8,5,2,6.0
5,F,6,8,2,45.0
6,M,2,8,4,3.0
7,F,7,4,3,24.0
8,M,3,5,1,36.0
9,F,5,9,3,39.0


Patients are evaluated at intervals of approximately every 3 months thereafter for up to 4 years. Three quantities are measured repeatedly over time:

- Pain
- Urgency
- Nocturnal Frequency

Pain and urgency are subjective appraisals on a scale from 0 - 9.

In [3]:
from defs import patients_evaluations

patients_evaluations.groupby("id")[["pain", "urgency", "nocturnal frequency"]].mean().head(10)

Unnamed: 0_level_0,pain,urgency,nocturnal frequency
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.9375,0.625,1.3125
1,4.4375,3.9375,0.5
2,2.4375,1.125,0.75
3,4.9375,0.3125,0.8125
4,2.8125,0.8125,0.1875
5,3.125,5.75,0.5
6,0.4375,3.6875,0.5
7,6.25,2.0625,2.0625
8,0.8125,3.1875,0.4375
9,2.8125,5.5,1.625


If patient $m$ received the treatent at time $T_m$, then compare the response of this patient to a patient who did not receive the treatment up to time $T_m$.

In [4]:
from defs import patients_risk_set

patients_risk_set.head(10)

Unnamed: 0,24,36,6,45,3,39,27,9,0,30,21,18,15,12,33,42
0,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...
1,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...,id gender pain urgency nocturnal freq...


$\alpha_p \in \mathcal{T}$ are $K$ binary variables $B_{pk} = 1$

## Matching by Minimum Cost Flow in a Network

Set $\mathcal{A} = \{ \alpha_1, \dots, \alpha_M \}$ called units. Set $\mathcal{T} \subseteq \mathcal{A}$ called treated units. Set $\mathcal{E} \subseteq \mathcal{T} \times \mathcal{A}$ called edges. If the pair $e = ( \alpha_p, \alpha_q )$ is an edge $e \in \mathcal{E}$, then it is permitted to match $\alpha_p$ to $\alpha_q$, but if $e \not\in \mathcal{E}$, then this match is forbidden.

The journal $\mathcal{A}$ consists of 400 patients randomly sampled from the IC database.

For each $e \in \mathcal{E}$, there is a distance $\delta_e > 0$. The distance $\delta_e$ is the Mahalanobis distance between subject $\alpha_p$ and control $\alpha_q$ on a six-dimensional covariate describing the three symptoms at baseline and at time $T_p$ when $a_q$ received treatment.

Set $S \subseteq M \subseteq \mathcal{E}$ where $|M| = S$ edges such that each unit $\alpha_q \in \mathcal{A}$ appears in at most one matched pair, possibly as $(\alpha_p, \alpha_q) \in M$ or as $(\alpha_q, \alpha_p) \in M$ but not as both. $\sum_{e \in M} \delta_e$ over all pair matchings $M$ of size $S$ obtainable with the given structure $\mathcal{A}, \mathcal{T}, \mathcal{E}$.

$|M| = S = 100$ matched pairs. There are three variables: pain score, urgency score, and nocturnal frequency. A patient is paired to their matched not-yet-treated data.

## Balanced Pair Matching

Each treated unit $\alpha_p \in \mathcal{T}$ are $K$ binary variables $B_{pk} = 1$ or $B_{pk} = 0$ for $k = 1, \dots, K$. $(\alpha_p, \alpha_q) = e \in \mathcal{E}$, there are $K$ binary variables, $B_{ek} = 1$ or $B_{ek} = 0$ for $k = 1, \dots, K$.

## Graphical Comparisons

$S = 100$ matched pairs.

## References

Li, Y. P., Propert, K. J., & Rosenbaum, P. R. (2001). Balanced risk set matching. Journal of the American Statistical Association, 96(455), 870–882. https://doi.org/10.1198/016214501753208573