Lanqing, May 17 2023

This is a minimal example and introduction to matching method. Here we consdier such a situation:
- There is two datasets, one called `data` and one called `simu`. Each dataset has 3 dimensions: `x,y,z`, and they could be correlated to each other.
- `data` is obtained from some experiment. It probably contains some unknown new physics which will take effect only in the interested variable `x`.
- `simu` is a model you have to understand the data you collected. In `simu` we assuemd no such new physics. It is assumed to be perfectly describing distribution $P(x|y,z)$ when there is no new physics. However, $P(y,z)$ could be generated arbitrarily, and is uncorrelated to new physics.
- In a general case, your simulation or model might have a different distribution to data observed, in the coordiantes `y`, `z` where you expect no effect from new physics. 
- To compare apple to apple and claim new physics based on variable `x`, you hope to compare `data` and `simu` when they have the same 2D distribution in `y` and `z`.

Matching method is a concept from statistical inference, and it does the job for you to make sure `y-z` 2D distribution will be the same for `data` and `simu`.

## Dataset Preparation

In [None]:
import warnings
# Ignore all warnings
warnings.filterwarnings('ignore')

import matching
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc
from tqdm import tqdm
import sys

from ipywidgets import interact, fixed

Prepare datasets:
- `simu` will be a Gaussian 3D distribution.    
- `data` will be another Gaussian 3D distribution, with different orientation and more smearing.   

In [None]:
# Define the mean and covariance matrix
mean_simu = [0, 0, 0]
covariance_simu = [
    [1, 0.7, 0.5],
    [0.1, 0.2, 0.1],
    [0.3, 0.6, 0.3],
]
mean_data = [0.5, -1, 0.8]
covariance_data = [
    [1.5, -0.7, -0.5],
    [-0.1, 2, 0.1],
    [0.3, -0.6, 2.8],
]

# Generate a 3D Gaussian distribution with the specified mean and covariance as simulation (model)
num_simu = 20000
simu = pd.DataFrame(np.random.multivariate_normal(mean_simu, covariance_simu, num_simu), columns=['x', 'y', 'z'])

# Generate a 3D Gaussian distribution with the specified mean and covariance as data
num_data = 15000
data = pd.DataFrame(np.random.multivariate_normal(mean_data, covariance_data, num_simu), columns=['x', 'y', 'z'])


## Before Matching

Let's take a glance at the datasets before matching.

In [None]:
fig = plt.figure(dpi=150)
ax = fig.add_subplot(111, projection='3d')
ax.scatter(simu['x'], simu['y'], simu['z'], s=1, label='simu', alpha=0.1)
ax.scatter(data['x'], data['y'], data['z'], s=1, label='data', alpha=0.1)

# Set axis labels
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')

plt.legend()
# Display the plot
plt.title('Before Matching: Gaussian Data and Gaussian Simu')
plt.show()

We want to see difference in `x`. They look different, but we also know that `x` is correlated with `y,z` so we cannot conclude new physics.

In [None]:
plt.figure(dpi=150)
plt.hist(simu['x'], bins=np.linspace(-3,3,20), density=True, label='simu', alpha=0.5)
plt.hist(data['x'], bins=np.linspace(-3,3,20), density=True, label='data', alpha=0.5)
plt.xlabel('x')
plt.ylabel('Frequency')
plt.legend()
plt.title('Before Matching: Parameter of Interest')

For (`y`, `z`), we just want them to be the same to decouple their correlation to `x`. They look very different now.

In [None]:
plt.figure(dpi=150)
plt.scatter(simu['y'], simu['z'], alpha=0.3, label='simu', s=1)
plt.scatter(data['y'], data['z'], alpha=0.3, label='data', s=1)
plt.xlabel('y')
plt.ylabel('z')
plt.legend()
plt.title('Before Matching: Covariates to Control')

## Doing the Matching

Below, we use Nearest-Neighbor Matching based on Mahalanobis Distance between data and simulation events.

In [None]:
# calling inference class
inf = matching.inference.Inference(data=data, simu=simu, 
                                   covariates=['y', 'z'])

In [None]:
# datasets after matching
# note that both datasets will be croped into the intersection of central 99% 
# for every covariate of interest
data_matched = inf.match_simu()
simu_matched = inf.simu

## After Matching

In [None]:
fig = plt.figure(dpi=150)
ax = fig.add_subplot(111, projection='3d')
ax.scatter(simu_matched['x'], simu_matched['y'], simu_matched['z'], s=1, label='simu', alpha=0.1)
ax.scatter(data_matched['x'], data_matched['y'], data_matched['z'], s=1, label='data', alpha=0.1)

# Set axis labels
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')

plt.legend()
# Display the plot
plt.title('After Matching: Gaussian Data and Gaussian Simu')
plt.show()

In [None]:
plt.figure(dpi=150)
plt.hist(simu_matched['x'], bins=np.linspace(-3,3,20), density=True, label='simu', alpha=0.5)
plt.hist(data_matched['x'], bins=np.linspace(-3,3,20), density=True, label='data', alpha=0.5)
plt.xlabel('x')
plt.ylabel('Frequency')
plt.legend()
plt.title('After Matching: Parameter of Interest')

In [None]:
plt.figure(dpi=150)
plt.scatter(simu_matched['y'], simu_matched['z'], alpha=0.3, label='simu', s=1)
plt.scatter(data_matched['y'], data_matched['z'], alpha=0.3, label='data', s=1)
plt.xlabel('y')
plt.ylabel('z')
plt.legend()
plt.title('After Matching: Covariates to Control')

As you can see, the 2D distribution in controled covariates look the same after matching. 