# 🎯 Introduction to Performance Estimators for Entity Resolution Algorithms

This notebook is an introduction to the performance estimation of entity resolution algorithms using our methodology introduced in Binette et al (2022). It assumes you are already familiar with performance clustering evaluation metrics, including pairwise precision and recall.

Our methodology provides nearly unbiased precision and recall estimates. It requires the following:
- A predicted membership vector `prediction`, and
- A benchmark dataset `sample` in the form of a ground truth membership vector.

The two **key assumptions** required are that:
1. Every cluster in the benchmark dataset is complete. In other words, `sample` contains true clusters with no missing records.
2. True clusters have been sampled according to a known sampling mechanism. In most cases, you will require knowing sampling probabilities up to a normalizing constant.

Sampling mechanism covered by the **pv_evaluation.estimators** module are described below.

### Sampling true clusters

By default, our estimators assume that true clusters have been sampled with probability proportional to their size. Pairwise precision and recall estimates can then be obtained as follows:
```python
from pv_evaluation.estimators import pairwise_precision_estimator, pairwise_recall_estimator

pairwise_precision_estimator(prediction, sample)
pairwise_recall_estimator(prediction, sample)
```

Standard deviation estimates are obtained using:
```python
from pv_evaluation.estimators import pairwise_precision_std, pairwise_recall_std

pairwise_precision_std(prediction, sample)
pairwise_recall_std(prediction, sample)
```

For cluster precision and recall, use the `cluster_precision_estimator`, `cluster_recall_estimator`, `cluster_precision_std` and `cluster_recall_std` functions..

#### Uniform cluster sampling

For uniform cluster sampling, use the `weights="uniform"` parameter can be passed to the above functions.

See [this application to Lai's 2011 benchmark](https://patentsview.github.io/PatentsView-Evaluation/build/html/examples/estimators/lai-2011-benchmark.html) for a practical example.

### Sampling a single block

In the context of entity resolution, a **block** is a set of records which contains true clusters. That is, if a given record is in a block, then all other matching records are also in that same block. For instance, the set of Israeli inventors can be considered a block.

If `sample` corresponds to a single block, then representative performance estimates can be obtained by using our estimators with the `sampling_type="single_block"` parameter. Note that it is not possible to standard deviation estimates for a single block sample. Furthermore, estimators for multiple block samples have not yet been implemented in the **pv_evaluation.estimators** module.

Usage is as follows:
```python
from pv_evaluation.estimators import pairwise_precision_estimator, pairwise_recall_estimator

pairwise_precision_estimator(prediction, sample, sampling_type="single_block")
pairwise_recall_estimator(prediction, sample, sampling_type="single_block")
```

See [this application to the Israeli inventors benchmark](https://patentsview.github.io/PatentsView-Evaluation/build/html/examples/estimators/israeli-data.html) for a practical example.