# groupbyrule: deduplicate data using fuzzy and deterministic matching rules

🚧 under construction 🚧

**groupbyrule** is a Python package for data cleaning and data integration. It integrates with [pandas](https://pandas.pydata.org/)' [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function to, not only group rows by a given identifier, but also groups rows based on logical rules and partial matching. In other words, it provides tools for deterministic record linkage and entity resolution in structured databases. It can also be used for *blocking*, a form of filtering used to speed-up more complex entity resolution algorithms. See the references below to learn more about these topics.

One of the main goal of **groupbyrule** is to be user-friendly. Matching rules and clustering algorithms are composable and the performance of algorithms can be readily evaluted given training data. The package is built on top of [pandas](https://pandas.pydata.org) for data manipulation and on [igraph](https://igraph.org/python/) for graph clustering and related computations.

## Installation

Install from github using the following command:

In [4]:
pip install git+https://github.com/OlivierBinette/groupbyrule.git

Collecting git+https://github.com/OlivierBinette/groupbyrule.git
  Cloning https://github.com/OlivierBinette/groupbyrule.git to /tmp/pip-req-build-aje_uf06
  Running command git clone -q https://github.com/OlivierBinette/groupbyrule.git /tmp/pip-req-build-aje_uf06
  Resolved https://github.com/OlivierBinette/groupbyrule.git to commit b9cd01ee2781c9ea24926d54104f362dd61c9988
Note: you may need to restart the kernel to use updated packages.


## Examples

### Rule-Based Linkage

Consider the `RLdata500` dataset from the [RecordLinkage R package](https://www.google.com/search?channel=fs&client=ubuntu&q=recordlinkage+r+package).

In [5]:
from groupbyrule import RLdata500

RLdata500

Unnamed: 0,fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd
1,CARSTEN,,MEIER,,1949,7,22
2,GERD,,BAUER,,1968,7,27
3,ROBERT,,HARTMANN,,1930,4,30
4,STEFAN,,WOLFF,,1957,9,2
5,RALF,,KRUEGER,,1966,1,13


We deduplicate this dataset by linking records which match either on both first name (`fname_c1`) and last name (`lname_c1`), on both first name and birth day (`bd`), or on both last name and birth day. Linkage transitivity is resolved, by default, by considering connected components of the resulting graph.

In [6]:
from groupbyrule import Any, Match, identity_RLdata500, precision_recall
import pandas as pd

# Specify linkage rule
rule = Any(Match("fname_c1", "lname_c1"),
           Match("fname_c1", "bd"),
           Match("lname_c1", "bd"))

# Apply the rule to a dataset
rule.fit(RLdata500)

# Evaluate performance by computing precision and recall
precision_recall(rule.groups, identity_RLdata500)

(0.96, 0.11538461538461539)


Note that this is not the best way to deduplicate this dataset. However, it showcases the composability of matching rules. The specific rules themselves (exact matching, similarity-based string matching, and different clustering algorithms) can be customized as needed. A more complete overview is available [here]() 🚧.

A better way to deduplicate this data is to link all pairs of records which agree on all but at most one attribute. This is done below, with the precision and recall computed from the ground truth membership vector `identity_RLdata500`.

In [7]:
from groupbyrule import AllButK

# Match records matching on all but at most k=1 of the specified attributes
rule = AllButK("fname_c1", "lname_c1", "bd", "bm", "by", k=1)

# Apply the rule to a dataset
rule.fit(RLdata500)

# Evaluate performance by computing precision and recall
precision_recall(rule.groups, identity_RLdata500)

(0.92, 1.0)

### Postprocessing

Following record linkage, records can be processed using pandas's groupby and aggregation functions. Below, we only keep the first non-NA attribute value for each record cluster. This is a simple way to obtain a deduplicated dataset.

In [8]:
RLdata500\
    .groupby(rule.groups)\
    .first()

Unnamed: 0,fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd
0,CARSTEN,,MEIER,,1949,7,22
1,GERD,,BAUER,,1968,7,27
2,ROBERT,,HARTMANN,,1930,4,30
3,STEFAN,,WOLFF,,1957,9,2
4,RALF,,KRUEGER,,1966,1,13
...,...,...,...,...,...,...,...
449,BRITTA,,KOEHLER,,2001,1,12
450,SABINE,,SCHNEIDER,,1953,5,20
451,MARIA,,SCHNEIDER,,1981,8,8
452,INGE,,SCHREIBER,,1967,12,13



### Similarity-Based Linkage Rules

🚧

### Supervised Approaches and Learning Rules

🚧

### Clustering Algorithms

🚧

### Performance Evaluation

🚧

## References

🚧
