# Entity Resolution using Privacy Preserving Record Linkage
expected memory usage: 250MB.  
expected runtime: 10 seconds.

## Introduction

This tutorial demonstrates the Privacy Preserving Record Linkage (PPRL) Protocol between two parties, called Alice and Bob.
Alice and Bob both have their own DB with ~500,000 private records, and the objective here is for the two sides to learn which of their records match similar records in the other side's DB but nothing else (beyond also incidentally the number of records in the other side's DB). So for example, they don't get to learn anything about the non-similar records of the other side, nor the similar but possibly different fields of the identified similar record of the other side.

The protocol is similar to the Private-Set-Intersection (PSI) protocol, except that it allowes for similarities rather than requiring exact equivalence of the reported candidate pairs. The similarity of the records is measured in terms of the Jaccard similarity index of the two records (see https://en.wikipedia.org/wiki/Jaccard_index) and uses min-hashing to estimate this similarity measure. For performance reasons, the Record-Linkage algorithm is statistical so that matched pairs of records probably have a high Jaccard similarity index.

For example, in the datasets used below Alice's Datbase includes RecordA and Bob's Database includes RecordB.
These records are not identical but their close similarity indicates that the very probably refer to the same entity
as reported in the final match report at the end of the demo below.

**RecordA**: Essie, Pocklington, Essie.Pocklington, ibm.com, 60259, 2800 Sawyer St, Tacoma,WA,98409, Box 69977, Clappertown, SD, USA, 91142, 3866, 623, 870, 8382

**RecordB**: Essie, Pocklington, Essie.Pocklington, ibm.com, 60259, 2800 Sawyer St, Tacoma,WASHINGTON,98409, Box 69977, Clappertown, SD, USA, 91142, 3866, 623, 870, 8382


In [None]:
import utils 

utils.verify_memory()

from pyhelayers import RecordLinkageConfig, RecordLinkageManager, RecordLinkageRule, RL_RULE_EQUAL, RL_RULE_SIMILAR, RL_RULE_NONE

## Step 1. Define the Record-Linkage configuration
Here we define the Record-Linkage configuration shared by both parties.
This includes the list of record field names and some
tuning of the Record-Linkage algorithm and heuristics.

In [None]:
config = RecordLinkageConfig()
fieldNames = ["first_name",
              "last_name",
              "email",
              "email_domain",
              "address_number",
              "address_location",
              "address_line2",
              "city",
              "state",
              "country",
              "zip_base",
              "zip_ext",
              "phone_area_code",
              "phone_exchange_code",
              "phone_line_number"]

config.set_num_bands_and_size_bands(num_bands=100, size_bands=13)
config.set_records_fields(fields_names=fieldNames, name_field_name="first_name")

 ## Step 2. Define the Record-Linkage rules
 Here we define the rules by which we will consider two records as linked.
  Each rule defines for each field a rule type - either RL_RULE_EQUAL,
  RL_RULE_SIMILAR or RL_RULE_NONE (which is the default rule type).
    - Fields with RL_RULE_EQUAL rule type implies that two records to be
      considered linked if their content of these fields is exactly the same.
    - Fields with RL_RULE_SIMILAR rule type implies that records to be
      considered linked if their content of these fields have high Jaccard
      simmilarity. We can also optionaly set the weight and size of shingles
      generated for every such field.
    - Fields with RL_RULE_NONE rule type are not taken into account in the record
      linkage process
  
  For two records to be considered linked, ALL the conditions in the specific
  rule must apply. For example, if first_name is set to RL_RULE_EQUAL and
  address_location is set to RL_RULE_SIMILAR then two records considered
  linked if their first_name content is equal AND their address_location
  content is similar.
  
  We can run the protocol with a number of rules iteratively. Two records are
  considered linked if at least one of the rules applies for them. The order
  of the rules matter - After a record has been matched, it will not be
  considered as a candidate at the following iterations.
  Thus, defining multiple rules is not only useful for defining separate match 
  conditions, but can also be used to optimize the performance of the matching 
  process by placing fast-to-compute rules before slow-to-compute rules.
  For example, testing for field equality is generally faster than testing for 
  field similarity, rule1 of this example is faster to   compute than rule2. 
  Rule1 is computed before rule2, so all the "easy" matches are first carried out 
  with rule1 and then the remaining "hard" cases are left for rule2.

In [None]:
rule1 = RecordLinkageRule(config)
rule1.set_field("last_name", RL_RULE_EQUAL)
rule1.set_field("city", RL_RULE_EQUAL)
rule1.set_field("address_location", RL_RULE_SIMILAR, 1, 5)
rule1.set_field("address_number", RL_RULE_SIMILAR, 1, 5)

rule2 = RecordLinkageRule(config)
rule2.set_field("first_name", RL_RULE_SIMILAR, 2, 4)
rule2.set_field("last_name", RL_RULE_SIMILAR, 2, 4)
rule2.set_field("email", RL_RULE_SIMILAR, 2, 4)
rule2.set_field("phone_line_number", RL_RULE_SIMILAR, 2, 4)
rule2.set_field("address_location", RL_RULE_SIMILAR, 1, 5)
rule2.set_field("address_number", RL_RULE_SIMILAR, 1, 5)
rule2.set_field("city", RL_RULE_SIMILAR, 1, 5)

 ## Step 3. Setup
 
 Setup the two Record-Linkage managers and load the records to be matched by them.
 Construct the RecordLinkageManager which manages the PPRL protocol.

In [None]:
alice = RecordLinkageManager(config)
bob = RecordLinkageManager(config)

Read the Alice and Bob's tables from the pair of csv files.
This step also processes the records (creates shingles and computes the min-hashes) and encrypts the processed information (thus plaing the part of the 1st party in a Diffie-Hellman like protocol).

In [None]:
INPUT_DIR = utils.get_data_sets_dir() + '/er/'
num_records = 1000

alice.init_records_from_file(INPUT_DIR + "out1.csv", num_records)
bob.init_records_from_file(INPUT_DIR + "out2.csv", num_records)

## Step 4. Match with rule1

We are ready to start the first iteration. At each iteration we need to set the rule we want to run.

In [None]:
alice.set_current_rule(rule1)
bob.set_current_rule(rule1)

Both sides exchange their encrypted data. At an iteration that runs with a rule that combines RL_RULE_EQUAL rule type fields and RL_RULE_SIMILAR rule type field, we must run the RL_RULE_EQUAL related functions first.

In [None]:
packageAlice = alice.encrypt_fields_for_equal_rule()
packageBob = bob.encrypt_fields_for_equal_rule()

Both sides receive the encrypted information from the other side, and then add their own encryption layer (thus playing the part of the 2nd party in a Diffie-Hellman like protocol).

In [None]:
alice.apply_secret_key_to_records(packageBob)
bob.apply_secret_key_to_records(packageAlice)

  Both parties receives the doubly encrypted PPRL information of their own
  records from the other party. They then processes this together with the
  doubly encrypted PPRL information they computed for the other party's
  records.

In [None]:
alice.match_records_by_equal_rule(packageAlice, packageBob)
bob.match_records_by_equal_rule(packageBob, packageAlice)

We now run the RL_RULE_SIMILAR related functions

In [None]:
packageAlice = alice.encrypt_fields_for_similar_rule()
packageBob = bob.encrypt_fields_for_similar_rule()

alice.apply_secret_key_to_records(packageBob)
bob.apply_secret_key_to_records(packageAlice)

alice.match_records_by_similar_rule(packageAlice, packageBob)
bob.match_records_by_similar_rule(packageBob, packageAlice)

  At this point we finished the first iteration and are ready to run the next one
  with the second rule. But before we proceed to it, we can check the current
  matches we got.

In [None]:
print('Found {0} matched records'.format(alice.get_num_matched_records()[0]))

## Step 5. Match with rule2

We proceed by setting the rule in the
  RecordLinkageManager objects.

In [None]:
alice.set_current_rule(rule2)
bob.set_current_rule(rule2)

Since the second rule contains only RL_RULE_SIMILAR rule type fields, we
  can skip the RL_RULE_EQUAL related functions.
  NOTICE: running the RL_RULE_EQUAL related function first is OK and will be
  silently ignored.
  In any case of confusion you can use the getNextExpectedFunctionName
  function.

In [None]:
packageAlice = alice.encrypt_fields_for_similar_rule()
packageBob = bob.encrypt_fields_for_similar_rule()

alice.apply_secret_key_to_records(packageBob)
bob.apply_secret_key_to_records(packageAlice)

alice.match_records_by_similar_rule(packageAlice, packageBob)
bob.match_records_by_similar_rule(packageBob, packageAlice)

## Step 6. Report the results
 
Finally, we're ready to Compare the two sets of doubly encrypted PPRL
information for the records of the two sides and report the matching pairs
of records.
  
For every record of Alice that has a duplicate in Bob's table, the report
includes the number of shared "bands" for the matched pair. This is a technical
term that relates to the min-hash algorithm that is used to compare the
records, but in general at least one shared band is required and sufficient
to warrent a report of a probable candidate pair of records. More shared
bands indicate a higher probability that the pair of records indeed
describe the same entity.
  
Thus the following example indicates that the reported record from Alice's
DB has an identical record in Bob's DB (with all the 100 bands matching).

**RecordA**: Fredericka, Martin, Fredericka.Martin, stanford.edu, 28598, 150
  Dewey St,
    Mountain Home,ID,83647, Suite 68325, Bozoo, OH, USA, 81783, 5338, 458,
    845, 8312

**RecordB**: Fredericka, Martin, Fredericka.Martin, stanford.edu, 28598, 150
  Dewey St,
    Mountain Home,ID,83647, Suite 68325, Bozoo, OH, USA, 81783, 5338, 458,
    845, 8312
  Number of matching bands: 100
  
The following example indicates that the reported record from Alice's
DB has a similar though not identical record in Bob's DB (with just one matching band).

**RecordA**: Essie, Pocklington, Essie.Pocklington, ibm.com, 60259, 2800 Sawyer St, Tacoma,WA,98409, Box 69977, Clappertown, SD, USA, 91142, 3866, 623, 870, 8382

**RecordB**: Essie, Pocklington, Essie.Pocklington, ibm.com, 60259, 2800 Sawyer St, Tacoma,WASHINGTON,98409, Box 69977, Clappertown, SD, USA, 91142, 3866, 623, 870, 8382

 Number of matching bands: 1
  
Note that the printout produced here prints Alice's record and also Bob's
matching record. This is possible here because this program has access both
to Alice's and Bob's records and the print method called by alice below
indeed receives Bob's table information. This is for debuging purposes
only, and a more realistic application would call
  int numMatches = alice.report_matched_records()
which would only print the records of Alice that have a similar record in
Bob's table but not actually print Bob's record.

In [None]:
res = alice.get_num_matched_records()
num_matches = res[0]
num_blocked = res[1]
print("Number of records analyzed from Alice's side: ", alice.get_num_of_records())
print("Number of records analyzed from Bob's side  : ", bob.get_num_of_records())
print("Number of matches similar records           : ", num_matches)
print("Number of blocked records                   : ", num_blocked)
print("RAM usage                                   : ", utils.get_used_ram(), "MB")

In [None]:
alice.report_matched_records_along_with_other_side_records(bob, True)

In [None]:
assert num_matches == 100

#### References:

<sub><sup> 1.	Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Finding Similar Items. In Mining of Massive Datasets (pp. 68-122). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139924801.004 </sup></sub>
