# Entity Matching


## Setup Python environment

If you are running this notebook inside Colab, you will first need to install necessary packages by running the code below:

In [1]:
try:
    import deepmatcher as dm
except:
    !pip install git+https://github.com/anhaidgroup/deepmatcher.git
    import deepmatcher as dm

We recommend having a GPU available for the DeepMatcher training. In case a GPU is not available, we will use all available CPU cores. You can run the following command to determine if a GPU is available and will be used for training:

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
try:
    import recordlinkage
except:
    !pip install -qqq recordlinkage
    import recordlinkage

In [4]:
import pandas as pd
import numpy as np

# Data preprocessing

Read the data:

In [5]:
left_df = pd.read_csv('Data/DBLP2.csv', encoding='cp1252')
right_df = pd.read_csv('Data/ACM.csv', encoding='cp1252')
display(left_df.head())
display(right_df.head())

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and it...,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996
2,conf/vldb/PalpanasSCP02,Incremental Maintenance for Non-Distributive A...,"Themistoklis Palpanas, Richard Sidle, Hamid Pi...",VLDB,2002
3,conf/vldb/GardarinGT96,Cost-based Selection of Path Expression Proces...,"Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...",VLDB,1996
4,conf/vldb/HoelS95,Benchmarking Spatial Join Operations with Spat...,"Erik G. Hoel, Hanan Samet",VLDB,1995


Unnamed: 0,id,title,authors,venue,year
0,304586,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,304587,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999
2,304589,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data,1999
3,304590,XML-based information mediation with MIX,"Chaitan Baru, Amarnath Gupta, Bertram Lud&#228...",International Conference on Management of Data,1999
4,304582,The CCUBE constraint object-oriented database ...,"Alexander Brodsky, Victor E. Segal, Jia Chen, ...",International Conference on Management of Data,1999


Label columns as left/right and then form Cartesian product:

In [6]:
left_df_renamed =left_df.add_prefix('left_')
left_df_renamed['join'] = 1
right_df_renamed =right_df.add_prefix('right_')
right_df_renamed['join'] = 1
prod_df = pd.merge(left_df_renamed, right_df_renamed, on="join")
prod_df = prod_df.drop(columns=['join'])
display(prod_df.head())

Unnamed: 0,left_id,left_title,left_authors,left_venue,left_year,right_id,right_title,right_authors,right_venue,right_year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999,304586,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999,304587,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999
2,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999,304589,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data,1999
3,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999,304590,XML-based information mediation with MIX,"Chaitan Baru, Amarnath Gupta, Bertram Lud&#228...",International Conference on Management of Data,1999
4,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999,304582,The CCUBE constraint object-oriented database ...,"Alexander Brodsky, Victor E. Segal, Jia Chen, ...",International Conference on Management of Data,1999


Read true links and write them to product table:

In [7]:
matches = pd.read_csv('Data/DBLP-ACM_perfectMapping.csv', encoding='cp1252')
matching_tuples = [(m[0],m[1]) for m in matches.values]
final_df_full = prod_df.copy()
final_df_full.insert(0,'label',0)
final_df_full['combined_index'] = list(zip(final_df_full.left_id, final_df_full.right_id))
final_df_full.loc[final_df_full['combined_index'].isin(matching_tuples),'label'] = 1
final_df_full = final_df_full.drop(columns=['combined_index','left_id','right_id'])
final_df_full.index.name = 'id'
final_df = final_df_full.sample(frac=0.05)
display(final_df.head())
display(final_df[final_df.label.eq(1)].head())

Unnamed: 0_level_0,label,left_title,left_authors,left_venue,left_year,right_title,right_authors,right_venue,right_year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1046262,0,Spatial Joins Using R-trees: Breadth-First Tra...,"Ning Jing, Yun-Wu Huang, Elke A. Rundensteiner",VLDB,1997,Mind your vocabulary: query mapping across het...,"Chen-Chuan K. Chang, H&#233;ctor Garc&#237;a-M...",International Conference on Management of Data,1999
723207,0,"Declarative Data Cleaning: Language, Model, an...","Daniela Florescu, Cristian-Augustin Saita, Eri...",VLDB,2001,"Transactional information systems: theory, alg...",Marc H. Scholl,ACM SIGMOD Record,2001
2621978,0,On-Line Warehouse View Maintenance,"Jennifer Widom, Dallan Quass",SIGMOD Conference,1997,P-Grid: a self-organizing structured P2P system,"Karl Aberer, Philippe Cudr&#233;-Mauroux, Anwi...",ACM SIGMOD Record,2003
3330295,0,Querying and mining data streams: you only get...,"Minos N. Garofalakis, Rajeev Rastogi, Johannes...",SIGMOD Conference,2002,Comprehension syntax,"Peter Buneman, Leonid Libkin, Dan Suciu, Val T...",ACM SIGMOD Record,1994
3254369,0,On Parallel Execution of Multiple Pipelined Ha...,"Philip S. Yu, Hui-I Hsiao, Ming-Syan Chen",SIGMOD Conference,1994,Bypassing Joins in Disjunctive Queries,"Michael Steinbrunn, Klaus Peithner, Guido Moer...",Very Large Data Bases,1995


Unnamed: 0_level_0,label,left_title,left_authors,left_venue,left_year,right_title,right_authors,right_venue,right_year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5396296,1,Industrial Panel on Data Warehousing Technolog...,Umeshwar Dayal,VLDB,1999,Industrial Panel on Data Warehousing Technolog...,Umeshwar Dayal,Very Large Data Bases,1999
37056,1,Snowball: A Prototype System for Extracting Re...,"Jeff Pavel, Luis Gravano, Aleksandr Voskoboyni...",SIGMOD Conference,2001,Snowball: a prototype system for extracting re...,"Eugene Agichtein, Luis Gravano, Jeff Pavel, Vi...",International Conference on Management of Data,2001
4635259,1,"Set Containment Joins: The Good, The Bad and T...","Karthikeyan Ramasamy, Jeffrey F. Naughton, Jig...",VLDB,2000,"Set Containment Joins: The Good, The Bad and T...","Karthikeyan Ramasamy, Jignesh M. Patel, Jeffre...",Very Large Data Bases,2000
4299157,1,Optimization of Constrained Frequent Set Queri...,"Laks V. S. Lakshmanan, Alex Pang, Raymond T. N...",SIGMOD Conference,1999,Optimization of constrained frequent set queri...,"Laks V. S. Lakshmanan, Raymond Ng, Jiawei Han,...",International Conference on Management of Data,1999
373228,1,Temporal Database System Implementations,Michael H. Böhlen,SIGMOD Record,1995,Temporal database system implementations,Michael H. B&#246;hlen,ACM SIGMOD Record,1995


Create train, validation and test sets

In [8]:
train,validate,test = np.split(final_df, [int(.6 * len(final_df)), int(.8 * len(final_df))])
train_file = 'Data/train.csv'
validate_file = 'Data/validate.csv'
test_file = 'Data/test.csv'
train.to_csv(train_file)
validate.to_csv(validate_file)
test.to_csv(test_file)

In [9]:
train, validation, test = dm.data.process(
    path='.',
    left_prefix='left_',
    right_prefix='right_',
    label_attr='label',
    id_attr='id',
    cache=None,
    train=train_file,
    validation=validate_file,
    test=test_file)


Reading and processing data from "./Data/train.csv"
0% [##############################] 100% | ETA: 00:00:00
Reading and processing data from "./Data/validate.csv"
0% [##############################] 100% | ETA: 00:00:00
Reading and processing data from "./Data/test.csv"
0% [##############################] 100% | ETA: 00:00:00INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh
wiki.en.bin: 8.49GB [02:23, 59.1MB/s]
INFO:deepmatcher.data.field:Extracting vectors into /home/jovyan/.vector_cache

Building vocabulary
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:10

Computing principal components
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:24


#### Peeking at processed data
Let's take a look at how the processed data looks like. To do this, we get the raw `pandas` table corresponding to the processed training dataset object. 

In [10]:
train_table = train.get_raw_table()
train_table.head()

Unnamed: 0,id,label,left_title,left_authors,left_venue,left_year,right_title,right_authors,right_venue,right_year
0,1046262,0,spatial joins using r-trees : breadth-first tr...,"ning jing , yun-wu huang , elke a. rundensteiner",vldb,1997,mind your vocabulary : query mapping across he...,"chen-chuan k. chang , h & # 233 ; ctor garc & ...",international conference on management of data,1999
1,723207,0,"declarative data cleaning : language , model ,...","daniela florescu , cristian-augustin saita , e...",vldb,2001,"transactional information systems : theory , a...",marc h. scholl,acm sigmod record,2001
2,2621978,0,on-line warehouse view maintenance,"jennifer widom , dallan quass",sigmod conference,1997,p-grid : a self-organizing structured p2p system,"karl aberer , philippe cudr & # 233 ; -mauroux...",acm sigmod record,2003
3,3330295,0,querying and mining data streams : you only ge...,"minos n. garofalakis , rajeev rastogi , johann...",sigmod conference,2002,comprehension syntax,"peter buneman , leonid libkin , dan suciu , va...",acm sigmod record,1994
4,3254369,0,on parallel execution of multiple pipelined ha...,"philip s. yu , hui-i hsiao , ming-syan chen",sigmod conference,1994,bypassing joins in disjunctive queries,"michael steinbrunn , klaus peithner , guido mo...",very large data bases,1995


## Define neural network model

In [11]:
model = dm.MatchingModel(attr_summarizer='hybrid')

## Train model

In [12]:
model.run_train(
    train,
    validation,
    epochs=8,
    batch_size=16,
    best_save_path='Results/model_dblp_acm.pth',
    pos_neg_ratio=10)



* Number of trainable parameters: 9210006
===>  TRAIN Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:43


Finished Epoch 1 || Run Time: 1289.5 | Load Time:  314.4 || F1:  36.62 | Prec:  27.27 | Rec:  55.71 || Ex/s: 112.25

===>  EVAL Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:11


Finished Epoch 1 || Run Time:  206.0 | Load Time:  105.8 || F1:  60.61 | Prec:  44.44 | Rec:  95.24 || Ex/s: 192.44

* Best F1: tensor(60.6061, device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:47


Finished Epoch 2 || Run Time: 1292.0 | Load Time:  315.5 || F1:  64.58 | Prec:  50.82 | Rec:  88.57 || Ex/s: 112.00

===>  EVAL Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:16


Finished Epoch 2 || Run Time:  210.1 | Load Time:  106.2 || F1:  60.00 | Prec: 100.00 | Rec:  42.86 || Ex/s: 189.71

---------------------

===>  TRAIN Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:44


Finished Epoch 3 || Run Time: 1290.3 | Load Time:  314.8 || F1:  84.28 | Prec:  75.28 | Rec:  95.71 || Ex/s: 112.16

===>  EVAL Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:11


Finished Epoch 3 || Run Time:  205.9 | Load Time:  106.0 || F1:  89.47 | Prec: 100.00 | Rec:  80.95 || Ex/s: 192.43

* Best F1: tensor(89.4737, device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:45


Finished Epoch 4 || Run Time: 1291.2 | Load Time:  315.0 || F1:  84.47 | Prec:  74.73 | Rec:  97.14 || Ex/s: 112.09

===>  EVAL Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:11


Finished Epoch 4 || Run Time:  206.3 | Load Time:  105.5 || F1:  83.72 | Prec:  81.82 | Rec:  85.71 || Ex/s: 192.48

---------------------

===>  TRAIN Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:40


Finished Epoch 5 || Run Time: 1286.7 | Load Time:  314.2 || F1:  92.00 | Prec:  86.25 | Rec:  98.57 || Ex/s: 112.46

===>  EVAL Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:11


Finished Epoch 5 || Run Time:  206.2 | Load Time:  105.9 || F1:  92.31 | Prec: 100.00 | Rec:  85.71 || Ex/s: 192.31

* Best F1: tensor(92.3077, device='cuda:0')
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 6


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:42


Finished Epoch 6 || Run Time: 1288.9 | Load Time:  314.4 || F1:  93.06 | Prec:  90.54 | Rec:  95.71 || Ex/s: 112.28

===>  EVAL Epoch 6


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:13


Finished Epoch 6 || Run Time:  207.6 | Load Time:  106.1 || F1:  90.00 | Prec:  94.74 | Rec:  85.71 || Ex/s: 191.27

---------------------

===>  TRAIN Epoch 7


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:47


Finished Epoch 7 || Run Time: 1293.0 | Load Time:  315.1 || F1:  97.18 | Prec:  95.83 | Rec:  98.57 || Ex/s: 111.95

===>  EVAL Epoch 7


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:12


Finished Epoch 7 || Run Time:  206.8 | Load Time:  106.1 || F1:  90.00 | Prec:  94.74 | Rec:  85.71 || Ex/s: 191.81

---------------------

===>  TRAIN Epoch 8


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:47


Finished Epoch 8 || Run Time: 1293.0 | Load Time:  315.1 || F1:  98.57 | Prec:  98.57 | Rec:  98.57 || Ex/s: 111.95

===>  EVAL Epoch 8


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:05:13


Finished Epoch 8 || Run Time:  207.8 | Load Time:  106.2 || F1:  90.00 | Prec:  94.74 | Rec:  85.71 || Ex/s: 191.10

---------------------

Loading best model...
Training done.


tensor(92.3077, device='cuda:0')

## Model evaluation

Now that we have a trained model for entity matching, we can now evaluate its accuracy on test data, to estimate the performance of the model on unlabeled data.

In [13]:
# Compute F1 on test set
model.run_eval(test)

===>  EVAL Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:03:55


Finished Epoch 5 || Run Time:  116.7 | Load Time:  118.8 || F1:  87.80 | Prec:  85.71 | Rec:  90.00 || Ex/s: 254.81



tensor(87.8049, device='cuda:0')

## Python record linkage toolkit

We will now see how the methods from the Python record linkage toolkit compare to the DeepMatcher approach.

In [16]:
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('title', 'title', method='jarowinkler', threshold=0.85,label = 'title')
compare_cl.string('authors', 'authors', method='jarowinkler', threshold=0.85,label = 'authors')
compare_cl.exact('year', 'year', label='year')
compare_cl.string('venue', 'venue', method='jarowinkler', threshold=0.65,label = 'authors')

<Compare>

In [19]:
# Load data
true_links = {}
features = {}
validation = {}
datasets= {'train': train_file, 
'test': test_file,
'validation': validate_file}

In [20]:
# Some helper functions
def print_results(name, table):
    print("".join(['*' for x in range(len(name) + 1)]))
    print('{}'.format(name))
    print("".join(['*' for x in range(len(name) + 1)]))
    print("Confusion matrix:")
    print(table['confusion_matrix'])
    print("Accuracy: {}".format(table['accuracy']))
    print("Recall: {}".format(table['recall']))
    print("F-score: {}".format(table['f-score']))
    print('\n')
    
def performance_metrics(true_links, result, set_size):
    validation = {}
    validation['confusion_matrix'] = recordlinkage.confusion_matrix(true_links, result, set_size)
    validation['accuracy'] = recordlinkage.accuracy(true_links, result, len(features['validation']))
    validation['recall'] = recordlinkage.recall(true_links, result)
    validation['f-score'] = recordlinkage.fscore(true_links, result)
    return validation

In [21]:
# Train and validate various classifiers
classifiers= {
'Hand-tuned':None,
'Logistic regression':recordlinkage.LogisticRegressionClassifier(),
'Naive Bayes': recordlinkage.NaiveBayesClassifier(),
'Support vector machine': recordlinkage.SVMClassifier(),
'K-means': recordlinkage.KMeansClassifier(),
'ECM': recordlinkage.ECMClassifier()}
for key in datasets:
    df = pd.read_csv(datasets[key])
    nof_cols = int((df.shape[1] - 2)/2)
    dfA = df.iloc[:,2:nof_cols + 2]
    dfB = df.iloc[:,nof_cols + 2:df.shape[1]]
    dfA.rename(columns={c:c[5:] for c in dfA.columns },inplace=True)
    dfB.rename(columns={c:c[6:] for c in dfB.columns },inplace=True)

    tuples = [(i,i) for i in range(len(df)) if df.iloc[i]['label'] == 1]
    true_links[key] = pd.MultiIndex.from_tuples(tuples)

    tuples_full = [(i,i) for i in range(len(df))]
    candidate_links = pd.MultiIndex.from_tuples(tuples_full)
    # Final features (used in other methods as well)
    features[key] = compare_cl.compute(candidate_links, dfA, dfB)

In [22]:
for key in classifiers:
    validation[key] = {}
    if key == 'Hand-tuned':
        # Immediate prediction
        result = features['validation'][features['validation'].sum(axis=1) > 0].index
    else:
        # Training 
        if key == 'ECM':
            classifiers[key].fit(features['train']) # somehow ECM cannot ignore redundant argument, opposed to K-Means
        else:
            classifiers[key].fit(features['train'], true_links['train'])
        # Predict the match status for all test record pairs
        result =  classifiers[key].predict(features['validation'])
        
    # Validate
    validation[key] = performance_metrics(true_links['validation'], result, len(features['validation']))
    
    #Print results
    print_results(key, validation[key])


***********
Hand-tuned
***********
Confusion matrix:
[[   21     0]
 [10593 49397]]
Accuracy: 0.823482361567046
Recall: 1.0
F-score: 0.003949224259520451


********************
Logistic regression
********************
Confusion matrix:
[[   20     1]
 [    0 59990]]
Accuracy: 0.9999833363883288
Recall: 0.9523809523809523
F-score: 0.975609756097561


************
Naive Bayes
************
Confusion matrix:
[[   20     1]
 [    0 59990]]
Accuracy: 0.9999833363883288
Recall: 0.9523809523809523
F-score: 0.975609756097561


***********************
Support vector machine
***********************
Confusion matrix:
[[   20     1]
 [    0 59990]]
Accuracy: 0.9999833363883288
Recall: 0.9523809523809523
F-score: 0.975609756097561


********
K-means
********
Confusion matrix:
[[    7    14]
 [  564 59426]]
Accuracy: 0.99036843245405
Recall: 0.3333333333333333
F-score: 0.02364864864864865


****
ECM
****
Confusion matrix:
[[   20     1]
 [    2 59988]]
Accuracy: 0.9999500091649864
Recall: 0.952380952

In [23]:
# Test results for best method
f_scores = {key:validation[key]['f-score'] for key in validation}
best_model = max(f_scores, key = f_scores.get) 
if best_model == 'Hand-tuned':
    result = features['test'][features['test'].sum(axis=1) > 2].index
else:
    result = classifiers[best_model].predict(features['test'])
test =  performance_metrics(true_links['test'], result, len(features['test']))
print_results("Selected model ({}) on test set".format(best_model), test)


*************************************************
Selected model (Logistic regression) on test set
*************************************************
Confusion matrix:
[[   19     1]
 [    2 59989]]
Accuracy: 0.9999500091649864
Recall: 0.95
F-score: 0.9268292682926829


