## Entity Matching Example

Using FEBRL synthetic data

### Load Dependencies

In [1]:
import json
import pandas as pd

from pyent.datasets import generate_febrl_data, remove_nan, sample_xy
from pyent.datasets import train_test_validate_stratified_split as ttvs
from pyent.features import generate_textual_features
from pyent.train import train_txt_baseline
from pyent.config import get_config

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Generate Synthetic Data 

In [2]:
master_df = remove_nan(generate_febrl_data(init_seed=2))
master_df.head(3)

Before Droping NaN's shape of data is (86506, 23)
After Droping NaN's shape of data is (52560, 23)


Unnamed: 0,rec_idL,rec_idR,given_name_l,surname_l,street_number_l,address_1_l,address_2_l,suburb_l,postcode_l,state_l,date_of_birth_l,soc_sec_id_l,given_name_r,surname_r,street_number_r,address_1_r,address_2_r,suburb_r,postcode_r,state_r,date_of_birth_r,soc_sec_id_r,labels
0,rec-2331-org,rec-4869-dup-0,christian,reid,10,britten-jones drive,honey patch,moe,2250,wa,19870501,2773283,taylah,reid,25,albermarlze place,cypress garden,pennant hills,2210,nsw,19571029.0,7596151,no_match
1,rec-1120-org,rec-2288-dup-0,angelica,green,5,nash place,palm grove,vaucluse,5242,nsw,19051230,7491589,jhoel,green,708,wangarastreet,gallagher house,burpengary,4670,qld,,9018656,no_match
2,rec-3774-org,rec-1136-dup-0,noah,clarke,608,bindel street,anstee court,boggabri,5158,nsw,19540920,8260965,lachlan,clarke,19,bunbury street,kildurham,nhill,3850,nsw,19180501.0,2371774,no_match


In [3]:
master_df.labels.value_counts()

no_match    49151
match        3409
Name: labels, dtype: int64

### Split Data into Development and Test Sets

In [4]:
X = master_df.loc[:, ~master_df.columns.isin(["labels"])]
y = master_df.loc[:, "labels"]

X_train, X_test, X_val, y_train, y_test, y_val = ttvs(
    features=X, targets=y, test_size=0.1, validate_size=0.2)

print(f"Train Split shape: ( {X_train.shape} , {y_train.shape} )")
print(f"Test Split shape: ( {X_test.shape} , {y_test.shape} )")
print(f"Validate Split shape: ( {X_val.shape} , {y_val.shape} )")

Train Split shape: ( (36792, 22) , (36792,) )
Test Split shape: ( (5256, 22) , (5256,) )
Validate Split shape: ( (10512, 22) , (10512,) )


### Get Parameters for Baseline Model Configuration


In [6]:
baseline_model_params = get_config()["model_params"]

print(f"Model Training will take in the following external parameters\n{json.dumps(baseline_model_params, indent=4)}")

Model Training will take in the following external parameters
{
    "model_name": "bert-base-uncased",
    "num_epochs": 1,
    "train_batch_size": 64,
    "test_batch_size": 32,
    "margin": 0.5
}


### Generate Textual Features

In [7]:
X_train_txt = generate_textual_features(X_train)
X_test_txt = generate_textual_features(X_test)
X_val_txt = generate_textual_features(X_val)

print(f"Train feature set shpae: {X_train_txt.shape} and Train target shape {len(y_train)}\nTest feature set shpae: {X_test_txt.shape} and Test target shape {len(y_test)}\nValidation feature set shape: {X_val_txt.shape} and Vaiidation target shape {len(y_val)}")

Train feature set shpae: (36792, 2) and Train target shape 36792
Test feature set shpae: (5256, 2) and Test target shape 5256
Validation feature set shape: (10512, 2) and Vaiidation target shape 10512


### Develop Transformer based Siamese Neural Network Model as Baseline Model

To start, for this model we can just look at the `sentence_l` and `sentence_r` _"textual"_ features we generated as shown above.

<!-- 
![example_siamese](../docs/example_siamese.png)
<h6>Image Obtained from Quora Blog Post: https://quoraengineering.quora.com/</h6>  
 -->
  
1. distill roberta base model fron huggingface
2. for negative pairs (i.e. target variabkes with negative class labels) the margin = 0.5
3. as distance metric we use cosine distance (1-cosine_similarity)


In [8]:
X_train_txt_sample, y_train_sample = sample_xy(X=X_train_txt,y=y_train,num=64)
X_test_txt_sample, y_test_sample = sample_xy(X=X_test_txt,y=y_test,num=32)
X_val_txt_sample, y_val_sample = sample_xy(X=X_val_txt,y=y_val,num=32)

train_txt_baseline(X_train_txt_sample, y_train_sample, X_test_txt_sample, y_test_sample, X_val_txt_sample, y_val_sample, 
**baseline_model_params)

2023-10-21 01:52:02 - Load pretrained SentenceTransformer: bert-base-uncased


Downloading (…)CoreML/model.mlmodel:   0%|          | 0.00/165k [00:00<?, ?B/s]

Downloading (…)"weight.bin";:   0%|          | 0.00/532M [00:00<?, ?B/s]

Downloading (…)ackage/Manifest.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading (…)"model.onnx";:   0%|          | 0.00/532M [00:00<?, ?B/s]

2023-10-21 01:52:33 - No sentence-transformers model found with name /Users/mustafawaheed/.cache/torch/sentence_transformers/bert-base-uncased. Creating a new one with MEAN pooling.


Some weights of the model checkpoint at /Users/mustafawaheed/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2023-10-21 01:52:35 - Use pytorch device: cpu
2023-10-21 01:52:35 - Evaluate model without training
2023-10-21 01:52:35 - Binary Accuracy Evaluation of the model on  dataset in epoch 0 after 0 steps:


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-10-21 01:52:40 - Accuracy with Cosine-Similarity:           91.67	(Threshold: 0.8556)
2023-10-21 01:52:40 - F1 with Cosine-Similarity:                 93.55	(Threshold: 0.8339)
2023-10-21 01:52:40 - Precision with Cosine-Similarity:          96.67
2023-10-21 01:52:40 - Recall with Cosine-Similarity:             90.62
2023-10-21 01:52:40 - Average Precision with Cosine-Similarity:  97.61

2023-10-21 01:52:40 - Accuracy with Manhattan-Distance:           93.75	(Threshold: 101.3777)
2023-10-21 01:52:40 - F1 with Manhattan-Distance:                 95.08	(Threshold: 101.3777)
2023-10-21 01:52:40 - Precision with Manhattan-Distance:          100.00
2023-10-21 01:52:40 - Recall with Manhattan-Distance:             90.62
2023-10-21 01:52:40 - Average Precision with Manhattan-Distance:  97.85

2023-10-21 01:52:40 - Accuracy with Euclidean-Distance:           93.75	(Threshold: 4.5505)
2023-10-21 01:52:40 - F1 with Euclidean-Distance:                 95.08	(Threshold: 4.5505)
2023-10-21 01:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/3 [00:00<?, ?it/s]

2023-10-21 01:53:34 - Binary Accuracy Evaluation of the model on  dataset after epoch 0:


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-10-21 01:53:39 - Accuracy with Cosine-Similarity:           93.75	(Threshold: 0.8121)
2023-10-21 01:53:39 - F1 with Cosine-Similarity:                 95.08	(Threshold: 0.8121)
2023-10-21 01:53:39 - Precision with Cosine-Similarity:          100.00
2023-10-21 01:53:39 - Recall with Cosine-Similarity:             90.62
2023-10-21 01:53:39 - Average Precision with Cosine-Similarity:  98.52

2023-10-21 01:53:39 - Accuracy with Manhattan-Distance:           93.75	(Threshold: 104.2180)
2023-10-21 01:53:39 - F1 with Manhattan-Distance:                 95.24	(Threshold: 114.3953)
2023-10-21 01:53:39 - Precision with Manhattan-Distance:          96.77
2023-10-21 01:53:39 - Recall with Manhattan-Distance:             93.75
2023-10-21 01:53:39 - Average Precision with Manhattan-Distance:  98.67

2023-10-21 01:53:39 - Accuracy with Euclidean-Distance:           93.75	(Threshold: 4.7143)
2023-10-21 01:53:39 - F1 with Euclidean-Distance:                 95.08	(Threshold: 4.7143)
2023-10-21 01:

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-10-21 01:53:46 - Accuracy with Cosine-Similarity:           96.88	(Threshold: 0.8064)
2023-10-21 01:53:46 - F1 with Cosine-Similarity:                 96.77	(Threshold: 0.8064)
2023-10-21 01:53:46 - Precision with Cosine-Similarity:          100.00
2023-10-21 01:53:46 - Recall with Cosine-Similarity:             93.75
2023-10-21 01:53:46 - Average Precision with Cosine-Similarity:  99.39

2023-10-21 01:53:46 - Accuracy with Manhattan-Distance:           96.88	(Threshold: 104.5131)
2023-10-21 01:53:46 - F1 with Manhattan-Distance:                 96.77	(Threshold: 104.5131)
2023-10-21 01:53:46 - Precision with Manhattan-Distance:          100.00
2023-10-21 01:53:46 - Recall with Manhattan-Distance:             93.75
2023-10-21 01:53:46 - Average Precision with Manhattan-Distance:  99.19

2023-10-21 01:53:46 - Accuracy with Euclidean-Distance:           96.88	(Threshold: 4.7902)
2023-10-21 01:53:46 - F1 with Euclidean-Distance:                 96.77	(Threshold: 4.7902)
2023-10-21 01

----

## Acknowledgements

```bibtex 
@inproceedings{reimers-2019-sentence-bert,
    title     = "Sentence-BERT: Sentence Embeddings using Siamese   BERT-Networks",
    author    = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month     = "11",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    url       = "https://arxiv.org/abs/1908.10084",
}
```
  
```bibtex  
@software{de_bruin_j_2019_3559043,
  author       = "De Bruin, J",
  title        = "Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python",
  month        = "12",
  year         = "2019",
  publisher    = "Zenodo",
  version      = "v0.14",
  doi          = "10.5281/zenodo.3559043",
  url          = "https://doi.org/10.5281/zenodo.3559043"
}
```


