## Entity Matching Example

Using FEBRL synthetic data

### Load Dependencies

In [10]:
import json
import pandas as pd

from pyent.datasets import generate_febrl_data, remove_nan, sample_xy
from pyent.datasets import train_test_validate_stratified_split as ttvs
from pyent.features import generate_textual_features
from pyent.train import train_txt_baseline
from pyent.config import get_config

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Generate Synthetic Data 

In [3]:
master_df = remove_nan(generate_febrl_data(init_seed=2))

Before Droping NaN's shape of data is (86506, 23)
After Droping NaN's shape of data is (52560, 23)


In [4]:
master_df.labels.value_counts()

no_match    49151
match        3409
Name: labels, dtype: int64

### Split Data into Development and Test Sets

In [5]:
X = master_df.loc[:, ~master_df.columns.isin(["labels"])]
y = master_df.loc[:, "labels"]

X_train, X_test, X_val, y_train, y_test, y_val = ttvs(
    features=X, targets=y, test_size=0.1, validate_size=0.2)


### Get Parameters for Baseline Model Configuration


In [16]:
baseline_model_params = get_config()["model_params"]

print(f"Model Training will take in the following external parameters\n{json.dumps(baseline_model_params, indent=4)}")

Model Training will take in the following external parameters
{
    "model_name": "bert-base-uncased",
    "num_epochs": 1,
    "train_batch_size": 64,
    "test_batch_size": 32,
    "margin": 0.5
}


### Generate Textual Features

In [14]:
X_train_txt = generate_textual_features(X_train)
X_test_txt = generate_textual_features(X_test)
X_val_txt = generate_textual_features(X_val)

print(f"Train feature set shpae: {X_train_txt.shape} and Train target shape {len(y_train)}\nTest feature set shpae: {X_test_txt.shape} and Test target shape {len(y_test)}\nValidation feature set shape: {X_val_txt.shape} and Vaiidation target shape {len(y_val)}")

Train feature set shpae: (36792, 2) and Train target shape 36792
Test feature set shpae: (5256, 2) and Test target shape 5256
Validation feature set shape: (10512, 2) and Vaiidation target shape 10512


### Develop Transformer based Siamese Neural Network Model as Baseline Model

To start, for this model we can just look at the `sentence_l` and `sentence_r` _"textual"_ features we generated as shown above.

<!-- 
![example_siamese](../docs/example_siamese.png)
<h6>Image Obtained from Quora Blog Post: https://quoraengineering.quora.com/</h6>  
 -->
  
1. distill roberta base model fron huggingface
2. for negative pairs (i.e. target variabkes with negative class labels) the margin = 0.5
3. as distance metric we use cosine distance (1-cosine_similarity)


In [15]:
X_train_txt_sample, y_train_sample = sample_xy(X=X_train_txt,y=y_train,num=64)
X_test_txt_sample, y_test_sample = sample_xy(X=X_test_txt,y=y_test,num=32)
X_val_txt_sample, y_val_sample = sample_xy(X=X_val_txt,y=y_val,num=32)

train_txt_baseline(X_train_txt_sample, y_train_sample, X_test_txt_sample, y_test_sample, X_val_txt_sample, y_val_sample, 
**baseline_model_params)

2023-02-24 02:02:05 - Load pretrained SentenceTransformer: bert-base-uncased
2023-02-24 02:02:07 - No sentence-transformers model found with name /Users/mustafawaheed/.cache/torch/sentence_transformers/bert-base-uncased. Creating a new one with MEAN pooling.


Some weights of the model checkpoint at /Users/mustafawaheed/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2023-02-24 02:02:09 - Use pytorch device: cpu
2023-02-24 02:02:09 - Evaluate model without training
2023-02-24 02:02:09 - Binary Accuracy Evaluation of the model on  dataset in epoch 0 after 0 steps:


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-02-24 02:02:14 - Accuracy with Cosine-Similarity:           91.67	(Threshold: 0.8566)
2023-02-24 02:02:14 - F1 with Cosine-Similarity:                 93.33	(Threshold: 0.8566)
2023-02-24 02:02:14 - Precision with Cosine-Similarity:          100.00
2023-02-24 02:02:14 - Recall with Cosine-Similarity:             87.50
2023-02-24 02:02:14 - Average Precision with Cosine-Similarity:  98.01

2023-02-24 02:02:14 - Accuracy with Manhattan-Distance:           91.67	(Threshold: 100.2859)
2023-02-24 02:02:14 - F1 with Manhattan-Distance:                 93.55	(Threshold: 105.4114)
2023-02-24 02:02:14 - Precision with Manhattan-Distance:          96.67
2023-02-24 02:02:14 - Recall with Manhattan-Distance:             90.62
2023-02-24 02:02:14 - Average Precision with Manhattan-Distance:  98.50

2023-02-24 02:02:14 - Accuracy with Euclidean-Distance:           91.67	(Threshold: 4.5346)
2023-02-24 02:02:14 - F1 with Euclidean-Distance:                 93.55	(Threshold: 4.7832)
2023-02-24 02:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/3 [00:00<?, ?it/s]

2023-02-24 02:03:06 - Binary Accuracy Evaluation of the model on  dataset after epoch 0:


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-02-24 02:03:12 - Accuracy with Cosine-Similarity:           96.88	(Threshold: 0.7931)
2023-02-24 02:03:12 - F1 with Cosine-Similarity:                 97.71	(Threshold: 0.7782)
2023-02-24 02:03:12 - Precision with Cosine-Similarity:          95.52
2023-02-24 02:03:12 - Recall with Cosine-Similarity:             100.00
2023-02-24 02:03:12 - Average Precision with Cosine-Similarity:  99.76

2023-02-24 02:03:12 - Accuracy with Manhattan-Distance:           96.88	(Threshold: 111.6846)
2023-02-24 02:03:12 - F1 with Manhattan-Distance:                 97.64	(Threshold: 111.6846)
2023-02-24 02:03:12 - Precision with Manhattan-Distance:          98.41
2023-02-24 02:03:12 - Recall with Manhattan-Distance:             96.88
2023-02-24 02:03:12 - Average Precision with Manhattan-Distance:  99.72

2023-02-24 02:03:12 - Accuracy with Euclidean-Distance:           96.88	(Threshold: 5.1010)
2023-02-24 02:03:12 - F1 with Euclidean-Distance:                 97.64	(Threshold: 5.1010)
2023-02-24 02:

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-02-24 02:03:19 - Accuracy with Cosine-Similarity:           98.44	(Threshold: 0.8228)
2023-02-24 02:03:19 - F1 with Cosine-Similarity:                 98.41	(Threshold: 0.8228)
2023-02-24 02:03:19 - Precision with Cosine-Similarity:          100.00
2023-02-24 02:03:19 - Recall with Cosine-Similarity:             96.88
2023-02-24 02:03:19 - Average Precision with Cosine-Similarity:  99.65

2023-02-24 02:03:19 - Accuracy with Manhattan-Distance:           96.88	(Threshold: 99.7710)
2023-02-24 02:03:19 - F1 with Manhattan-Distance:                 96.97	(Threshold: 114.4995)
2023-02-24 02:03:19 - Precision with Manhattan-Distance:          94.12
2023-02-24 02:03:19 - Recall with Manhattan-Distance:             100.00
2023-02-24 02:03:19 - Average Precision with Manhattan-Distance:  99.72

2023-02-24 02:03:19 - Accuracy with Euclidean-Distance:           96.88	(Threshold: 4.4660)
2023-02-24 02:03:19 - F1 with Euclidean-Distance:                 96.97	(Threshold: 5.2224)
2023-02-24 02:

----

## Acknowledgements

```bibtex 
@inproceedings{reimers-2019-sentence-bert,
    title     = "Sentence-BERT: Sentence Embeddings using Siamese   BERT-Networks",
    author    = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month     = "11",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    url       = "https://arxiv.org/abs/1908.10084",
}
```
  
```bibtex  
@software{de_bruin_j_2019_3559043,
  author       = "De Bruin, J",
  title        = "Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python",
  month        = "12",
  year         = "2019",
  publisher    = "Zenodo",
  version      = "v0.14",
  doi          = "10.5281/zenodo.3559043",
  url          = "https://doi.org/10.5281/zenodo.3559043"
}
```


