## Entity Matching Example

Using FEBRL synthetic data

### Load Dependencies

In [13]:
import pandas as pd

from pyent.datasets import generate_febrl_data, remove_nan, sample_xy
from pyent.datasets import train_test_validate_stratified_split as ttvs
from pyent.features import generate_textual_features
from pyent.train import train_txt_baseline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Generate Synthetic Data 

In [14]:
master_df = remove_nan(generate_febrl_data(init_seed=2))

Before Droping NaN's shape of data is (86506, 23)
After Droping NaN's shape of data is (52560, 23)


In [15]:
master_df.labels.value_counts()

no_match    49151
match        3409
Name: labels, dtype: int64

### Split Data into Development and Test Sets

In [16]:
X = master_df.loc[:, ~master_df.columns.isin(["labels"])]
y = master_df.loc[:, "labels"]

X_train, X_test, X_val, y_train, y_test, y_val = ttvs(
    features=X, targets=y, test_size=0.1, validate_size=0.2)


### Generate Textual Features

In [17]:
X_train_txt = generate_textual_features(X_train)
X_test_txt = generate_textual_features(X_test)
X_val_txt = generate_textual_features(X_val)

print(f"Train feature set shpae: {X_train_txt.shape} and Train target shape {len(y_train)}\nTest feature set shpae: {X_test_txt.shape} and Test target shape {len(y_test)}\nValidation feature set shape: {X_val_txt.shape} and Vaiidation target shape {len(y_val)}")

Train feature set shpae: (36792, 2) and Train target shape 36792
Test feature set shpae: (5256, 2) and Test target shape 5256
Validation feature set shape: (10512, 2) and Vaiidation target shape 10512


### Develop Transformer based Siamese Neural Network Model as Baseline Model

To start, for this model we can just look at the `sentence_l` and `sentence_r` _"textual"_ features we generated as shown above.

<!-- 
![example_siamese](../docs/example_siamese.png)
<h6>Image Obtained from Quora Blog Post: https://quoraengineering.quora.com/</h6>  
 -->
  
1. distill roberta base model fron huggingface
2. for negative pairs (i.e. target variabkes with negative class labels) the margin = 0.5
3. as distance metric we use cosine distance (1-cosine_similarity)


In [19]:
X_train_txt_sample, y_train_sample = sample_xy(X=X_train_txt,y=y_train,num=64)
X_test_txt_sample, y_test_sample = sample_xy(X=X_test_txt,y=y_test,num=32)
X_val_txt_sample, y_val_sample = sample_xy(X=X_val_txt,y=y_val,num=32)

train_txt_baseline(X_train_txt_sample, y_train_sample, X_test_txt_sample, y_test_sample, X_val_txt_sample, y_val_sample)

2023-02-22 20:36:10 - Load pretrained SentenceTransformer: bert-base-uncased
2023-02-22 20:36:11 - No sentence-transformers model found with name /Users/mustafawaheed/.cache/torch/sentence_transformers/bert-base-uncased. Creating a new one with MEAN pooling.


Some weights of the model checkpoint at /Users/mustafawaheed/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2023-02-22 20:36:12 - Use pytorch device: cpu
2023-02-22 20:36:12 - Evaluate model without training
2023-02-22 20:36:12 - Binary Accuracy Evaluation of the model on  dataset in epoch 0 after 0 steps:


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-02-22 20:36:17 - Accuracy with Cosine-Similarity:           96.88	(Threshold: 0.8663)
2023-02-22 20:36:17 - F1 with Cosine-Similarity:                 96.77	(Threshold: 0.8663)
2023-02-22 20:36:17 - Precision with Cosine-Similarity:          100.00
2023-02-22 20:36:17 - Recall with Cosine-Similarity:             93.75
2023-02-22 20:36:17 - Average Precision with Cosine-Similarity:  99.54

2023-02-22 20:36:17 - Accuracy with Manhattan-Distance:           98.44	(Threshold: 97.6487)
2023-02-22 20:36:17 - F1 with Manhattan-Distance:                 98.46	(Threshold: 100.9428)
2023-02-22 20:36:17 - Precision with Manhattan-Distance:          96.97
2023-02-22 20:36:17 - Recall with Manhattan-Distance:             100.00
2023-02-22 20:36:17 - Average Precision with Manhattan-Distance:  99.91

2023-02-22 20:36:17 - Accuracy with Euclidean-Distance:           98.44	(Threshold: 4.5921)
2023-02-22 20:36:17 - F1 with Euclidean-Distance:                 98.46	(Threshold: 4.5921)
2023-02-22 20:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/3 [00:00<?, ?it/s]

2023-02-22 20:37:02 - Binary Accuracy Evaluation of the model on  dataset after epoch 0:


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-02-22 20:37:07 - Accuracy with Cosine-Similarity:           96.88	(Threshold: 0.8360)
2023-02-22 20:37:07 - F1 with Cosine-Similarity:                 96.97	(Threshold: 0.7954)
2023-02-22 20:37:07 - Precision with Cosine-Similarity:          94.12
2023-02-22 20:37:07 - Recall with Cosine-Similarity:             100.00
2023-02-22 20:37:07 - Average Precision with Cosine-Similarity:  99.63

2023-02-22 20:37:07 - Accuracy with Manhattan-Distance:           100.00	(Threshold: 108.6110)
2023-02-22 20:37:07 - F1 with Manhattan-Distance:                 100.00	(Threshold: 108.6110)
2023-02-22 20:37:07 - Precision with Manhattan-Distance:          100.00
2023-02-22 20:37:07 - Recall with Manhattan-Distance:             100.00
2023-02-22 20:37:07 - Average Precision with Manhattan-Distance:  100.00

2023-02-22 20:37:07 - Accuracy with Euclidean-Distance:           100.00	(Threshold: 4.9154)
2023-02-22 20:37:07 - F1 with Euclidean-Distance:                 100.00	(Threshold: 4.9154)
2023-02

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

2023-02-22 20:37:14 - Accuracy with Cosine-Similarity:           96.88	(Threshold: 0.8013)
2023-02-22 20:37:14 - F1 with Cosine-Similarity:                 96.88	(Threshold: 0.7927)
2023-02-22 20:37:14 - Precision with Cosine-Similarity:          96.88
2023-02-22 20:37:14 - Recall with Cosine-Similarity:             96.88
2023-02-22 20:37:14 - Average Precision with Cosine-Similarity:  98.95

2023-02-22 20:37:14 - Accuracy with Manhattan-Distance:           98.44	(Threshold: 114.4303)
2023-02-22 20:37:14 - F1 with Manhattan-Distance:                 98.41	(Threshold: 114.4303)
2023-02-22 20:37:14 - Precision with Manhattan-Distance:          100.00
2023-02-22 20:37:14 - Recall with Manhattan-Distance:             96.88
2023-02-22 20:37:14 - Average Precision with Manhattan-Distance:  99.00

2023-02-22 20:37:14 - Accuracy with Euclidean-Distance:           96.88	(Threshold: 5.1860)
2023-02-22 20:37:14 - F1 with Euclidean-Distance:                 96.88	(Threshold: 5.2328)
2023-02-22 20:

----

## Acknowledgements

```bibtex 
@inproceedings{reimers-2019-sentence-bert,
    title     = "Sentence-BERT: Sentence Embeddings using Siamese   BERT-Networks",
    author    = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month     = "11",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    url       = "https://arxiv.org/abs/1908.10084",
}
```
  
```bibtex  
@software{de_bruin_j_2019_3559043,
  author       = "De Bruin, J",
  title        = "Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python",
  month        = "12",
  year         = "2019",
  publisher    = "Zenodo",
  version      = "v0.14",
  doi          = "10.5281/zenodo.3559043",
  url          = "https://doi.org/10.5281/zenodo.3559043"
}
```




In [None]:
# ## df_in is a pandas DataFrame with all the required columns

# block_vars = ['area', 'rooms', 'bathrooms', 'garages', 'stratum', 'type']
# compare_vars = [
#             String('description', 'description', method='lcs',
#                    label='description', threshold=0.95),
#             Numeric('originPrice', 'originPrice', method='gauss',
#                     label='originPrice', offset=0.2, scale=0.2),
#             Geographic('latitude', 'longitude', 'latitude', 'longitude',
#                        method='gauss', offset=0.2, label='location')
#             ]
# indexer = rl.index.Block(block_vars)
# candidate_links = indexer.index(df_in)
# njobs = 8

# ## This is the part that takes hours
# comparer = rl.Compare(compare_vars, n_jobs=njobs)
# compare_vectors = comparer.compute(pairs=candidate_links, x=df_in)

# ## Model training doesn't take too long
# ecm = rl.ECMClassifier(binarize=0.5)
# ecm.fit(compare_vectors)
# pairs_ecm = ecm.predict(compare_vectors)

In [None]:
# from torch.utils.data import DataLoader
# from sentence_transformers import losses
# from sentence_transformers import LoggingHandler, SentenceTransformer, evaluation
# from sentence_transformers.readers import InputExample


# ############################################################
# logging.basicConfig(format='%(asctime)s - %(message)s',
#                     datefmt='%Y-%m-%d %H:%M:%S',
#                     level=logging.INFO,
#                     handlers=[LoggingHandler()])
# logger = logging.getLogger(__name__)
# ############################################################

# # prepare data splits for algorithm
# X_train_txt['target'] = np.where(y_train == "match", 1, 0)
# X_test_txt['target'] = np.where(y_test == "match", 1, 0)
# X_val_txt['target'] = np.where(y_val == "match", 1, 0)


# # oaraneters abd configs for training
# model_name = 'bert-base-uncased'
# num_epochs = 1
# train_batch_size = 64
# margin = 0.5
# model_save_path = '../output/models/{}-bsz-{}-ep-{}-{}'.format(
#     model_name, 
#     train_batch_size,
#     num_epochs,
#     datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
# )
# os.makedirs(model_save_path, exist_ok=True)
# distance_metric = losses.SiameseDistanceMetric.COSINE_DISTANCE
# model = SentenceTransformer(model_name)


# # create train and test sample
# train_samples = []
# for row in  X_train_txt.iterrows():
#     if row[1]['target'] == 1:
#         train_samples.append(
#             InputExample(
#                 texts=[
#                     row[1]['sentence_l'], 
#                     row[1]['sentence_r']
#                 ], 
#                 label=int(row[1]['target'])
#             )
#         )
#         train_samples.append(
#             InputExample(
#                 texts=[
#                     row[1]['sentence_r'], 
#                     row[1]['sentence_l']
#                 ], 
#                 label=int(row[1]['target'])
#             )
#         )
#     else:
#         train_samples.append(
#             InputExample(
#                 texts=[
#                     row[1]['sentence_l'], 
#                     row[1]['sentence_r']
#                 ], 
#                 label=int(row[1]['target'])
#             )
#         )

# # initialize data loader and loss definition
# train_dataloader = DataLoader(
#     train_samples, 
#     shuffle=True, 
#     batch_size=train_batch_size
# )

# train_loss = losses.OnlineContrastiveLoss(
#     model=model, 
#     distance_metric=distance_metric, 
#     margin=margin
# )

# evaluators = []

# dev_sentences1 = []
# dev_sentences2 = []
# dev_labels = []
# for row in X_val_txt.iterrows():
#     dev_sentences1.append(row[1]['sentence_l'])
#     dev_sentences2.append(row[1]['sentence_r'])
#     dev_labels.append(int(row[1]['target']))

# binary_acc_evaluator = evaluation.BinaryClassificationEvaluator(
#     sentences1=dev_sentences1, 
#     sentences2=dev_sentences2, 
#     labels=dev_labels
# )
# evaluators.append(binary_acc_evaluator)

# # This SequentialEvaluator runs all other evaluators if/when added 
# seq_evaluator = evaluation.SequentialEvaluator(
#     evaluators=evaluators, 
#     main_score_function=lambda scores: scores[-1]
# )

# logger.info("Evaluate model without training")
# seq_evaluator(
#     model=model, 
#     epoch=0, 
#     steps=0, 
#     output_path=model_save_path
# )

# model.fit(
#     train_objectives=[(train_dataloader, train_loss)],
#     evaluator=seq_evaluator,
#     epochs=num_epochs,
#     use_amp=True,
#     warmup_steps=500,
#     output_path=model_save_path,
#     show_progress_bar=True
# )

# bi_encoder = SentenceTransformer(model_save_path)

# test_sentence_l = X_test_txt.sentence_l.tolist()
# test_sentence_r = X_test_txt.sentence_r.tolist()
# test_target = X_test_txt.target.tolist()

# test_eval = evaluation.BinaryClassificationEvaluator(
#     sentences1=test_sentence_l,
#     sentences2=test_sentence_r,
#     labels=test_target,
#     name=f"test_evaluator_{os.path.basename(model_save_path)}",
#     batch_size=32,
#     write_csv=True,
#     show_progress_bar=True
# )

# test_pref_metrics = test_eval.compute_metrices(bi_encoder)
# acc, acc_threshold = test_eval(bi_encoder).find_best_acc_and_threshold()
# f1, precision, recall, f1_threshold = test_eval(bi_encoder).find_best_f1_and_threshold()