# Evaluation of Data Integration

In this notebook, we evaluate how effective two relations can be integrated using the soft join operator.
Therefore, we utilize the [Datasets for DeepMatcher paper](https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md).

The datasets contain two relations with the same entities from two different sources.
E.g. `iTunes-Amazon` contains song records obtained from iTunes and Amazon. So, the task is to identify same songs.

To test different Models, datasets, ..., there are parameters in the [modifications](#modifications) section.
**Modfy Code in [Modifications](#modifications) Section Only!**

We calculate:
* $ TP = \text{True Matches} \cap \text{Predicted Matches} $
* $ FN = \text{True Matches} \setminus \text{Predicted Matches} $
* $ FP = \text{Predicted Matches} \setminus \text{True Matches} $

To determine the scores:
* $ Precision = \frac{TP}{TP + FP}$
* $ Recall = \frac{TP}{TP + FN}$
* $ F_1 = \frac{2 \cdot precision \cdot recall}{precision + recall} $
* BLEU 1-4

## Imports

In [None]:
%%capture
!rm -rf SofteningQueryEvaluation
!git clone https://github.com/HackerBschor/SofteningQueryEvaluation
%cd SofteningQueryEvaluation

!pip3 install faiss-gpu-cu12
!pip3 install pgvector

In [6]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
import time
import json
import tqdm
import pandas as pd


from db.operators import Dummy, InnerSoftJoin

from models import ModelMgr
from models.embedding.SentenceTransformer import SentenceTransformerEmbeddingModel
from models.semantic_validation import LLaMAValidationModel

from evaluation.util import calculate_metrics, calc_bleu

In [8]:
with open("evaluation/DataIntegration/DataIntegration.json") as f:
    datasets = json.load(f)

print(", ".join(datasets.keys()))

AbtBuy, AmazonGoogle, Beer, iTunesAmazon, DBLP_ACM, DBLP_SCHOLAR, FodorsZagat, WalmartAmazon


## Modifications

In [9]:
dataset_name = "iTunesAmazon" # The tested dataset

significant_columns_left = ["Album_Name", "Artist_Name", "Released", "Song_Name", "Time"]
significant_columns_right = ["Album_Name", "Artist_Name", "Released", "Song_Name", "Time"]

In [10]:
# Models
stem = SentenceTransformerEmbeddingModel(ModelMgr())
lsv = LLaMAValidationModel(ModelMgr())

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

## Function and Dataset Declarations

In [11]:
significant_columns_left = [f"left_{x}" for x in significant_columns_left]
significant_columns_right = [f"right_{x}" for x in significant_columns_right]

In [12]:
dataset = datasets[dataset_name]
matches = pd.DataFrame(dataset["matches"])
matches.head(2)

Unnamed: 0,snoLeft,snoRight
0,111,53124
1,148,50767


In [13]:
left = pd.DataFrame(dataset["left"])
left.rename(columns={c: f"left_{c}" for c in left.columns}, inplace=True)
left.head(2)

Unnamed: 0,left_snoLeft,left_Album_Name,left_Album_Price,left_Artist_Name,left_CopyRight,left_Customer_Rating,left_Genre,left_Price,left_Released,left_Song_Name,left_Time
0,111,VHS,$7.99,X Ambassadors,2015 KIDinaKORNER/Interscope Records,4.54839,"Alternative,Music,Rock,Adult Alternative",$1.29,30-Jun-15,VHS Outro (Interlude),1:25
1,148,Title (Deluxe),$12.99,Meghan Trainor,"2014, 2015 Epic Records, a division of Sony M...",4.0674,"Pop,Music,Rock,Pop/Rock,Dance,Teen Pop",$1.29,9-Jan-15,Credit,2:51


In [14]:
right = pd.DataFrame(dataset["right"])
right.rename(columns={c: f"right_{c}" for c in right.columns}, inplace=True)
right.head(2)

Unnamed: 0,right_snoRight,right_Album_Name,right_Artist_Name,right_Song_Name,right_Price,right_Time,right_Released,right_Label,right_Copyright,right_Genre
0,363,#NAME?,Ed Sheeran,Sunburn (Deluxe Edition),$1.29,4:35,"September 9, 2011",Atlantic Records UK,Doll records,Pop
1,379,#NAME?,Ed Sheeran,Sunburn (Deluxe Edition),$1.29,4:35,"September 9, 2011",Atlantic Records UK,Doll records,Pop


In [15]:
candidates = matches\
    .merge(left, left_on=matches.columns[0], right_on=f"left_{matches.columns[0]}")\
    .merge(right, left_on=matches.columns[1], right_on=f"right_{matches.columns[1]}")\
    .drop(columns=matches.columns)

candidates.head(2)

Unnamed: 0,left_snoLeft,left_Album_Name,left_Album_Price,left_Artist_Name,left_CopyRight,left_Customer_Rating,left_Genre,left_Price,left_Released,left_Song_Name,...,right_snoRight,right_Album_Name,right_Artist_Name,right_Song_Name,right_Price,right_Time,right_Released,right_Label,right_Copyright,right_Genre
0,111,VHS,$7.99,X Ambassadors,2015 KIDinaKORNER/Interscope Records,4.54839,"Alternative,Music,Rock,Adult Alternative",$1.29,30-Jun-15,VHS Outro (Interlude),...,53124,VHS [Explicit],X Ambassadors,VHS Outro (Interlude) [Explicit],$1.29,1:25,"June 30, 2015",KIDinaKORNER/Interscope Records,(C) 2015 KIDinaKORNER/Interscope Records,Alternative Rock
1,148,Title (Deluxe),$12.99,Meghan Trainor,"2014, 2015 Epic Records, a division of Sony M...",4.0674,"Pop,Music,Rock,Pop/Rock,Dance,Teen Pop",$1.29,9-Jan-15,Credit,...,50767,Title (Deluxe),Meghan Trainor,Credit,$1.29,2:51,"January 9, 2015",Epic,"2011 What A Music Ltd, licence exclusive Parl...",Pop


In [16]:
gt = {tuple([x[f"left_{matches.columns[0]}"], x[f"right_{matches.columns[1]}"]]) for _, x in candidates.iterrows()}
print(str(gt)[0:500] + "...")

{(2743, 17193), (6533, 38335), (5713, 35091), (1490, 3901), (206, 41214), (4748, 5685), (4302, 29727), (3763, 38465), (1165, 28023), (1070, 11280), (2223, 816), (2783, 31567), (1898, 51663), (3772, 38476), (427, 33316), (538, 1879), (250, 53124), (4414, 42138), (3325, 7406), (3422, 34354), (3945, 18982), (4588, 26169), (2480, 33820), (1152, 53319), (593, 43724), (4660, 41125), (3851, 9123), (255, 49141), (3357, 39370), (669, 24650), (3927, 16409), (6048, 4670), (6219, 29615), (1407, 48324), (610...


### Execute Operator and Evaluate

In [17]:
def evaluate(em, sv, threshold, method, embedding_comparison, embedding_method):
    key = (threshold, method, embedding_comparison, embedding_method)

    op1 = Dummy("ltable", left.columns, list(left.values))
    op2 = Dummy("rtable", right.columns, list(right.values))

    op = InnerSoftJoin(
        op1, op2, em=em, sv=sv,
        threshold=threshold, method=method,
        columns_left=significant_columns_left, columns_right=significant_columns_right,
        embedding_comparison=embedding_comparison,
        embedding_method = embedding_method
    )

    tic = time.time()
    result = op.open().fetch_all()
    toc = time.time()
    pred = {tuple([x[f"left_{matches.columns[0]}"], x[f"right_{matches.columns[1]}"]]) for x in result}

    scores = calculate_metrics(gt, pred, toc - tic)

    print(key, scores["F1 Score"])

    return key, scores

## Evaluation


In [18]:
thresholds = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]

evaluation_results = {}

In [19]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "threshold", embedding_comparison = "RECORD_WISE", embedding_method = "FIELD_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.9, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.8, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.7, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.6, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.5, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.4, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.015037593984962407
(0.3, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.7014925373134329
(0.2, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.023506366307541625


In [20]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "threshold", embedding_comparison = "RECORD_WISE", embedding_method = "FULL_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0
(0.9, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7365269461077845
(0.8, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0.020618556701030924


In [21]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "threshold", embedding_comparison = "COLUMN_WISE", embedding_method = None)
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'threshold', 'COLUMN_WISE', None) 0
(0.9, 'threshold', 'COLUMN_WISE', None) 0.6176470588235294
(0.8, 'threshold', 'COLUMN_WISE', None) 0.8683274021352313
(0.7, 'threshold', 'COLUMN_WISE', None) 0.825


In [22]:
res = evaluate(stem, lsv, method = "zero-shot-prompting", threshold = None, embedding_comparison = "RECORD_WISE", embedding_method = None)
evaluation_results[res[0]] = res[1]

(None, 'zero-shot-prompting', 'RECORD_WISE', None) 0.7380952380952381


In [23]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "both", embedding_comparison = "RECORD_WISE", embedding_method = "FIELD_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.9, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.8, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.7, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.6, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.5, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.4, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.015037593984962407
(0.3, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.6160714285714286
(0.2, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.7380952380952381
(0.1, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.7380952380952381


In [24]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "both", embedding_comparison = "RECORD_WISE", embedding_method = "FULL_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0
(0.9, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7338709677419355
(0.8, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.7, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.6, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.5, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.4, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.3, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.2, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381
(0.1, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.7380952380952381


In [25]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "both", embedding_comparison = "COLUMN_WISE", embedding_method = None)
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'both', 'COLUMN_WISE', None) 0
(0.9, 'both', 'COLUMN_WISE', None) 0.5
(0.8, 'both', 'COLUMN_WISE', None) 0.7302904564315352
(0.7, 'both', 'COLUMN_WISE', None) 0.7380952380952381
(0.6, 'both', 'COLUMN_WISE', None) 0.7380952380952381
(0.5, 'both', 'COLUMN_WISE', None) 0.7380952380952381
(0.4, 'both', 'COLUMN_WISE', None) 0.7380952380952381
(0.3, 'both', 'COLUMN_WISE', None) 0.7380952380952381
(0.2, 'both', 'COLUMN_WISE', None) 0.7380952380952381
(0.1, 'both', 'COLUMN_WISE', None) 0.7380952380952381


In [26]:
gt_blue = {tuple(row[col] for col in list(left.columns) + list(right.columns)) for _, row in candidates.iterrows()}

for key in tqdm.tqdm(evaluation_results):
    result = pd.DataFrame(evaluation_results[key]["pred"], columns=["l", "r"])\
        .merge(left, left_on="l", right_on="left_snoLeft")\
        .merge(right, left_on="r", right_on="right_snoRight")

    pred_blue = {tuple(row[col] for col in list(left.columns) + list(right.columns)) for _, row in result.iterrows()}
    scores_bleu = calc_bleu(gt_blue, pred_blue)
    for score_bleu in scores_bleu:
        evaluation_results[key][score_bleu] = scores_bleu[score_bleu]

100%|██████████| 47/47 [12:01<00:00, 15.36s/it]


In [30]:
keys = ["threshold", "method", "embedding_comparison", "embedding_method"]
evaluation_results_list = [v | {ki: vi for ki, vi in zip(keys, k)} for k, v in evaluation_results.items()]
df_evaluation_results = pd.DataFrame.from_records(evaluation_results_list, index=keys)
df_evaluation_results

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Precision,Recall,F1 Score,tp,fn,fp,runtime,pred,bleu1,bleu2,bleu3,bleu4
threshold,method,embedding_comparison,embedding_method,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1.0,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,132,0,2.838552,{},-1.0,-1.0,-1.0,-1.0
0.9,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,132,0,1.536983,{},-1.0,-1.0,-1.0,-1.0
0.8,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,132,0,1.54531,{},-1.0,-1.0,-1.0,-1.0
0.7,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,132,0,1.565419,{},-1.0,-1.0,-1.0,-1.0
0.6,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,132,0,1.5494,{},-1.0,-1.0,-1.0,-1.0
0.5,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,132,0,1.55242,{},-1.0,-1.0,-1.0,-1.0
0.4,threshold,RECORD_WISE,FIELD_SERIALIZED,1.0,0.007576,0.015038,1,131,0,1.525844,"{(6235, 13724)}",0.351403,0.215811,0.157853,0.11815
0.3,threshold,RECORD_WISE,FIELD_SERIALIZED,0.691176,0.712121,0.701493,94,38,42,1.565871,"{(2743, 17193), (1778, 33431), (1490, 3901), (...",0.883405,0.851309,0.832391,0.817656
0.2,threshold,RECORD_WISE,FIELD_SERIALIZED,0.011893,1.0,0.023506,132,0,10967,1.657646,"{(2914, 43724), (1596, 31105), (1165, 825), (1...",1.0,1.0,1.0,1.0
1.0,threshold,RECORD_WISE,FULL_SERIALIZED,0.0,0.0,0.0,0,132,0,1.831528,{},-1.0,-1.0,-1.0,-1.0


In [31]:
name = "iTunesAmazon_mpnetBaseV2_LLama3B"
df_evaluation_results.drop(columns=["pred"]).to_csv(f"{name}.csv")
df_evaluation_results.to_pickle(f"{name}.pkl")