# Evaluation of Data Integration

In this notebook, we evaluate how effective two relations can be integrated using the soft join operator.
Therefore, we utilize the [Datasets for DeepMatcher paper](https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md).

The datasets contain two relations with the same entities from two different sources.
E.g. `iTunes-Amazon` contains song records obtained from iTunes and Amazon. So, the task is to identify same songs.

To test different Models, datasets, ..., there are parameters in the [modifications](#modifications) section.
**Modfy Code in [Modifications](#modifications) Section Only!**

We calculate:
* $ TP = \text{True Matches} \cap \text{Predicted Matches} $
* $ FN = \text{True Matches} \setminus \text{Predicted Matches} $
* $ FP = \text{Predicted Matches} \setminus \text{True Matches} $

To determine the scores:
* $ Precision = \frac{TP}{TP + FP}$
* $ Recall = \frac{TP}{TP + FN}$
* $ F_1 = \frac{2 \cdot precision \cdot recall}{precision + recall} $
* BLEU 1-4

## Imports

In [2]:
%%capture
!rm -rf SofteningQueryEvaluation
!git clone https://github.com/HackerBschor/SofteningQueryEvaluation
%cd SofteningQueryEvaluation

!pip3 install faiss-gpu-cu12

!pip3 install pgvector

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
import time
import json
import tqdm
import pandas as pd


from db.operators import Dummy, InnerSoftJoin

from models import ModelMgr
from models.embedding.SentenceTransformer import SentenceTransformerEmbeddingModel
from models.semantic_validation import LLaMAValidationModel

from evaluation.util import calculate_metrics, calc_bleu

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [5]:
with open("evaluation/DataIntegration/DataIntegration.json") as f:
    datasets = json.load(f)

print(", ".join(datasets.keys()))

AbtBuy, AmazonGoogle, Beer, iTunesAmazon, DBLP_ACM, DBLP_SCHOLAR, FodorsZagat, WalmartAmazon


## Modifications

In [6]:
dataset_name = "AbtBuy" # The tested dataset
number_matches = 100

significant_columns_left = ["name", "description", "price"]
significant_columns_right = ["name", "description", "price"]

In [7]:
# Models
# meta-llama/Llama-3.2-3B-Instruct
# meta-llama/Meta-Llama-3-8B-Instruct
stem = SentenceTransformerEmbeddingModel(ModelMgr())
lsv = LLaMAValidationModel(ModelMgr(), model_path="meta-llama/Llama-3.2-3B-Instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

## Function and Dataset Declarations

In [8]:
significant_columns_left = [f"left_{x}" for x in significant_columns_left]
significant_columns_right = [f"right_{x}" for x in significant_columns_right]

In [9]:
dataset = datasets[dataset_name]
matches = pd.DataFrame(dataset["matches"])

if number_matches is not None:
    matches = matches.sample(number_matches, random_state=42)

print(len(matches))

matches.head(2)

100


Unnamed: 0,idAbt,idBuy
44,22101,201953039
568,34548,207907555


In [10]:
left = pd.DataFrame(dataset["left"])
left.rename(columns={c: f"left_{c}" for c in left.columns}, inplace=True)

if number_matches is not None:
    left = left[left["left_" + matches.columns[0]].isin(matches[matches.columns[0]])]

print(len(left))

left.head(2)

100


Unnamed: 0,left_idAbt,left_name,left_description,left_price
10,38474,Linksys Gigabit 5-Port Workgroup Switch - EG005W,Linksys Gigabit 5-Port Workgroup Switch - EG00...,$64.00
31,20453,Canon Cyan Ink Tank - Cyan - CLI8C,Canon Cyan Ink Tank - CLI8C/ Compatible With T...,$16.00


In [11]:
right = pd.DataFrame(dataset["right"])
right.rename(columns={c: f"right_{c}" for c in right.columns}, inplace=True)

if number_matches is not None:
    right = right[right["right_" + matches.columns[1]].isin(matches[matches.columns[1]])]

print(len(right))

right.head(2)

100


Unnamed: 0,right_idBuy,right_name,right_description,right_manufacturer,right_price
10,10343605,Linksys Instant Gigabit EG005W Ethernet Switch,Linksys EG005W Gigabit 5-Port Workgroup Switch,LINKSYS,
31,201692677,Canon CLI-8C Ink Cartridge - 0621B002,Cyan,Canon,$13.99


In [12]:
candidates = matches\
    .merge(left, left_on=matches.columns[0], right_on=f"left_{matches.columns[0]}")\
    .merge(right, left_on=matches.columns[1], right_on=f"right_{matches.columns[1]}")\
    .drop(columns=matches.columns)

print(len(candidates))

candidates.head(2)

100


Unnamed: 0,left_idAbt,left_name,left_description,left_price,right_idBuy,right_name,right_description,right_manufacturer,right_price
0,22101,Canon Photo Ink Cartridge - CL52,Canon Photo Ink Cartridge - CL52/ Compatible W...,$25.00,201953039,Canon CL-52 Photo Ink Cartridge For PIXMA iP62...,Color,Canon,$18.16
1,34548,Sony White Earbud Style Headphones - MDREX55WH,Sony MDREX55WHI White Earbud Style Headphones ...,,207907555,Ex Series Earbuds Wht - MDR EX55/WHI,,Sony,$25.76


In [13]:
gt = {tuple([x[f"left_{matches.columns[0]}"], x[f"right_{matches.columns[1]}"]]) for _, x in candidates.iterrows()}
print(str(gt)[0:100] + "...")

{(34962, 208114681), (38173, 210441817), (33150, 208105630), (38400, 209208547), (14563, 202870448),...


### Execute Operator and Evaluate

In [14]:
def evaluate(em, sv, threshold, method, embedding_comparison, embedding_method):
    key = (threshold, method, embedding_comparison, embedding_method)

    op1 = Dummy("ltable", left.columns, list(left.values))
    op2 = Dummy("rtable", right.columns, list(right.values))

    op = InnerSoftJoin(
        op1, op2, em=em, sv=sv,
        threshold=threshold, method=method,
        columns_left=significant_columns_left, columns_right=significant_columns_right,
        embedding_comparison=embedding_comparison,
        embedding_method = embedding_method,
        zs_system_prompt = 'You are an object-matcher. Check if the two tuples A and B refere to the same real worl entity. If so, answer with "yes", if not, answer with"no" only',
        zs_template = "A is {a}\nB is {b}"
    )

    tic = time.time()
    result = op.open().fetch_all()
    toc = time.time()
    pred = {tuple([x[f"left_{matches.columns[0]}"], x[f"right_{matches.columns[1]}"]]) for x in result}

    scores = calculate_metrics(gt, pred, toc - tic)

    print(key, scores["F1 Score"])

    return key, scores

## Evaluation


In [15]:
thresholds = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]

evaluation_results = {}

In [16]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "threshold", embedding_comparison = "RECORD_WISE", embedding_method = "FIELD_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.9, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.8, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.7, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.6, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.5, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.07547169811320754
(0.4, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.38613861386138615
(0.3, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.14058577405857742
(0.2, 'threshold', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.04193751310547285


In [17]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "threshold", embedding_comparison = "RECORD_WISE", embedding_method = "FULL_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0
(0.9, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0.366412213740458
(0.8, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0.5220125786163522
(0.7, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0.29464285714285715
(0.6, 'threshold', 'RECORD_WISE', 'FULL_SERIALIZED') 0.09713453132588634


In [18]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "threshold", embedding_comparison = "COLUMN_WISE", embedding_method = None)
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'threshold', 'COLUMN_WISE', None) 0
(0.9, 'threshold', 'COLUMN_WISE', None) 0
(0.8, 'threshold', 'COLUMN_WISE', None) 0.16071428571428573
(0.7, 'threshold', 'COLUMN_WISE', None) 0.29411764705882354
(0.6, 'threshold', 'COLUMN_WISE', None) 0.39826839826839827
(0.5, 'threshold', 'COLUMN_WISE', None) 0.20625
(0.4, 'threshold', 'COLUMN_WISE', None) 0.06158583525789068
(0.3, 'threshold', 'COLUMN_WISE', None) 0.0420781451266638
(0.2, 'threshold', 'COLUMN_WISE', None) 0.03124023742580443


In [19]:
res = evaluate(stem, lsv, method = "zero-shot-prompting", threshold = None, embedding_comparison = "RECORD_WISE", embedding_method = None)
evaluation_results[res[0]] = res[1]

(None, 'zero-shot-prompting', 'RECORD_WISE', None) 0.48484848484848486


In [20]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "both", embedding_comparison = "RECORD_WISE", embedding_method = "FIELD_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.9, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.8, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.7, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.6, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0
(0.5, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.0392156862745098
(0.4, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.24561403508771928
(0.3, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.44961240310077516
(0.2, 'both', 'RECORD_WISE', 'FIELD_SERIALIZED') 0.48484848484848486


KeyboardInterrupt: 

In [21]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "both", embedding_comparison = "RECORD_WISE", embedding_method = "FULL_SERIALIZED")
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0
(0.9, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.21428571428571425
(0.8, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.44961240310077516
(0.7, 'both', 'RECORD_WISE', 'FULL_SERIALIZED') 0.48484848484848486


KeyboardInterrupt: 

In [22]:
for t in thresholds:
    res = evaluate(stem, lsv, threshold = t, method = "both", embedding_comparison = "COLUMN_WISE", embedding_method = None)
    evaluation_results[res[0]] = res[1]
    if res[1]["Recall"] == 1.0:
        break

(1.0, 'both', 'COLUMN_WISE', None) 0
(0.9, 'both', 'COLUMN_WISE', None) 0
(0.8, 'both', 'COLUMN_WISE', None) 0.058252427184466014
(0.7, 'both', 'COLUMN_WISE', None) 0.09523809523809523
(0.6, 'both', 'COLUMN_WISE', None) 0.2758620689655173
(0.5, 'both', 'COLUMN_WISE', None) 0.36065573770491804
(0.4, 'both', 'COLUMN_WISE', None) 0.41269841269841273
(0.3, 'both', 'COLUMN_WISE', None) 0.47328244274809156
(0.2, 'both', 'COLUMN_WISE', None) 0.48484848484848486


KeyboardInterrupt: 

In [25]:
gt_blue = {tuple(row[col] for col in list(left.columns) + list(right.columns)) for _, row in candidates.iterrows()}

for key in tqdm.tqdm(evaluation_results):
    result = pd.DataFrame(evaluation_results[key]["pred"], columns=["l", "r"])\
        .merge(left, left_on="l", right_on="left_idAbt")\
        .merge(right, left_on="r", right_on="right_idBuy")

    pred_blue = {tuple(row[col] for col in list(left.columns) + list(right.columns)) for _, row in result.iterrows()}
    scores_bleu = calc_bleu(gt_blue, pred_blue)
    for score_bleu in scores_bleu:
        evaluation_results[key][score_bleu] = scores_bleu[score_bleu]

100%|██████████| 46/46 [25:22<00:00, 33.09s/it]


In [26]:
keys = ["threshold", "method", "embedding_comparison", "embedding_method"]
evaluation_results_list = [v | {ki: vi for ki, vi in zip(keys, k)} for k, v in evaluation_results.items()]
df_evaluation_results = pd.DataFrame.from_records(evaluation_results_list, index=keys)
df_evaluation_results

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Precision,Recall,F1 Score,tp,fn,fp,runtime,pred,bleu1,bleu2,bleu3,bleu4
threshold,method,embedding_comparison,embedding_method,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1.0,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,100,0,3.192026,{},-1.0,-1.0,-1.0,-1.0
0.9,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,100,0,1.993993,{},-1.0,-1.0,-1.0,-1.0
0.8,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,100,0,1.999471,{},-1.0,-1.0,-1.0,-1.0
0.7,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,100,0,1.996535,{},-1.0,-1.0,-1.0,-1.0
0.6,threshold,RECORD_WISE,FIELD_SERIALIZED,0.0,0.0,0.0,0,100,0,1.999452,{},-1.0,-1.0,-1.0,-1.0
0.5,threshold,RECORD_WISE,FIELD_SERIALIZED,0.666667,0.04,0.075472,4,96,2,1.975623,"{(38400, 209208547), (14563, 202870448), (2898...",0.264302,0.156949,0.112511,0.086111
0.4,threshold,RECORD_WISE,FIELD_SERIALIZED,0.382353,0.39,0.386139,39,61,63,1.979413,"{(34728, 203142438), (34280, 207391014), (2989...",0.63382,0.564526,0.53128,0.510723
0.3,threshold,RECORD_WISE,FIELD_SERIALIZED,0.076712,0.84,0.140586,84,16,1011,2.033704,"{(34728, 203142438), (34962, 208114681), (3566...",0.94644,0.937446,0.932777,0.929866
0.2,threshold,RECORD_WISE,FIELD_SERIALIZED,0.021418,1.0,0.041938,100,0,4569,2.031068,"{(33326, 208289990), (34962, 208114681), (3367...",1.0,1.0,1.0,1.0
1.0,threshold,RECORD_WISE,FULL_SERIALIZED,0.0,0.0,0.0,0,100,0,1.744103,{},-1.0,-1.0,-1.0,-1.0


In [27]:
name = "AbtBuy_100_mpnetBaseV2_LLama3B"
df_evaluation_results.drop(columns=["pred"]).to_csv(f"{name}.csv")
df_evaluation_results.to_pickle(f"{name}.pkl")