# Evaluation of Data Integration

In this notebook, we evaluate how effective two relations can be integrated using the soft join operator.
Therefore, we utilize the [Datasets for DeepMatcher paper](https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md).

The datasets contain two relations with the same entities from two different sources.
E.g. `iTunes-Amazon` contains song records obtained from iTunes and Amazon. So, the task is to identify same songs.

To test different Models, datasets, ..., there are parameters in the [modifications](#modifications) section.
**Modfy Code in [Modifications](#modifications) Section Only!**

We calculate:
* $ TP = \text{True Matches} \cap \text{Predicted Matches} $
* $ FN = \text{True Matches} \setminus \text{Predicted Matches} $
* $ FP = \text{Predicted Matches} \setminus \text{True Matches} $

To determine the scores:
* $ Precision = \frac{TP}{TP + FP}$
* $ Recall = \frac{TP}{TP + FN}$
* $ F_1 = \frac{2 \cdot precision \cdot recall}{precision + recall} $
* BLEU 1-4

## Imports

In [11]:
import os
import tqdm
import requests
import zipfile

import pandas as pd
import numpy as np

from db.operators import Dummy, InnerSoftJoin

from models import ModelMgr
from models.embedding.SentenceTransformer import SentenceTransformerEmbeddingModel
from models.semantic_validation import LLaMAValidationModel

import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import SmoothingFunction

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/nico/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/nico/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Modifications

In [12]:
dataset = ("Structured", "iTunes-Amazon") # The tested dataset
data_path = "../data/"  # Data Path (usually remains unchanged)

# Models
m = ModelMgr()
stem = SentenceTransformerEmbeddingModel(m)
sv = LLaMAValidationModel(m)

threshold = 0.80
method = "threshold"
embedding_comparison = "COLUMN_WISE"

significant_columns_left = ["Album_Name", "Artist_Name", "Released", "Song_Name", "Time"]
significant_columns_right = ["Album_Name", "Artist_Name", "Released", "Song_Name", "Time"]
# significant_columns_left, significant_columns_right = ["BeerName"], ["BeerName"]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Function and Dataset Declarations

In [13]:
# Available Dataset with meta data

datasets = {
    "Structured": {
        "Beer": {"filename": "beer_raw_data.zip",  "matches": "labeled_data.csv", "key_column": "Label", "label_column": "gold", "drop_columns": ["_id"]},
        "iTunes-Amazon": {"filename": "itunes_amazon_raw_data.zip",  "matches": "labeled_data.csv", "key_column": "Sno", "label_column": "label", "drop_columns": ["Unnamed: 0", "_id"]},
        "Fodors-Zagats": "fodors_zagat_raw_data.zip",
        "DBLP-ACM": "dblp_acm_raw_data.zip",
        "DBLP-GoogleScholar": "dblp_scholar_raw_data.zip",
        "Amazon-Google": "amazon_google_raw_data.zip",
        "Walmart-Amazon": "walmart_amazon_raw_data.zip",
    },
    "Textual": {
        "Abt-Buy": "abt_buy_raw_data.zip",
        "Company": "company_raw_data.zip"
    }
}

dataset_data = datasets[dataset[0]][dataset[1]]

In [14]:
def compute_bleu_representativeness(input_dataset, integrated_dataset, use_n_grams=4, smoothing_function = SmoothingFunction().method1):
    """
    Compute BLEU-based representativeness score between input and integrated datasets.

    :param input_dataset: List of text entries from the input dataset
    :param integrated_dataset: List of text entries from the integrated dataset
    :param use_n_grams: Number of n-grams to use (BLEU 1, 2, 3, 4)
    :param smoothing_function: Smoothing function to use for computing BLEU
    :return: Average BLEU score as the representativeness measure
    """
    assert use_n_grams in (1,2,3,4)
    weights = np.array([0., 0., 0., 0.])
    weights[:use_n_grams] = 1.0/float(use_n_grams)

    integrated_tokens = [word_tokenize(entry.lower()) for entry in integrated_dataset]
    input_tokens = [word_tokenize(entry.lower()) for entry in input_dataset]

    bleu_scores = []

    for input_entry in input_tokens:
        item_bleu_scores = max([sentence_bleu([integrated_entry], input_entry, smoothing_function=smoothing_function, weights=weights) for integrated_entry in integrated_tokens])
        bleu_scores.append(item_bleu_scores)

    return np.average(bleu_scores)

def download_file(url, save_file):
    """
    Download deep matcher dataset

    :param url: URL to download from
    :param save_file: File to save to
    """
    with requests.get(url, stream=True) as response:
        response.raise_for_status()
        with open(save_file, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

def process_dataset(ds):
    """
    Download deep matcher dataset and return DataFrame with joined data

    :param ds: The dataset which will be downloaded and processed
    """
    url = f"https://pages.cs.wisc.edu/~anhai/data1/deepmatcher_data/{ds[0]}/{ds[1]}/{dataset_data["filename"]}"
    save_path = f"{data_path}{dataset[0]}/{dataset[1]}/"
    save_file = save_path + dataset_data["filename"]

    os.makedirs(save_path, exist_ok=True)

    if not os.path.exists(save_file):
        download_file(url, save_file)

    # Unzip the file
    with zipfile.ZipFile(save_file, "r") as zip_ref:
        zip_ref.extractall(save_path)

    #pd.read_csv(save_path + dataset_data["tableA"], encoding="unicode_escape"),
    #pd.read_csv(save_path + dataset_data["tableB"], encoding="unicode_escape"),

    return pd.read_csv(save_path + dataset_data["matches"], encoding="unicode_escape", skiprows=5)


#table_a, table_b, matches = process_dataset(dataset)
candidates = process_dataset(dataset) # table_a.head(), table_b.head()
candidates.head()

Unnamed: 0.1,Unnamed: 0,_id,ltable.Sno,rtable.Sno,ltable.Album_Name,ltable.Artist_Name,ltable.CopyRight,ltable.Released,ltable.Song_Name,ltable.Time,rtable.Album_Name,rtable.Artist_Name,rtable.CopyRight,rtable.Released,rtable.Song_Name,rtable.Time,label
0,916,916,111,53124,vhs,x ambassadors,2015 kidinakorner/interscope records,30-Jun-15,vhs outro (interlude),1:25,vhs [explicit],x ambassadors,(c) 2015 kidinakorner/interscope records,"June 30, 2015",vhs outro (interlude) [explicit],1:25,1
1,1053,1053,148,50767,title (deluxe),meghan trainor,"2014, 2015 epic records, a division of sony m...",9-Jan-15,credit,2:51,title (deluxe),meghan trainor,"2011 what a music ltd, licence exclusive parl...","January 9, 2015",credit,2:51,1
2,1290,1290,206,41214,slow down (remixes),selena gomez,"2013 hollywood records, inc.",20-Aug-13,slow down (smash mode remix),5:21,slow down remixes,selena gomez,"(c) 2013 hollywood records, inc.","August 20, 2013",slow down (smash mode remix),5:21,1
3,1424,1424,211,19812,slow down (reggae remixes) - single,selena gomez,"2013 hollywood records, inc.",20-Aug-13,slow down (sure shot rockers reggae dub remix),3:15,good for you (remixes),selena gomez,(c) 2015 interscope records,"September 4, 2015",good for you (yellow claw & cesqeaux remix) [f...,3:01,0
4,1706,1706,250,53111,vhs,x ambassadors,2015 kidinakorner/interscope records,30-Jun-15,vhs outro (interlude),1:25,vhs [explicit],x ambassadors,(c) 2015 kidinakorner/interscope records,"June 30, 2015",first show (interlude),0:11,0


## Evaluation

### Determine Matching Set and Data

In [15]:
matches = candidates[candidates[dataset_data["label_column"]] == 1]
matches = matches.drop(columns=dataset_data["drop_columns"])

gt = {(x[f"ltable.{dataset_data['key_column']}"], x[f"rtable.{dataset_data['key_column']}"]) for _, x in matches.iterrows()}
print(str(gt)[0: 50], "...")

matches.head()

{(2743, 17193), (6533, 38335), (5713, 35091), (149 ...


Unnamed: 0,ltable.Sno,rtable.Sno,ltable.Album_Name,ltable.Artist_Name,ltable.CopyRight,ltable.Released,ltable.Song_Name,ltable.Time,rtable.Album_Name,rtable.Artist_Name,rtable.CopyRight,rtable.Released,rtable.Song_Name,rtable.Time,label
0,111,53124,vhs,x ambassadors,2015 kidinakorner/interscope records,30-Jun-15,vhs outro (interlude),1:25,vhs [explicit],x ambassadors,(c) 2015 kidinakorner/interscope records,"June 30, 2015",vhs outro (interlude) [explicit],1:25,1
1,148,50767,title (deluxe),meghan trainor,"2014, 2015 epic records, a division of sony m...",9-Jan-15,credit,2:51,title (deluxe),meghan trainor,"2011 what a music ltd, licence exclusive parl...","January 9, 2015",credit,2:51,1
2,206,41214,slow down (remixes),selena gomez,"2013 hollywood records, inc.",20-Aug-13,slow down (smash mode remix),5:21,slow down remixes,selena gomez,"(c) 2013 hollywood records, inc.","August 20, 2013",slow down (smash mode remix),5:21,1
5,250,53124,vhs,x ambassadors,2015 kidinakorner/interscope records,30-Jun-15,vhs outro (interlude),1:25,vhs [explicit],x ambassadors,(c) 2015 kidinakorner/interscope records,"June 30, 2015",vhs outro (interlude) [explicit],1:25,1
7,252,53004,vhs,x ambassadors,2015 kidinakorner/interscope records,30-Jun-15,vhs outro (interlude),1:25,vhs,x ambassadors,(c) 2015 kidinakorner/interscope records,"June 30, 2015",vhs outro (interlude),1:26,1


In [16]:
columns_left = [col.replace("ltable.", "") for col in matches.columns if col.startswith("ltable")]
columns_right = [col.replace("rtable.", "") for col in matches.columns if col.startswith("rtable")]
print(columns_left, columns_right, sep="\n")

['Sno', 'Album_Name', 'Artist_Name', 'CopyRight', 'Released', 'Song_Name', 'Time']
['Sno', 'Album_Name', 'Artist_Name', 'CopyRight', 'Released', 'Song_Name', 'Time']


In [17]:
data_left = list(set([tuple([row[f"ltable.{col}"] for col in columns_left]) for _, row in matches.iterrows()]))
data_right = list(set([tuple([row[f"rtable.{col}"] for col in columns_right]) for _, row in matches.iterrows()]))
print(data_left[0], data_right[0], sep="\n")

(6048, 'artpop', 'lady gaga', '__æ 2013 interscope records', '11-Nov-13', 'g.u.y.', '3:52')
(9123, 'caldwell county ep', 'eric church', ' (c) 2011 emi records nashville', ' January 14, 2011', 'chevy van', '2:48')


### Execute Operator and Evaluate

In [18]:
op1 = Dummy("ltable", columns_left, data_left)
op2 = Dummy("rtable", columns_right, data_right)

op = InnerSoftJoin(op1, op2, em=stem, sv=sv, threshold=threshold, method=method, columns_left=significant_columns_left, columns_right=significant_columns_right, embedding_comparison=embedding_comparison)

result = [x for x in tqdm.tqdm(op.open())]
pred = {(r[f"ltable.{dataset_data['key_column']}"], r[f"rtable.{dataset_data['key_column']}"]) for r in result}

151it [00:01, 132.30it/s]


In [19]:
tps, fns, fps = gt & pred, gt - pred, pred - gt
tp, fn, fp = len(tps), len(fns), len(fps)

values = {"tp": tp, "fn": fn, "fp": fp}

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

scores = { "Precision": precision, "Recall": recall, "F1 Score": f1_score, }
print(scores)

{'Precision': 0.8211920529801324, 'Recall': 0.9393939393939394, 'F1 Score': 0.8763250883392226}


In [20]:
joined_column = [f"ltable.{c}" for c in columns_left] + [f"rtable.{c}" for c in columns_right]

serialized_results = [", ".join(str(v) for k,v in x.items()) for x in result]
serialized_ground_truth = [", ".join([str(v[col]) for col in joined_column]) for k, v in matches.iterrows()]

print("Sample for matched records: ", serialized_results[0], serialized_ground_truth[0], sep="\n")

print("\nBLUE Scores: ")
for i in range(4):
    print(f"\tbleu{i+1}: {compute_bleu_representativeness(serialized_ground_truth, serialized_results, use_n_grams=i+1):0.4}",)

Sample for matched records: 
3851, caldwell county - ep, eric church,    2011 emi records nashville. all rights reserved. unauthorized reproduction is a violation of applicable laws.  manufactured by capitol records nashville, 3322 west end avenue, 11th floor, nashville, tn   37203, 14-Jan-11, chevy van, 2:48, 9123, caldwell county ep, eric church,  (c) 2011 emi records nashville,  January 14, 2011, chevy van, 2:48
111, vhs, x ambassadors,  2015 kidinakorner/interscope records, 30-Jun-15, vhs outro (interlude), 1:25, 53124, vhs [explicit], x ambassadors,  (c) 2015 kidinakorner/interscope records,  June 30, 2015, vhs outro (interlude) [explicit], 1:25

BLUE Scores: 
	bleu1: 0.9728
	bleu2: 0.9648
	bleu3: 0.9603
	bleu4: 0.9571
