# Clean Clean ER - Workflows


In this notebook it's implemented the 3 basic JedAI workflows for Clean Clean Entity Resolution


## ![workflow1.png](attachment:40cc4ff9-3fca-4bf2-83ca-7e4a890497ef.png)

In [1]:
%pip install strsimpy

Note: you may need to restart the kernel to use updated packages.


Libraries import

In [2]:
from html import entities
import os
import sys
import pandas as pd
import networkx
from networkx import (
    draw,
    DiGraph,
    Graph,
)

%load_ext autoreload
%autoreload 2
%reload_ext autoreload

Import JedAI utilities

In [3]:
from utils.tokenizer import cora_text_cleaning_method
from utils.utils import print_clusters
from blocks.utils import print_blocks, print_candidate_pairs

Import of evaluation module

In [4]:
from evaluation.scores import Evaluation

### Data Reading

In [5]:
from datamodel import Data

data = Data(
    dataset_1=pd.read_csv(
        "../data/abt-buy/D2Aemb.csv",
        sep='|'
    ).astype(str), 
    dataset_2=pd.read_csv(
        "../data/abt-buy/D2Bemb.csv",
        sep='|'
    ).astype(str),  
    ground_truth=pd.read_csv("../data/abt-buy/D2groundtruth.csv", sep='|'),
)

data.process(cora_text_cleaning_method)

In [6]:
data.print_specs()

Type of Entity Resolution:  Clean-Clean
Number of entities in D1:  1076
Number of entities in D1:  1076
Total number of entities:  2152
Attributes provided:  ['Id', 'Name', 'Aggregate Value', 'Embedded Name', 'Embedded Ag.Value', 'Clean Name', 'Embedded Clean Name', 'Clean Ag.Value', 'Embedded Clean Ag.Value']


### Schema Clustering

In [7]:
# TODO valentine

### Block Building

In [8]:
from blocks.building import (
    StandardBlocking,
    QGramsBlocking
)

In [9]:
blocks = StandardBlocking().build_blocks(data)

Standard Blocking - Clean-Clean ER (1): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1076/1076 [00:19<00:00, 56.39it/s]
Standard Blocking - Clean-Clean ER (2): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1076/1076 [00:19<00:00, 56.23it/s]


In [10]:
blocks = QGramsBlocking(
    qgrams=2
).build_blocks(data)

Q-Grams Blocking - Clean-Clean ER (1): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1076/1076 [00:19<00:00, 55.05it/s]
Q-Grams Blocking - Clean-Clean ER (2): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1076/1076 [00:20<00:00, 52.85it/s]


In [11]:
# print_blocks(blocks, data.is_dirty_er)

In [12]:
Evaluation().report(blocks, data)

+----------+
 Evaluation
+----------+
Precision:      0.00% 
Recall:         0.46%
F1-score:       0.00%


### Block Cleaning

In [13]:
from blocks.cleaning import (
    BlockFiltering
)

In [14]:
filtered_blocks = BlockFiltering(
    ratio=0.9
).process(blocks, data)

Block Filtering: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.50it/s]


In [15]:
# print_blocks(filtered_blocks, data.is_dirty_er)

In [16]:
Evaluation().report(filtered_blocks, data)

+----------+
 Evaluation
+----------+
Precision:      0.00% 
Recall:         0.56%
F1-score:       0.00%


### Comparison Cleaning - Meta Blocking

In [17]:
from blocks.comparison_cleaning import (
    WeightedEdgePruning
)

In [18]:
%%time
candidate_pairs_blocks = WeightedEdgePruning(
    weighting_scheme='CBS'
).process(filtered_blocks, data)

Weighted Edge Pruning: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4304/4304 [11:52<00:00,  6.04it/s]

CPU times: total: 11min 48s
Wall time: 11min 53s





In [19]:
print_candidate_pairs(candidate_pairs_blocks)

Number of blocks:  2086

Entity id  [1;32m0[0m  is candidate with: 
- Number of candidates: [[1;34m656 entities[0m]
{2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2059, 2061, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2084, 2086, 2089, 2090, 2091, 2092, 2093, 2094, 2096, 2097, 2098, 2100, 2101, 2103, 2104, 2109, 2112, 2113, 2114, 2118, 2119, 2121, 2122, 2125, 2126, 2127, 2128, 2130, 2132, 2133, 2136, 2137, 2138, 2139, 2141, 2145, 2147, 1076, 1077, 1078, 1079, 1080, 1082, 1083, 1084, 1085, 1086, 1088, 1089, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1102, 1105, 1107, 1109, 1110, 1111, 1112, 1113, 1114, 1115, 1117, 1119, 1121, 1123, 1124, 1126, 1128, 1129, 1130, 1131, 1134, 1136, 1140, 1141, 1143, 1150, 1151, 1154, 1156, 1159, 1160, 1161, 1165, 1166, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, 1177, 1179, 1181, 1182, 1183, 1184, 1187, 1188, 1192, 1193, 1196, 1197, 1199, 1200, 1204, 1206, 1207, 120

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



{0, 1024, 2, 1027, 1028, 7, 8, 9, 10, 1033, 1039, 1042, 19, 1044, 1046, 1047, 25, 1050, 1051, 1053, 31, 32, 33, 1055, 35, 1056, 40, 1066, 1067, 44, 1070, 48, 1072, 1074, 52, 53, 57, 60, 69, 76, 78, 85, 86, 87, 89, 97, 99, 101, 115, 118, 121, 124, 125, 126, 127, 128, 130, 135, 137, 138, 142, 148, 153, 155, 159, 164, 166, 167, 170, 171, 174, 175, 176, 181, 194, 196, 197, 200, 201, 207, 211, 212, 213, 214, 217, 218, 225, 226, 227, 229, 233, 237, 241, 243, 244, 248, 251, 253, 259, 267, 269, 274, 275, 277, 280, 283, 284, 286, 287, 290, 292, 293, 294, 299, 303, 305, 308, 314, 319, 320, 324, 329, 333, 334, 335, 337, 338, 343, 348, 350, 353, 360, 363, 364, 365, 369, 373, 374, 375, 378, 380, 382, 384, 385, 386, 388, 390, 394, 395, 400, 404, 405, 408, 426, 429, 430, 435, 437, 438, 439, 441, 442, 443, 447, 448, 449, 452, 454, 457, 459, 461, 462, 465, 468, 469, 474, 475, 476, 478, 479, 480, 481, 482, 484, 493, 494, 495, 496, 498, 499, 500, 501, 502, 503, 504, 506, 507, 508, 509, 510, 515, 516, 519

In [20]:
Evaluation().report(candidate_pairs_blocks, data)

ZeroDivisionError: float division by zero

### Entity Matching

In [None]:
from matching.similarity import EntityMatching

In [None]:
attr = ['author', 'title']
# or with weights
attr = {
    'Name' : 0.6, 
    'Aggregate Value' : 0.4
}

EM = EntityMatching(
    metric='jaccard', 
    similarity_threshold=0.5
    # embedings=None, # gensim
    # attributes=attr,
    # qgram=2 # for ngram metric or jaccard
)

pairs_graph = EM.predict(blocks, data)

In [None]:
pairs_graph = EM.predict(filtered_blocks, data)

In [None]:
%%time

attr = {
    'Name' : 0.6, 
    'Aggregate Value' : 0.4
}

EM = EntityMatching(
    metric='jaccard', 
    similarity_threshold=0.5
)

pairs_graph = EM.predict(candidate_pairs_blocks, data)

In [None]:
draw(pairs_graph)

### Entity Clustering

In [None]:
from clustering.connected_components import ConnectedComponentsClustering

In [None]:
clusters = ConnectedComponentsClustering().process(pairs_graph)

In [None]:
print_clusters(clusters)

### Evaluation

In [None]:
Evaluation().report(clusters, data)

## ![workflow2.png](attachment:f449e2c7-75f0-4f05-91e6-56e9eb3a9c23.png)

## ![workflow3.png](attachment:c5c014d0-3774-4389-82d4-24a985db68a4.png)