# Dirty ER - Workflows


In this notebook it's implemented the 3 basic JedAI workflows for Dirty Entity Resolution


## WorkFlow 1

![workflow1.png](attachment:40cc4ff9-3fca-4bf2-83ca-7e4a890497ef.png)

In [1]:
%pip install strsimpy

Note: you may need to restart the kernel to use updated packages.


Libraries import

In [2]:
from html import entities
import os
import sys
import pandas as pd
import networkx
from networkx import (
    draw,
    DiGraph,
    Graph,
)

%load_ext autoreload
%autoreload 2
%reload_ext autoreload

Import JedAI utilities

In [3]:
from utils.tokenizer import cora_text_cleaning_method
from utils.utils import print_clusters
from blocks.utils import print_blocks, print_candidate_pairs

Import of evaluation module

In [4]:
from evaluation.scores import Evaluation

### Data Reading

In [5]:
from datamodel import Data

#### CSV format

In [6]:
d1 = pd.read_csv("../data/cora/cora.csv", sep='|')
gt = pd.read_csv("../data/cora/cora_gt.csv", sep='|', header=None)
attr = ['Entity Id','author', 'title']

#### JSON format

In [None]:
d1 = pd.read_json("../data/cora/cora.json")
gt = pd.read_json("../data/cora/cora_gt.csv")
attr = ['author', 'title']

#### RDF format

In [None]:
import rdfpandas as rfd
import pandas as pd
import rdflib

g1 = rdflib.Graph()
g1.parse('d1.ttl', format = 'ttl')
g_gt = rdflib.Graph()
g_gt.parse('d1.ttl', format = 'ttl')

d1 = rfd.graph.to_dataframe(g1)
gt = rfd.graph.to_dataframe(g_gt)

#### Relational DB

#### SPARKQL

Data is the connecting module of all steps of the workflow

In [8]:
data = Data(
    dataset_1=d1,
    id_column_name_1='Entity Id',
    ground_truth=gt,
    attributes_1=attr
)

data.process(cora_text_cleaning_method)

### Schema Clustering

In [11]:
# import valentine
# # Instantiate matcher and run
# matcher = Coma(strategy="COMA_OPT")
# matches = valentine_match(df1, df2, matcher)

### Block Building

In [12]:
from blocks.building import (
    StandardBlocking,
    QGramsBlocking,
    SuffixArraysBlocking,
    ExtendedSuffixArraysBlocking,
    ExtendedQGramsBlocking
)

In [13]:
blocks = StandardBlocking().build_blocks(data)

Standard Blocking - Dirty ER: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:00<00:00, 18500.40it/s]


In [14]:
blocks = QGramsBlocking(
    qgrams=2
).build_blocks(data)

Q-Grams Blocking - Dirty ER: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:00<00:00, 4096.40it/s]


In [15]:
print_blocks(blocks, data.is_dirty_er)

Number of blocks:  610

Block  [1;32m0  [0m  contains entities with ids: 
Dirty dataset: [[1;34m131 entities[0m]
{0, 1030, 7, 520, 10, 1040, 530, 20, 1050, 540, 30, 1060, 550, 40, 1070, 560, 50, 1080, 570, 60, 1090, 580, 70, 1100, 590, 80, 1110, 600, 90, 1120, 610, 100, 1130, 620, 110, 1140, 630, 120, 1150, 640, 130, 1160, 650, 140, 1170, 660, 150, 1180, 670, 160, 1190, 680, 170, 1200, 690, 180, 1210, 700, 190, 1220, 710, 200, 1230, 720, 210, 1240, 730, 220, 1250, 740, 230, 1260, 750, 240, 1270, 760, 250, 1280, 770, 260, 1290, 780, 270, 790, 280, 800, 290, 810, 300, 820, 310, 830, 320, 840, 330, 850, 340, 860, 350, 870, 360, 880, 370, 890, 380, 900, 390, 910, 400, 920, 410, 930, 420, 940, 430, 950, 440, 960, 450, 970, 460, 980, 470, 990, 480, 1000, 490, 1010, 500, 1020, 510}

Block  [1;32m  p[0m  contains entities with ids: 
Dirty dataset: [[1;34m428 entities[0m]
{0, 1, 2, 3, 4, 15, 17, 18, 20, 23, 24, 27, 30, 31, 33, 34, 36, 39, 41, 42, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 

In [16]:
blocks = SuffixArraysBlocking(
    suffix_length=2
).build_blocks(data)

Suffix Arrays Blocking - Dirty ER: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:00<00:00, 21949.59it/s]


In [17]:
blocks = ExtendedSuffixArraysBlocking(
    suffix_length=2
).build_blocks(data)

Extended Suffix Arrays Blocking - Dirty ER: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:00<00:00, 3637.63it/s]


In [18]:
# print_blocks(blocks, data.is_dirty_er)

In [20]:
Evaluation(data).report(blocks)

+-----------------------------+
 > Evaluation
+-----------------------------+
Precision:      0.09% 
Recall:       100.00%
F1-score:       0.19%

Total pairs: 18311825
True positives: 17184
True negatives: -17456776
False positives: 18294641
False negative: 0


### Block Cleaning

In [21]:
from blocks.cleaning import (
    BlockFiltering
)

In [22]:
filtered_blocks = BlockFiltering(
    ratio=0.9
).process(blocks, data)

Block Filtering: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 25.35it/s]


In [23]:
# print_blocks(filtered_blocks, data.is_dirty_er)

In [24]:
Evaluation(data).report(filtered_blocks)

+-----------------------------+
 > Evaluation
+-----------------------------+
Precision:      0.13% 
Recall:       100.00%
F1-score:       0.25%

Total pairs: 13511169
True positives: 17184
True negatives: -12656120
False positives: 13493985
False negative: 0


### Comparison Cleaning - Meta Blocking

In [25]:
from blocks.purging import (
    ComparisonsBasedBlockPurging
)

In [26]:
cleaned_blocks = ComparisonsBasedBlockPurging(
    smoothing_factor=0.008
).process(blocks, data)

Comparison-based Block Purging: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1088/1088 [00:00<00:00, 218459.61it/s]


In [28]:
Evaluation(data).report(cleaned_blocks)

+-----------------------------+
 > Evaluation
+-----------------------------+
Precision:     23.53% 
Recall:         0.07%
F1-score:       0.14%

Total pairs: 51
True positives: 12
True negatives: 820654
False positives: 39
False negative: 17172


In [29]:
from blocks.comparison_cleaning import (
    WeightedEdgePruning,
    WeightedNodePruning,
    CardinalityEdgePruning,
    CardinalityNodePruning,
    BLAST,
    ReciprocalCardinalityNodePruning,
    ReciprocalCardinalityWeightPruning,
    ComparisonPropagation
)

In [30]:
candidate_pairs_blocks = WeightedEdgePruning(
    weighting_scheme='CBS'
).process(filtered_blocks, data)

Weighted Edge Pruning: 2590it [00:54, 47.89it/s]                                                                                                                                                                


In [31]:
candidate_pairs_blocks = WeightedNodePruning(
    weighting_scheme='CBS'
).process(filtered_blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

Weighted Node Pruning:   0%|                                                                                                                                                           | 0/1295 [00:00<?, ?it/s]

Valid entities are:  0


Weighted Node Pruning: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:50<00:00, 25.76it/s]


In [32]:
candidate_pairs_blocks = CardinalityEdgePruning(
    weighting_scheme='CBS'
).process(filtered_blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

Cardinality Edge Pruning: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:28<00:00, 45.91it/s]


In [33]:
candidate_pairs_blocks = CardinalityNodePruning(
    weighting_scheme='JS'
).process(filtered_blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

Cardinality Node Pruning: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:26<00:00, 48.95it/s]


In [34]:
candidate_pairs_blocks = BLAST(
    weighting_scheme='JS'
).process(filtered_blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

BLAST: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:51<00:00, 25.10it/s]


In [35]:
candidate_pairs_blocks = ReciprocalCardinalityNodePruning(
    weighting_scheme='JS'
).process(filtered_blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

Reciprocal Cardinality Node Pruning: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:26<00:00, 48.20it/s]


In [36]:
candidate_pairs_blocks = ReciprocalCardinalityWeightPruning(
    weighting_scheme='JS'
).process(filtered_blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

Reciprocal Weighted Node Pruning:   0%|                                                                                                                                                | 0/1295 [00:00<?, ?it/s]

Valid entities are:  0


Reciprocal Weighted Node Pruning: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:51<00:00, 25.17it/s]


In [37]:
candidate_pairs_blocks = ComparisonPropagation().process(blocks, data)

# In one case valid entities set is empty and crushed / what to do in this case, Java doesnt handle it

Comparison Propagation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1295/1295 [00:05<00:00, 245.25it/s]


In [None]:
# print_candidate_pairs(candidate_pairs_blocks)

In [38]:
Evaluation(data).report(candidate_pairs_blocks)

+-----------------------------+
 > Evaluation
+-----------------------------+
Precision:      2.05% 
Recall:       100.00%
F1-score:       4.02%

Total pairs: 837865
True positives: 17184
True negatives: 17184
False positives: 820681
False negative: 0


### Entity Matching

In [39]:
from matching.similarity import EntityMatching

In [40]:
attr = ['author', 'title']
# or with weights
attr = {
    'author' : 0.6,
    'title' : 0.4
}

EM = EntityMatching(
    metric='jaccard', 
    similarity_threshold=0.5
    # embedings=None, # gensim
    # attributes=attr,
    # qgram=2 # for ngram metric or jaccard
)

# pairs_graph = EM.predict(blocks, data)

In [None]:
pairs_graph = EM.predict(filtered_blocks)

In [41]:
attr = {
    'author' : 0.6, 
    'title' : 0.4
}

EM = EntityMatching(
    metric='jaccard', 
    similarity_threshold=0.5
    # embedings=None, # gensim
    # attributes=attr,
    # qgram=2 # for ngram metric or jaccard
)

pairs_graph = EM.predict(candidate_pairs_blocks)

TypeError: EntityMatching.predict() missing 1 required positional argument: 'data'

In [None]:
draw(pairs_graph)

### Entity Clustering

In [None]:
from clustering.connected_components import ConnectedComponentsClustering

In [None]:
clusters = ConnectedComponentsClustering().process(pairs_graph)

In [None]:
# print_clusters(clusters)

In [None]:
e = Evaluation(data)

e.report(clusters)

In [None]:
e.confusion_matrix()

## WorkFlow 2

![workflow2.png](attachment:f449e2c7-75f0-4f05-91e6-56e9eb3a9c23.png)

## WorkFlow 3

![workflow3.png](attachment:c5c014d0-3774-4389-82d4-24a985db68a4.png)