# BLAST

This example illustrates how to perform meta-blocking by using BLAST.

BLAST can be only used on clean-clean datasets (data linkage), so when we have two datasets that do not contain duplicates and we want to discover the duplicates between them. 

In [1]:
import sparker
import random

## Load the data
sparkER provides wrappers to load CSV and JSON files.

First, we load the first dataset, and we extract the maximum id. The profiles ids of the second dataset will be assigned starting from this one.

*real_id_field* is the field that contains the identifier of the record.
*source_id* is used to identify from which dataset the profile belongs, it is necessary for the attributes alignment

In [2]:
# Profiles contained in the first dataset
profiles1 = sparker.JSONWrapper.load_profiles('../datasets/clean/DblpAcm/dataset1.json', 
                                              real_id_field = "realProfileID", 
                                              source_id=1)
# Max profile id in the first dataset, used to separate the profiles in the next phases
separator_id = profiles1.map(lambda profile: profile.profile_id).max()
# Separators, used during blocking to understand from which dataset a profile belongs. It is an array because sparkER
# could work with multiple datasets
separator_ids = [separator_id]

Let's visualize a profile to check if they are correctly loaded

In [3]:
print(profiles1.take(1)[0])

{'profile_id': 0, 'attributes': [{'key': 'venue', 'value': 'SIGMOD Record'}, {'key': 'year', 'value': '1999'}, {'key': 'title', 'value': 'Semantic Integration of Environmental Models for Application to Global Information Systems and Decision-Making'}, {'key': 'authors', 'value': 'D. Scott Mackay'}], 'original_id': '0', 'source_id': 1}


Loads the second dataset and extract the max id (it will be used later)

In [4]:
profiles2 = sparker.JSONWrapper.load_profiles('../datasets/clean/DblpAcm/dataset2.json', 
                                              start_id_from = separator_id+1, 
                                              real_id_field = "realProfileID", 
                                              source_id=2)
# Max profile id
max_profile_id = profiles2.map(lambda profile: profile.profile_id).max()

Finally, concatenate the two RDDs

In [5]:
profiles = profiles1.union(profiles2)

### Groundtruth (optional)
If you have a groundtruth you can measure the performance at each blocking step.

When you load the groundtruth, it contains the original profiles IDs, so it is necessary to convert it to use the IDs assigned to each profile by Spark.

In [6]:
# Loads the groundtruth, takes as input the path of the file and the names of the attributes that represent
# respectively the id of profiles of the first dataset and the id of profiles of the second dataset
gt = sparker.JSONWrapper.load_groundtruth('../datasets/clean/DblpAcm/groundtruth.json', 'id1', 'id2')

In [7]:
# Converts the groundtruth by replacing original IDs with those given by Spark
new_gt = sparker.Converters.convert_groundtruth(gt, profiles1, profiles2)

In [8]:
# We can explore some pairs
random.sample(new_gt, 2)

[(51, 4816), (1496, 4445)]

## Attributes alignment

BLAST employs LSH to automatically align the attributes.

If *compute_entropy* is set to True it computes the entropy of each cluster of attributes that can be used to further improve the meta-blocking performance.

*target_threshold* parameter regulates the similarity that the values of the attributes should have to be clustered together

In [9]:
clusters = sparker.AttributeClustering.cluster_similar_attributes(profiles,
                                  num_hashes=128,
                                  target_threshold=0.5,
                                  compute_entropy=True)
clusters

[{'cluster_id': 0, 'keys': ['2_title', '1_title'], 'entropy': 9.280896484656086},
 {'cluster_id': 1, 'keys': ['1_authors', '2_authors'], 'entropy': 10.71793208243355},
 {'cluster_id': 2, 'keys': ['1_year', '2_year'], 'entropy': 3.309290823680249},
 {'cluster_id': 3, 'keys': ['2_venue', '1_venue'], 'entropy': 3.3225607247542883}]

## Blocking
Now we can perform blocking by using the generated clusters

In [10]:
blocks = sparker.Blocking.create_blocks_clusters(profiles, clusters, separator_ids)
print("Number of blocks",blocks.count())

Number of blocks 7120


## Block cleaning

sparkER implements two block cleaning strategies:

* Block purging: discard the largest blocks that involve too many comparisons, the parameter must be >= 1. A lower value mean a more aggressive purging.
* Block cleaning: removes for every profile the largest blocks in which it appears. The parameter is in range ]0, 1\[. A lower value mean a more aggressive cleaning.

In [11]:
# Perfoms the purging
blocks_purged = sparker.BlockPurging.block_purging(blocks, 1.005)

In [12]:
# Performs the cleaning
(profile_blocks, profile_blocks_filtered, blocks_after_filtering) = sparker.BlockFiltering.block_filtering_quick(blocks_purged, 0.8, separator_ids)

If you have the groundtruth, after every blocking step it is possible to check which are the performance of the blocking collection.

In [13]:
recall, precision, cmp_n = sparker.Utils.get_statistics(blocks_after_filtering, max_profile_id, new_gt, separator_ids)

print("Recall", recall)
print("Precision", precision)
print("Number of comparisons", cmp_n)

Recall 1.0
Precision 0.03026509172064667
Number of comparisons 73484


## Meta-blocking
Meta-blocking can be used to further refine the block collection removing superfluous comparisons.


For every partition of the RDD, the pruning algorithm returns as output a triplet that contains:

* The number of edges
* The number of matches (only if the groundtruth is provided)
* The retained edges

To perform the meta-blocking first some data structures have to be created.

In [14]:
block_index_map = blocks_after_filtering.map(lambda b : (b.block_id, b.profiles)).collectAsMap()
block_index = sc.broadcast(block_index_map)

block_entropies = sc.broadcast(blocks.map(lambda b : (b.block_id, b.entropy)).collectAsMap())

# This is only needed for certain weight measures
profile_blocks_size_index = sc.broadcast(profile_blocks_filtered.map(lambda pb : (pb.profile_id, len(pb.blocks))).collectAsMap())

# Broadcasted groundtruth
gt_broadcast = sc.broadcast(new_gt)

### Meta-blocking with BLAST
BLAST employs $\chi^2$ weighting scheme.

*chi2divider* parameter regulates the pruning aggressivity, a lower value performs a more aggressive pruning.

In [15]:
results = sparker.WNP.wnp(
                          profile_blocks_filtered,
                          block_index,
                          max_profile_id,
                          separator_ids,
                          weight_type=sparker.WeightTypes.CHI_SQUARE,
                          groundtruth=gt_broadcast,
                          profile_blocks_size_index=profile_blocks_size_index,
                          use_entropy=True,
                          blocks_entropies=block_entropies,
                          chi2divider=2.0
                         )
num_edges = results.map(lambda x: x[0]).sum()
num_matches = results.map(lambda x: x[1]).sum()
print("Recall", num_matches/len(new_gt))
print("Precision", num_matches/num_edges)
print("Number of comparisons",num_edges)

Recall 0.9986510791366906
Precision 0.31924680178237747
Number of comparisons 6957


## Collecting edges after meta-blocking
As mentioned before, the third element of the tuples returned by the meta-blocking contains the edges.


Edges are weighted according to the weight strategy provided to the meta-blocking.

In [16]:
edges = results.flatMap(lambda x: x[2])

edges.take(10)

[(0, 2733, 474746.74542385637),
 (404, 4257, 323134.03437346633),
 (404, 4292, 377061.62799117394),
 (1448, 3092, 533286.1049898568),
 (1448, 2689, 28001.924248551786),
 (2612, 2818, 9460.396018690539),
 (2412, 3616, 634811.2979628429),
 (2412, 3037, 24872.747932579347),
 (1284, 3052, 65034.464088759465),
 (1284, 2686, 859823.5641014529)]