Implementation of
- Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. Arjit Jain, Sunita Sarawagi, Prithviraj Sen
Traditional methods for Active Learning Pairwise classification tasks follow a pipeline as described: In each iteration, the learning algorithm (learner) learns a matcher (shown in an ellipse which we use to denote model components) from labeled data 𝑇, the labeled pairs collected from the (human) labeler so far, while the example selector (selector) chooses the most informative unlabeled pairs to acquire labels for. After including the new labels into 𝑇, the process repeats until we learn a matcher of sufficient quality.
Our proposed integrated matcher-blocker combination and new AL workflow as shown. Compared to the previous diagram, the two most notable differences are
- the blocker (dashed box) is now part of the AL feedback loop, and
- the matcher is a component within the blocker. As base matcher, we use transformer-based pretrained language models (TPLM) which have recently led to excellent ER accuracies in the passive (non-AL) settings.
This code has been tested on a machine with 64 2.10GHz Intel Xeon Silver 4216 CPUs with 1007GB RAM and a single NVIDIA Titan Xp 12 GB GPU with CUDA 10.2 running Ubuntu 18.04
The first step is to get the data. We provide the data used in DeepMatcher experiments (Link1 Link2 Link3) The multilingual data can be downloaded from salesforce/localization-xml-mt
cd MultiLingual
git clone https://github.com/salesforce/localization-xml-mt.git
Now create a virtual environment using conda
conda create -n DIAL_env
conda activate DIAL_env
conda install -y -c conda-forge -c pytorch pytorch==1.6 cudatoolkit=10.2
pip install faiss-cpu transformers scikit-learn pandas
Use run_single.sh to run DIAL. Example
bash run_single.sh DIAL amazon_google_exp
To evaluate on Test, run
bash run_eval.sh Eval-Test DIAL amazon_google_exp
and to evaluated on All Pairs, run
bash run_eval.sh Eval-AllPairs DIAL amazon_google_exp
Currently supports : Walmart-Amazon, Amazon-Google, DBLP-ACM, DBLP-Google Scholar, Abt-Buy
To run experiments with the multilingual dataset,
cd MultiLingual
bash run_multilingual_expts.sh DIAL-Multilingual
If you use this code for your research, please consider citing our arXiv preprint
@misc{jain2021deep,
title={Deep Indexed Active Learning for Matching Heterogeneous Entity Representations},
author={Arjit Jain and Sunita Sarawagi and Prithviraj Sen},
year={2021},
eprint={2104.03986},
archivePrefix={arXiv},
primaryClass={cs.DB}
}
References: