Welcome to the AWS_ALSH GitHub!
AWS_ALSH is a package for the problem of Nearest Neighbor Search (NNS) over generalized weighted space. Given a dataset of n data points and a query q with a weight vector w in the Eculidean space, this problem aims to find the nearest point to q such that the weighted Euclidean distance is minimized. It is a fundamental problem, and it has wide applications in many data mining and machine learning tasks.
In the package (AWS_ALSH), we introduce two Asymmetric Locality-Sensitive Hashing (ALSH) schemes that can deal with this problem in sublinear time and work on Arbitraty Weighted Space (AWS). We call them weight-oblivious. The details of AWS_ALSH can be found from our work Sublinear Time Nearest Neighbor Search over Generalized Weighted Space which has been published in ICML 2019.
We provide three datasets Mnist, MovieLens, and Sift for evaluations. They can be downloaded from OneDrive. The statistics of the datasets are summarized as follows.
Datasets | #Data Points | #Queries | Dimensionality |
---|---|---|---|
Mnist | 60,000 | 1,000 | 784 |
MovieLens | 52,889 | 1,000 | 150 |
Sift | 1,000,000 | 1,000 | 128 |
For each query set, we generate the weight vector w's from five typical distributions, whihc are illustrated as follows.
Types | Illustrations |
---|---|
identical | all "1"s |
binary | uniformly distributed in {0,1}d |
normal | d-dimensional normal distributions N(0,I) |
uniform | uniformly distributed in [0,1]d |
negative | all "-1"s |
To build the project, use the following instructions:
mkdir build
cd build
cmake ..
make -j
In order to run the scripts, the datasets/query should be put in the data/
folder (You should create such a folder by yourself and copy the datasets and queries to this folder), or modify the scripts/dataset_config.py
to change the path of datasets and queries accordingly.
We provide an example as follows to illustrate how to run this package:
./alsh -A fraction_recall_s2alsh -n 60000 -q 1000 -d 784 -D ../data/Mnist784/Mnist784.ds -Q ../data/Mnist784/Mnist784.q -W ../data/Mnist784/Mnist784_normal.w -G ../data/Mnist784/Mnist784_normal.gt -O output.out -U 3.140000 --data_hash_filename data_hash.dh --query_hash_filename query_hash.qh
We also provide the python scripts on the scripts/
folder that can be used to run the precision_recall and fraction_recall experiments automatically, e.g.
python ../scripts/run_ground_truth.py
python ../scripts/run_precision_recall.py
python ../scripts/run_fraction_recall.py
The results for the precision-recall experiments can be found as follows.
If you need more details to run this package, please type the following command to get more information.
./alsh --help
Please use the following bibtex to cite this work when you use AWS_ALSH in your paper.
@inproceedings{lei2019sublinear,
title={Sublinear time nearest neighbor search over generalized weighted space},
author={Lei, Yifan and Huang, Qiang and Kankanhalli, Mohan and Tung, Anthony},
booktitle={International Conference on Machine Learning},
pages={3773--3781},
year={2019},
organization={PMLR}
}
It is welcome to contact Lei Yifan (leiyifan@u.nus.edu) or Huang Qiang (huangq@comp.nus.edu.sg) if you meet any issue. Thank you.