Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Locality-Sensitive Hashing: a Primer
Locality-Sensitive Hashing (LSH) is a class of methods for the nearest neighbor search problem, which is defined as follows: given a dataset of points in a metric space (e.g., Rd with the Euclidean distance), our goal is to preprocess the data set so that we can quickly answer nearest neighbor queries: given a previously unseen query point, we want to find one or several points in our dataset that are closest to the query point. LSH is one of the main techniques for nearest neighbor search in high dimensions (but there are also many others, e.g., see the corresponding Wikipedia article).
In a nutshell, LSH is a way to randomly partition the ambient space into cells that respect the desired similarity metric. The core building block of LSH are locality-sensitive hash functions. We say that a hash function (or more precisely, a family of hash functions) is locality sensitive if the following property holds: pairs of points that are close together are more likely to collide than pairs of points that are far apart. LSH data structures use locality-sensitive hash functions in order to partition the space in wich the dataset lives: every possible hash value essentially corresponds to its own cell.
A good partition of our data space will allow us to answer nearest neighbor queries more efficiently: instead of comparing the query point with all points in the dataset, we first find the hash cell our query lies in. If we have a good partition that respects our desired similarity metric, the nearest neigbor will probably lie in the same hash cell as the query point. So we only need to compare the query point to the candidates in the current hash cell, which can be significantly fewer points than the entire data set. This is how LSH enables us to answer nearest neighbor queries efficiently.
An example: Hyperplane LSH
We first consider a simple example hash function that is locality sensitive. Suppose the data points lie on a unit sphere in a d-dimensional space (Euclidean distance on a sphere corresponds to the cosine similarity). In order to partition the sphere, we then sample a random hyperplane through the center of the sphere and cut the sphere across it, which gives two equal cells. Intuitively, this approach gives a locality-sensitive hash function because any two close points will almost always be on the same side of a hyperplane, while pairs of points that are far apart tend to be separated by the random hyperplane (in the most extreme case of points that are opposite, any hyperplane will separate the points).
Using LSH for Similarity Search
We now show how to use LSH for similarity search. A first idea would be to sample a partition (equivalently, sample a locality-sensitive hash function), create a bucket for every cell, and group the data points into these buckets. This procedure can be implemented efficiently using a "standard" (i.e., not locality sensitive) hash table. Given a query, we would then treat all the data points that are in the same bucket as the query point as potential nearest neighbors and compute the distance between each of these points and the query. But the example of hyperplane hash function outlined above shows that this may not always be good enough. Indeed, with a single hyperplane we only have two cells. As a result, we would need to test roughly half of the data points when answering a single query!
Multiple hash functions and tables
A better idea is to use many partitions (hash functions) at once. Instead of sampling one partition, we sample K independent partitions. Now, buckets correspond to size-K tuples of cells in the corresponding partitions. For instance, for the case of Hyperplane LSH we have 2K buckets. So for a given query, we might hope to enumerate only a 1/2K fraction of the data points (caveat: this is a bit of an oversimplification, and the reality is slightly more complicated). By choosing K wisely, we achieve a good query time.
Are we done? Not quite. The reason is that we forgot to look at the probability of success (finding the nearest neighbor). If we choose K to be large enough, the probability of the query and the corresponding nearest neighbor falling into the same bucket will be tiny (remember that each hash function has a certain probability of mapping points to different cells, even if the points are close). To cope with this challenge, we repeat the whole process several times. Now, instead of having a single hash table, we have L independent hash tables. Given a query, we now perform L independent hash table lookups and get a set of L candidate buckets (one per table). We then use all data points in the L buckets as candidates and compute the distances between the query and each of these points. In total, this construction samples K * L partitions (hash functions).
How should one choose K and L? If K is too small, we have too many data points per bucket, and the query time is high. On the other hand, if K is too large, we need L to be large so that we still get a good probability of success. But a large value of L leads to high space consumption (many tables) and slow queries (many hash functions to compute). While this might sound discouraging, there is a sweet spot between these extremes! It is usually possible to choose K and L such that we get a significant speed-up compared to a linear scan over the entire data set.
However, it turns out that we need the number of tables L to be pretty large in order to get good query times in many practical scenarios: often, we need L somewhere between 100 and 1000. This makes the space consumption prohibitive if we are working with large data sets. Luckily, there is a way to reduce the number of tables significantly. The idea is to query more than one bucket in each table. Instead of querying only a single bucket per table (L buckets in total), we now query T > L buckets in total across all tables, where T is a user-specified parameter. A multiprobe scheme chooses these T buckets so that they are (approximately) the most likely buckets to contain the nearest neighbor. So by increasing T, we have a higher probability of finding the nearest neighbor in one of the tables, which allows us to decrease L.
All in all, we have three parameters:
- K, the number of hash functions (space partitions) per hash table.
- L, the number of hash tables (each of which has K hash functions).
- T, the number of probes (total number of buckets probed across all hash tables).
Usually, it is a good idea to choose L first based on the available memory. Then, we have a trade-off between K and T: the larger K is, the more probes we need to achieve a given probability of success, and vice versa. The best way to choose K and T is usually the following parameter search: Try increasing values of K, and for each value of K, find the right number of probes T so that we get the desired accuracy on a set of sample queries. Varying the parameter T does not require rebuilding the hash table (as opposed to K and L). Moreover, we can search over T using a binary search. Usually, this means that we can find the optimal parameter setting fairly quickly.
See our bibliography for a list of papers about LSH that are relevant to FALCONN.