-
Notifications
You must be signed in to change notification settings - Fork 12
/
README.html
245 lines (195 loc) · 9.99 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>srs_v2.2/README.html</title>
</head>
<body>
<h1>SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With Tiny Index</h1>
<p>SRS-Mem is a C++ program for performing approximate nearest neighbor search in
high dimension Euclidean space in the main memory. The current implementation is
adapted from our
<a href="http://www.cse.unsw.edu.au/~weiw/files/VLDB15-SRS-Final.pdf">VLDB'15 research paper</a>.
The main modification is to use an in-memory multidimensional index (rather than
an R-tree as in the paper), as it is often the case that our index is so small
and can be accommodated in the main memory. Currently, the index is a modified
version of the
<a href="http://hunch.net/~jl/projects/cover_tree/cover_tree.html">Cover Tree</a> due to
its strong theoretical guarantees; nevertheless, any multidimensional index that
supports incremental exact <em>k</em>NN search can be used!</p>
<h2>Features</h2>
<ul>
<li><p><strong>Guaranteed Success Probability</strong></p>
<p>Theoretically, SRS guarantees to return a <em>c</em>-approximate nearest neighbor to
the query with a user specified probability even in the worst case. For
example, many heuristic methods will not return a near neighbor on some hard
datasets (e.g., those generated by <code>gen_hard_data</code>).</p>
<p>There are several other unique theoretical properties of the SRS algorithm.
The top-<em>k</em> version of SRS guarantees to return <em>c</em>-<em>k</em>-approximate nearest
neighbors with constant probability (while previous methods have no guarantee
for <em>k</em> > 1), and the <code>SRS-1</code> algorithm guarantees to return <em>nearest
neighbor</em> (i.e., <em>c</em> = 1) to the query with any user specified success
probability.</p></li>
<li><p><strong>Small Index Size</strong></p>
<p>The index size of SRS is substantially smaller than the size of original data.
For example, the index size for a 12GB data set (8 million 384-dimension
points) is only 337MB. This means that the SRS index usually can be
accommodated <em>entirely</em> in the main memory.</p>
<p>As a side note, our index and query processing algorithm is independent of the
dimensionality of the dataset, i.e., it works for arbitrarily high dimensions. </p></li>
<li><p><strong>Rich Functionality</strong></p>
<p>The users can easily achieve a space-time balance by tuning parameters even
after the index has been built. All four variants of the query processing
algorithm in the paper are supported.</p></li>
</ul>
<h2>How it works</h2>
<p>In a nutshell, SRS reduces the problem of "approximate NN search in high
dimensional space" to a "exact T-NN search in a low dimensional space". </p>
<p>In the indexing phase, SRS projects data points from the original
high-dimensional space into an appropriately chosen <em>m</em>-dimensional space via
2-stable random projections, and then uses a cover-tree to index these projected
points.</p>
<p>The key observation is that the inter-point distance in the projected space
(called <em>Projected Distance</em>) over that in the original high-dimensional space
follows a scaled chi-squared distribution, which has a sharp concentration bound.
Therefore, given any threshold on the projected distance, and for any point <em>o</em>,
we can compute exactly the probability that <em>o</em>'s projected distance is within the
threshold.</p>
<p>In the querying phase, SRS performs an incrementally <em>k</em>-NN search on the
projected space, until when it has found a satisfactory point (i.e.,
early-termination condition), or it has examined <em>t * n</em> points (i.e.,
normal-termination condition).</p>
<h2>Before start</h2>
<ul>
<li>The users need to have Boost C++ library installed (http://www.boost.org/).
The Boost library is used to calculate the quantile of chi-squared
distribution.</li>
<li>Currently the program has only been tested on Linux.</li>
<li>There are four key parameters to the SRS algorithms:
<ul>
<li><em>n</em>: number of data points. </li>
<li><em>c</em>: approximation ratio. </li>
<li><em>m</em>: number of 2-stable random projections to be used in the index. </li>
<li><em>prob_thres</em>: the probability that the algorithm returns a <em>c</em>-approximate NN.
Typically, <em>n</em> is fixed, and one can fine tune the other three parameters to
achieve different space/time/quality tradoffs. Fixing any of the two
parameters will place a constraint on the third parameter.</li>
</ul></li>
<li>In addition to these input parameters, the program also generate an internal
parameter <em>t</em>: the fraction of data points to be examined before the search
terminates in the normal condition. </li>
</ul>
<h2>How to use SRS</h2>
<ol>
<li><p>Compile the program:</p>
<p><code>
% make all
</code></p></li>
<li><p>Use <code>cal_param</code> to calculate a feasible setting of parameters based on the
given constraints. The users can manually set either <em>m</em> or <em>the success
probability</em>. A feasible setting of parameters will be printed on the screen
and it can be used later in the query processing phase. The implementation is
based on Algorithm 6 in the paper. For example (using the toy dataset with
3000 points):</p>
<p><code>
% ./cal_param -n 3000 -m 7 -c 4
</code></p>
<p>The following message will be printed out:</p>
<p><code>
A feasible setting is:
m = 7
prob_thres(-r) = 0.299203
T_max(-t) = 2
t = 0.000544
</code></p>
<p>The output indicates that, in the query processing phase, the users shall
use the arguments <code>-c 4</code>, <code>-m 7</code>, <code>-r 0.299203</code> and <code>-t 2</code>.</p>
<p>As a rule of thumb, we recommend setting <em>m</em> between 6 and 10, to begin with.</p></li>
<li><p>Use <code>gen_gt</code> to generate the ground truth of given dataset and query
workload. The ground truth file will be used when processing the query
workload. For example (using the toy dataset):</p>
<p><code>
% ./gen_gt -d 192 -n 3000 -k 10 -q data/toy.q -s data/toy.ds -g data/toy.gt -y i
</code></p>
<p><code>-y i</code> indicates that each coordinates is an integer. The other option is <code>-y
f</code>, indicating that each coordinates is a floating point number.</p></li>
<li><p>Use <code>srs</code> with the <code>-I</code> option to index the data. Users need to specify <em>m</em> and
<em>index path</em> in this step. For example (using the toy dataset):</p>
<p><code>
% mkdir index
% ./srs -I -d 192 -i index/ -m 7 -n 3000 -s data/toy.ds -y i
</code></p></li>
<li><p>Use <code>srs</code> with the <code>-Q</code> option to process the query workload. The
implementation is based on Algorithm 1 in the paper and its variants. The
top-<em>k</em> approximate nearest neighbors for each query in the query workload
will be returned, together with the average ratio and time over all queries.</p>
<p>Users can use the parameter setting given by <code>cal_param</code>. For
example (using the toy dataset):</p>
<p><code>
% ./srs -Q -c 4 -g data/toy.gt -i index/ -k 10 -q data/toy.q -t 2 -r 0.299203
</code></p>
<p>Alternatively, users can also specify the parameters by themselves to achieve another
space-time trade-off.</p></li>
<li><p>The users can change the <code>-t</code> parameter to <em>n</em> (i.e., the cardinality of the
dataset) to force the algorithm rely on the early-termination condition to
stop. This will increase the quality and slightly increase the time cost.
This is the <code>SRS-2</code> algorithm in the paper.</p></li>
<li><p>The users can change the <code>-r</code> parameter to any number larger than 1, to force the
algorithm stop only on the normal-termination condition (i.e., examining <em>tn</em>
data points). This will substantially increase the quality and time cost.
This is the <code>SRS-1</code> algorithm in the paper.</p></li>
<li><p>The users can change the <code>-c</code> parameter to a smaller value to achieve better
quality without affecting the worst case time cost. This is the <code>SRS-12+</code>
algorithm in the paper.</p></li>
</ol>
<h2>Data Format</h2>
<ol>
<li><p>Data file should contain <em>n</em> lines, where <em>n</em> is the cardinality of the
dataset. The file should be formatted as:</p>
<p><code>
e_1_1 e_1_2 ... e_1_d
...
e_n_1 e_n_2 ... e_n_d
</code></p>
<p>where <code>e_i_j</code>s are integers, and are separated by whitespace.</p></li>
<li><p>Query file should contain <em>N+1</em> lines, where <em>N</em> is the number of queries in the
query workload. The file should be formatted as:</p>
<p><code>
N d
ID_1 e_1_1 e_1_2 ... e_1_d
...
ID_N e_N_1 e_N_2 ... e_N_d
</code>
where <em>d</em> is the dimensionality, <code>e_i_j</code> is an integer, and separated by whitespace.</p></li>
</ol>
<h2>Hard Dataset</h2>
<ul>
<li><p>Users can use <code>gen_hard_data</code> to generate a hard dataset with a user specified
cardinality, dimensionality and approximation ratio. The dataset contains one
point which is the nearest neighbor of query. All the other points are
(c+ε)-NN of query (c is user specified approximation ratio), and these
points are distributed randomly and uniformly on a sphere centered at the
query point with radius c+ε.</p></li>
<li><p>An example of using <code>gen_hard_data</code>:</p>
<p><code>
% ./gen_hard_data -n 1000000 -d 128 -c 4 -s hard.ds -q hard.q
</code></p>
<p>Then a dataset contains 1,000,000 points (i.e., <code>hard.ds</code>) and a query set
contains 1 query (i.e., <code>hard.q</code>) will be generated.</p></li>
</ul>
<h2>Condition of use</h2>
<ul>
<li>SRS is distributed under the terms of the GPL License.</li>
<li>Copyright is owned by DBWang Group, University of New South Wales, Australia.</li>
</ul>
<h2>Future work</h2>
<ul>
<li>Support more input data formats.</li>
<li>Integrate SRS with other multidimensional indexing methods.</li>
</ul>
<h2>Contact</h2>
<p>Please report bugs, feature requests, comments, or suggestions to Yifang Sun
(<code>yifangs AT cse.unsw.edu.au</code>) or Wei Wang (<code>weiw AT cse.unsw.edu.au</code>).</p>
</body>
</html>