forked from isoboroff/trec-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NOTES.txt
477 lines (358 loc) · 15.4 KB
/
NOTES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
NOTES ON LUCENE-5.4.0
Rup Palchowdhury
rup.palchowdhury [at] gmail [dot] com
Henry Feild
hfeild [at] endicott [dot] edu
----------------------------------------------------------------------
TABLE OF CONTENTS
A. STEMMERS
B. MODELS
C. MODIFYING MODELS
1. STARTING WITH A TEMPLATE: TMPL.java
2. TF
3. IDF
4. K
D. PARSING & INDEXING A CORPUS
1. CODE: IndexTREC
2. CODE: TrecAnalyzer
3. DOCUMENT TYPES
E. RETRIEVAL
1. CODE: BatchSearch
----------------------------------------------------------------------
A. STEMMERS
Here is a handful of stemmer implementations in Lucene, in brackets
are names they are commonly known by:
PorterStemFilter
KStemFilter (Krovetz stemmer)
SnowballFilter
EnglishMinimalStemFilter (S-Stemmer)
See: https://lucene.apache.org/core/5_4_0/analyzers-common/org/apache/lucene/analysis/en/package-summary.html
----------------------------------------------------------------------
B. MODELS
The models in LTR are in the 'models' file. The first column are
names LTR uses to identify the actual Java classes. Since these
strings show up in the run tag, it was necessary to map them to
shorter strings.
bm25L BM25Similarity
dfrL DFRSimilarity
defaultL DefaultSimilarity
lmdirichletL LMDirichletSimilarity
ibL IBSimilarity
tmpl TMPL
bm25 BM25
tmple TMPLe
bm25e BM25e
bm25L is Lucene's stock implementation, while bm25 is a shorter
cleaner version. Those names with the suffix 'e' point to a version
that preserves Lucene's technique of packing (encoding) floating point
document length values into a byte. Diff of TMPL.java and TMPLe.java
to see what I mean. The next section on modifying models explains it
in detail.
A full list of models available in Lucene-5.4.0 is in the package
org.apache.lucene.search.similarities:
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/package-summary.html
----------------------------------------------------------------------
C. MODIFYING MODELS
The starting point was Lucene's BM25Similarity.java, which was
stripped to its bare minimum and transformed into TMPL.java. Here I
describe how to craft TMPL.java to implement your own TF-IDF scoring
formula.
BM25Similarity.java:
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
----------------------------------------------------------------------
C.1. STARTING WITH A TEMPLATE: TMPL.java
There has been an attempt on the page on 'Lucene scoring' to
specify the arithmetic involved in computing the similarity code,
color-coding parts of the arithmetic formula, and then mentioning the
methods from the Java class that do the job of each part of the
formula using the same color code.
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Much of the surrounding text describing everything on that page
possibly assumes a lot of context which most readers will be unaware
of. So here I will use TMPL.java to explain only the parts of the code
that is implementing a formula of the form:
TF * IDF / LN
where LN is the length-normalization factor.
A shallow cut-away of this formula would involve K in this way;
TF * IDF / (K + TF)
The following sections explain where and how the three pieces, TF, IDF
and K are computed. The rest of the code is better left
untouched. Most of the functions have been rigged to return 1 so that
the chain-multiplication does not affect the score computation.
----------------------------------------------------------------------
C.2. TF
The term-frequency value is available to the object and need not be
constructed out of anything. The method TFIDFScorer.score() receives
the value in the variable 'tf' and is used to compute the weight 'w'
as shown:
public class TFIDFScorer extends SimScorer
{
...
public float score(int doc, float tf)
{
float w = tf / K[(byte)norms.get(doc) & 0xFF] * bw.idf;
return w;
}
...
}
----------------------------------------------------------------------
C.3. IDF
'N' and 'n', the ingredients of IDF, are initialized in here and 'idf'
computed. There are bells-n-whistles, where 'idf' is accumulated as s
series-sum, which is possibly because of Lucene's view of queries and
documents as having been partitioned into 'fields' which necessitates
a sum over the fields. My interpretation is that, control does not
step into the 'else' block and even if it does, the sum is safe.
public final SimWeight computeWeight(float queryBoost,
CollectionStatistics collectionStats, TermStatistics... termStats)
{
float N, n, idf, adl, dl;
idf = 1.0f;
N = collectionStats.maxDoc();
adl = collectionStats.sumTotalTermFreq() / N;
if (termStats.length == 1) {
n = termStats[0].docFreq();
idf = log(N/n);
}
else {
for (final TermStatistics stat : termStats) {
n = stat.docFreq();
idf += log(N/n);
}
}
return new TFIDFWeight(collectionStats.field(), idf, adl, K);
}
----------------------------------------------------------------------
C.4. K
NOTE: There is another version of TMPL.java, names TMPLe.java that
involves the bits of code that encodes K values. The discussion on
encoding K, that follows, should be read along with TMPLe.java.
If you have noticed, the computations of K takes place in
computeWeight() too, right after IDF. Mathematically K is some
function of 'dl' and 'adl', and for this exposition we represent this
computation as:
K = f(dl, adl)
It so happens that Lucene stows away the 'dl' for each document at a
point in time that is before score computations have taken place and
encodes these integers in a single byte; which of course is lossy and
is done for reasons of efficiency. Since 'dl' is a byte, it has only
256 possible values and so does f(dl, adl). K is implemented as a
256-element array, each of whose elements is a floating-point number;
all the possible values of f(dl, adl). Later K is used as a look-up
table.
encodeNorm() is used elsewhere to pack 'dl' into a byte. In
TFIDFScorer.score(), norms.get(doc) retrieves that byte, 'dl', for a
particular doc. decodeNorm() unpacks that byte to a float using
another look-up table NORM[], which was populated with floating-point
representation of the 256 possible bytes. The unpacked byte, in the
form of a float, is then used to compute f(dl, adl) and stored in
K[i]. Later when the normalization factor K is needed, the array K is
indexed by typecasting the byte value, 'dl', returned by norms.get(doc).
The pieces of code involved in this follow with bits of commentary:
Populate NORM with 256 possible values and implement the packing and
unpacking technique of your choice, making sure they are inverse
operations of each other;
private static final float[] NORM = new float[256];
static {
for (int i = 0; i < 256; i++) {
NORM[i] = SmallFloat.byte315ToFloat((byte)i);
}
}
...
protected float decodeNorm(byte b)
{
return NORM[b & 0xFF];
}
...
protected byte encodeNorm(int dl)
{
return SmallFloat.floatToByte315(dl);
}
Compute the 256 possible values of 'K' by generating the 256 possible
values of 'dl';
float K[] = new float[256];
for (int i = 0; i < K.length; i++) {
dl = decodeNorm((byte)i);
K[i] = log(dl / adl);
}
Pick the particular value of K corresponding to a byte returned by
norms.get(). 'norms' is of type NumericDocValues which encapsulates a
per-document numeric value. This is also a point where the weight is
computed; score() is self-explanatory.
public class TFIDFScorer extends SimScorer
{
private final TFIDFWeight bw;
private final NumericDocValues norms;
private final float[] K;
TFIDFScorer(TFIDFWeight bw, NumericDocValues norms)
throws IOException
{
this.bw = bw;
this.K = bw.K;
this.norms = norms;
}
public float score(int doc, float tf)
{
float w = tf / K[(byte)norms.get(doc) & 0xFF] * bw.idf;
return w;
}
...
}
----------------------------------------------------------------------
D. PARSING & INDEXING A CORPUS
java -cp "LTR/lib/*" IndexTREC -settings [v]
-index [w]
-docs [x]
-stop [y]
-stem [z]
[v] - Path to a settings file.
[w] - Directory where the index will be placed.
[x] - Directory containing the document corpus.
[y] - A file containing the stop words.
[z] - The class that implements a particular stemming algorithm.
(Additional settings exist; any setting that is valid in the settings file
can be provided on the command line with a '-' prefix; see README.txt for
details.)
----------------------------------------------------------------------
D.1 CODE: IndexTREC
https://github.com/sauparna/LTR/blob/master/src/IndexTREC.java
Here is a rundown on modifying Lucene's indexing structures to process
TREC data. Everything happens with this block of code in main(), after
command-line arguments have been parsed and starting points
determined:
...
Directory dir = FSDirectory.open(Paths.get(ltrSettings.indexPath));
TrecAnalyzer analyzer = new TrecAnalyzer(ltrSettings);
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
indexDocs(writer, docDir);
writer.close();
...
Functionality for parsing the corpus is implemented in these
functions:
class IndexTREC:
public static class DocVisitor extends SimpleFileVisitor<Path>
static void indexDocs(
LTRSettings settings, final IndexWriter writer, Path path)
class FileParser:
public static processFile(LTRSettings settings, IndexWriter writer, File file)
The processFile() method is a parser mapper and is plugged into
DocVisitor.visitFile(). It checks the extension of the file and sends it to
an appropriate parser, uncompressing if necessary. See D.3 DOCUMENT TYPES for
details on available parsers. Each parser is responsible for processing a
particular file type, extracting documents and providing them to the index
writer.
The indexDocs() method uses a DocVisitor object to walk a directory tree and
gobble up files. From the programmer's point of view, the parser is the only
bit of engineering that she is able to do, and the rest of the work is wiring
loose ends and plugging open ports. The block of code in main(), is a series
of transformations up to the point of leaving it to indexDocs() which sets the
ball rolling:
indexDocs(IndexWriter(Directory, IndexWriterConfig(TrecAnalyzer(stop, stem))))
----------------------------------------------------------------------
D.2 CODE: TrecAnalyzer
https://github.com/sauparna/LTR/blob/master/src/TrecAnalyzer.java
The constructor reads a list of stop words into memory and stows away
the name of the stemmer object. Then in TokenStreamComponents the
term-processing pipeline is set up, as before, in a series of
transformations:
TokenStreamComponents(
StemmerX(
StopFilter(
LowerCaseFilter(
XTokenizer()))))
This is a conceptual view, where I have quoted the actual object name from the
code (except StemmerX which is meant to be one of several stemmer object from
Lucene's libraries chosen at run-time using reflection, and XTokenizer, which
can be an arbitrary tokenizer, with WhitespaceTokenizer as the default), and
is not syntactically correct. Nonetheless it sketches the idea.
----------------------------------------------------------------------
D.3 DOCUMENT TYPES
COMPRESSED CORPUS FILES
Any file found in the given docPath ending in any of the extensions:
- bz
- bzip2
- gz
- gzip
will be uncompressed. The uncompressed stream will then be sent to the parser
associated with the file extension left after the compression extension is
removed (e.g., a file named input.warc.gz will be uncompressed and sent to the
WARC parser).
TREC FORMAT (DEFAULT)
File extensions: .* (anything that doesn't match other formats)
Parser: FileParser.parseTRECFile(...)
This is the format used by TREC text corpuses, e.g., newwire collections such
as AP, as well as TREC web corpuses such as GOV2. Each file may consist of one
or more documents, where each document is contained within a <DOC> tag and
must have a child tag named <DOCNO> specifying the TREC docno of the document.
Any number of additional tags can be included beyond <DOCNO>. When no index
fields are specified in the settings file, the document text excluding tags
themselves will be indexed. You may specify specific fields to index to
alter this behavior.
WARC FORMAT
File extensions: .warc
Parser: FileParser.parseWARCFile(...)
Used for ClueWeb 2009 and 2012, this web archive format also includes a
WARC-TREC-ID field in the header of each document. Only documents that have
that header and have a WARC-TYPE field value of "response" are processed. If
no index fields are specified, the text of the document is indexed, excluding
tags.
----------------------------------------------------------------------
E. RETRIEVAL
java -cp "LTR/lib/*" BatchSearch -settings [u]
-index [v]
-queries [w]
-similarity [x]
-stop [y]
-stem [z]
[u] - Path to a settings file.
[v] - Path to the directory that contains the index.
[w] - The file containing TREC queries.
[x] - The class name of Lucene's similarity model.
[y] - A file containing the stop words.
[z] - The class that implements a particular stemming algorithm.
(Additional settings exist; any setting that is valid in the settings file
can be provided on the command line with a '-' prefix; see README.txt for
details.)
----------------------------------------------------------------------
E.1 CODE: BatchSearch
https://github.com/sauparna/LTR/blob/master/src/BatchSearch.java
An implementation of search (retrieval) using a set of TREC
queries. Its interface is not unlike IndexTREC in that command-line
arguments are parsed to determine the starting points and the block of
code in main() does everything. Again, other than parsing a TREC
query, and printing search results in TREC format (TREC runs), most of
the work involves tying loose ends.
...
IndexReader reader = DirectoryReader.open(
FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(similarity);
TrecAnalyzer analyzer = new TrecAnalyzer(opt);
SimpleQueryParser parser = new SimpleQueryParser(analyzer, field);
org.jsoup.nodes.Document soup;
String str, qid, txt;
str = FileUtils.readFileToString(new File(queries));
soup = Jsoup.parse(str);
for (Element elm : soup.select("TOP")) {
qid = elm.child(0).text().trim();
txt = elm.child(1).text();
Query query = parser.parse(txt);
doBatchSearch(searcher, qid, query, simstr, NUMRET);
}
reader.close();
...
The view of it as a series of transformations, in sketchy pseudocode
looks like this:
i = IndexSearcher(IndexReader(DirectoryReader(index)))
i.setSimilarity(SimilarityX)
q = SimpleQueryParser(TrecAnalyzer(stop, stem)).parse()
doBatchSearch(i, q)
{
result = i.search(q, M)
prettyprint(result)
}
SimilarityX is the name of the similarity class chosen at run-time
using reflection. M is the number of documents to retrieve which is
also the length of the search result (or run file).