Skip to content

Converts a ds2i/pisa binary input collection into a JASS index with various compression options available.

Notifications You must be signed in to change notification settings

JMMackenzie/binary_to_jass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Binary_to_JASS

This simple tool converts input binary sequences often used in the ds2i/pisa codebase into an index that JASS can read. This is useful for running experiments where both indexes are derived from the same base collection, allowing head-to-head comparisons.

This tool is not meant to be well engineered, but is a simple hack that works well enough. Please adapt to your own use.

Acknowledgements

This codebase contains functionality found in other codebases. Some of these codebases, or other related work, is shown below.

Note that the real purpose of this library is to take the same input format used by pisa/ds2i and convert it for usage in JassV2

Collection input format

A binary sequence is a sequence of integers prefixed by its length, where both the sequence integers and the length are written as 32-bit little-endian unsigned integers.

A collection consists of 3 files, <basename>.docs, <basename>.freqs, <basename>.sizes.

  • <basename>.docs starts with a singleton binary sequence where its only integer is the number of documents in the collection. It is then followed by one binary sequence for each posting list, in order of term-ids. Each posting list contains the sequence of document-ids containing the term.

  • <basename>.freqs is composed of a one binary sequence per posting list, where each sequence contains the occurrence counts of the postings, aligned with the previous file (note however that this file does not have an additional singleton list at its beginning).

  • <basename>.sizes is composed of a single binary sequence whose length is the same as the number of documents in the collection, and the i-th element of the sequence is the size (number of terms) of the i-th document.

About

Converts a ds2i/pisa binary input collection into a JASS index with various compression options available.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages