Simpl-IR — A simple and flexible information retrieval toolbox

Building

You will need a few dependencies,

GHC 8.0.1
cabal-install 1.24.
libicu-dev
liblzma-dev

Then simply,

$ git clone --recursive git://github.com/laura-dietz/simpl-ir
$ cd simpl-ir
$ cabal new-build
$ mkdir -p bin
$ for i in $(find dist-newstyle/build/ -executable -a -type f -a -! -iname '*.so'); do ln -fs $(pwd)/$i bin/$(basename $i); done

You will end up with a number of executables in the bin directory. If you are reading this file you likely know what to do from here.

Usage

Create listing of corpus files. Save in trec-kba-2015.list (yes, we messed up the name)

$ find trec-kba-2014/ -iname '*news*.xz' > trec-kba-2015.list

Create corpus background statistics on a subsample of the corpus (here: every 100'th archive). Store in file names background

$ ./bin/simplir corpus-stats -q queries/queries.tsv -o background $(awk 'NR % 1000 == 0 {print}' trec-kba-2015.list)

Score whole corpus in streaming fashion for a given list of queries using the given collection background.

Corpus is comprised of all files listed in trec-kba-2015.list - therefore @)
List of queries is given in queries/queries/tsv
Background needs to be computed with the corpus-stats command (see above)

Two result files are written, both starting with filename prefix results:

*.run contains the ranking in trec_eval run-file format (space-separated).
*.json contains positional matches of query terms together with score and archive information in JSON format.

$ ./bin/simplir score -q queries/queries.tsv -s background -o results @trec-kba-2015.list
$ ls results.*
results.run
results.json
$

results.json format is roughly this (order of token occurrences within document might change soon):

[
  {
    "query_id": "Q1",
    "results": [
      {
        "score": -2.718447427575901,
        "doc_name": "81e4bbcca45920019e1e20df7d20ea8a",
        "archive": "trec-kba-2014/2012-01-24-11/news-30-fd0b2f9a9096a16c9a14166a25306272-1d1e4a7ae7e43d21cfa0cb31ca253a9c-0097415001369a2ab033800d80f7aa91-7b2ad6e31345a96cbb48277d099e6444.sc.xz",
        "length": 178,   # the document length
        "postings": [
          {
            "term": "protest",
            "positions": [
              {
                "char_pos": {     # character begin/end offsets of the term
                  "begin": 1597,
                  "end": 1603
                },
                "token_pos": 15   # i'th token in the document
              },
              {
                "char_pos": {     # next term occurrence...
                  "begin": 45,
                  "end": 51
                },
                "token_pos": 3
              }
            ]
          },
      }
     ...
    ]
  },
  ...
]

Name		Name	Last commit message	Last commit date
Latest commit History 1,133 Commits
graph-algorithms		graph-algorithms
nix		nix
simplir-data-source-s3		simplir-data-source-s3
simplir-data-source		simplir-data-source
simplir-disk-index		simplir-disk-index
simplir-eval		simplir-eval
simplir-galago		simplir-galago
simplir-html-clean		simplir-html-clean
simplir-io		simplir-io
simplir-kyoto-index		simplir-kyoto-index
simplir-learning-to-rank		simplir-learning-to-rank
simplir-leveldb-index		simplir-leveldb-index
simplir-pipes-utils		simplir-pipes-utils
simplir-stop-words		simplir-stop-words
simplir-tools		simplir-tools
simplir-trec-streaming		simplir-trec-streaming
simplir-trec		simplir-trec
simplir-word-embedding		simplir-word-embedding
simplir		simplir
vendor		vendor
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
README.mkd		README.mkd
ZTest.hs		ZTest.hs
cabal.project		cabal.project
cabal.project.local.ghc-9.2		cabal.project.local.ghc-9.2
default.nix		default.nix
inquery-stopwordlist		inquery-stopwordlist
mk-cabal-bin.sh		mk-cabal-bin.sh
nixpkgs.nix		nixpkgs.nix
trec-eval.nix		trec-eval.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simpl-IR — A simple and flexible information retrieval toolbox

Building

Usage

About

Releases

Packages

Languages

TREMA-UNH/simplir

Folders and files

Latest commit

History

Repository files navigation

Simpl-IR — A simple and flexible information retrieval toolbox

Building

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages