Skip to content
An end-to-end neural ad-hoc ranking pipeline.
Python Perl Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
config
docs
etc
onir
scripts
.gitignore
LICENSE.txt
README.md
requirements.txt
setup.py

README.md

OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

Quick start

Install dependencies

pip install -r requirements.txt

Train and validate a model (here, ConvKNRM on ANTIQUE):

scripts/pipeline.sh config/conv_knrm config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Grid serach for BM25 over ANTIQUE for comparision with neural model performance:

scripts/pipeline.sh config/grid_search config/antique

(Performance on the test set can be obtained by adding pipeline.test=True)

Models, datasets, and vocabularies will be saved in ~/data/onir/. This can be overridden by setting data_dir=~/some/other/place/ as a command line argument, in a configuration file, or in the ONIR_ARGS environment variable.

Features

Rankers

  • DRMM ranker=drmm paper
  • Duet (local model) ranker=duetl paper
  • MatchPyramid ranker=matchpyramid paper
  • KNRM ranker=knrm paper
  • PACRR ranker=pacrr paper
  • ConvKNRM ranker=conv_knrm paper
  • Vanilla BERT config/vanilla_bert paper
  • CEDR models config/cedr/[model] paper
  • MatchZoo models source
    • MatchZoo's KNRM ranker=mz_knrm
    • MatchZoo's ConvKNRM ranker=mz_conv_knrm

Datasets

  • TREC Robust 2004 config/robust/fold[x]
  • MS-MARCO config/msmarco
  • ANTIQUE config/antique
  • TREC CAR config/car
  • New York Times config/nyt -- for content-based weak supervision

Evaluation Metrics

  • map (from trec_eval)
  • ndcg (from trec_eval)
  • ndcg@X (from trec_eval, gdeval)
  • p@X (from trec_eval)
  • err@X (from gdeval)
  • mrr (from trec_eval)
  • rprec (from trec_eval)
  • judged@X (implemented in python)

Vocabularies

  • Binary term matching vocab=binary (i.e., changes interaction matrix from cosine similarity to to binary indicators)
  • Pretrained word vectors vocab=wordvec
    • vocab.source=fasttext
      • vocab.variant=wiki-news-300d-1M, vocab.variant=crawl-300d-2M
      • (information about FastText variants can be found here)
    • vocab=source=glove
      • vocab.variant=cc-42b-300d, vocab.variant=cc-840b-300d
      • (information about GloVe variants can be found here)
    • vocab.source=convknrm
      • vocab.variant=knrm-bing vocab.variant=knrm-sogou, vocab.variant=convknrm-bing vocab.variant=convknrm-sogou
      • (information about ConvKNRM word embedding variants can be found here)
    • vocab.source=bionlp
      • vocab.variant=pubmed-pmc
      • (information about BioNLP variants can be found here)
  • Pretrained word vectors w/ single UNK vector for unknown terms vocab=wordvec_unk
    • (with above word embedding sources)
  • Pretrained word vectors w/ hash-based random selection for unknown terms vocab=wordvec_hash (defualt)
    • (with above word embedding sources)
  • BERT contextualized embeddings vocab=bert
    • Core models (from HuggingFace): vocab.bert_base=bert-base-uncased (default), vocab.bert_base=bert-large-uncased, vocab.bert_base=bert-base-cased, vocab.bert_base=bert-large-cased, vocab.bert_base=bert-base-multilingual-uncased, vocab.bert_base=bert-base-multilingual-cased, vocab.bert_base=bert-base-chinese, vocab.bert_base=bert-base-german-cased, vocab.bert_base=bert-large-uncased-whole-word-masking, vocab.bert_base=bert-large-cased-whole-word-masking, vocab.bert_base=bert-large-uncased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-large-cased-whole-word-masking-finetuned-squad, vocab.bert_base=bert-base-cased-finetuned-mrpc
    • SciBERT: vocab.bert_base=scibert-scivocab-uncased, vocab.bert_base=scibert-scivocab-cased, vocab.bert_base=scibert-basevocab-uncased, vocab.bert_base=scibert-basevocab-cased
    • BioBERT vocab.bert_base=biobert-pubmed-pmc, vocab.bert_base=biobert-pubmed, vocab.bert_base=biobert-pmc

Citing OpenNIR

If you use OpenNIR, please cite the following WSDM demonstration paper:

@InProceedings{macavaney:wsdm2020-onir,
  author = {MacAvaney, Sean},
  title = {{OpenNIR}: A Complete Neural Ad-Hoc Ranking Pipeline},
  booktitle = {{WSDM} 2020},
  year = {2020}
}

Acknowledgements

I gratefully acknowledge support for this work from the ARCS Endowment Fellowship. I thank Andrew Yates, Arman Cohan, Luca Soldaini, Nazli Goharian, and Ophir Frieder for valuable feedback on the manuscript and/or code contributions to OpenNIR.

You can’t perform that action at this time.