Skip to content
K-NRM: End-to-End Neural Ad-hoc Ranking with Kernel Pooling
Python Shell
Branch: master
Clone or download
Latest commit fa5d60c Nov 2, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
knrm shorter license header Nov 2, 2017
LICENSE refactor Nov 2, 2017
README.md Update README.md Nov 2, 2017
_config.yml Set theme jekyll-theme-cayman Nov 2, 2017
model_simplified-1.png add model figure Nov 2, 2017
sample-train.sh refactor Nov 2, 2017
sample.config refactor Nov 2, 2017
setup.py shorter license header Nov 2, 2017

README.md

K-NRM

This is the implementation of the Kernel-based Neural Ranking Model (K-NRM) model from paper End-to-End Neural Ad-hoc Ranking with Kernel Pooling.

If you use this code for your scientific work, please cite it as (bibtex):

C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. End-to-end neural ad-hoc ranking with kernel pooling. 
In Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. 
ACM. 2017.

Requirements


  • Tensorflow 0.12
  • Numpy
  • traitlets

Coming soon: K-NRM with Tensorflow 1.0

Guide To Use


Configure: first, configure the model through the config file. Configurable parameters are listed here

sample.config

Training : pass the config file, training data and validation data as

python ./knrm/model/model_knrm.py config-file\
    --train \
    --train_file: path to training data\
    --validation_file: path to validation data\
    --train_size: size of training data (number of training samples)\
    --checkpoint_dir: directory to store/load model checkpoints\ 
    --load_model: True or False. Start with a new model or continue training

sample-train.sh

Testing: pass the config file and testing data as

python ./knrm/model/model_knrm.py config-file\
    --test \
    --test_file: path to testing data\
    --test_size: size of testing data (number of testing samples)\
    --checkpoint_dir: directory to load trained model\
    --output_score_file: file to output documents score\

Relevance scores will be output to output_score_file, one score per line, in the same order as test_file. We provide a script to convert scores into trec format.

./knrm/tools/gen_trec_from_score.py

Data Preperation


All queries and documents must be mapped into sequences of integer term ids. Term id starts with 1. -1 indicates OOV or non-existence. Term ids are sepereated by ,

Training Data Format

Each training sample is a tuple of (query, postive document, negative document)

query \t postive_document \t negative_document \t score_difference

Example: 177,705,632 \t 177,705,632,-1,2452,6,98 \t 177,705,632,3,25,14,37,2,146,159, -1 \t 0.119048

If score_difference < 0, the data generator will swap postive docment and negative document.

If score_difference < lickDataGenerator.min_score_diff, this training sample will be omitted.

We recommend shuffling the training samples to ease model convergence.

Testing Data Format

Each testing sample is a tuple of (query, document)

q \t document

Example: 177,705,632 \t 177,705,632,-1,2452,6,98

Configurations


Model Configurations

  • BaseNN.n_bins: number of kernels (soft bins) (default: 11. One exact match kernel and 10 soft kernels)
  • Knrm.lamb: defines the guassian kernels' sigma value. sigma = lamb * bin_size (default:0.5 -> sigma=0.1)
  • BaseNN.embedding_size: embedding dimension (default: 300)
  • BaseNN.max_q_len: max query length (default: 10)
  • BaseNN.max_d_len: max document length (default: 50)
  • DataGenerator.max_q_len: max query length. Should be the same as BaseNN.max_q_len (default: 10)
  • DataGenerator.max_d_len: max query length. Should be the same as BaseNN.max_d_len (default: 50)
  • BaseNN.vocabulary_size: vocabulary size.
  • DataGenerator.vocabulary_size: vocabulary size.

Data

  • Knrm.emb_in: initial embeddings
  • DataGenerator.min_score_diff: minimum score differences between postive documents and negative ones (default: 0)

Training Parameters

  • BaseNN.bath_size: batch size (default: 16)
  • BaseNN.max_epochs: max number of epochs to train
  • BaseNN.eval_frequency: evaluate model on validation set very this steps (default: 1000)
  • BaseNN.checkpoint_steps: save model very this steps (default: 10000)
  • Knrm.learning_rate: learning rate for Adam Opitmizer (default: 0.001)
  • Knrm.epsilon: epsilon for Adam Optimizer (default: 0.00001)

Efficiency

During training, it takes about 60ms to process one batch on a single-GPU machine with the following settings:

  • batch size: 16
  • max_q_len: 10
  • max_d_len: 50
  • vocabulary_size: 300K

Smaller vocabulary and shorter documents accelerate the training.

Click2Vec


We also provide the click2vec model as described in our paper.

  • ./knrm/click2vec/generate_click_term_pair.py: generate <query_term, clicked_title_term> pairs
  • ./knrm/click2vec/run_word2vec.sh: call Google's word2vec tool to train click2vec.

Cite the paper


If you use this code for your scientific work, please cite it as:

C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. End-to-end neural ad-hoc ranking with kernel pooling. 
In Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. 
ACM. 2017.
@inproceedings{xiong2017neural,
  author          = {{Xiong}, Chenyan and {Dai}, Zhuyun and {Callan}, Jamie and {Liu}, Zhiyuan and {Power}, Russell},
  title           = "{End-to-End Neural Ad-hoc Ranking with Kernel Pooling}",
  booktitle       = {Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval},
  organization    = {ACM},
  year            = 2017,
}
You can’t perform that action at this time.