Learning-to-Rank for Apache Lucene (compatibility with Apache Lucene is still a work-in-progress).
The purpose of this document is to provide instructions on how to execute the program, some experimental results. A general overview is also provided at the end of the document for those who would like to review or learn more about the algorithms. The general overview is meant to provide a "big picture," interpretation, or explanation of each algorithm, as well as pseudo code. For specific mathematical details and formulation, please see implementation or original articles.
-
Introduction to LTR
-
Execution Instructions
- Requirements
- Data
- How to Execute Training Program
- How to Execute Prediction Program
- Changing Parameters
-
Experiments
-
Overview of Algorithms
- PRank
- OAP-BPM
- NNRank
- RankNet
- FRankNet
- LambdaRank
- SortNet
- ListNet
-
Overview of Optimizers
-
Overview of Evaluation Metrics
Ranking models are an essential component of information retrieval systems.
Consider the pool of search results for a given search. It is largely impossible to have a pool of search results which will exactly match the set of results the user is looking for. Thus, by vastly increasing the number of search results, one can hope to include all of the desired results of the user. The problem with this, however, is that it becomes much harder for the user to find the desired results among the large pool of search results.
Ranking serves as means of mitigating this problem. If all of the user's desired results are pushed to the top of the list, then it does not take much effort for the user to find their desired results.
The ability for a "ranker" to rank the search results accurately is dependent on the ranking model. Learning to Rank is the application of machine learning to construct the ranking model.
There are three different types of machine learning algorithms: supervised learning (labels, or the "correct answer," is provided for all data), unsupervised learning (no labels provided), and reinforcement learning (training through trial and error in a specific environment).
In Learning to Rank, there are three different approaches: pointwise, pairwise, and listwise. In pointwise, training involves looking at the data or documents independently. Each document will have a relevance score, which is unrelated or unaffected by other documents. Because of this, the final rank of the document is not visible to the loss function. In addition, this method does not take into account the fact that documents may actually be related, or that documents are related to a certain query.
The pairwise approach involves considering documents in pairs. Learning occurs by looking at which document is more relevant. For example, the model may make a prediction about which document in a pair is more relevant, and compare that to the labels of each document. Then, the model is adjusted. Depending on the algorithm, the model may output a score for a single document, however the loss function (and thus training) will usually require a pair of documents in this case.
The listwise approach involves looking at lists of documents associated with a query. While the model may output a score for one document, the loss function requires a full list of documents in a query. Thus, during learning, information about all of the documents is necessary.
This project currently only implements algorithms of the supervised learning variety, and thus we will focus on supervised learning in this document. The implemented algorithms include pointwise, pairwise, and listwise approaches.
You will need to install the following:
-
Java 8+
-
Apache Ant 1.8+ (Not required to run, however the following instructions will assume Apache Ant has been installed.)
You can download OHSUMED, LETOR MQ2007, LETOR MQ2008 from this web page.
Dataset Descriptions
Each row is a query-document pair. The first column is relevance label of this pair, the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. The larger the relevance label, the more relevant the query-document pair. A query-document pair is represented by a 46-dimensional feature vector. Here are several example rows from MQ2007 dataset:
====================================================
2 qid:10032 1:0.056537 2:0.000000 3:0.666667 ... 46:0.076923 #docid=GX029-35-5894638 inc=0.02 prob=0.13984
0 qid:10032 1:0.279152 2:0.000000 3:0.000000 ... 46:1.000000 #docid=GX030-77-6315042 inc=0.71 prob=0.34136
0 qid:10032 1:0.130742 2:0.000000 3:0.333333 ... 46:1.000000 #docid=GX140-98-1356607 inc=0.12 prob=0.07013
1 qid:10032 1:0.593640 2:1.000000 3:0.000000 ... 46:0.000000 #docid=GX256-43-0740276 inc=0.01 prob=0.40073
====================================================
1- Download the project from Github
git clone https://github.com/LTR4L/ltr4l.git
2- go to the project folder
$ cd ltr4l
3- run ivy bootstrap (only needs to be run once after cloned)
ant ivy-bootstrap
4- package jar file
$ ant clean package
5-run the following command :
train ranknet
To see help of train command, use -help option:
train -help
usage: train <LTR-algorithm-name> [-config <file>] [-debug] [-help]
[-iterations <num>] [-model <file>] [-report <file>] [-training
<file>] [-validation <file>] [-verbose] [-version]
Execute Learning-to-Rank training algorithm. The algorithm is specified by
the required argument <LTR-algorithm-name>. The program will look for the
configuration file "config/<LTR-algorithm-name>.json" unless config option
is specified. The following options can be specified in order to override
the existing settings in the config file.
-config <file> use given file for configuration
-debug print debugging information
-help print this message
-iterations <num> use given number of iterations
-model <file> specify model file name
-report <file> specify report file name
-training <file> use given file for training
-validation <file> use given file for validation
-verbose be extra verbose
-version print the version information and exit
6- open the report file:
open report/ranknet-report.csv
If you have a model JSON file (for example, the model saved from running the training program), you can run the prediction program using that model file.
Assuming the default parameters (using the default location models are saved to by the training program), the execution instructions are as follows:
1- go to the project folder
cd ltr4l
2- run the following command (if jar file is not packaged, do step 3 in the previous section):
./predict ranknet
To see help for the predict command, use the -help option:
usage: predict <LTR-algorithm-name> [-debug] [-eval <evalType>] [-help]
[-k <k>] [-model <file>] [-report <file>] [-test <file>] [-verbose]
[-version]
Execute Learning-to-Rank predicting algorithm. The algorithm is specified
by the required argument <LTR-algorithm-name>. The program will look for
the model file "model/<LTR-algorithm-name>-model.json" unless model option
is specified. The following options can be specified in order to override
the existing settings in the config file.
-debug print debugging information
-eval <evalType> specify type of evaluator
-help print this message
-k <k> specify k-value for evaluators which use @k
-model <file> use given file for configuration and model
-report <file> specify report file name
-test <file> use given file for testing the model
-verbose be extra verbose
-version print the version information and exit
Below is an example of a config file:
{
"algorithm" : "RankNet",
"numIterations" : 100,
"params" : {
"learningRate" : 0.00001,
"optimizer" : "adam",
"weightInit" : "xavier",
"regularization" : {
"regularizer" : "L2",
"rate" : 0.01
},
"layers" : [
{
"activator" : "Sigmoid",
"num" : 10
}
]
},
"dataSet" : {
"training" : "data/MQ2008/Fold1/train.txt",
"validation" : "data/MQ2008/Fold1/vali.txt",
"test" : "data/MQ2008/Fold1/test.txt"
},
"model" : {
"format" : "json",
"file" : "model/ranknet-model.json"
},
"evaluation" : {
"evaluator" : "NDCG",
"params" : {
"k" : 10
}
},
"report" : {
"format" : "csv",
"file" : "report/ranknet-report.csv"
}
}
You must specify the number of nodes and activation for each hidden layer and the output layer. However, for NNRank, you do not need to specify the final layer. For example, to add another layer of 3 ReLu nodes and the output layer to Sigmoid, change layers to:
"layers" : [
{
"activator" : "Sigmoid",
"num" : 10
},
{
"activator" : "ReLU",
"num" : 3
},
{
"activator" : "Sigmoid",
"num" : 1
}
]
You can also change the training data and validation data by changing the path of the first and second argument while executing the program:
java -jar LTR4L-X.X.X.jar ranknet data/MQ2007/Fold1/train.txt data/MQ2007/Fold1/vali.txt confs/ranknet.json
In this section, first we provide barcharts which show NDCG@10 for each algorithm, across different datasets. Then we provide some graphs of NDCG and Loss, as well as the parameters used and elapsed time. If batch size is not included in the table, default batch size was used (i.e. weights were updated according to original papers). Note that elapsed time can vary quite a bit depending on the parameters (especially on the number of hidden layers, activation, etc...), and is only provided to give a general idea of the speed of the algorithm. Even with no change in parameters, the elapsed time can vary. Note that this is the time required for all epochs to finish: i.e. for all training and validation to complete.
Note that SortNet is missing from MQ2007 and OHSUMED, due to a bug. When it is fixed, the graphs will be updated.
Parameter | Value |
---|---|
Algorithm | PRank |
Dataset | LETOR:MQ2007 Fold 1 |
Weights Initialization | Zero |
Bias Initialization | Zero, Infinity |
Loss Function | Square Error |
Epochs | 100 |
Time Elapsed | 8.7 s |
Note: N is the number of PRanks used.
Parameter | Value |
---|---|
Algorithm | OAP-BPM |
Dataset | LETOR:MQ2008 Fold 1 |
Weights Initialization | Zero |
Bias Initialization | Zero, Infinity |
N | 100 |
Bernoulli | 0.03 |
Loss Function | Square Error |
Epochs | 100 |
Time Elapsed | 10.8 s |
Parameter | Value |
---|---|
Algorithm | NNRank |
Dataset | LETOR:MQ2008 Fold 1 |
Optimizer | Momentum |
Batch Size | 15 |
Weights Initialization | Gaussian |
Bias Initialization | Constant (0.1) |
Layers | [46, 15, 3] |
Hidden Activation | Sigmoid |
Output Activation | Sigmoid |
Loss Function | Square Error |
Epochs | 100 |
Learning Rate | 0.01 |
Regularization | L1 |
Regularization Rate | 0.01 |
Threshold | 0.8 (Currently Hardcoded) |
Time Elapsed | 12.539s |
Note: The document pairs used are only the pairs which have different labels (i.e. two documents with label of "0") are ignored during training. In addition, only a maximum 1/6 of the queries are actually used during training (randomly selected each time). Note that elapsed time is quite large compared to other algorithms, despite this reduction in the document pairs. Also note the steady rise in NDCG and fall in loss despite the reduction of document pairs used.
Parameter | Value |
---|---|
Algorithm | RankNet |
Dataset | LETOR:MQ2007 Fold 1 |
Optimizer | Adagrad |
Weights Initialization | Xavier |
Bias Initialization | Constant (0.1) |
Layers | [46, 10, 1] |
Hidden Activation | Identity |
Output Activation | Sigmoid |
Loss Function | Cross Entropy |
Epochs | 100 |
Learning Rate | 0.00001 |
Regularization | L2 |
Regularization Rate | 0.01 |
Time Elapsed | 114.8s |
As an example of the importance of tuning and the difference datasets can make, here are the results of RankNet performed on MQ2008:
Parameter | Value |
---|---|
Algorithm | RankNet |
Dataset | LETOR:MQ2008 Fold 1 |
Optimizer | Adam |
Weights Initialization | Xavier |
Bias Initialization | Constant (0.1) |
Layers | [46, 10, 1] |
Hidden Activation | Sigmoid |
Output Activation | Sigmoid |
Loss Function | Cross Entropy |
Epochs | 100 |
Learning Rate | 0.00001 |
Regularization | L2 |
Regularization Rate | 0.01 |
Time Elapsed | 37.2s |
Note: FRankNet is still work in progress.
Parameter | Value |
---|---|
Algorithm | FRankNet |
Dataset | LETOR:MQ2008 Fold 1 |
Optimizer | Momentum |
Weights Initialization | Gaussian |
Bias Initialization | Constant (0.1) |
Layers | [46, 5, 1] |
Hidden Activation | Sigmoid |
Output Activation | Sigmoid |
Loss Function | Cross Entropy |
Epochs | 100 |
Learning Rate | 0.0001 |
Regularization | L2 |
Regularization Rate | 0.01 |
Time Elapsed | 15.0s |
Parameter | Value |
---|---|
Algorithm | LambdaRank |
Dataset | LETOR:MQ2008 Fold 1 |
Optimizer | Adam |
Weights Initialization | Xavier |
Bias Initialization | Constant (0.1) |
Layers | [46, 10, 1] |
Hidden Activation | Sigmoid |
Output Activation | Sigmoid |
Loss Function | Cross Entropy |
Epochs | 100 |
Learning Rate | 0.1 |
Regularization | L2 |
Regularization Rate | 0.1 |
Time Elapsed | 38.1s |
Parameter | Value |
---|---|
Algorithm | SortNet |
Dataset | LETOR:MQ2008 Fold 1 |
Optimizer | Momentum |
Weights Initialization | Xavier |
Bias Initialization | Constant (0.1) |
Layers | [46, 3, 1] |
Hidden Activation | Sigmoid |
Output Activation | Sigmoid |
Loss Function | Square Error |
Epochs | 100 |
Learning Rate | 0.01 |
Regularization | L2 |
Regularization Rate | 0.01 |
Time Elapsed | 20.8s |
Parameter | Value |
---|---|
Algorithm | ListNet |
Dataset | LETOR:MQ2007 Fold 1 |
Optimizer | Adam |
Weights Initialization | Xavier |
Bias Initialization | Constant (0.1) |
Layers | [46, 15, 1, 1] |
Hidden Activation | Identity, Sigmoid |
Output Activation | Sigmoid |
Loss Function | Cross Entropy |
Epochs | 100 |
Learning Rate | 0.001 |
Regularization | L2 |
Regularization Rate | 0.01 |
Time Elapsed | 70.8s |
At the end of the overview for each algorithm, pseudo code will be provided. In general, the flow of training and validation takes the following structure:
S = epoch number
for s = 1 to S do:
train() //updates weights accordingly
validate() //calculates NDCG and loss using all documents in the validation set
report()
end for
We will focus on train(), and elaborate on other parts when necessary. We will also include initialization, and variables. Note that yi will refer to the label of document xi, unless otherwise stated.
Type: Single Perceptron
Approach: Pointwise
Strengths: Very simple structure, very fast
Weaknesses: Only linear results/weights can be obtained (not very effective for non-linear data sets).
Network Input: A single document's features
Network Output: Relevance score**
Weights updated by document.
PRank is essentially the inner product between a weight vector and a document's features. Both vectors are the space of real numbers. The inner product is the predicted relevance score, and based off of this score, PRank classifies the document. The document can be classified into categories such as: highly-relevant, relevant, irrelevant or 3-star, 2-star, 1-star, etc... The simplest classification would be between two classes: relevant and irrelevant.
To determine whether a score falls into a category, PRank looks at "threshold values." Threshold values are similar to a grading system. For example, if score >= 90, A if 80 <= score < 90, B etc...
When PRank is learning, it essentially changes both the weight vector and threshold values, if the predicted score is not correct. In terms of the grading analogy, it is as if the teacher how many points each problem is worth (weights), as well as the number of points needed to get a certain score. The farther away the predicted category is from the actual category, the more the weights and thresholds are changed. Weights and thresholds are updated per document, and documents are chosen at random.
Note that in terms of a Neural Network, the structure behind PRank is relatively simple. It can be thought of as a network with only two layers: the input layer, and the output layer. In the input layer, each node represents a feature of the document (thus the number of nodes is equal to the number of features of the document). In the output layer, there is one node. The activation for each node in the network is the identity function, and each node's output is multiplied by a weight, and that becomes one of the inputs of the output node.
Pseudo-code is as follows:
S = epoch number
T = Training set of document-label pairs {(x1, y1), (x2, y2),..., (xn, yn))}
V = Validation set of document-label pairs {(x1, y1), (x2, y2),..., (xm, ym))}
Initialization:
weights vector: W = (0, 0, 0, ..., 0)
threshold vector: b = (0, 0, ..., 0) [Note: length of b is # of categories - 1; i.e. last threshold is infinitely large ]
Error function = Square Error
for s=1 to S do:
shuffle(T) [arrange randomly]
for each document x(t) in T do:
yt' = min(r, r is index of b {1, 2, ..., k}) {(W*x(t) - b[r]) < 0} //start of updateWeights(x)
if (yt' != yt) then:
τ[] = new integer array (length = b.length = k)
for r = 1 to k - 1:
integer ytr
if (yt <= r) then: ytr = -1 else 1 //This is to count mismatched categories.
if (wx(t) - b[r]) * ytr <=0 then:
τ[r] = ytr
else:
τ[r] = 0
end for
τs = Στ[r] for all r
W = W + τs * X
b = b - τ
end if //end of updateWeights(x)
end for
validate()
report()
end for
Type: PRank Network
Approach: Pointwise
Strengths: Relatively simple structure, more effective learning than PRank (rise in NDCG, decrease in loss)
Weaknesses: Slower than PRank, only linear results/weights can be obtained
Network Input: A single document's features
Network Output: Relevance score**
Weights updated by document.
OAP-BPM is an algorithm which uses N PRanks, and a Bernoulli number τ.
τ is a number between 0 and 1: 0 < τ <= 1. Bernoulli(τ) returns 1 with a probability of τ, 0 otherwise.
The average weight of all of the PRanks is equal to the weight of the OAPBPM.
Each iteration, OAP-BPM will present a document to a particular PRank instance with a probability of τ.
If a document is presented to a PRank, the PRank will predict the class and update its weights accordingly.
Then, OAP-BPM's weights are updated by PRank's weight / N.
Please see below for an example of presentation of documents and update of weights below:
S = epoch number
T = Training set of document-label pairs {(x1, y1), (x2, y2),..., (xn, yn))}
V = Validation set of document-label pairs {(x1, y1), (x2, y2),..., (xm, ym))}
N = number of PRank objects in pRank
Initialization:
weights vector: W = (0, 0, 0, ..., 0)
threshold vector: b = (0, 0, ..., 0) [Note: length of b is # of categories - 1; i.e. last threshold is infinitely large ]
Error function = Square Error
for s = 1 until to S do:
shuffle(T)
for each document x(t) in T do:
for each pR in pRanks do:
if (bernoulli(τ) == 1) then:
pR.updateWeights()
this.weights = this.weights + pr.weights / N
this.b = this.b + pr.b / N
end if
end for
validate()
report()
end for
end for
Type: Feed-forward neural network (Multi-layer perceptron)
Approach: Pointwise
Strengths: Relatively fast and accurate, online and batch mode, can handle non-linear data
Weaknesses:
Network Input: A single document's features.
Network Output: Document's classification (for a particular query)
Weights updated per document.
NNRank uses a Multi-layer perceptron network, where the output layer has a number of nodes equal to the number of ordinal categories (for example, if the categories are 1, 2, 3, and 4 stars, there will be 4 output nodes). The key difference of NNRank (from other implemented MLP algorithms) is that the network directly tries to classify a given document, rather than output a relevance score. Strictly speaking, this method falls into the multi-threshold approach.
S = epoch number
T = Training set of queries with associated document-label pairs {q1, q2, ..., qn} (each q has a list of documents w/ labels)
V = Validation set of queries with associated document-label pairs {q1, q2, ..., qm}
Initialization:
Error function = square error
network = constructNodes() [2 dimensional list of nodes]
initializeWeights() [use specified random initialization strategy]
for s = 1 until S do:
for all document x in T do:
output = predict(x's features)
label = x.label
if (output != label) then:
backProp(x)
updateWeights(x)
end if
end for
validate()
report()
end for
predict(features):
threshold = 0.5
forwardProp(features)
for nodeId = 0 until network.finalLayer.length do:
node = network.finalLayer[nodeId]
if (node.getOutput < threshold) then:
return nodeId - 1
end if
end for
return network.finalLayer.length - 1
Type: Feed-forward neural network
Approach: Pairwise*
Strengths: Relatively high accuracy. Has been widely adopted to much success.
Weaknesses: Very slow for training through all document pairs.
Network Input: A single document's features
Network Output: Relevance Score
Weights updated per document pair
RankNet uses a Multi-Layer perceptron network, where the output layer has one node. It works to solve the preference learning problem directly, rather than solving an ordinal regression (i.e. a classification problem). While only a single document is necessary for forward-propagation (and thus ranking), a pairwise cross-entropy error function is used to measure loss, and thus two documents are required for backpropagation (as loss is defined over two documents, the derivative of the loss must be taken with respect to document pairs). Therefore, RankNet is considered a pairwise approach, document pairs are used as instances during training.
S = epoch number
T = Training set of queries with associated document-label pairs {q1, q2, ..., qn} (each q has a list of documents w/ labels)
V = Validation set of queries with associated document-label pairs {q1, q2, ..., qm}
TP = All pairs of documents (x1, x2) associated with each q in T, such that y1 > y2.
VP = All pairs of documents (x1, x2) associated with each q in V, such that y1 > y2.
Initialization:
Error function = cross entropy
network = constructNodes() //2 dimensional list of nodes
initializeWeights() //use specified random initialization strategy
for s = 1 until S do:
for all document pairs (x1, x2) in T do:
threshold = 0.5
s1 = forwardProp(x1)
s2 = forwardProp(x2)
delta = s1 - s2
if (delta < threshold) then:
sigma = Sigmoid.output(delta)
backProp(Sigma)
forwardProp(x1)
backProp(-Sigma)
end if
end for
validate()
report()
end for
Type: Feed-forward neural network
Approach: Pairwise, Listwise*
Strengths: Faster than ranknet, updates less frequently.
Weaknesses:
Network Input: A single document's features
Network Output: Relevance Score
Weights updated per epoch
FRankNet is a modification of RankNet. The structure of the network in FRankNet is unchanged from RankNet. The difference lies in the way that the weights are updated. RankNet looks at how well the current model predicts a pair of documents and adjusts the model. FRankNet looks at how well the current model predicts for many different pairs, then adjusts the model.
In RankNet, first the output function is calculated for a pair of documents in a particular query,
then the derivative of the cost function is taken with respect to the difference of scores of the pair, and that is used to
update the weights. When updating for every pair of documents, two backpropagations are required for every pair.
Thus, for N documents in a query, the total number of updates per query is:
N(N-1)/2,
and the total number of times backpropagation occurs is:
N(N-1).
In FRankNet, the weights are updated per query.
The output function is first calculated for every pair in the query and combined (added and subtracted appropriately).
Then, the derivative of the cost function is taken with respect to the score of a document (not the difference of scores of a document pair) using the output.
This derivative is called λi (for document i) and it is used for backpropagation. After this is done for every document in the query, the weights are updated.
Thus, for N documents in a query, the total number of updates per query is:
1
, and the total number of times backpropagation occurs is: N
.
The approach is pairwise because the cost function is defined for document pairs, and listwise because all of the document pairs (and thus all of the documents) in a query are used; thus a document list for the query is required.
S = epoch number
T = Training set of queries with associated document-label pairs {q1, q2, ..., qn} (each q has a list of documents w/ labels)
V = Validation set of queries with associated document-label pairs {q1, q2, ..., qm}
TP = All pairs of documents (x1, x2) associated with each q in T, such that y1 > y2.
VP = All pairs of documents (x1, x2) associated with each q in V, such that y1 > y2.
Initialization:
Error function = cross entropy
network = constructNodes() //2 dimensional list of nodes
initializeWeights() //use specified random initialization strategy
for s = 1 until S do:
for all queries q in T do:
lambdas = new map
ranks = new map
for each doc x in q, do:
ranks.put(x, forwardProp(x)) //this is for speed up
lambdas.put(x, 0)
end for
for each document pair (x1, x2) in q do:
delta = ranks.get(x2) - ranks.get(x1)
lambda = Sigmoid.output(delta)
lambdas[x1] -= lambda
lambdas[x2] += lambda
end for
for each document x1 in q do:
forwardProp(x1)
backProp(lambdas[x1])
end for
updateWeights()
validate()
report()
end for
Type: Feed-forward neural network
Approach: Pairwise, Listwise*
Strengths: Faster than ranknet, updates less frequently.
Weaknesses:
Network Input: A single document's features
Network Output: Relevance Score
Weights updated per epoch
LambdaRank is a modification of FRankNet. The structure of the network remains unchanged.
The main difference is when calculating the λi by looking at all pairs with document i (explained in FRankNet), the value is multiplied by |ΔNDCG| of swapping the two documents. Please refer to the original article for the reasoning behind the use of this term.
S = epoch number
T = Training set of queries with associated document-label pairs {q1, q2, ..., qn} (each q has a list of documents w/ labels)
V = Validation set of queries with associated document-label pairs {q1, q2, ..., qm}
TP = All pairs of documents (x1, x2) associated with each q in T, such that y1 > y2.
VP = All pairs of documents (x1, x2) associated with each q in V, such that y1 > y2.
Initialization:
Error function = cross entropy
network = constructNodes() //2 dimensional list of nodes
initializeWeights() //use specified random initialization strategy
for s = 1 until S do:
for all queries q in T do:
lambdas = new map
ranks = new map
for each doc x in q, do:
ranks.put(x, forwardProp(x)) //this is for speed up
lambdas.put(x, 0)
end for
for each document pair (x1, x2) in q do:
calculate ΔNDCG
delta = ranks.get(x2) - ranks.get(x1)
lambda = Sigmoid.output(delta)
lambdas[x1] -= lambda * ΔNDCG
lambdas[x2] += lambda * ΔNDCG
end for
for each document x1 in q do:
forwardProp(x1)
backProp(lambdas[x1])
end for
updateWeights()
validate()
report()
end for
Type: Comparative Neural Network
Approach: Pairwise
Strengths:
Weaknesses:
Network Input: Two document's features
Network Output: Preference between the two documents (which document is more relevant).
For example, for an input of {x1, x2}, the output will be {o1, o2}.
if o1 > o2, then o1 is more relevant than o2. If o2 > o1, then o2 is more relevant than o1.
Thus, the targets (given two labels) will be either {1,0} or {0,1}
Weights updated per document (/ pair)
The structure of SortNet's neural network is different from the rest of the networks implemented thus far in this project. Apart from the fact that the features of two documents are required for forward propagation, the output layer must have two nodes, and each hidden layer must have an even number of nodes. The reason for this is that there is symmetry in the network, such that the weights between ni and n1 are equal to ni' and n1', and the weights between ni and n1' are equal to ni' and n1.
Note: As the output of the network is the preference between the two documents, there is no way to directly calculate the score of a document, nor is there a way to classify the document (for ordinal regression).
S = epoch number
T = Training set of queries with associated document-label pairs {q1, q2, ..., qn} (each q has a list of documents w/ labels)
V = Validation set of queries with associated document-label pairs {q1, q2, ..., qm}
TP = All pairs of documents (x1, x2) associated with each q in T, such that y1 > y2.
VP = All pairs of documents (x1, x2) associated with each q in V, such that y1 > y2.
Initialization:
Error function = square error
network = constructNodes() //2 dimensional list of nodes; comparative neural network
initializeWeights() //use specified random initialization strategy
targets[][] = {{1,0},{0,1}}
for s = 1 until S do:
for all queries q in T do:
threshold = 0.5
for all documents x1 in q do: //This is a variation. It does not look at all pairs!
get a random document x2 in q such that x2 != x1
delta = y1 - y2
if (delta != 0) then:
prediction = predict(x1, x2)
if (prediction * delta < threshold) then:
if (delta > 0) then: backprop (targets[0], errorFunc)
else: backProp(targets[1], errorFunc)
updateWeights()
end if
end if
end for
validate()
report()
end for
end for
predict(features)
output = forwardProp(features)
return output[0] - output[1]
Type: Feed-forward neural network
Approach: Listwise
Strengths:
Weaknesses:
Network Input: A single document's features
Network Output: Relevance Score
Weights updated per query
ListNet can be considered a "listwise version" of RankNet. The main difference is that the cross entropy loss is calculated using a list of documents rather than a pair of documents, and the probability used for the cross entropy loss is top one probability (see original paper) rather than the probability of one document being more relevant than another. To see the key differences, please look at implementation of ListNetMLP.backProp and ListNetTrainer.calculateLoss().
S = epoch number
T = Training set of queries with associated document-label pairs {q1, q2, ..., qn} (each q has a list of documents w/ labels)
V = Validation set of queries with associated document-label pairs {q1, q2, ..., qm}
Initialization:
Error function = Entropy
network = constructNodes() //2 dimensional list of nodes
initializeWeights() //use specified random initialization strategy
targets[][] = {{1,0},{0,1}}
for s = 1 until S do:
for all queries q in T do:
for all documents x in q do:
forwardProp(x)
backProp(x)
end for
updateWeights()
end for
validate()
report()
end for
This section is a work in progress, and it will be updated when overviews/formulation of each optimizer implementation has been written.
In LTR4L, several different gradient descent optimization algorithms have been implemented. The following optimizers have been implemented (and can also be specified by the names below):
SGD
Momentum
Adam
Nesterov
Adagrad
Adamax
Nadam
AMS
This section is a work in progress, and it will be updated as overviews/formulations of each optimizer has been written.
Evaluation metrics are essential to measure the accuracy of a ranking model. For evaluation metrics with an asterisk, a position value (k) may be be specified as well (and if it is not, the default value of 10 will be used). If a specified evaluation metric does not require a position value, "k" will be ignored, even if it is specified. If no evaluation metric has been specified, NDCG@10 will be chosen by default. The following evaluation metrics can be specified in the config file:
DCG (Discounted Cumulative Gain)*
NDCG (Normalized Discounted Cumulative Gain)*
Precision (P@k)*
MAP (Mean Average Precision)
WAP (Weighted Average Precision)
MRR (Mean Reciprocal Rank)
Note that while DCG is available, NDCG is recommended over DCG, as the latter is not normalized. Below is the example of the config file in the execution instructions.
{
.
.
.
"evaluation" : {
"evaluator" : "NDCG",
"params" : {
"k" : 10
}
},
.
.
.
}