Learning to Rank:
- often in the context of information retrieval, learning to rank aims to train a model that arranges a set of query results into an ordered list. For supervised learning-to-rank, the predictors are sample documents encoded as a feature matrix, and the labels are relevance degrees for each sample. Relevance degree can be multi-level (graded) or binary (relevant or not). The training samples are often grouped by their query index with each query group containing multiple query results.

XGBoost implements learning to rank through a set of objective functions and performance metrics. default objective is rank:ndcg 

**Training with the Pairwise Objective:**
- For the sake of simplicity below, we will use a synthetic binary learning-to-rank dataset, with binary labels representing whether the result is relevant or not and randomly assign the query group index to each sample. 

In [1]:
from sklearn.datasets import make_classification 
import numpy as np
import xgboost as xgb

In [2]:
# make a synthetic ranking dataset for demonstration

seed = 1994
X, y = make_classification(random_state=seed)
rng = np.random.default_rng(seed)
n_query_groups = 3
qid = rng.integers(0, n_query_groups, size = X.shape[0])

In [4]:
X.shape

(100, 20)

In [8]:
np.argsort(qid)

array([70, 67, 23, 60, 29, 30, 58, 18, 76, 32, 55, 34, 66, 13, 54, 85,  9,
       48, 91, 94,  3, 40, 97, 35, 63, 53, 47,  0, 68, 46, 73, 75, 77, 78,
       80, 81, 83, 86, 88, 90, 95, 61, 45, 99, 28, 15, 20, 43, 21, 14, 24,
       11, 10, 31, 33, 17, 42, 37, 41,  1,  4,  5,  6, 79, 96,  2, 12, 82,
       84, 92,  8, 87, 16,  7, 89, 93, 44, 71, 74, 39, 38, 98, 50, 51, 52,
       36, 56, 57, 59, 27, 62, 26, 64, 65, 25, 22, 69, 72, 19, 49],
      dtype=int64)

In [10]:
qid[np.argsort(qid)]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

In [9]:
qid

array([1, 1, 2, 0, 1, 1, 2, 2, 2, 0, 1, 1, 2, 0, 1, 1, 2, 1, 0, 2, 1, 1,
       2, 0, 1, 2, 2, 2, 1, 0, 0, 1, 0, 1, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1,
       2, 1, 1, 1, 0, 2, 2, 2, 2, 1, 0, 0, 2, 2, 0, 2, 0, 1, 2, 0, 2, 2,
       0, 0, 1, 2, 0, 2, 2, 1, 2, 1, 0, 1, 1, 2, 1, 1, 2, 1, 2, 0, 1, 2,
       1, 2, 1, 0, 2, 2, 0, 1, 2, 0, 2, 1], dtype=int64)

In [11]:
# sort the inputs based on query index
sorted_idx = np.argsort(qid)
X = X[sorted_idx, :]
y = y[sorted_idx]
qid = qid[sorted_idx]

In [12]:
ranker = xgb.XGBRanker(tree_method="hist", lambdarank_num_pair_per_sample=8, objective="rank:ndcg", lambdarank_pair_method="topk")
ranker.fit(X, y, qid=qid)

Hyperopt - Hyperparameter Tuning: