### **Microsoft Learning to Rank (MSRank) Dataset – Description**

The **MSRank dataset** is a subset of Microsoft’s MSLR-WEB10K collection, designed for **learning-to-rank tasks**. It is widely used to benchmark ranking algorithms, including CatBoost.

* **Samples**:

  * **723,412 training samples**
  * **241,521 testing samples**

* **Features**:

  * **137 features** per sample
  * A mix of numerical and categorical descriptors that describe the relationship between a **query** (e.g., a search term) and a **document** (e.g., a webpage).
  * Features include: term frequency, inverse document frequency, BM25 scores, document lengths, PageRank scores, click data, and various query-document match signals.

* **Target (Label)**:

  * A **relevance score** assigned to each query-document pair:

    * `0` = not relevant
    * `1` = partially relevant
    * `2` (and higher, depending on the subset) = highly relevant
  * These labels are used to train models to rank documents for each query.

* **Groups**:

  * Each sample belongs to a **group (query ID)**, meaning all documents in that group correspond to the same user query.
  * The goal is to **order documents within each group by relevance**.


In [1]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [2]:
from catboost import datasets
from sklearn.model_selection import train_test_split

# Load the Microsoft Learning to Rank dataset
train_df, test_df = datasets.msrank()

# The first column is the label, the second is the GroupId, and the rest are features
y_train = train_df.iloc[:, 0]
train_group_id = train_df.iloc[:, 1].astype(int) # Convert to integer type
X_train = train_df.iloc[:, 2:]

y_test = test_df.iloc[:, 0]
test_group_id = test_df.iloc[:, 1].astype(int) # Convert to integer type
X_test = test_df.iloc[:, 2:]

# Quick check of data
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("train_group_id shape:", train_group_id.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
print("test_group_id shape:", test_group_id.shape)

X_train shape: (723412, 136)
y_train shape: (723412,)
train_group_id shape: (723412,)
X_test shape: (241521, 136)
y_test shape: (241521,)
test_group_id shape: (241521,)


In [3]:
X_train.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,128,129,130,131,132,133,134,135,136,137
0,3.0,3.0,0.0,0.0,3.0,1.0,1.0,0.0,0.0,1.0,...,62.0,11089534.0,2.0,116.0,64034.0,13.0,3.0,0.0,0.0,0.0
1,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,...,44.0,5.0,30.0,23836.0,63634.0,2.0,4.0,0.0,0.0,0.0
2,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,...,59.0,5.0,8.0,213.0,48469.0,1.0,13.0,0.0,0.0,0.0
3,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,...,44.0,0.0,30.0,23871.0,63634.0,3.0,4.0,0.0,0.0,0.0
4,3.0,0.0,3.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,...,44.0,4.0,30.0,23848.0,63634.0,3.0,4.0,0.0,0.0,0.0


In [4]:
y_train.head()

Unnamed: 0,0
0,2.0
1,0.0
2,0.0
3,0.0
4,0.0


The target values (y_train) often indicate:

0 = irrelevant

1 = somewhat relevant

2 = highly relevant

In [5]:
from catboost import Pool, CatBoostRanker
import numpy as np


# Create Pool objects for train and validation (including group_id)
train_pool = Pool(
    data = X_train,
    label = y_train,
    group_id = train_group_id
)

test_pool = Pool(
    data = X_test,
    label = y_test,
    group_id = test_group_id
)

In [6]:
model = CatBoostRanker(
    iterations = 500,
    learning_rate = 0.1,
    depth = 6,
    loss_function = 'YetiRank',
    eval_metric = 'NDCG',
    use_best_model = True,
    verbose = 100
)

In [7]:
model.fit(
    train_pool,
    eval_set = test_pool
)

0:	test: 0.6831883	best: 0.6831883 (0)	total: 5s	remaining: 41m 32s
100:	test: 0.7895074	best: 0.7895076 (99)	total: 3m 52s	remaining: 15m 17s
200:	test: 0.7918769	best: 0.7919038 (197)	total: 7m 34s	remaining: 11m 15s
300:	test: 0.7935426	best: 0.7936099 (292)	total: 11m 17s	remaining: 7m 27s
400:	test: 0.7936718	best: 0.7937556 (350)	total: 14m 58s	remaining: 3m 41s
499:	test: 0.7939695	best: 0.7944165 (425)	total: 18m 38s	remaining: 0us

bestTest = 0.7944164915
bestIteration = 425

Shrink model to first 426 iterations.


<catboost.core.CatBoostRanker at 0x7cd9fc9a6310>

In [8]:
# Predictions
preds = model.predict(test_pool)
print("Sample predictions:", preds[:10])

Sample predictions: [-0.14453574  0.38564039  0.20739398  0.37999115  0.85888284  1.68826768
  0.73587038  0.99992739  1.26925725 -0.19086433]


In [16]:
fi_values = model.get_feature_importance(
    data=train_pool,               # Pass training or validation Pool
    type='LossFunctionChange'
)

feature_names = train_pool.get_feature_names()  # or list(X_train.columns)
fi_df = pd.DataFrame({
    'feature': feature_names,
    'importance': fi_values
}).sort_values(by='importance', ascending=False)

print(fi_df.head(10))

    feature  importance
133     135    0.005506
128     130    0.002533
107     109    0.002179
13       15    0.002147
129     131    0.001591
29       31    0.001263
132     134    0.001025
134     136    0.000769
131     133    0.000755
127     129    0.000715
