# Candidate ranking model tutorial

`CandidateRankingModel` from RecTools is a fully functional two-stage recommendation pipeline. 

On the first stage simple models generate candidates from their usual recommendations. On the second stage, a "reranker" (usually Gradient Boosting Decision Trees model) learns how to rank these candidates to predict user actual interactions.

Main features of our implementation:
- Ranks and scores from first-stage models can be added as features for the second-stage reranker.
- Explicit features for user-items candidate pairs can be added using `CandidateFeatureCollector`
- Custom negative samplers for creating second-stage train can be used.
- Custom splitters for creating second-stage train targets can be used.
- CatBoost models as second-stage reranking models are supported out of the box.

**You can treat `CandidateRankingModel` as any other RecTools model and easily pass it to cross-validation. All of the complicated logic for fitting first-stage and second-stage models and recommending through the whole pipeline will happen under the hood.**

**Table of Contents**

* Load data: kion
* Initialization of CandidateRankingModel
* What if we want to easily add user/item features to candidates?
    * From external source
* Using boostings from well-known libraries as a ranking model
    * CandidateRankingModel with gradient boosting from sklearn
        * Features of constructing model
    * CandidateRankingModel with gradient boosting from catboost
        * Features of constructing model
        * Using CatBoostClassifier
        * Using CatBoostRanker
    * CandidateRankingModel with gradient boosting from lightgbm
        * Features of constructing model
        * Using LGBMClassifier
        * Using LGBMRanker
            * An example of creating a custom class for reranker
* CrossValidate
    * Evaluating the metrics of candidate ranking models and candidate generator models

In [1]:
from pathlib import Path
import typing as tp
import warnings

import pandas as pd
import numpy as np

from implicit.nearest_neighbours import CosineRecommender
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import RidgeClassifier
from catboost import CatBoostClassifier, CatBoostRanker
try:
    from lightgbm import LGBMClassifier, LGBMRanker
    LGBM_AVAILABLE = True
except ImportError:
    warnings.warn("lightgbm is not installed. Some parts of the notebook will be skipped.")
    LGBM_AVAILABLE = False
    
    
from rectools import Columns
from rectools.dataset import Dataset
from rectools.metrics import Precision, Recall, MeanInvUserFreq, Serendipity
from rectools.models import PopularModel, ImplicitItemKNNWrapperModel
from rectools.models.base import ExternalIds
from rectools.models.ranking import (
    CandidateRankingModel,
    CandidateGenerator,
    Reranker,
    CatBoostReranker, 
    CandidateFeatureCollector,
    PerUserNegativeSampler
)
from rectools.model_selection import cross_validate, TimeRangeSplitter

## Load data: kion

In [2]:
%%time
!wget -q https://github.com/irsafilo/KION_DATASET/raw/f69775be31fa5779907cf0a92ddedb70037fb5ae/data_original.zip -O data_original.zip
!unzip -o data_original.zip
!rm data_original.zip

Archive:  data_original.zip
  inflating: data_original/interactions.csv  
  inflating: __MACOSX/data_original/._interactions.csv  
  inflating: data_original/users.csv  
  inflating: __MACOSX/data_original/._users.csv  
  inflating: data_original/items.csv  
  inflating: __MACOSX/data_original/._items.csv  
CPU times: user 9.05 ms, sys: 159 ms, total: 168 ms
Wall time: 3.33 s


In [3]:
# Prepare dataset

DATA_PATH = Path("data_original")
users = pd.read_csv(DATA_PATH / 'users.csv')
items = pd.read_csv(DATA_PATH / 'items.csv')
interactions = (
    pd.read_csv(DATA_PATH / 'interactions.csv', parse_dates=["last_watch_dt"])
    .rename(columns={"last_watch_dt": Columns.Datetime})
)
interactions["weight"] = 1

In [4]:
dataset = Dataset.construct(interactions)

In [5]:
RANDOM_STATE = 32

## Initialization of `CandidateRankingModel`

In [6]:
# Prepare first stage models. They will be used to generate candidates for reranking
first_stage = [
    CandidateGenerator(PopularModel(), num_candidates=30, keep_ranks=True, keep_scores=True), 
    CandidateGenerator(
        ImplicitItemKNNWrapperModel(CosineRecommender()), 
        num_candidates=30, 
        keep_ranks=True, 
        keep_scores=True
    )
]

In [7]:
# Prepare reranker. This model is used to rerank candidates from first stage models. 
# It is usually trained on classification or ranking task

reranker = CatBoostReranker(model=CatBoostClassifier(n_estimators=100, verbose=False, random_state=RANDOM_STATE))

In [8]:
# Prepare splitter for selecting reranker train. Only one fold is expected!
# This fold data will be used to define targets for training

splitter = TimeRangeSplitter("7D", n_splits=1)

In [9]:
# Initialize CandidateRankingModel
# We can also pass negative sampler but here we are just using the default one

two_stage = CandidateRankingModel(first_stage, splitter, reranker)

## What data is reranker trained on? 

We can explicitly call `get_train_with_targets_for_reranker` method to look at the actual "train" for reranker.

Here's what happens under the hood during this call:
- Dataset interactions are split using provided splitter (usually on time basis) to history dataset and holdout interactions
- First stage models are fitted on history dataset
- First stage models generate recommendations -> These pairs become candidates for reranker
- All candidate pairs are assigned targets from holdout interactions. (`1` if interactions actually happend, `0` otherwise)
- Negative targets are sampled (here default PerUserNegativeSampler is used which keeps a fixed number of negative samples per user)



In [10]:
%%time
candidates = two_stage.get_train_with_targets_for_reranker(dataset)

CPU times: user 54.6 s, sys: 3.66 s, total: 58.2 s
Wall time: 55 s


In [11]:
# This is train data for boosting model or any other reranker. id columns will be dropped before training
# Here we see ranks and scores from first-stage models as features for reranker
candidates.head(10)

Unnamed: 0,user_id,item_id,PopularModel_1_score,PopularModel_1_rank,ImplicitItemKNNWrapperModel_1_score,ImplicitItemKNNWrapperModel_1_rank,target
0,838806,13018,13372.0,27.0,,,0
1,683784,14703,16864.0,19.0,0.989779,24.0,0
2,13987,4436,16846.0,23.0,,,0
3,659591,6809,39498.0,10.0,,,0
4,373676,11749,,,0.178417,16.0,0
5,641778,4495,19571.0,17.0,,,0
6,1034884,11778,,,0.164348,28.0,0
7,1055325,13545,,,0.187268,22.0,0
8,606185,7102,17110.0,22.0,,,0
9,305940,14942,,,1.257735,6.0,0


## What if we want to easily add user/item features to candidates?

You can add any user, item or user-item-pair features to candidates. They can be added from dataset or from external sources and they also can be time-dependent (e.g. item popularity).

To let the CandidateRankingModel join these features to train data for reranker, you need to create a custom feature collector. Inherit it from `CandidateFeatureCollector` which is used by default.

You can overwrite the following methods:
- `_get_user_features`
- `_get_item_features`
- `_get_user_item_features`

Each of the methods receives:
- `dataset` with all interactions that are available for model in this particular moment (no leak from the future). You can use it to collect user or items stats on the current moment.
- `fold_info` with fold stats if you need to know that date that model considers as current date. You can join time-dependent features from external source that are valid on this particular date.

In the example below we will simply collect users age, sex and income features from external csv file:

In [None]:
# Write custom feature collecting funcs for users, items and user/item pairs
class CustomFeatureCollector(CandidateFeatureCollector):
    
    def __init__(self, user_features_path: Path, user_cat_cols: tp.List[str]) -> None:        
        self.user_features_path = user_features_path
        self.user_cat_cols = user_cat_cols
    
    # your any helper functions for working with loaded data
    def _encode_cat_cols(self, df: pd.DataFrame, cols: tp.List[str]) -> pd.DataFrame:    
        for col in cols:
            df[col] = df[col].astype("category").cat.codes.astype("category")
        return df
    
    def _get_user_features(
        self, users: ExternalIds, dataset: Dataset, fold_info: tp.Optional[tp.Dict[str, tp.Any]]
    ) -> pd.DataFrame:
        columns = self.user_cat_cols.copy()
        columns.append(Columns.User)
        user_features = pd.read_csv(self.user_features_path)[columns]        
        
        users_without_features = pd.DataFrame(
            np.setdiff1d(dataset.user_id_map.external_ids, user_features[Columns.User].unique()),
            columns=[Columns.User]
        )        
        user_features = pd.concat([user_features, users_without_features], axis=0)
        user_features = self._encode_cat_cols(user_features, self.user_cat_cols)
        
        return user_features[user_features[Columns.User].isin(users)]

In [13]:
# Now we specify our custom feature collector for CandidateRankingModel

two_stage = CandidateRankingModel(
    candidate_generators=first_stage,
    splitter=splitter,
    reranker=Reranker(RidgeClassifier()),
    feature_collector=CustomFeatureCollector(
        user_features_path=DATA_PATH / "users.csv", 
        user_cat_cols=["age", "income", "sex"],
    )
)

In [14]:
%%time
candidates = two_stage.get_train_with_targets_for_reranker(dataset)

CPU times: user 54.9 s, sys: 3.41 s, total: 58.3 s
Wall time: 55 s


In [15]:
# Now our candidates also have features for users: age, sex and income
candidates.head(10)

Unnamed: 0,user_id,item_id,PopularModel_1_score,PopularModel_1_rank,ImplicitItemKNNWrapperModel_1_score,ImplicitItemKNNWrapperModel_1_rank,target,age,income,sex
0,115859,13865,115095.0,4.0,,,0,1,2,0
1,288932,13865,115095.0,3.0,0.287356,1.0,1,1,2,1
2,880313,16228,16213.0,30.0,,,0,2,3,1
3,467787,13159,,,0.23262,28.0,1,1,2,1
4,448548,4457,20811.0,16.0,0.144684,21.0,0,1,0,0
5,232813,2657,66415.0,5.0,0.422717,11.0,0,2,2,1
6,1061114,10077,,,0.241296,24.0,0,3,2,0
7,755593,6443,,,0.130398,28.0,0,1,2,0
8,194860,7829,18080.0,23.0,,,0,0,2,1
9,37313,7626,13131.0,30.0,,,0,2,3,1


## Using boostings from well-known libraries as a ranking model

In this section we're using end-to-end pipelines for generating recommendations: standard methods `fit` and `recommend` common for every model in RecTools.

In [16]:
# Let's select a few users to recommend for

all_users = dataset.user_id_map.external_ids
users_to_recommend = all_users[:100]

### CandidateRankingModel with gradient boosting from sklearn

**Features of constructing model:**
   - `GradientBoostingClassifier` works correctly with Reranker
   - But it cannot work with missing values. When initializing CandidateGenerator, so specify the parameter values `scores_fillna_value` and `ranks_fillna_value`.

In [17]:
# Prepare first stage models
first_stage_gbc = [
    CandidateGenerator(
        model=PopularModel(),
        num_candidates=30,
        keep_ranks=True,
        keep_scores=True,
        scores_fillna_value=1.01, # when working with the GradientBoostingClassifier, you need to fill in the empty scores (e.g. max score)
        ranks_fillna_value=31  # when working with the GradientBoostingClassifier, you need to fill in the empty ranks (e.g. min rank)
    ), 
    CandidateGenerator(
        model=ImplicitItemKNNWrapperModel(CosineRecommender()),
        num_candidates=30,
        keep_ranks=True,
        keep_scores=True,
        scores_fillna_value=1.01, # when working with the GradientBoostingClassifier, you need to fill in the empty scores (e.g. max score)
        ranks_fillna_value=31  # when working with the GradientBoostingClassifier, you need to fill in the empty ranks (e.g. min rank)
    )
]

In [18]:
two_stage_gbc = CandidateRankingModel(
    candidate_generators=first_stage_gbc,
    splitter=splitter,
    reranker=Reranker(GradientBoostingClassifier(random_state=RANDOM_STATE)),
    sampler=PerUserNegativeSampler(n_negatives=3, random_state=RANDOM_STATE) # pass sampler to fix random_state
)

In [19]:
%%time
two_stage_gbc.fit(dataset)

CPU times: user 1min 35s, sys: 2.91 s, total: 1min 38s
Wall time: 1min 31s


<rectools.models.ranking.candidate_ranking.CandidateRankingModel at 0x7d2dcfb8b5c0>

In [20]:
%%time
reco_gbc = two_stage_gbc.recommend(
    users=users_to_recommend, 
    dataset=dataset,
    k=10,
    filter_viewed=True
)

CPU times: user 816 ms, sys: 23.9 ms, total: 840 ms
Wall time: 839 ms


In [21]:
reco_gbc.head(5)

Unnamed: 0,user_id,item_id,score,rank
0,5324,3734,0.881224,1
1,5324,2657,0.73635,2
2,5324,4151,0.70363,3
3,5324,7626,0.547508,4
4,5324,9728,0.52663,5


### CandidateRankingModel with gradient boosting from catboost

**Features of constructing model:**
- for `CatBoostClassifier` and `CatBoostRanker` it is necessary to process categorical features: fill in empty values (if there are categorical features in the training sample for Rerankers). You can do this with CustomFeatureCollector.

**Using CatBoostClassifier**
- `CatBoostClassifier` works perfectly with CatBoostReranker, and we don't need to fill nulls here

In [22]:
# Prepare first stage models
first_stage_catboost = [
    CandidateGenerator(
        model=PopularModel(),
        num_candidates=30,
        keep_ranks=True,
        keep_scores=True,
    ), 
    CandidateGenerator(
        model=ImplicitItemKNNWrapperModel(CosineRecommender()),
        num_candidates=30,
        keep_ranks=True,
        keep_scores=True,
    )
]

In [23]:
user_features_path = DATA_PATH / "users.csv"
user_cat_cols = ["age", "income", "sex"]

# Categorical features are definitely transferred to the pool_kwargs
pool_kwargs = {
    "cat_features": user_cat_cols    
}

In [24]:
# To transfer CatBoostClassifier we use CatBoostReranker (for faster work with large amounts of data)
# You can also pass parameters in fit_kwargs and pool_kwargs in CatBoostReranker

two_stage_catboost_classifier = CandidateRankingModel(
    candidate_generators=first_stage_catboost,
    splitter=splitter,
    reranker=CatBoostReranker(CatBoostClassifier(verbose=False, random_state=RANDOM_STATE), pool_kwargs=pool_kwargs),
    sampler=PerUserNegativeSampler(n_negatives=3, random_state=RANDOM_STATE), # pass sampler to fix random_state
    feature_collector=CustomFeatureCollector(user_features_path=user_features_path, user_cat_cols=user_cat_cols),
)

In [25]:
%%time
two_stage_catboost_classifier.fit(dataset)

CPU times: user 1h 1min 46s, sys: 34min 42s, total: 1h 36min 28s
Wall time: 2min 16s


<rectools.models.ranking.candidate_ranking.CandidateRankingModel at 0x7d2dbdc308f0>

In [26]:
reco_catboost_classifier = two_stage_catboost_classifier.recommend(
    users=users_to_recommend, 
    dataset=dataset,
    k=10,
    filter_viewed=True
)

In [27]:
reco_catboost_classifier.head(5)

Unnamed: 0,user_id,item_id,score,rank
0,5324,3734,0.966344,1
1,5324,1844,0.850233,2
2,5324,142,0.834881,3
3,5324,4151,0.768359,4
4,5324,2657,0.658084,5


**Using CatBoostRanker**
- Instead of `CatBoostClassifier` you can also easily use `CatBoostRanker` without any additional modifications

In [28]:
# To transfer CatBoostRanker we use CatBoostReranker

two_stage_catboost_ranker = CandidateRankingModel(
    candidate_generators=first_stage_catboost,
    splitter=splitter,
    reranker=CatBoostReranker(CatBoostRanker(verbose=False, random_state=RANDOM_STATE), pool_kwargs=pool_kwargs),
    sampler=PerUserNegativeSampler(n_negatives=3, random_state=RANDOM_STATE), # pass sampler to fix random_state
    feature_collector=CustomFeatureCollector(user_features_path=user_features_path, user_cat_cols=user_cat_cols),                
)

In [29]:
%%time
two_stage_catboost_ranker.fit(dataset)

CPU times: user 1h 3min 1s, sys: 50min 51s, total: 1h 53min 52s
Wall time: 3min 37s


<rectools.models.ranking.candidate_ranking.CandidateRankingModel at 0x7d2dcfb8bd40>

In [30]:
%%time
reco_catboost_ranker = two_stage_catboost_ranker.recommend(
    users=users_to_recommend, 
    dataset=dataset,
    k=10,
    filter_viewed=True
)

CPU times: user 1.78 s, sys: 137 ms, total: 1.92 s
Wall time: 1.68 s


In [31]:
reco_catboost_ranker.head(5)

Unnamed: 0,user_id,item_id,score,rank
0,5324,3734,4.464815,1
1,5324,2657,3.655252,2
2,5324,4151,3.50404,3
3,5324,9728,2.463656,4
4,5324,142,2.382605,5


### CandidateRankingModel with gradient boosting from lightgbm
**Features of constructing model:**
- `LGBMClassifier` and `LGBMRanker` cannot work with missing values

**Using LGBMClassifier**
- `LGBMClassifier` works correctly with Reranker

In [32]:
if LGBM_AVAILABLE:
    # Prepare first stage models
    first_stage_lgbm = [
        CandidateGenerator(
            model=PopularModel(),
            num_candidates=30,
            keep_ranks=True,
            keep_scores=True,
            scores_fillna_value=1.01, # when working with the LGBMClassifier, you need to fill in the empty scores (e.g. max score)
            ranks_fillna_value=31  # when working with the LGBMClassifier, you need to fill in the empty ranks (e.g. min rank)
        ), 
        CandidateGenerator(
            model=ImplicitItemKNNWrapperModel(CosineRecommender()),
            num_candidates=30,
            keep_ranks=True,
            keep_scores=True,
            scores_fillna_value=1,  # when working with the LGBMClassifier, you need to fill in the empty scores
            ranks_fillna_value=31   # when working with the LGBMClassifier, you need to fill in the empty ranks
        )
    ]

In [33]:
if LGBM_AVAILABLE:
    # example parameters for running model training 
    # more valid parameters here https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.fit
    fit_params = {
        "categorical_feature": user_cat_cols,
    }

In [34]:
if LGBM_AVAILABLE:
    two_stage_lgbm_classifier = CandidateRankingModel(
        candidate_generators=first_stage_lgbm,
        splitter=splitter,
        reranker=Reranker(LGBMClassifier(random_state=RANDOM_STATE), fit_params),
        sampler=PerUserNegativeSampler(n_negatives=3, random_state=RANDOM_STATE),  # pass sampler to fix random_state
        feature_collector=CustomFeatureCollector(user_features_path=user_features_path, user_cat_cols=user_cat_cols)
    )

In [35]:
%%time
if LGBM_AVAILABLE:
    two_stage_lgbm_classifier.fit(dataset)

[LightGBM] [Info] Number of positive: 78233, number of negative: 330228
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011737 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 392
[LightGBM] [Info] Number of data points in the train set: 408461, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.191531 -> initscore=-1.440092
[LightGBM] [Info] Start training from score -1.440092
CPU times: user 1min 46s, sys: 2.77 s, total: 1min 49s
Wall time: 1min 18s


In [37]:
if LGBM_AVAILABLE:
    reco_lgbm_classifier = two_stage_lgbm_classifier.recommend(
        users=users_to_recommend, 
        dataset=dataset,
        k=10,
        filter_viewed=True
    )

In [38]:
if LGBM_AVAILABLE:
    reco_lgbm_classifier.head(5)

**Using LGBMRanker**
- `LGBMRanker` does not work correctly with Reranker!

When using LGBMRanker, you need to correctly compose groups. To do this, you can create a class inheriting from Reranker and override method `prepare_fit_kwargs` in it.

Documentation on how to form groups for LGBMRanker (read about `group`):
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#lightgbm.LGBMRanker.fit

**An example of creating a custom class for reranker**

In [39]:
if LGBM_AVAILABLE:
    class LGBMReranker(Reranker):
        def __init__(
            self,
            model: LGBMRanker,
            fit_kwargs: tp.Optional[tp.Dict[str, tp.Any]] = None,
        ):
            super().__init__(model)
            self.fit_kwargs = fit_kwargs
            
        def _get_group(self, df: pd.DataFrame) -> np.ndarray:
            return df.groupby(by=["user_id"])["item_id"].count().values

        def prepare_fit_kwargs(self, candidates_with_target: pd.DataFrame) -> tp.Dict[str, tp.Any]:
            candidates_with_target = candidates_with_target.sort_values(by=[Columns.User])
            groups = self._get_group(candidates_with_target)
            candidates_with_target = candidates_with_target.drop(columns=Columns.UserItem)

            fit_kwargs = {
                "X": candidates_with_target.drop(columns=Columns.Target),
                "y": candidates_with_target[Columns.Target],
                "group": groups,
            }

            if self.fit_kwargs is not None:
                fit_kwargs.update(self.fit_kwargs)

            return fit_kwargs

In [40]:
if LGBM_AVAILABLE:
    # example parameters for running model training 
    # more valid parameters here
    # https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html#lightgbm.LGBMRanker.fit
    fit_params = {
        "categorical_feature": user_cat_cols,
    }

In [41]:
if LGBM_AVAILABLE:
    # Now we specify our custom feature collector for CandidateRankingModel

    two_stage_lgbm_ranker = CandidateRankingModel(
        candidate_generators=first_stage_lgbm,
        splitter=splitter,
        reranker=LGBMReranker(LGBMRanker(random_state=RANDOM_STATE), fit_kwargs=fit_params),
        sampler=PerUserNegativeSampler(n_negatives=3, random_state=RANDOM_STATE),  # pass sampler to fix random_state
        feature_collector=CustomFeatureCollector(user_features_path=user_features_path, user_cat_cols=user_cat_cols)
        )

In [42]:
%%time
if LGBM_AVAILABLE:
    two_stage_lgbm_ranker.fit(dataset)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003743 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 394
[LightGBM] [Info] Number of data points in the train set: 408461, number of used features: 7
CPU times: user 1min 52s, sys: 2.62 s, total: 1min 54s
Wall time: 1min 22s


In [43]:
if LGBM_AVAILABLE:
    reco_lgbm_ranker = two_stage_lgbm_ranker.recommend(
        users=users_to_recommend, 
        dataset=dataset,
        k=10,
        filter_viewed=True
    )

In [44]:
if LGBM_AVAILABLE:
    reco_lgbm_ranker.head(5)

## CrossValidate
### Evaluating the metrics of candidate ranking models and candidate generator models.

In [45]:
# Take few models to compare
models = {
    "popular": PopularModel(),
    "cosine_knn": ImplicitItemKNNWrapperModel(CosineRecommender()),
    "two_stage_gbc": two_stage_gbc,
    "two_stage_catboost_classifier": two_stage_catboost_classifier,
    "two_stage_catboost_ranker": two_stage_catboost_ranker,
}
if LGBM_AVAILABLE:
    models["two_stage_lgbm_classifier"] = two_stage_lgbm_classifier
    models["two_stage_lgbm_ranker"] = two_stage_lgbm_ranker

# We will calculate several classic (precision@k and recall@k) and "beyond accuracy" metrics
metrics = {
    "prec@1": Precision(k=1),
    "prec@10": Precision(k=10),
    "recall@10": Recall(k=10),
    "novelty@10": MeanInvUserFreq(k=10),
    "serendipity@10": Serendipity(k=10),
}

K_RECS = 10

In [46]:
%%time

cv_results = cross_validate(
    dataset=dataset,
    splitter=splitter,
    models=models,
    metrics=metrics,
    k=K_RECS,
    filter_viewed=True,
)

[LightGBM] [Info] Number of positive: 73891, number of negative: 310533
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005224 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 394
[LightGBM] [Info] Number of data points in the train set: 384424, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.192212 -> initscore=-1.435699
[LightGBM] [Info] Start training from score -1.435699
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004715 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 395
[LightGBM] [Info] Number of data points in the train set: 384424, number of used features: 7
CPU times: user 2h 10min 10s, sys: 1h 21min 46s, total: 3h 31min 56s
Wall time: 15min 32s


In [47]:
pivot_results = (
    pd.DataFrame(cv_results["metrics"])
    .drop(columns="i_split")
    .groupby(["model"], sort=False)
    .agg(["mean"])
)
pivot_results

Unnamed: 0_level_0,prec@1,prec@10,recall@10,novelty@10,serendipity@10
Unnamed: 0_level_1,mean,mean,mean,mean,mean
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
popular,0.070806,0.032655,0.166089,3.715659,2e-06
cosine_knn,0.079372,0.036757,0.176609,5.75866,0.000189
two_stage_gbc,0.045986,0.038248,0.188901,4.850412,0.000149
two_stage_catboost_classifier,0.026009,0.031835,0.156352,4.73483,0.000111
two_stage_catboost_ranker,0.043061,0.035819,0.176745,4.669844,0.000124
two_stage_lgbm_classifier,0.036375,0.033809,0.16659,4.735537,0.000121
two_stage_lgbm_ranker,0.038473,0.035208,0.173689,4.625044,0.000115
