In [4]:
import sys

sys.path.append("../")

In [5]:
%load_ext autoreload
%autoreload 2

In [6]:
from rumboost.rumboost import rum_train
from rumboost.datasets import load_preprocess_LPMC
from rumboost.metrics import cross_entropy

import lightgbm
import hyperopt
import numpy as np


# Example: Cross-nested logit model (correlation amongst alternative)

This notebook shows features implemented in RUMBoost through an example on the LPMC dataset, a mode choice dataset in London developed Hillel et al. (2018). You can find the original source of data [here](https://www.icevirtuallibrary.com/doi/suppl/10.1680/jsmic.17.00018) and the original paper [here](https://www.icevirtuallibrary.com/doi/full/10.1680/jsmic.17.00018).

We first load the preprocessed dataset and its folds for cross-validation. You can find the data under the Data folder

In [7]:
#load dataset
LPMC_train, LPMC_test, folds = load_preprocess_LPMC(path="../Data/")

## Cross-Nested Logit model

We relax the assumption that the error term is distributed i.i.d. We assume that alternatives are correlated amongst several nests to obtain a cross-nested logit-like model. Cross-Nested logit probabilities are implemented in RUMBoost. The additional parameters, the scale of a nest $\mu$ and the membership of alternatives to nests, are treated as hyperparameters.

Training a cross-nested logit-like rumboost model requires two additional arguments:

- ```alphas```: a 2d numpy array of the form ```np.array([[alpha_00, alpha_01, alpha_02],[alpha_10, alpha_11, alpha_12], [alpha_20, alpha_21, alpha_22], [alpha_30, alpha_31, alpha_32]]``` where ```alpha_ij``` means the degree of membership of alternative ```i``` to nest ```j```
- ```mu```: a numpy array containing the values (as float) of mu for each nest, e.g. ```[mu_nest_0, mu_nest_1, mu_nest_2]```

We test here a cross-nested logit model where the two nests are motorized and flexible. As this is a work in progress, we just arbitrarily choose values of mu and alphas. This will be later chosen with hyperparameter tuning or through scipy.minimize.

### General parameters

You can find an example of general parameters below. Unless stated otherwise, the parameters are the same than in LightGBM, since these parameters are applied directly to LightGBM Booster objects. You can find more information in the LightGBM [docs](https://lightgbm.readthedocs.io/en/stable/Parameters.html#).  For a simple RUMBoost, we recommend letting most of the parameters with default values, as RUMBoost is less sensitive to overfitting. **For a multiclass classification problem, you need to specify the num_classes parameter with the appropriate number of classes**.

In [8]:
# parameters
general_params = {
    "n_jobs": -1,
    "num_classes": 4,  # important
    "verbosity": 1,  # specific RUMBoost parameter
    "num_iterations": 3000,
    "early_stopping_round": 100,
}

### Random Utility Model structure


In [9]:
rum_structure = [
    {
        "utility": [0],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_walking",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
            ],
        },
        "shared": False,
    },
    {
        "utility": [1],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_cycling",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
            ],
        },
        "shared": False,
    },
    {
        "utility": [2],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_pt_access",
            "dur_pt_bus",
            "dur_pt_rail",
            "dur_pt_int_waiting",
            "dur_pt_int_walking",
            "pt_n_interchanges",
            "cost_transit",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
                -1,
                -1,
                -1,
                -1,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
                [17],
                [18],
                [19],
                [20],
                [21],
                [22],
            ],
        },
        "shared": False,
    },
    {
        "utility": [3],
        "variables": [
            "age",
            "female",
            "day_of_week",
            "start_time_linear",
            "car_ownership",
            "driving_license",
            "purpose_B",
            "purpose_HBE",
            "purpose_HBO",
            "purpose_HBW",
            "purpose_NHBO",
            "fueltype_Average",
            "fueltype_Diesel",
            "fueltype_Hybrid",
            "fueltype_Petrol",
            "distance",
            "dur_driving",
            "cost_driving_fuel",
            "congestion_charge",
            "driving_traffic_percent",
        ],
        "boosting_params": {
            "monotone_constraints_method": "advanced",
            "max_depth": 1,
            "n_jobs": -1,
            "learning_rate": 0.1,
            "monotone_constraints": [
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                0,
                -1,
                -1,
                -1,
                -1,
                -1,
            ],
            "interaction_constraints": [
                [0],
                [1],
                [2],
                [3],
                [4],
                [5],
                [6],
                [7],
                [8],
                [9],
                [10],
                [11],
                [12],
                [13],
                [14],
                [15],
                [16],
                [17],
                [18],
                [19],
            ],
        },
        "shared": False,
    },
]

### $\mu$ and $\alpha$ hyperparameter search

We treat $\mu$ as a hyperparameter. We use hyperopt to find the optimal value of the hyperparameter. More details on how to use hyperopt [here](https://hyperopt.github.io/hyperopt/).

Note that for computational purposes, we show here a hyperparameter search for one iteration.

In [10]:
mu = np.array([1.25, 1.16]) #random values

alphas  = np.array([[0., 1.],
                    [0., 1.],
                    [1., 0.],
                    [0.5, 0.5]])

cross_nested_structure = {
    "mu":mu,
    "alphas":alphas,
    "optimise_mu":False,
    "optimise_alphas":False,
}

In [11]:
model_specification = {
    "rum_structure": rum_structure,
    "cross_nested_logit": cross_nested_structure,
    "general_params": general_params,
}

In [12]:
#features and label column names
features = [f for f in LPMC_train.columns if f != "choice"]
label = "choice"

#create lightgbm dataset
lgb_train_set = lightgbm.Dataset(LPMC_train[features], label=LPMC_train[label], free_raw_data=False)
lgb_test_set = lightgbm.Dataset(LPMC_test[features], label=LPMC_test[label], free_raw_data=False)

### Hyperparameter search

In [21]:
# specifiy seach of mu
param_space = {
    "mu_0": hyperopt.hp.uniform("mu_0", 1, 2),
    "mu_1": hyperopt.hp.uniform("mu_1", 1, 2),
    "alpha_14": hyperopt.hp.uniform("alpha_14", 0, 1),
}


# objective for hyperopt
def objective(space):

    # create mu structure
    cross_nested_structure["mu"] = np.array([space["mu_0"], space["mu_1"]])
    cross_nested_structure["alphas"] = np.array(
        [[0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [space["alpha_14"], 1 - space["alpha_14"]]]
    )

    ce_loss = 0
    num_trees = 0

    for train_idx, test_idx in folds:
        train_set = lgb_train_set.subset(sorted(train_idx))
        test_set = lgb_train_set.subset(sorted(test_idx))

        LPMC_model_trained = rum_train(
            train_set, model_specification, valid_sets=[test_set]
        )

        ce_loss += LPMC_model_trained.best_score
        num_trees += LPMC_model_trained.best_iteration

    ce_loss = ce_loss / 5
    num_trees = num_trees / 5

    return {"loss": ce_loss, "status": hyperopt.STATUS_OK, "best_iteration": num_trees}


# %%
# n_iter=25
n_iter = 1

trials = hyperopt.Trials()
best_classifier = hyperopt.fmin(
    fn=objective,
    space=param_space,
    algo=hyperopt.tpe.suggest,
    max_evals=n_iter,
    trials=trials,
)

print(
    f'Best mu_0: {best_classifier["mu_0"]} \n Best mu_1: {best_classifier["mu_1"]} \n Best alphas: {best_classifier["alpha_14"]} \n Best negative CE: {trials.best_trial["result"]["loss"]}'
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000161 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 855                     
[LightGBM] [Info] Number of data points in the train set: 43812, number of used features: 17
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001127 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 855                     
[LightGBM] [Info] Number of data points in the train set: 43812, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.016519 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1828                    
[LightGBM] [Info] Number of data point




[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[1]------NCE value on train set : 1.3059             
---------NCE value on test set 1: 1.3071             
[11]-----NCE value on train set : 0.9734             
---------NCE value on test set 1: 0.9779             
[21]-----NCE value on train set : 0.8563             
---------NCE value on test set 1: 0.8610             
[31]-----NCE value on train set : 0.8016             
---------NCE value on test set 1: 0.8062             
[41]-----NCE value on train set : 0.7714             
---------NCE value on test set 1: 0.7766             
[51]-----NCE value on train set : 0.7519             
---------NCE value on test set 1: 0.7572             
[61]-----NCE value on train set : 0.7379             
---------NCE value on test set 1: 0.7428             
[71]-----NCE value o




[1]------NCE value on train set : 1.3065             
---------NCE value on test set 1: 1.3060             
[11]-----NCE value on train set : 0.9751             
---------NCE value on test set 1: 0.9715             
[21]-----NCE value on train set : 0.8578             
---------NCE value on test set 1: 0.8548             
[31]-----NCE value on train set : 0.8029             
---------NCE value on test set 1: 0.8004             
[41]-----NCE value on train set : 0.7731             
---------NCE value on test set 1: 0.7704             
[51]-----NCE value on train set : 0.7537             
---------NCE value on test set 1: 0.7508             
[61]-----NCE value on train set : 0.7397             
---------NCE value on test set 1: 0.7362             
[71]-----NCE value on train set : 0.7291             
---------NCE value on test set 1: 0.7254             
[81]-----NCE value on train set : 0.7207             
---------NCE value on test set 1: 0.7170             
[91]-----NCE value on train 




[1]------NCE value on train set : 1.3064             
---------NCE value on test set 1: 1.3057             
[11]-----NCE value on train set : 0.9756             
---------NCE value on test set 1: 0.9713             
[21]-----NCE value on train set : 0.8583             
---------NCE value on test set 1: 0.8530             
[31]-----NCE value on train set : 0.8033             
---------NCE value on test set 1: 0.7979             
[41]-----NCE value on train set : 0.7731             
---------NCE value on test set 1: 0.7680             
[51]-----NCE value on train set : 0.7534             
---------NCE value on test set 1: 0.7489             
[61]-----NCE value on train set : 0.7391             
---------NCE value on test set 1: 0.7353             
[71]-----NCE value on train set : 0.7282             
---------NCE value on test set 1: 0.7252             
[81]-----NCE value on train set : 0.7196             
---------NCE value on test set 1: 0.7173             
[91]-----NCE value on train 




[1]------NCE value on train set : 1.3057             
---------NCE value on test set 1: 1.3069             
[11]-----NCE value on train set : 0.9723             
---------NCE value on test set 1: 0.9785             
[21]-----NCE value on train set : 0.8546             
---------NCE value on test set 1: 0.8625             
[31]-----NCE value on train set : 0.7996             
---------NCE value on test set 1: 0.8082             
[41]-----NCE value on train set : 0.7694             
---------NCE value on test set 1: 0.7791             
[51]-----NCE value on train set : 0.7498             
---------NCE value on test set 1: 0.7602             
[61]-----NCE value on train set : 0.7356             
---------NCE value on test set 1: 0.7467             
[71]-----NCE value on train set : 0.7248             
---------NCE value on test set 1: 0.7369             
[81]-----NCE value on train set : 0.7164             
---------NCE value on test set 1: 0.7287             
[91]-----NCE value on train 




[1]------NCE value on train set : 1.3062             
---------NCE value on test set 1: 1.3060             
[11]-----NCE value on train set : 0.9736             
---------NCE value on test set 1: 0.9764             
[21]-----NCE value on train set : 0.8560             
---------NCE value on test set 1: 0.8597             
[31]-----NCE value on train set : 0.8010             
---------NCE value on test set 1: 0.8064             
[41]-----NCE value on train set : 0.7707             
---------NCE value on test set 1: 0.7767             
[51]-----NCE value on train set : 0.7510             
---------NCE value on test set 1: 0.7580             
[61]-----NCE value on train set : 0.7367             
---------NCE value on test set 1: 0.7441             
[71]-----NCE value on train set : 0.7259             
---------NCE value on test set 1: 0.7340             
[81]-----NCE value on train set : 0.7173             
---------NCE value on test set 1: 0.7261             
[91]-----NCE value on train 

KeyError: 'alphas_14'

In [22]:
model_specification

{'rum_structure': [{'utility': [0],
   'variables': ['age',
    'female',
    'day_of_week',
    'start_time_linear',
    'car_ownership',
    'driving_license',
    'purpose_B',
    'purpose_HBE',
    'purpose_HBO',
    'purpose_HBW',
    'purpose_NHBO',
    'fueltype_Average',
    'fueltype_Diesel',
    'fueltype_Hybrid',
    'fueltype_Petrol',
    'distance',
    'dur_walking'],
   'boosting_params': {'monotone_constraints_method': 'advanced',
    'max_depth': 1,
    'n_jobs': -1,
    'learning_rate': 0.1,
    'monotone_constraints': [0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     0,
     -1,
     -1],
    'interaction_constraints': [[0],
     [1],
     [2],
     [3],
     [4],
     [5],
     [6],
     [7],
     [8],
     [9],
     [10],
     [11],
     [12],
     [13],
     [14],
     [15],
     [16]]},
   'shared': False},
  {'utility': [1],
   'variables': ['age',
    'female',
    'day_of_week',
    'start_time_lin

### Cross-Validation

We use the value of a previous hyperparameter search and search for the optimal number of trees.

In [None]:
_, _, folds = load_preprocess_LPMC(path="../Data/")

mu = np.array([1.81, 1.]) #random values

alphas  = np.array([[0., 1.],
                    [0., 1.],
                    [1., 0.],
                    [0.364, 0.636]])

cross_nested_structure = {
    "mu":mu,
    "alphas":alphas,
    "optimise_mu":False,
    "optimise_alphas":False,
}

model_specification['cross_nested_logit'] = cross_nested_structure

ce_loss = 0
num_trees = 0

#5-fold CV
for i, (train_idx, test_idx) in enumerate(folds):

    #train and validation set
    train_set = lgb_train_set.subset(sorted(train_idx))
    test_set = lgb_train_set.subset(sorted(test_idx))

    print('-'*50 + '\n')
    print(f'Iteration {i+1}')

    #train rum_boost with cross-nested arguments
    LPMC_model_trained = rum_train(train_set, model_specification, valid_sets = [test_set])

    #aggregate results
    ce_loss += LPMC_model_trained.best_score
    num_trees += LPMC_model_trained.best_iteration
    
    print('-'*50 + '\n')
    print(f'Best cross entropy loss: {LPMC_model_trained.best_score}')
    print(f'Best number of trees: {LPMC_model_trained.best_iteration}')

ce_loss = ce_loss/5
num_trees = num_trees/5
print('-'*50 + '\n')
print(f'Cross validation negative cross entropy loss: {ce_loss}')
print(f'With a number of trees on average of {num_trees}')

--------------------------------------------------

Iteration 1




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6560824787160351
Best number of trees: 687
--------------------------------------------------

Iteration 2




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6444866813873578
Best number of trees: 687
--------------------------------------------------

Iteration 3




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6594268117163683
Best number of trees: 687
--------------------------------------------------

Iteration 4




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6633328079905507
Best number of trees: 687
--------------------------------------------------

Iteration 5




Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
Finished loading model, total used 687 iterations
--------------------------------------------------

Best cross entropy loss: 0.6666172963883393
Best number of trees: 683
--------------------------------------------------

Cross validation negative cross entropy loss: 0.6579892152397303
With a number of trees on average of 686.2


### Testing the model on out-of-sample data

Now that we have the optimal number of trees (686), we can train the final version of the model on the full dataset, and test it on out-of-sample data with the ```predict()``` function. Note that the dataset must be a lightgbm object in the ```predict()``` function.

**Also note that you need to specify ```mu``` and ```alphas``` in the predict function to adapt the probability formula accordingly.**

In [None]:
general_params["num_iterations"] = int(num_trees)
general_params["early_stopping_round"] = None

LPMCCN_model_fully_trained = rum_train(lgb_train_set, model_specification)

preds = LPMCCN_model_fully_trained.predict(lgb_test_set) 
ce_test = cross_entropy(preds, lgb_test_set.get_label().astype(int))

print('-'*50)
print(f'Final negative cross-entropy on the test set: {ce_test}')



Finished loading model, total used 686 iterations
Finished loading model, total used 686 iterations
Finished loading model, total used 686 iterations
Finished loading model, total used 686 iterations
--------------------------------------------------
Final negative cross-entropy on the test set: 0.6796725053458201


### $\mu$ and $\alpha$ optimisation 

We optimise $\mu$ and $\alpha$ with scipy.minimize.

In [15]:
mu = np.array([1.25, 1.16])  # random values

alphas = np.array([[0.0, 1.0], [0.0, 1.0], [1.0, 0.0], [0.5, 0.5]])

cross_nested_structure = {
    "mu": mu,
    "alphas": alphas,
    "optimise_mu": True,
    "optimise_alphas": np.array(
        [[False, False], [False, False], [False, False], [True, True]]
    ),
}

In [16]:
model_specification = {
    "rum_structure": rum_structure,
    "cross_nested_logit": cross_nested_structure,
    "general_params": general_params,
}

In [17]:
#features and label column names
features = [f for f in LPMC_train.columns if f != "choice"]
label = "choice"

#create lightgbm dataset
lgb_train_set = lightgbm.Dataset(LPMC_train[features], label=LPMC_train[label], free_raw_data=False)
lgb_test_set = lightgbm.Dataset(LPMC_test[features], label=LPMC_test[label], free_raw_data=False)

### Cross-Validation

In [18]:
_, _, folds = load_preprocess_LPMC(path="../Data/")

ce_loss = 0
num_trees = 0

#5-fold CV
for i, (train_idx, test_idx) in enumerate(folds):

    #train and validation set
    train_set = lgb_train_set.subset(sorted(train_idx))
    test_set = lgb_train_set.subset(sorted(test_idx))

    print('-'*50 + '\n')
    print(f'Iteration {i+1}')

    #train rum_boost with cross-nested arguments
    LPMC_model_trained = rum_train(train_set, model_specification, valid_sets = [test_set])

    #aggregate results
    ce_loss += LPMC_model_trained.best_score
    num_trees += LPMC_model_trained.best_iteration
    
    print('-'*50 + '\n')
    print(f'Best cross entropy loss: {LPMC_model_trained.best_score}')
    print(f'Best number of trees: {LPMC_model_trained.best_iteration}')

ce_loss = ce_loss/5
num_trees = num_trees/5
print('-'*50 + '\n')
print(f'Cross validation negative cross entropy loss: {ce_loss}')
print(f'With a number of trees on average of {num_trees}')

--------------------------------------------------

Iteration 1
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000285 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 855
[LightGBM] [Info] Number of data points in the train set: 43812, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000280 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 855
[LightGBM] [Info] Number of data points in the train set: 43812, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000698 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[Light



[11]-----NCE value on train set : 0.9956
---------NCE value on test set 1: 0.9994
[21]-----NCE value on train set : 0.8553
---------NCE value on test set 1: 0.8595
[31]-----NCE value on train set : 0.8049
---------NCE value on test set 1: 0.8095
[41]-----NCE value on train set : 0.7709
---------NCE value on test set 1: 0.7760
[51]-----NCE value on train set : 0.7541
---------NCE value on test set 1: 0.7600
[61]-----NCE value on train set : 0.7382
---------NCE value on test set 1: 0.7441
[71]-----NCE value on train set : 0.7291
---------NCE value on test set 1: 0.7347
[81]-----NCE value on train set : 0.7192
---------NCE value on test set 1: 0.7245
[91]-----NCE value on train set : 0.7133
---------NCE value on test set 1: 0.7183
[101]----NCE value on train set : 0.7063
---------NCE value on test set 1: 0.7113
[111]----NCE value on train set : 0.7021
---------NCE value on test set 1: 0.7071
[121]----NCE value on train set : 0.6969
---------NCE value on test set 1: 0.7021
[131]----NCE val



[11]-----NCE value on train set : 0.9974
---------NCE value on test set 1: 0.9936
[21]-----NCE value on train set : 0.8570
---------NCE value on test set 1: 0.8532
[31]-----NCE value on train set : 0.8065
---------NCE value on test set 1: 0.8033
[41]-----NCE value on train set : 0.7725
---------NCE value on test set 1: 0.7689
[51]-----NCE value on train set : 0.7560
---------NCE value on test set 1: 0.7523
[61]-----NCE value on train set : 0.7401
---------NCE value on test set 1: 0.7359
[71]-----NCE value on train set : 0.7310
---------NCE value on test set 1: 0.7266
[81]-----NCE value on train set : 0.7210
---------NCE value on test set 1: 0.7160
[91]-----NCE value on train set : 0.7151
---------NCE value on test set 1: 0.7101
[101]----NCE value on train set : 0.7081
---------NCE value on test set 1: 0.7030
[111]----NCE value on train set : 0.7040
---------NCE value on test set 1: 0.6990
[121]----NCE value on train set : 0.6988
---------NCE value on test set 1: 0.6937
[131]----NCE val



[1]------NCE value on train set : 1.3350
---------NCE value on test set 1: 1.3344
[11]-----NCE value on train set : 0.9976
---------NCE value on test set 1: 0.9936
[21]-----NCE value on train set : 0.8574
---------NCE value on test set 1: 0.8524
[31]-----NCE value on train set : 0.8069
---------NCE value on test set 1: 0.8019
[41]-----NCE value on train set : 0.7729
---------NCE value on test set 1: 0.7679
[51]-----NCE value on train set : 0.7560
---------NCE value on test set 1: 0.7514
[61]-----NCE value on train set : 0.7399
---------NCE value on test set 1: 0.7358
[71]-----NCE value on train set : 0.7305
---------NCE value on test set 1: 0.7272
[81]-----NCE value on train set : 0.7203
---------NCE value on test set 1: 0.7177
[91]-----NCE value on train set : 0.7142
---------NCE value on test set 1: 0.7124
[101]----NCE value on train set : 0.7070
---------NCE value on test set 1: 0.7060
[111]----NCE value on train set : 0.7027
---------NCE value on test set 1: 0.7024
[121]----NCE val



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000737 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1822
[LightGBM] [Info] Number of data points in the train set: 43815, number of used features: 23
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000273 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1362
[LightGBM] [Info] Number of data points in the train set: 43815, number of used features: 20
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[LightGBM] [Info] Using self-defined objective function
[1]------NCE value on train set : 1.3344
---------NCE value on test se



[11]-----NCE value on train set : 0.9955
---------NCE value on test set 1: 0.9976
[21]-----NCE value on train set : 0.8545
---------NCE value on test set 1: 0.8592
[31]-----NCE value on train set : 0.8039
---------NCE value on test set 1: 0.8098
[41]-----NCE value on train set : 0.7697
---------NCE value on test set 1: 0.7768
[51]-----NCE value on train set : 0.7529
---------NCE value on test set 1: 0.7606
[61]-----NCE value on train set : 0.7367
---------NCE value on test set 1: 0.7452
[71]-----NCE value on train set : 0.7275
---------NCE value on test set 1: 0.7364
[81]-----NCE value on train set : 0.7172
---------NCE value on test set 1: 0.7270
[91]-----NCE value on train set : 0.7112
---------NCE value on test set 1: 0.7217
[101]----NCE value on train set : 0.7040
---------NCE value on test set 1: 0.7152
[111]----NCE value on train set : 0.6998
---------NCE value on test set 1: 0.7115
[121]----NCE value on train set : 0.6945
---------NCE value on test set 1: 0.7070
[131]----NCE val

### Testing the model on out-of-sample data

Now that we have the optimal number of trees (686), we can train the final version of the model on the full dataset, and test it on out-of-sample data with the ```predict()``` function. Note that the dataset must be a lightgbm object in the ```predict()``` function.

**Also note that you need to specify ```mu``` and ```alphas``` in the predict function to adapt the probability formula accordingly.**

In [None]:
general_params["num_iterations"] = int(num_trees)
general_params["early_stopping_round"] = None

LPMCCN_model_fully_trained = rum_train(lgb_train_set, model_specification)

preds = LPMCCN_model_fully_trained.predict(lgb_test_set) 
ce_test = cross_entropy(preds, lgb_test_set.get_label().astype(int))

print('-'*50)
print(f'Final negative cross-entropy on the test set: {ce_test}')



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000229 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000315 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 854
[LightGBM] [Info] Number of data points in the train set: 54766, number of used features: 17
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001873 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1834
[LightGBM] [Info] Number of data poi

# References

Salvadé, N., & Hillel, T. (2024). Rumboost: Gradient Boosted Random Utility Models. *arXiv preprint [arXiv:2401.11954](https://arxiv.org/abs/2401.11954)*

Hillel, T., Elshafie, M.Z.E.B., Jin, Y., 2018. Recreating passenger mode choice-sets for transport simulation: A case study of London, UK. Proceedings of the Institution of Civil Engineers - Smart Infrastructure and Construction 171, 29–42. https://doi.org/10.1680/jsmic.17.00018