Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xgboost example #522

Merged
merged 1 commit into from
Jun 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
374 changes: 374 additions & 0 deletions examples/07-Train-an-xgboost-model-using-the-Merlin-Models-API.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,374 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is NVTabular dataset the preferred terminology to refer to: merlin.io.dataset.Dataset or Merlin Dataset?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great point! To be honest, I think I have been imprecise here. Will change the wording to the one you suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we're passing this directly to the model. The schema passed in still has both target columns. However, the logic in the code currently picks the target column based on the objective. e.g. objective="binary:logistic" -> will look for targets in schema that are tagged with Tags.BINARY_CLASSIFICATION. And in this case rating has tags (Tags.REGRESSION, Tags.TARGET), and rating_binary has tags (Tags.BINARY_CLASSIFICATION, Tags.TARGET). So if the objective were switched from "binary:logistic" to "reg:logistic" then the target will be switched from rating_binary to rating.

OBJECTIVES = {

  "binary:logistic": Tags.BINARY_CLASSIFICATION,

  "reg:logistic": Tags.REGRESSION,

  "reg:squarederror": Tags.REGRESSION,

  "rank:pairwise": Tags.TARGET,

  "rank:ndcg": Tags.TARGET,

  "rank:map": Tags.TARGET,

}


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, Targets can be explicitly specified with target_columns=[...] in the constructor of the XGBoost class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this feels like too much magic. We could simplify the default (when no target columns are specfified) to rely only on the TARGET tag (e.g. all columns in schema with TARGET tag become the target for the model) and remove the extra complexity involving the objective and other tags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some amazing sleuthing there Oliver! 🙂 outstanding

I had an earlier version of the example where I was passing in target_columns and then I must have omitted it here and it still worked.

I really like how this automatic selection of the target column provides a glimpse into how the Merlin Models API works, and what value it can add, thus I am adding a bit of prose to explain how it actually works, based on what you shared above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Radek 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-level API: "high-level" is used as an adjective and should be hyphenated.

Instead of "currently", do you mind adding a time frame or a software version so that the sentence ages better? Maybe "For the Merlin Models v0.6.0 release, some xgboost and implicit models are supported."

nit: s/allows/enables/ -- I've been conditioned to avoid "allows" due to possible confusion related to granting permission.

"try out" can be "evaluate" if you want a $10 word. Up to you.


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My $0.02 is to remove the parens--they tend to diminish the importance of the words and what you have is no less important than the rest of the sentence.

I'd also break the second sentence into two:

"The dataset consists of userId and movieId pairings. For each record, a user rates a movie and the record includes additional information such as genre of the movie, age of the user, and so on."


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: My sugg is to use present tense nearly all the time: "The get_movielens function downloads the movielens-100k data for us and returns it materialized as a Merlin Dataset ."

Present tense is typically easier for the reader to understand and lends confidence and authority to your voice.


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Merlin Models API

Maybe "A key feature of the Merlin Models API is tagging."

s/can be/is/ and avoid parens again:

"You can tag your data once, during preprocessing, and this information is picked up during later steps such as additional preprocessing steps, training your model, serving the model, and so on."

I think I'd even use present tense in the last sentence--though clearly up to you:

"During preprocessing that is performed by the get_movielens function, two columns in the dataset are assigned the Tags.TARGET tag:"


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything grammatically wrong with "...to train with by passing..." but the two prepositions side-by-side caught my eye. I think you can remove "with" and not lose meaning. Up to you.

sugg: s/when constructing/when you construct/

Please add a colon after "the following:"

Maybe instead of "...do something else" it could be "...do something better" or "more powerful".

nit: avoid "xgboost" in plain text. Show it as inline code: xgboost when it seems valuable to suggest coding or XGBoost without any styling if you don't need to suggest coding. It similar to using "Merlin Models" in plain text.

sugg: s/we will be setting/we will set/. Future tense seems OK here. Also, "Given this piece of information, the Merlin Models code can infer that we want to train..."

sugg: s/would not be/is not/: "...it further, it is not useful for training."


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sugg: style as "...an xgboost model" or "...an XGBoost model"

sugg: s/that will predict/that predicts/

I find starting a sentence with inline code or a number awkward. My sugg is "For the rating_binary column, a value of 1 indicates..." I'd style the 0 as inline code too because it is like directed user input. (Conversely, it's not that all numbers are inline code. It's OK to say the dataset has 100K records.)


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you: I wouldn't bother including the apostrophe in the inline code style.

Instead of "recently introduced," can you say something like "This class is introduced in the XGBoost 1.1 release and this data format provides..."

Instead of a link with "here" as the text, can you use "You can read more about it in this [article](...) from the RAPIDS AI channarticle

more present tense sugg: "...with early stopping, the training ceases as soon as the model starts overfitting..."

Maybe: "The verbose_eval parameter specifies..." Up to you.


Reply via ReviewNB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"GPU RAM" is likely technically correct, but most of our docs refer to "GPU memory."


Reply via ReviewNB

"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "a556f660",
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions anda\n",
"# limitations under the License.\n",
"# =============================================================================="
]
},
{
"cell_type": "markdown",
"id": "697d1452",
"metadata": {},
"source": [
"<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
"\n",
"# Train a third party model using the Merlin Models API\n",
"\n",
"## Overview\n",
"\n",
"Merlin Models exposes a high-level API that can be used with models from other libraries. For the Merlin Models v0.6.0 release, some `xgboost` and `implicit` models are supported.\n",
"\n",
"Relying on this high level API enables you to iterate more effectively. You do not have to switch between various APIs as you evaluate additional models on your data.\n",
"\n",
"Furthermore, you can use your data represented as a `Dataset` across all your models.\n",
"\n",
"### Learning objectives\n",
"\n",
"- Training with `xgboost`\n",
"- Using the Merlin Models high level API"
]
},
{
"cell_type": "markdown",
"id": "1cccd005",
"metadata": {},
"source": [
"## Preparing the dataset"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "55d93b8b",
"metadata": {},
"outputs": [],
"source": [
"from merlin.core.utils import Distributed\n",
"from merlin.models.xgb import XGBoost\n",
"\n",
"from merlin.datasets.entertainment import get_movielens\n",
"from merlin.schema.tags import Tags"
]
},
{
"cell_type": "markdown",
"id": "cec216e2",
"metadata": {},
"source": [
"We will use the `movielens-100k` dataset. The dataset consists of `userId` and `movieId` pairings. For each record, a user rates a movie and the record includes additional information such as genre of the movie, age of the user, and so on."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "24586409",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-06-30 11:01:39.340989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
"2022-06-30 11:01:39.341346: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
"2022-06-30 11:01:39.341477: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
"/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1292: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.\n",
" warnings.warn(\n"
]
}
],
"source": [
"train, valid = get_movielens(variant='ml-100k')"
]
},
{
"cell_type": "markdown",
"id": "4e26cedb",
"metadata": {},
"source": [
"The `get_movielens` function downloads the `movielens-100k` data for us and returns it materialized as a Merlin `Dataset`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e2237f8b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(<merlin.io.dataset.Dataset at 0x7ff1dae87fd0>,\n",
" <merlin.io.dataset.Dataset at 0x7ff1dae8deb0>)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train, valid"
]
},
{
"cell_type": "markdown",
"id": "8ed670fc",
"metadata": {},
"source": [
"One of the features that the Merlin Models API supports is tagging. You can tag your data once, during preprocessing, and this information is picked up during later steps such as additional preprocessing steps, training your model, serving the model, and so on.\n",
"\n",
"Here, we will make use of the `Tags.TARGET` to identify the objective for our `xgboost` model.\n",
"\n",
"During preprocessing that is performed by the `get_movielens` function, two columns in the dataset are assigned the `Tags.TARGET` tag:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "69274522",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>tags</th>\n",
" <th>dtype</th>\n",
" <th>is_list</th>\n",
" <th>is_ragged</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>rating</td>\n",
" <td>(Tags.REGRESSION, Tags.TARGET)</td>\n",
" <td>int64</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>rating_binary</td>\n",
" <td>(Tags.BINARY_CLASSIFICATION, Tags.TARGET)</td>\n",
" <td>int32</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"[{'name': 'rating', 'tags': {<Tags.REGRESSION: 'regression'>, <Tags.TARGET: 'target'>}, 'properties': {}, 'dtype': dtype('int64'), 'is_list': False, 'is_ragged': False}, {'name': 'rating_binary', 'tags': {<Tags.BINARY_CLASSIFICATION: 'binary_classification'>, <Tags.TARGET: 'target'>}, 'properties': {}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.schema.select_by_tag(Tags.TARGET)"
]
},
{
"cell_type": "markdown",
"id": "c6e607b7",
"metadata": {},
"source": [
"You can specify the target to train by passing `target_columns` when you construct the model. We would like to use `rating_binary` as our target, so we could do the following:\n",
"\n",
"`model = XGBoost(target_columns='rating_binary', ...`\n",
"\n",
"However, we can also do something better. Instead of providing this argument to the constructor of our model, we can instead specify the `objective` for our `xgboost` model and have the Merlin Models API do the rest of the work for us.\n",
"\n",
"Later in this example, we will set our booster's objective to `'binary:logistic'`. Given this piece of information, the Merlin Modelc code can infer that we want to train with a target that has the `Tags.BINARY_CLASSIFICATION` tag assigned to it and there will be nothing else we will need to do.\n",
"\n",
"Before we begin to train, let us remove the `title` column from our schema. In the dataset, the title is a string, and unless we preprocess it further, it is not useful in training."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e8a28f88",
"metadata": {},
"outputs": [],
"source": [
"schema_without_title = train.schema.remove_col('title')"
]
},
{
"cell_type": "markdown",
"id": "aedb65d5",
"metadata": {},
"source": [
"To summarize, we will train an `xgboost` model that predicts the rating of a movie.\n",
"\n",
"For the `rating_binary` column, a value of `1` indicates that the user has given the movie a high rating, and a target of `0` indicates that the user has given the movie a low rating."
]
},
{
"cell_type": "markdown",
"id": "f575b14b",
"metadata": {},
"source": [
"## Training the model"
]
},
{
"cell_type": "markdown",
"id": "59e1d262",
"metadata": {},
"source": [
"Before we begin training, let's define a couple of custom parameters.\n",
"\n",
"Specifying `gpu_hist` as our `tree_method` will run the training on the GPU. Also, it will trigger representing our datasets as `DaskDeviceQuantileDMatrix` instead of the standard `DaskDMatrix`. This class is introduced in the XGBoost 1.1 release and this data format provides more efficient training with lower memory footprint. You can read more about it in this [article](https://medium.com/rapids-ai/new-features-and-optimizations-for-gpus-in-xgboost-1-1-fc153dc029ce) from the RAPIDS AI channel.\n",
"\n",
"Additionally, we will train with early stopping and evaluate the stopping criteria on a validation set. If we were to train without early stopping, `XGboost` would continue to improve results on the train set until it would reach a perfect score. That would result in a low training loss but we would lose any ability to generalize to unseen data. Instead, by training with early stopping, the training ceases as soon as the model starts overfitting to the train set and the results on the validation set will start to deteriorate.\n",
"\n",
"The `verbose_eval` parameter specifies how often metrics are reported during training."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b1804697",
"metadata": {},
"outputs": [],
"source": [
"xgb_booster_params = {\n",
" 'objective':'binary:logistic',\n",
" 'tree_method':'gpu_hist',\n",
"}\n",
"\n",
"xgb_train_params = {\n",
" 'num_boost_round': 100,\n",
" 'verbose_eval': 20,\n",
" 'early_stopping_rounds': 10,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "4b755e80",
"metadata": {},
"source": [
"We are now ready to train.\n",
"\n",
"In order to facilitate training on data larger than the available GPU memory, the training will leverage Dask. All the complexity of starting a local dask cluster is hidden in the `Distributed` context manager.\n",
"\n",
"Without further ado, let's train."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8c511fc6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"distributed.diskutils - INFO - Found stale lock file and directory '/workspace/examples/dask-worker-space/worker-wnzk7dfa', purging\n",
"distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0]\tvalidation_set-logloss:0.65881\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[11:01:42] task [xgboost.dask]:tcp://127.0.0.1:32957 got new rank 0\n",
"[11:01:42] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[20]\tvalidation_set-logloss:0.61290\n",
"[40]\tvalidation_set-logloss:0.60795\n",
"[60]\tvalidation_set-logloss:0.60568\n",
"[80]\tvalidation_set-logloss:0.60320\n",
"[85]\tvalidation_set-logloss:0.60294\n"
]
}
],
"source": [
"with Distributed():\n",
" model = XGBoost(schema=schema_without_title, **xgb_booster_params)\n",
" model.fit(\n",
" train,\n",
" evals=[(valid, 'validation_set'),],\n",
" **xgb_train_params\n",
" )\n",
" metrics = model.evaluate(valid)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
20 changes: 20 additions & 0 deletions tests/integration/tf/test_ci_07_xgboost_integration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from testbook import testbook

from tests.conftest import REPO_ROOT


@testbook(
REPO_ROOT / "examples/07-Train-an-xgboost-model-using-the-Merlin-Models-API.ipynb",
execute=False,
)
def test_func(tb):
tb.inject(
"""
import os
os.environ["INPUT_DATA_DIR"] = "/raid/data/movielens"
"""
)
tb.execute()
metrics = tb.ref("metrics")
assert metrics.keys() == {"logloss"}
assert metrics["logloss"] < 0.65