NVIDIA-Merlin · radekosmulski · Jun 30, 2022 · Jun 22, 2022 · oliverholworthy · Jun 30, 2022
diff --git a/examples/07-Train-an-xgboost-model-using-the-Merlin-Models-API.ipynb b/examples/07-Train-an-xgboost-model-using-the-Merlin-Models-API.ipynb
@@ -0,0 +1,374 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "a556f660",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2022 NVIDIA Corporation. All Rights Reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions anda\n",
+    "# limitations under the License.\n",
+    "# =============================================================================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "697d1452",
+   "metadata": {},
+   "source": [
+    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
+    "\n",
+    "# Train a third party model using the Merlin Models API\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "Merlin Models exposes a high-level API that can be used with models from other libraries. For the Merlin Models v0.6.0 release, some `xgboost` and `implicit` models are supported.\n",
+    "\n",
+    "Relying on this high level API enables you to iterate more effectively. You do not have to switch between various APIs as you evaluate additional models on your data.\n",
+    "\n",
+    "Furthermore, you can use your data represented as a `Dataset` across all your models.\n",
+    "\n",
+    "### Learning objectives\n",
+    "\n",
+    "- Training with `xgboost`\n",
+    "- Using the Merlin Models high level API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cccd005",
+   "metadata": {},
+   "source": [
+    "## Preparing the dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "55d93b8b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from merlin.core.utils import Distributed\n",
+    "from merlin.models.xgb import XGBoost\n",
+    "\n",
+    "from merlin.datasets.entertainment import get_movielens\n",
+    "from merlin.schema.tags import Tags"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cec216e2",
+   "metadata": {},
+   "source": [
+    "We will use the `movielens-100k` dataset. The dataset consists of `userId` and `movieId` pairings. For each record, a user rates a movie and the record includes additional information such as genre of the movie, age of the user, and so on."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "24586409",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2022-06-30 11:01:39.340989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
+      "2022-06-30 11:01:39.341346: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
+      "2022-06-30 11:01:39.341477: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:952] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n",
+      "/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1292: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "train, valid = get_movielens(variant='ml-100k')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4e26cedb",
+   "metadata": {},
+   "source": [
+    "The `get_movielens` function downloads the `movielens-100k` data for us and returns it materialized as a Merlin `Dataset`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "e2237f8b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(<merlin.io.dataset.Dataset at 0x7ff1dae87fd0>,\n",
+       " <merlin.io.dataset.Dataset at 0x7ff1dae8deb0>)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train, valid"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ed670fc",
+   "metadata": {},
+   "source": [
+    "One of the features that the Merlin Models API supports is tagging. You can tag your data once, during preprocessing, and this information is picked up during later steps such as additional preprocessing steps, training your model, serving the model, and so on.\n",
+    "\n",
+    "Here, we will make use of the `Tags.TARGET` to identify the objective for our `xgboost` model.\n",
+    "\n",
+    "During preprocessing that is performed by the `get_movielens` function, two columns in the dataset are assigned the `Tags.TARGET` tag:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "69274522",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>name</th>\n",
+       "      <th>tags</th>\n",
+       "      <th>dtype</th>\n",
+       "      <th>is_list</th>\n",
+       "      <th>is_ragged</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>rating</td>\n",
+       "      <td>(Tags.REGRESSION, Tags.TARGET)</td>\n",
+       "      <td>int64</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>rating_binary</td>\n",
+       "      <td>(Tags.BINARY_CLASSIFICATION, Tags.TARGET)</td>\n",
+       "      <td>int32</td>\n",
+       "      <td>False</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "[{'name': 'rating', 'tags': {<Tags.REGRESSION: 'regression'>, <Tags.TARGET: 'target'>}, 'properties': {}, 'dtype': dtype('int64'), 'is_list': False, 'is_ragged': False}, {'name': 'rating_binary', 'tags': {<Tags.BINARY_CLASSIFICATION: 'binary_classification'>, <Tags.TARGET: 'target'>}, 'properties': {}, 'dtype': dtype('int32'), 'is_list': False, 'is_ragged': False}]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train.schema.select_by_tag(Tags.TARGET)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6e607b7",
+   "metadata": {},
+   "source": [
+    "You can specify the target to train by passing `target_columns` when you construct the model. We would like to use `rating_binary` as our target, so we could do the following:\n",
+    "\n",
+    "`model = XGBoost(target_columns='rating_binary', ...`\n",
+    "\n",
+    "However, we can also do something better. Instead of providing this argument to the constructor of our model, we can instead specify the `objective` for our `xgboost` model and have the Merlin Models API do the rest of the work for us.\n",
+    "\n",
+    "Later in this example, we will set our booster's objective to `'binary:logistic'`. Given this piece of information, the Merlin Modelc code can infer that we want to train with a target that has the `Tags.BINARY_CLASSIFICATION` tag assigned to it and there will be nothing else we will need to do.\n",
+    "\n",
+    "Before we begin to train, let us remove the `title` column from our schema. In the dataset, the title is a string, and unless we preprocess it further, it is not useful in training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "e8a28f88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "schema_without_title = train.schema.remove_col('title')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aedb65d5",
+   "metadata": {},
+   "source": [
+    "To summarize, we will train an `xgboost` model that predicts the rating of a movie.\n",
+    "\n",
+    "For the `rating_binary` column, a value of `1` indicates that the user has given the movie a high rating, and a target of `0` indicates that the user has given the movie a low rating."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f575b14b",
+   "metadata": {},
+   "source": [
+    "## Training the model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59e1d262",
+   "metadata": {},
+   "source": [
+    "Before we begin training, let's define a couple of custom parameters.\n",
+    "\n",
+    "Specifying `gpu_hist` as our `tree_method` will run the training on the GPU. Also, it will trigger representing our datasets as `DaskDeviceQuantileDMatrix` instead of the standard `DaskDMatrix`. This class is introduced in the XGBoost 1.1 release and this data format provides more efficient training with lower memory footprint. You can read more about it in this [article](https://medium.com/rapids-ai/new-features-and-optimizations-for-gpus-in-xgboost-1-1-fc153dc029ce) from the RAPIDS AI channel.\n",
+    "\n",
+    "Additionally, we will train with early stopping and evaluate the stopping criteria on a validation set. If we were to train without early stopping, `XGboost` would continue to improve results on the train set until it would reach a perfect score. That would result in a low training loss but we would lose any ability to generalize to unseen data. Instead, by training with early stopping, the training ceases as soon as the model starts overfitting to the train set and the results on the validation set will start to deteriorate.\n",
+    "\n",
+    "The `verbose_eval` parameter specifies how often metrics are reported during training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "b1804697",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xgb_booster_params = {\n",
+    "    'objective':'binary:logistic',\n",
+    "    'tree_method':'gpu_hist',\n",
+    "}\n",
+    "\n",
+    "xgb_train_params = {\n",
+    "    'num_boost_round': 100,\n",
+    "    'verbose_eval': 20,\n",
+    "    'early_stopping_rounds': 10,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b755e80",
+   "metadata": {},
+   "source": [
+    "We are now ready to train.\n",
+    "\n",
+    "In order to facilitate training on data larger than the available GPU memory, the training will leverage Dask. All the complexity of starting a local dask cluster is hidden in the `Distributed` context manager.\n",
+    "\n",
+    "Without further ado, let's train."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "8c511fc6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "distributed.diskutils - INFO - Found stale lock file and directory '/workspace/examples/dask-worker-space/worker-wnzk7dfa', purging\n",
+      "distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[0]\tvalidation_set-logloss:0.65881\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[11:01:42] task [xgboost.dask]:tcp://127.0.0.1:32957 got new rank 0\n",
+      "[11:01:42] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[20]\tvalidation_set-logloss:0.61290\n",
+      "[40]\tvalidation_set-logloss:0.60795\n",
+      "[60]\tvalidation_set-logloss:0.60568\n",
+      "[80]\tvalidation_set-logloss:0.60320\n",
+      "[85]\tvalidation_set-logloss:0.60294\n"
+     ]
+    }
+   ],
+   "source": [
+    "with Distributed():\n",
+    "    model = XGBoost(schema=schema_without_title, **xgb_booster_params)\n",
+    "    model.fit(\n",
+    "        train,\n",
+    "        evals=[(valid, 'validation_set'),],\n",
+    "        **xgb_train_params\n",
+    "    )\n",
+    "    metrics = model.evaluate(valid)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/tests/integration/tf/test_ci_07_xgboost_integration.py b/tests/integration/tf/test_ci_07_xgboost_integration.py
@@ -0,0 +1,20 @@
+from testbook import testbook
+
+from tests.conftest import REPO_ROOT
+
+
+@testbook(
+    REPO_ROOT / "examples/07-Train-an-xgboost-model-using-the-Merlin-Models-API.ipynb",
+    execute=False,
+)
+def test_func(tb):
+    tb.inject(
+        """
+        import os
+        os.environ["INPUT_DATA_DIR"] = "/raid/data/movielens"
+        """
+    )
+    tb.execute()
+    metrics = tb.ref("metrics")
+    assert metrics.keys() == {"logloss"}
+    assert metrics["logloss"] < 0.65