Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
radekosmulski committed Jun 30, 2022
1 parent 90c539b commit ffb2dfa
Showing 1 changed file with 17 additions and 17 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,15 @@
"\n",
"## Overview\n",
"\n",
"Merlin Models exposes a high level API that can be used with models from other libraries. Currently select `xgboost` and `implicit` models are supported.\n",
"Merlin Models exposes a high-level API that can be used with models from other libraries. For the Merlin Models v0.6.0 release, some `xgboost` and `implicit` models are supported.\n",
"\n",
"Relying on this high level API allows you to iterate more effectively. You do not have to switch between various APIs as you try out additional models on your data.\n",
"Relying on this high level API enables you to iterate more effectively. You do not have to switch between various APIs as you evaluate additional models on your data.\n",
"\n",
"Furthermore, you can use your data represented as a `Dataset` across all your models.\n",
"\n",
"### Learning objectives\n",
"\n",
"- Training with xgboost\n",
"- Training with `xgboost`\n",
"- Using the Merlin Models high level API"
]
},
Expand Down Expand Up @@ -73,7 +73,7 @@
"id": "cec216e2",
"metadata": {},
"source": [
"We will use the `movielens-100k` dataset. It consists of `userId` and `movieId` pairings where a user has rated a movie along with some additional information (such as genre of the movie, age of the user, etc)."
"We will use the `movielens-100k` dataset. The dataset consists of `userId` and `movieId` pairings. For each record, a user rates a movie and the record includes additional information such as genre of the movie, age of the user, and so on."
]
},
{
Expand Down Expand Up @@ -103,7 +103,7 @@
"id": "4e26cedb",
"metadata": {},
"source": [
"The `get_movielens` function downloaded the `movielens-100k` data for us and returned it materialized as a Merlin Dataset."
"The `get_movielens` function downloads the `movielens-100k` data for us and returns it materialized as a Merlin `Dataset`."
]
},
{
Expand Down Expand Up @@ -133,11 +133,11 @@
"id": "8ed670fc",
"metadata": {},
"source": [
"One of the features that the Merlin Model API supports is tagging. You can tag your data once, during preprocessing, and this information can be picked up by downstream functionality (additional preprocessing steps, training your model or serving it).\n",
"One of the features that the Merlin Models API supports is tagging. You can tag your data once, during preprocessing, and this information is picked up during later steps such as additional preprocessing steps, training your model, serving the model, and so on.\n",
"\n",
"Here, we will make use of the `Tags.TARGET` to identify the objective for our `xgboost` model.\n",
"\n",
"During preprocessing, two columns in the dataset were assigned the `Tags.TARGET` tag."
"During preprocessing that is performed by the `get_movielens` function, two columns in the dataset are assigned the `Tags.TARGET` tag:"
]
},
{
Expand Down Expand Up @@ -213,15 +213,15 @@
"id": "c6e607b7",
"metadata": {},
"source": [
"You can specify the target to train with by passing `target_columns` when constructing the model. We would like to train with `rating_binary` as our target, so we could do the following\n",
"You can specify the target to train by passing `target_columns` when you construct the model. We would like to use `rating_binary` as our target, so we could do the following:\n",
"\n",
"`model = XGBoost(target_columns='rating_binary', ...`\n",
"\n",
"However, we can also do something else. Instead of providing this argument to the constructor of our model, we can instead specify the `objective` for our xgboost model and have the Merlin Models API do the rest of the work for us.\n",
"However, we can also do something better. Instead of providing this argument to the constructor of our model, we can instead specify the `objective` for our `xgboost` model and have the Merlin Models API do the rest of the work for us.\n",
"\n",
"Later in this example, we will be setting our booster's objective to `'binary:logistic'`. Given this piece of information, the API will be able to infer we would like to train with a target that has the `Tags.BINARY_CLASSIFICATION` tag assigned to it and there will be nothing else we will need to do.\n",
"Later in this example, we will set our booster's objective to `'binary:logistic'`. Given this piece of information, the Merlin Modelc code can infer that we want to train with a target that has the `Tags.BINARY_CLASSIFICATION` tag assigned to it and there will be nothing else we will need to do.\n",
"\n",
"Before we begin to train, let us remove the `title` column from our schema. In the dataset, the title is a string, and unless we preprocess it further, it would not be useful in training."
"Before we begin to train, let us remove the `title` column from our schema. In the dataset, the title is a string, and unless we preprocess it further, it is not useful in training."
]
},
{
Expand All @@ -239,9 +239,9 @@
"id": "aedb65d5",
"metadata": {},
"source": [
"To summarize, we will train an xgboost model that will predict the rating of a movie.\n",
"To summarize, we will train an `xgboost` model that predicts the rating of a movie.\n",
"\n",
"`rating_binary` of 1 indicates that the user has given the movie a high rating, and a target of 0 indicates that the user has given the movie a low rating."
"For the `rating_binary` column, a value of `1` indicates that the user has given the movie a high rating, and a target of `0` indicates that the user has given the movie a low rating."
]
},
{
Expand All @@ -259,11 +259,11 @@
"source": [
"Before we begin training, let's define a couple of custom parameters.\n",
"\n",
"Specifying `'gpu_hist'`as our `'tree_method'` will run the training on the GPU. Also, it will trigger representing our datasets as `DaskDeviceQuantileDMatrix` instead of the standard `DaskDMatrix`. This is a recently introduced format that can lead to a more efficient training with lower memory footprint. You can read more about it [here](https://medium.com/rapids-ai/new-features-and-optimizations-for-gpus-in-xgboost-1-1-fc153dc029ce).\n",
"Specifying `gpu_hist` as our `tree_method` will run the training on the GPU. Also, it will trigger representing our datasets as `DaskDeviceQuantileDMatrix` instead of the standard `DaskDMatrix`. This class is introduced in the XGBoost 1.1 release and this data format provides more efficient training with lower memory footprint. You can read more about it in this [article](https://medium.com/rapids-ai/new-features-and-optimizations-for-gpus-in-xgboost-1-1-fc153dc029ce) from the RAPIDS AI channel.\n",
"\n",
"Additionally, we will train with early stopping and evaluate the stopping criteria on a validation set. If we were to train without early stopping, XGboost would continue to improve results on the train set until it would reach a perfect score. That would result in a low training loss but we would lose any ability to generalize to unseen data. Instead, by training with early stopping, the training will cease as soon as the model will start overfitting to the train set and the results on the validation set will start to deteriorate.\n",
"Additionally, we will train with early stopping and evaluate the stopping criteria on a validation set. If we were to train without early stopping, `XGboost` would continue to improve results on the train set until it would reach a perfect score. That would result in a low training loss but we would lose any ability to generalize to unseen data. Instead, by training with early stopping, the training ceases as soon as the model starts overfitting to the train set and the results on the validation set will start to deteriorate.\n",
"\n",
"The `verbose_eval` specifies how often metrics are reported during training."
"The `verbose_eval` parameter specifies how often metrics are reported during training."
]
},
{
Expand Down Expand Up @@ -292,7 +292,7 @@
"source": [
"We are now ready to train.\n",
"\n",
"In order to facilitate training on data larger than the available GPU RAM, the training will leverage Dask. All the complexity of starting a local dask cluster is hidden in the `Distributed` context manager.\n",
"In order to facilitate training on data larger than the available GPU memory, the training will leverage Dask. All the complexity of starting a local dask cluster is hidden in the `Distributed` context manager.\n",
"\n",
"Without further ado, let's train."
]
Expand Down

0 comments on commit ffb2dfa

Please sign in to comment.