-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] add examples #47
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Documentation preview |
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use special formatting for Merlin dataloader in the headline? We haven't done that in any other example. We do not use special formatting in the rest of the notebook.
We do not use NVTabular in the notebook, so we should not confuse the reader in the beginning with it. We should remove it.
The Overview should be something like:
"Merlin dataloader is a library for constructing highly optimized dataloaders to accelerate training pipelines in TensorFlow (Keras) and PyTorch. In this example, we will provide a simple pipeline to train a MatrixFactorization Model in TensorFlow with Merlin dataloader based on the MovieLens dataset.
The core features of Merlin dataloader:
- Accelerate pipelines by upto 10x compared to other dataloaders
- Handles larger than memory dataset by streaming data from disk
- Support for common data formats: CSV, Parquet, Avro
- Distributed training support
Learning Objectives:
- Using Merlin dataloader to train a TensorFlow Keras Model
"
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes made
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the description of MovieLens short. The focus is on the dataloader and Movielens is our "hello world" example.
I would call the headline "Downloading and perparing the dataset".
I would just say that we use Movielens as an example how to use Merlin dataloaders.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, can we make the example more lean with focus on the dataloader + training?
It seems that we can use merlin.core -> we have a function to download and extract a dataset:
from merlin.core.utils import download_file
Can we use this? See
https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/01-Download-Convert.ipynb
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely, that is a great point, switched to using download_file
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to install unzip? How does it work in this example:
https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/01-Download-Convert.ipynb
I think we should not install libraries in the example. We should make a note, that it requires unzip (maybe download_file function doesnt rely on unzip)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything works now with just download_file
without a need to install anything else 👍
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a utils function and do "Categorify" other preprocessing under the hood and store it to parquet. (If we want)
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works without any additional preprocessing so in the interest of keeping it simple and not dragging reader's attention from the dataloader would just keep the data as is
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the dataloader does not support cat_names, cont_names and label_names - is that correct?
It seems it relies on a data schema -> Either, we need to provide an easy tool to define a schema manually or add the 3 parameters back?
Can we make this more generic/explanable. E.g.
label_columns = ['rating']
def process_batch(data, _):
x = {col: data[col] for col in data.keys() if col not in label_columns}
y = data[label_columns]
return (x, y)
"What Tensorflow
expects to see are targets as the 2nd position in the tuple. " -> Let's rewrite it to make it more clear - something like
"TensorFlow Kera's .fit function expects the data to get as a tuple (x, y), with x are the input features and y is the label. We need to provide this information to the dataloader. We can add a custom function to convert the data into the tuple with process_batch
"
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads much better now and communicates much more to the reader 🙂👍
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove this. This is pretty confusing :) We would not validate without training?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing, removed
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unforutnately not sure why these do come up. I googled for it but the answer I was able to find is that this is something that started to occur with moving to 2.9, but no explanation as to how to get rid of them
examples/01a-Getting-started.ipynb
Outdated
@@ -0,0 +1,632 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should summarize the end of the notebook with something like:
## Conclusion We demonstrated how to train a TensorFlow Keras model with Merlin dataloader. Merlin dataloader can accelerate existing TensorFlow pipelines with minimal code changes.Next Steps
Merlin dataloader is part of NVIDIA Merlin, a open source framework for recommender systems. In this example, we looked only on a specific use-case to accelerate existing training pipelines. We provide more libraries to make recommender system pipelines easier:
- NVTabular is a library to accelerate and scale feature engineering
- Merlin Models is a library with high-quality implementations of popular recommender systems architectures
The libraries are designed to work closely togethes. We recommend to checkout our exmaples:
- Getting Started with NVTabular: Process Tabular Data On GPU
- Getting Started with Merlin Models: Develop a Model for MovieLens
In the example, From ETL to Training RecSys models - NVTabular and Merlin Models integrated example, we explain how the close collaboration works.
Can you add links for the libraries and examples?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely, done! 🙂🙌
@@ -0,0 +1,577 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my understanding, this example is similar to examples/01a-Getting-started.ipynb except of using NVTabular. I think we should reference NVTabular in the end and provide two examples (TensorFlow/PyTorch) with focus on dataloader.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed!
@@ -0,0 +1,553 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,553 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use a minimal example without NVTabular. We can build one utils function to download and process the dataset?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the dependency on nvtabular completely, having this mirror the previous tensorflow notebook
@@ -0,0 +1,553 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets call the function evaluate -> calculate_loss could be single batch or full batch?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, made the change!
@@ -0,0 +1,553 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do not calculate loss in the beginning, I havent seen that in another pipeline/example?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -0,0 +1,553 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar conclusion and next steps as getting started - however, we cannot reference to merlin models. We can reference other examples such as
https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/getting-started-movielens/03-Training-with-PyTorch.ipynb
Reply via ReviewNB
It looks good - I added some comments. We need a unittest for the final notebooks |
@@ -0,0 +1,487 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add in the headline in each example the framework
Getting Started with Merlin dataloader and PyTorch
Getting Started with Merlin dataloader and TensorFlow
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change 👍
@@ -0,0 +1,487 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is a good point 🙂 let me remove it
@@ -0,0 +1,487 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you for spotting it, my bad for not removing it, removed it now
3fb279e
to
d9e0f2e
Compare
rerun tests |
This is a WIP branch for adding examples. I have added 01 (and a brief README) and am looking for feedback if I am moving in the right direction here.
Thank you for all your help! 🙂