Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add examples #47

Merged
merged 17 commits into from
Nov 24, 2022
Merged

[WIP] add examples #47

merged 17 commits into from
Nov 24, 2022

Conversation

radekosmulski
Copy link
Contributor

This is a WIP branch for adding examples. I have added 01 (and a brief README) and am looking for feedback if I am moving in the right direction here.

Thank you for all your help! 🙂

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@radekosmulski radekosmulski marked this pull request as draft November 14, 2022 04:45
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/dataloader/review/pr-47

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use special formatting for Merlin dataloader in the headline? We haven't done that in any other example. We do not use special formatting in the rest of the notebook.

We do not use NVTabular in the notebook, so we should not confuse the reader in the beginning with it. We should remove it.

The Overview should be something like:

"Merlin dataloader is a library for constructing highly optimized dataloaders to accelerate training pipelines in TensorFlow (Keras) and PyTorch. In this example, we will provide a simple pipeline to train a MatrixFactorization Model in TensorFlow with Merlin dataloader based on the MovieLens dataset.

The core features of Merlin dataloader:

- Accelerate pipelines by upto 10x compared to other dataloaders

- Handles larger than memory dataset by streaming data from disk

- Support for common data formats: CSV, Parquet, Avro

- Distributed training support

Learning Objectives:

- Using Merlin dataloader to train a TensorFlow Keras Model

"


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes made

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the description of MovieLens short. The focus is on the dataloader and Movielens is our "hello world" example.

I would call the headline "Downloading and perparing the dataset".

I would just say that we use Movielens as an example how to use Merlin dataloaders.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, can we make the example more lean with focus on the dataloader + training?

It seems that we can use merlin.core -> we have a function to download and extract a dataset:

from merlin.core.utils import download_file

Can we use this? See

https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/01-Download-Convert.ipynb


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, that is a great point, switched to using download_file

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to install unzip? How does it work in this example:

https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/01-Download-Convert.ipynb

I think we should not install libraries in the example. We should make a note, that it requires unzip (maybe download_file function doesnt rely on unzip)


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything works now with just download_file without a need to install anything else 👍

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a utils function and do "Categorify" other preprocessing under the hood and store it to parquet. (If we want)


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works without any additional preprocessing so in the interest of keeping it simple and not dragging reader's attention from the dataloader would just keep the data as is

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have all required imports in the beginning of the notebook?


Reply via ReviewNB

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the dataloader does not support cat_names, cont_names and label_names - is that correct?

It seems it relies on a data schema -> Either, we need to provide an easy tool to define a schema manually or add the 3 parameters back?

Can we make this more generic/explanable. E.g.

label_columns = ['rating']

def process_batch(data, _):

x = {col: data[col] for col in data.keys() if col not in label_columns}

y = data[label_columns]

  return (x, y)

"What Tensorflow expects to see are targets as the 2nd position in the tuple. " -> Let's rewrite it to make it more clear - something like

"TensorFlow Kera's .fit function expects the data to get as a tuple (x, y), with x are the input features and y is the label. We need to provide this information to the dataloader. We can add a custom function to convert the data into the tuple with process_batch"


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads much better now and communicates much more to the reader 🙂👍

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this. This is pretty confusing :) We would not validate without training?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing, removed

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we get the warnings/errors?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unforutnately not sure why these do come up. I googled for it but the answer I was able to find is that this is something that started to occur with moving to 2.9, but no explanation as to how to get rid of them

@@ -0,0 +1,632 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should summarize the end of the notebook with something like:

## Conclusion
We demonstrated how to train a TensorFlow Keras model with Merlin dataloader. Merlin dataloader can accelerate existing TensorFlow pipelines with minimal code changes. 

Next Steps

Merlin dataloader is part of NVIDIA Merlin, a open source framework for recommender systems. In this example, we looked only on a specific use-case to accelerate existing training pipelines. We provide more libraries to make recommender system pipelines easier:

  • NVTabular is a library to accelerate and scale feature engineering
  • Merlin Models is a library with high-quality implementations of popular recommender systems architectures

The libraries are designed to work closely togethes. We recommend to checkout our exmaples:

  • Getting Started with NVTabular: Process Tabular Data On GPU
  • Getting Started with Merlin Models: Develop a Model for MovieLens

In the example, From ETL to Training RecSys models - NVTabular and Merlin Models integrated example, we explain how the close collaboration works.

Can you add links for the libraries and examples?



Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, done! 🙂🙌

@@ -0,0 +1,577 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my understanding, this example is similar to examples/01a-Getting-started.ipynb except of using NVTabular. I think we should reference NVTabular in the end and provide two examples (TensorFlow/PyTorch) with focus on dataloader.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

@@ -0,0 +1,553 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Feedback as 01-Getting-Started


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,553 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a minimal example without NVTabular. We can build one utils function to download and process the dataset?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the dependency on nvtabular completely, having this mirror the previous tensorflow notebook

@@ -0,0 +1,553 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call the function evaluate -> calculate_loss could be single batch or full batch?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, made the change!

@@ -0,0 +1,553 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do not calculate loss in the beginning, I havent seen that in another pipeline/example?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,553 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar conclusion and next steps as getting started - however, we cannot reference to merlin models. We can reference other examples such as

https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/getting-started-movielens/03-Training-with-PyTorch.ipynb


Reply via ReviewNB

@bschifferer
Copy link
Contributor

It looks good - I added some comments.

We need a unittest for the final notebooks

@@ -0,0 +1,487 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add in the headline in each example the framework

Getting Started with Merlin dataloader and PyTorch

Getting Started with Merlin dataloader and TensorFlow


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change 👍

@@ -0,0 +1,487 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a sentence why you print this out (or we remove it :) )


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is a good point 🙂 let me remove it

@@ -0,0 +1,487 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not calculate it anymore, do we?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you for spotting it, my bad for not removing it, removed it now

@radekosmulski
Copy link
Contributor Author

rerun tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example for dataloaders for new dataloader repo
2 participants