New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] add examples #47

Merged

radekosmulski merged 17 commits into main from add_examples

Nov 24, 2022

Contributor

radekosmulski commented Nov 14, 2022

This is a WIP branch for adding examples. I have added 01 (and a brief README) and am looking for feedback if I am moving in the right direction here.

Thank you for all your help! 🙂

review-notebook-app bot commented Nov 14, 2022

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

radekosmulski marked this pull request as draft

November 14, 2022 04:45

github-actions bot commented Nov 14, 2022

Documentation preview

https://nvidia-merlin.github.io/dataloader/review/pr-47

bschifferer reviewed

View reviewed changes

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

Can we not use special formatting for Merlin dataloader in the headline? We haven't done that in any other example. We do not use special formatting in the rest of the notebook.

We do not use NVTabular in the notebook, so we should not confuse the reader in the beginning with it. We should remove it.

The Overview should be something like:

"Merlin dataloader is a library for constructing highly optimized dataloaders to accelerate training pipelines in TensorFlow (Keras) and PyTorch. In this example, we will provide a simple pipeline to train a MatrixFactorization Model in TensorFlow with Merlin dataloader based on the MovieLens dataset.

The core features of Merlin dataloader:

- Accelerate pipelines by upto 10x compared to other dataloaders

- Handles larger than memory dataset by streaming data from disk

- Support for common data formats: CSV, Parquet, Avro

- Distributed training support

Learning Objectives:

- Using Merlin dataloader to train a TensorFlow Keras Model

"

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

changes made

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

I think we should keep the description of MovieLens short. The focus is on the dataloader and Movielens is our "hello world" example.

I would call the headline "Downloading and perparing the dataset".

I would just say that we use Movielens as an example how to use Merlin dataloaders.

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

done

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

In general, can we make the example more lean with focus on the dataloader + training?

It seems that we can use merlin.core -> we have a function to download and extract a dataset:

from merlin.core.utils import download_file

Can we use this? See

https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/01-Download-Convert.ipynb

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

absolutely, that is a great point, switched to using download_file

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

Do we need to install unzip? How does it work in this example:

https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/01-Download-Convert.ipynb

I think we should not install libraries in the example. We should make a note, that it requires unzip (maybe download_file function doesnt rely on unzip)

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

everything works now with just download_file without a need to install anything else 👍

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

We can use a utils function and do "Categorify" other preprocessing under the hood and store it to parquet. (If we want)

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

works without any additional preprocessing so in the interest of keeping it simple and not dragging reader's attention from the dataloader would just keep the data as is

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

Can we have all required imports in the beginning of the notebook?

Reply via ReviewNB

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

It seems that the dataloader does not support cat_names, cont_names and label_names - is that correct?

It seems it relies on a data schema -> Either, we need to provide an easy tool to define a schema manually or add the 3 parameters back?

Can we make this more generic/explanable. E.g.

label_columns = ['rating']

def process_batch(data, _):

x = {col: data[col] for col in data.keys() if col not in label_columns}

y = data[label_columns]

return (x, y)

"What Tensorflow expects to see are targets as the 2nd position in the tuple. " -> Let's rewrite it to make it more clear - something like

"TensorFlow Kera's .fit function expects the data to get as a tuple (x, y), with x are the input features and y is the label. We need to provide this information to the dataloader. We can add a custom function to convert the data into the tuple with process_batch"

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

this reads much better now and communicates much more to the reader 🙂👍

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

Lets remove this. This is pretty confusing :) We would not validate without training?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

sure thing, removed

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

Do you know why we get the warnings/errors?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

I am unforutnately not sure why these do come up. I googled for it but the answer I was able to find is that this is something that started to occur with moving to 2.9, but no explanation as to how to get rid of them

examples/01a-Getting-started.ipynb Outdated

		@@ -0,0 +1,632 @@
		{

Contributor

bschifferer Nov 16, 2022

We should summarize the end of the notebook with something like:

## Conclusion We demonstrated how to train a TensorFlow Keras model with Merlin dataloader. Merlin dataloader can accelerate existing TensorFlow pipelines with minimal code changes.

Next Steps

Merlin dataloader is part of NVIDIA Merlin, a open source framework for recommender systems. In this example, we looked only on a specific use-case to accelerate existing training pipelines. We provide more libraries to make recommender system pipelines easier:

NVTabular is a library to accelerate and scale feature engineering
Merlin Models is a library with high-quality implementations of popular recommender systems architectures

The libraries are designed to work closely togethes. We recommend to checkout our exmaples:

Getting Started with NVTabular: Process Tabular Data On GPU
Getting Started with Merlin Models: Develop a Model for MovieLens

In the example, From ETL to Training RecSys models - NVTabular and Merlin Models integrated example, we explain how the close collaboration works.

Can you add links for the libraries and examples?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

absolutely, done! 🙂🙌

bschifferer reviewed

View reviewed changes

examples/01b-Getting-started-Tensorflow.ipynb Outdated

		@@ -0,0 +1,577 @@
		{

Contributor

bschifferer Nov 16, 2022

in my understanding, this example is similar to examples/01a-Getting-started.ipynb except of using NVTabular. I think we should reference NVTabular in the end and provide two examples (TensorFlow/PyTorch) with focus on dataloader.

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

agreed!

examples/01c-Getting-started-Pytorch.ipynb Outdated

		@@ -0,0 +1,553 @@
		{

Contributor

bschifferer Nov 16, 2022

Same Feedback as 01-Getting-Started

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

done

examples/01c-Getting-started-Pytorch.ipynb Outdated

		@@ -0,0 +1,553 @@
		{

Contributor

bschifferer Nov 16, 2022

We should use a minimal example without NVTabular. We can build one utils function to download and process the dataset?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

removed the dependency on nvtabular completely, having this mirror the previous tensorflow notebook

examples/01c-Getting-started-Pytorch.ipynb Outdated

		@@ -0,0 +1,553 @@
		{

Contributor

bschifferer Nov 16, 2022

lets call the function evaluate -> calculate_loss could be single batch or full batch?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

good point, made the change!

examples/01c-Getting-started-Pytorch.ipynb Outdated

		@@ -0,0 +1,553 @@
		{

Contributor

bschifferer Nov 16, 2022

Let's do not calculate loss in the beginning, I havent seen that in another pipeline/example?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 17, 2022

removed

examples/01c-Getting-started-Pytorch.ipynb Outdated

		@@ -0,0 +1,553 @@
		{

Contributor

bschifferer Nov 16, 2022

Similar conclusion and next steps as getting started - however, we cannot reference to merlin models. We can reference other examples such as

https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

Reply via ReviewNB

Contributor

bschifferer commented Nov 16, 2022

It looks good - I added some comments.

We need a unittest for the final notebooks

bschifferer added the examples label

bschifferer reviewed

View reviewed changes

examples/01b-Getting-started-Pytorch.ipynb

		@@ -0,0 +1,487 @@
		{

Contributor

bschifferer Nov 17, 2022

maybe we should add in the headline in each example the framework

Getting Started with Merlin dataloader and PyTorch

Getting Started with Merlin dataloader and TensorFlow

Reply via ReviewNB

Contributor Author

radekosmulski Nov 22, 2022

made the change 👍

examples/01b-Getting-started-Pytorch.ipynb

		@@ -0,0 +1,487 @@
		{

Contributor

bschifferer Nov 17, 2022

Maybe add a sentence why you print this out (or we remove it :) )

Reply via ReviewNB

Contributor Author

radekosmulski Nov 22, 2022

yes, that is a good point 🙂 let me remove it

examples/01b-Getting-started-Pytorch.ipynb

		@@ -0,0 +1,487 @@
		{

Contributor

bschifferer Nov 17, 2022

We do not calculate it anymore, do we?

Reply via ReviewNB

Contributor Author

radekosmulski Nov 22, 2022

yes, thank you for spotting it, my bad for not removing it, removed it now

bschifferer mentioned this pull request

How to use dataloader without NVTabular? #50

Open

radekosmulski added 12 commits

November 22, 2022 23:33


          add draft of 01

c472c17


          Update README.md

e8c480f


          add tf version

5bd1b4d


          clean-up 01b

e9bca67


          update

c873752


          update

0c6caed


          implement review changes

d11e2b4


          udpate

f2c966e


          udpate

5ed1da5


          update

4ab1447


          add test for examples

15d95d2


          implement review comments

d9e0f2e

radekosmulski force-pushed the add_examples branch from 3fb279e to d9e0f2e Compare

November 22, 2022 13:33

radekosmulski marked this pull request as ready for review

November 22, 2022 13:34

radekosmulski added 4 commits

November 22, 2022 23:42


          fix flake8 issues

6025db8


          fix import sorting issue

824e03c


          update


          Merge branch 'main' into add_examples

a2aaed8

Contributor Author

radekosmulski commented Nov 23, 2022

rerun tests

bschifferer approved these changes

View reviewed changes


          Merge branch 'main' into add_examples

d89d156

radekosmulski merged commit 77a759d into main

radekosmulski deleted the add_examples branch

November 24, 2022 04:50

oliverholworthy mentioned this pull request

Update pytorch getting started test and setup examples tests to run in CI #63

Merged

bschifferer linked an issue

that may be closed by this pull request

Add example for dataloaders for new dataloader repo #12

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment