Add multigpu JAX tutorial #4956

awolant · 2023-07-26T15:01:55Z

Category:

New feature

Description:

Adds tutorial on how to train a neural network with DALI and JAX on multiple GPUs.

Additional information:

Affected modules and functionalities:

JAX docs.

Key points relevant for the review:

Is this understandable? Spelling, grammar?

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: 3553

Signed-off-by: Albert Wolant <awolant@nvidia.com>

awolant · 2023-07-26T15:02:05Z

!build

review-notebook-app · 2023-07-26T15:02:05Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

dali-automaton · 2023-07-26T15:05:31Z

CI MESSAGE: [9121124]: BUILD STARTED

docs/examples/frameworks/jax/model.py

dali-automaton · 2023-07-26T17:13:10Z

CI MESSAGE: [9121124]: BUILD FAILED

Signed-off-by: Albert Wolant <awolant@nvidia.com>

awolant · 2023-07-31T08:15:44Z

!build

dali-automaton · 2023-07-31T08:20:26Z

CI MESSAGE: [9175574]: BUILD STARTED

dali-automaton · 2023-07-31T10:08:22Z

CI MESSAGE: [9175574]: BUILD PASSED

docs/examples/frameworks/jax/jax-basic_example.ipynb

mzient · 2023-07-31T10:48:34Z

docs/examples/frameworks/jax/jax-multigpu_example.ipynb

@@ -0,0 +1,304 @@
+{


Here we show how to run training from "Training neural network with DALI and JAX" ~~using~~on multiple GPUs.

If you haven't already done so, it is best to start with single GPU example to better understand following content.

Reply via ReviewNB

mzient · 2023-07-31T10:48:34Z

docs/examples/frameworks/jax/jax-multigpu_example.ipynb

@@ -0,0 +1,304 @@
+{


(...)creating a pipeline definition function.

Note the new arguments passed to ~~the~~ fn.readers.caffe2

(...) used to ~~controll~~control sharding:
(...) sets the total number of shards

Also, (<--comma) the (not entirely sure about this one)device_id argument was removed from the decorator
(...)~~particualr~~ particular

batch_size_per_gpu as batch_size // jax.device_cout()
^^^^ don't we want to round up?

Reply via ReviewNB

Done.

When it comes to batch_size_per_gpu: for this test I set it up with batch_size equal to 200 so it is divisible by common number of possible GPUs (2, 4, 8).
I wanted to make this code as simple as possible.
I added a note to this part to explain that this may need some adjustment to make sure that you use all samples in every epoch.

mzient · 2023-07-31T10:48:34Z

docs/examples/frameworks/jax/jax-multigpu_example.ipynb

@@ -0,0 +1,304 @@
+{


Each of them will start the preprocessing from a differnt shard
Does it mean it will then proceed to the next shard? If they process only items belonging to a particular shard, then better wording would be
Each of them will process a different shard of the dataset

~~Similar as~~ Like in the single GPU example
or
~~Similar as in~~ Similarly to the single GPU example

(...) . It will ~~encapsule~~ encapsulate (...) return a dictionary of JAX arrays (...)

Reply via ReviewNB

Does it mean it will then proceed to the next shard? If they process only items belonging to a particular shard
This is controlled by stick_to_shard argument. By default it is false, so in the next epoch pipeline will move to the next shard. I added a sentence with the information about this argument.

Signed-off-by: Albert Wolant <awolant@nvidia.com>

awolant · 2023-07-31T15:21:10Z

!build

dali-automaton · 2023-07-31T15:25:19Z

CI MESSAGE: [9179529]: BUILD STARTED

dali-automaton · 2023-07-31T22:21:53Z

CI MESSAGE: [9179529]: BUILD FAILED

dali-automaton · 2023-08-01T08:16:09Z

CI MESSAGE: [9179529]: BUILD PASSED

Adds tutorial on how to train a neural network with DALI and JAX on multiple GPUs. Signed-off-by: Albert Wolant <awolant@nvidia.com>

awolant added 2 commits July 25, 2023 11:47

Update notebook

c5a6b64

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Add working multi GPU JAX training

8d0adaa

Signed-off-by: Albert Wolant <awolant@nvidia.com>

github-advanced-security bot found potential problems Jul 26, 2023

View reviewed changes

docs/examples/frameworks/jax/model.py Fixed Show fixed Hide fixed

awolant added 5 commits July 27, 2023 13:16

Update example

9200f47

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Make it compatible with more than 2 GPUs

898af88

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Split notebooks

c19d3d0

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Make multigpu example work

011053a

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Add text

1a4d53b

Signed-off-by: Albert Wolant <awolant@nvidia.com>

awolant marked this pull request as ready for review July 31, 2023 08:10

jantonguirao assigned mzient and stiepan Jul 31, 2023

banasraf unassigned stiepan Jul 31, 2023

jantonguirao self-assigned this Jul 31, 2023

mzient reviewed Jul 31, 2023

View reviewed changes

jantonguirao approved these changes Jul 31, 2023

View reviewed changes

awolant added 5 commits July 31, 2023 16:39

Fix review comments

6ec42c3

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Fix review comments

f44068c

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Fix review comments

15b1364

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Fix review comments

18e8ed2

Signed-off-by: Albert Wolant <awolant@nvidia.com>

Fix review comments

2061a92

Signed-off-by: Albert Wolant <awolant@nvidia.com>

mzient approved these changes Jul 31, 2023

View reviewed changes

awolant merged commit c147a2e into NVIDIA:main Aug 1, 2023
3 checks passed

JanuszL pushed a commit to JanuszL/DALI that referenced this pull request Oct 13, 2023

Add multigpu JAX tutorial (NVIDIA#4956)

cfd4a18

Adds tutorial on how to train a neural network with DALI and JAX on multiple GPUs. Signed-off-by: Albert Wolant <awolant@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multigpu JAX tutorial #4956

Add multigpu JAX tutorial #4956

awolant commented Jul 26, 2023 •

edited

Loading

awolant commented Jul 26, 2023

review-notebook-app bot commented Jul 26, 2023

dali-automaton commented Jul 26, 2023

dali-automaton commented Jul 26, 2023

awolant commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

mzient Jul 31, 2023 •

edited

Loading

awolant Jul 31, 2023

mzient Jul 31, 2023 •

edited

Loading

awolant Jul 31, 2023 •

edited

Loading

mzient Jul 31, 2023 •

edited

Loading

awolant Jul 31, 2023

awolant Jul 31, 2023

awolant commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

dali-automaton commented Aug 1, 2023

Add multigpu JAX tutorial #4956

Add multigpu JAX tutorial #4956

Conversation

awolant commented Jul 26, 2023 • edited Loading

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

awolant commented Jul 26, 2023

review-notebook-app bot commented Jul 26, 2023

dali-automaton commented Jul 26, 2023

dali-automaton commented Jul 26, 2023

awolant commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

mzient Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

awolant Jul 31, 2023

Choose a reason for hiding this comment

mzient Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

awolant Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

mzient Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

awolant Jul 31, 2023

Choose a reason for hiding this comment

awolant Jul 31, 2023

Choose a reason for hiding this comment

awolant commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

dali-automaton commented Jul 31, 2023

dali-automaton commented Aug 1, 2023

awolant commented Jul 26, 2023 •

edited

Loading

mzient Jul 31, 2023 •

edited

Loading

mzient Jul 31, 2023 •

edited

Loading

awolant Jul 31, 2023 •

edited

Loading

mzient Jul 31, 2023 •

edited

Loading