Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF dataset tests rework #2539

Merged
merged 6 commits into from Jan 18, 2021
Merged

TF dataset tests rework #2539

merged 6 commits into from Jan 18, 2021

Conversation

awolant
Copy link
Contributor

@awolant awolant commented Dec 10, 2020

Signed-off-by: Albert Wolant awolant@nvidia.com

Why we need this PR?

Pick one, remove the rest

  • Refactoring to improve TF dataset tests.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    Added tests for multi GPU with mirrored strategy, added eager mode tests, fixed device mismatch error.
  • Affected modules and functionalities:
    TF dataset tests
  • Key points relevant for the review:
    Tests, docs.
  • Documentation (including examples):
    Updated docs are part of this PR

JIRA TASK: [Use DALI-1757]

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@awolant
Copy link
Contributor Author

awolant commented Dec 10, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1884587]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1884587]: BUILD PASSED

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@awolant
Copy link
Contributor Author

awolant commented Dec 10, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1889921]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1889921]: BUILD PASSED

Signed-off-by: Albert Wolant <awolant@nvidia.com>
@awolant
Copy link
Contributor Author

awolant commented Dec 11, 2020

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892434]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892434]: BUILD PASSED

1 similar comment
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1892434]: BUILD PASSED

@@ -8,7 +8,7 @@
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the training distributed to multiple GPUs, we use tf.distribute.MirroredStrategy. Comma?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -8,7 +8,7 @@
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everyrhing


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 263 to 268
std::stringstream msg;
msg << "TF device and DALI device mismatch. TF device: ";
msg << (dataset()->device_type_ == device_type_t::CPU ? "CPU" : "GPU");
msg << ", DALI device: ";
msg << (dali_device_type == device_type_t::CPU ? "CPU" : "GPU");
msg << " for output " << i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use make_string? Not sure how this would affect readability but that's more of a DALI way of doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Albert Wolant <awolant@nvidia.com>
dataset_pipeline = PythonOperatorPipeline()
pipeline = Pipeline(1, 1, 0, 0)
with pipeline:
output = fn.python_function(function=lambda: np.zeros((3, 3, 3)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me what is the expected error. Maybe the name of the test case should say it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PythonFunction is not allowed with TF for now. Changed the name of the test to include that inforamation.

images = tf.reshape(images, [BATCH_SIZE, IMAGE_SIZE*IMAGE_SIZE])
labels = tf.reshape(
tf.one_hot(labels, NUM_CLASSES),
images = tf_v1.reshape(images, [BATCH_SIZE, IMAGE_SIZE*IMAGE_SIZE])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: Why do we need to reshape? Aren't the outputs of the dataset already shaped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, no reason for this. Done

test_data_root = os.environ['DALI_EXTRA_PATH']
file_root = os.path.join(test_data_root, 'db', 'coco_dummy', 'images')
annotations_file = os.path.join(
test_data_root, 'db', 'coco_dummy', 'instances.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (dali_device_type != dataset()->device_type_) {
auto msg = dali::make_string(
"TF device and DALI device mismatch. TF device: ",
(dataset()->device_type_ == device_type_t::CPU ? "CPU" : "GPU"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Add std::string to_string(device_type_t type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what do you mean here? I compare two device_type_t things and return string from ternary operator.

nosetests --verbose -s test_dali_tf_dataset.py:_test_tf_dataset_other_gpu
nosetests --verbose -s test_dali_tf_dataset.py:_test_tf_dataset_multigpu
nosetests --verbose -s test_dali_tf_dataset_mnist.py
nosetests --verbose -s test_dali_tf_dataset_graph.py:_test_tf_dataset_other_gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we need to list those tests explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all tests that need more GPUs hidden (name starts with _), so they don't trigger in L0, where we run it on single GPU machine. Here I manually run those tests.
Alternative is to create two files with tests, but the end result is the same.

Signed-off-by: Albert Wolant <awolant@nvidia.com>
Signed-off-by: Albert Wolant <awolant@nvidia.com>
@awolant
Copy link
Contributor Author

awolant commented Jan 18, 2021

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1991279]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1991279]: BUILD PASSED

@awolant awolant merged commit 9becd97 into NVIDIA:master Jan 18, 2021
TheTimmy pushed a commit to TheTimmy/DALI that referenced this pull request Jan 20, 2021
* TF dataset tests rework for multi GPU dataset

Signed-off-by: Albert Wolant <awolant@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants