Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ability to run DALI without GPU #2165

Merged
merged 8 commits into from
Aug 28, 2020
Merged

Conversation

JanuszL
Copy link
Contributor

@JanuszL JanuszL commented Jul 29, 2020

Why we need this PR?

Pick one, remove the rest

  • It adds an ability to run DALI without GPU

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    Adds a special case for device_id that skips all CUDA related calls
  • Affected modules and functionalities:
    Executor
    A couple of operators
  • Key points relevant for the review:
    Some fixes are hacky, check if it can be improved
  • Validation and testing:
    Added a test case for CPU only scenario
  • Documentation (including examples):
    Updated pipeline description

JIRA TASK: [DALI-1491]

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1503498]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1503498]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1503505]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1503505]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1504379]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1504495]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1504379]: BUILD FAILED

@JanuszL JanuszL force-pushed the cpu_only_core branch 2 times, most recently from 254502a to 046af83 Compare July 29, 2020 10:57
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1504495]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1505415]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1505415]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1505433]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1505433]: BUILD FAILED

@JanuszL JanuszL force-pushed the cpu_only_core branch 2 times, most recently from 74a4ffd to f8496f0 Compare July 29, 2020 22:18
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1506219]: BUILD STARTED

@JanuszL JanuszL force-pushed the cpu_only_core branch 2 times, most recently from 18f38e7 to 5747322 Compare July 29, 2020 23:46
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1506219]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1506588]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1506588]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1506588]: BUILD PASSED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1577356]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1577356]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1577356]: BUILD PASSED

"or equal to CPU_ONLY_DEVICE_ID.");
DALI_ENFORCE(graph_->NumOp(OpType::GPU) == 0 && graph_->NumOp(OpType::MIXED) == 0,
"Cannot run a pipeline with Mixed/GPU ops in CPU-only mode. Please provide "
"valid device id or change the operators device.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick:

Suggested change
"valid device id or change the operators device.");
"valid device id or change the operators' device.");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1580145]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1580145]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1580145]: BUILD PASSED

@JanuszL JanuszL merged commit c9942ce into NVIDIA:master Aug 28, 2020
@JanuszL JanuszL deleted the cpu_only_core branch August 28, 2020 13:47
@JanuszL JanuszL mentioned this pull request Aug 28, 2020
klecki added a commit to klecki/DALI that referenced this pull request Apr 15, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request Apr 19, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request May 23, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request May 24, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request May 25, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request May 25, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request May 26, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request Jun 22, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request Jun 27, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Just for the time of tests two optimizations with
contiguous TensorVector -> TensorList were disabled.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request Jul 13, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Memory is set to non pinned when CPU_ONLY_DEVICE_ID is
detected.

Eager mode optimizations with contiguous
TensorVector -> TensorList was disabled - it waits
for the rework of TensorVector replacint TensorList.

TODO: This commit also removes the External Source
optimization.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit to klecki/DALI that referenced this pull request Jul 14, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Memory is set to non pinned when CPU_ONLY_DEVICE_ID is
detected.

Eager mode optimizations with contiguous
TensorVector -> TensorList was disabled - it waits
for the rework of TensorVector replacint TensorList.

TODO: This commit also removes the External Source
optimization.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
klecki added a commit that referenced this pull request Jul 14, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from #2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Memory is set to non pinned when CPU_ONLY_DEVICE_ID is
detected.

Eager mode optimizations with contiguous
TensorVector -> TensorList was disabled - it waits
for the rework of TensorVector replacing TensorList.

An escape hatch to access the shared_ptr of the
allocation was ported to TensorVector from TensorList,
to allow for ExternalSource to pass data without copy.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
staniewzki pushed a commit to staniewzki/DALI that referenced this pull request Jul 19, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Memory is set to non pinned when CPU_ONLY_DEVICE_ID is
detected.

Eager mode optimizations with contiguous
TensorVector -> TensorList was disabled - it waits
for the rework of TensorVector replacing TensorList.

An escape hatch to access the shared_ptr of the
allocation was ported to TensorVector from TensorList,
to allow for ExternalSource to pass data without copy.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
szkarpinski pushed a commit to szkarpinski/DALI that referenced this pull request Jul 21, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Memory is set to non pinned when CPU_ONLY_DEVICE_ID is
detected.

Eager mode optimizations with contiguous
TensorVector -> TensorList was disabled - it waits
for the rework of TensorVector replacing TensorList.

An escape hatch to access the shared_ptr of the
allocation was ported to TensorVector from TensorList,
to allow for ExternalSource to pass data without copy.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
szkarpinski pushed a commit to szkarpinski/DALI that referenced this pull request Jul 21, 2022
Remove from TensorVector the AsTensorList method
and constructor from external shared_ptr<TensorList>

Both allowed to observe the internal state of TensorVector
from the outside, breaking the encapsulation.

Adjust few places that relied on the changes of internal
state to be externally visible:
* Initializing the data graph in workspaces.
  The TV -> TL conversion is done vie Mixed stage.
* ArgumentInputs that are produced as TensorList
  need to be resynced by ShareData instead.

This reverts some changes from NVIDIA#2165:
  - CPU stage cannot be used as direct outputs
    due to the TensorList/TensorVector mismatch,
    we can only share data downwards, but we
    cannot share and preemptivelly expect the allocation
    to be mirrored.
  - CPU-only stage still uses Mixed stage, but just
    with MakeContigous constrained to CPU outputs.

Workspace initialization now takes the CPU_ONLY_DEVICE_ID
into consideration and does not set the stream (resulting
in has_stream() being false, which in turn keeps the
AccessOrder in Mixed ops as HostOrder only).

Memory is set to non pinned when CPU_ONLY_DEVICE_ID is
detected.

Eager mode optimizations with contiguous
TensorVector -> TensorList was disabled - it waits
for the rework of TensorVector replacing TensorList.

An escape hatch to access the shared_ptr of the
allocation was ported to TensorVector from TensorList,
to allow for ExternalSource to pass data without copy.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants