Skip to content

single_gpu_tutorial.ipynb fails to run on GPU #183

@ronjer30

Description

@ronjer30

Describe the bug

The single gpu tutorial notebook fails to launch a GPU based Dask cluster

Steps/Code to reproduce bug

  1. Launch notebook
  2. Run all steps in 0.Env Setup section
  3. Navigate to 4.Exact Deduplication section
  4. Launch GPU Dask cluster by running the following code in the cell
client = get_client(cluster_type = 'gpu', set_torch_to_use_rmm=False)
print(f"Number of dask worker:{get_num_workers(client)}")
client.run(pre_imports)

Returns the following error

NotImplementedError: 
        NeMo Curator does not support query planning yet.
        Please disable query planning before importing
        `dask.dataframe` or `dask_cudf`. This can be done via:
        `export DASK_DATAFRAME__QUERY_PLANNING=False`, or
        importing `dask.dataframe/dask_cudf` after importing
        `nemo_curator`.

Expected behavior
The execution should succeed and output should resemble the below
Number of dask worker:1 {'tcp://127.0.0.1:36179': None}

**Environment overview **

  • Environment location: Bare-metal
  • Method of NeMo-Curator install: Docker
docker run \
   --rm \
   -it \
   --gpus '"device=1"' \
   --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
   -p 8888:8888 \
   -p 8787:8787 \
   nvcr.io/nvidia/nemo:dev

Additional context
Setting the following env variable in the notebook's env setup step resolves the issue
os.environ["DASK_DATAFRAME__QUERY_PLANNING"] = "False"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions