Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered #793

Closed
threedliteguy opened this issue Jun 14, 2020 · 15 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@threedliteguy
Copy link

threedliteguy commented Jun 14, 2020

Describe the bug
Crash when using example from:
https://blog.blazingdb.com/data-visualization-with-blazingsql-12095862eb73

Steps/Code to reproduce bug
run sample code [(attached)]([url](url
s3-test.py.txt
))

Expected behavior
No illegal memory access exception.

Environment overview (please complete the following information)

  • Environment location:
    Bare metal conda
    Python 3.7.7 (default, Mar 23 2020, 22:36:06)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Debian 10 CUDA 10.2

  • Method of cuDF install:

conda install -c blazingsql/label/cuda10.2 -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7

Environment details
PATH=/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/go/bin

Additional context
Code attached. Other tests using blazingsql worked fine on this box.

Output:

listening: tcp://*:22758
2020-06-14T15:56:41Z|-78920688|TRACE|deregisterFileSystem: filesystem authority not found
CacheDataLocalFile: /tmp/.blazing-temp-D63WqK6ZgzRBOMd0kxS4CzTDNC69hqAn1vlzzPGIjU8ijs78nLFqpShVKo8Qkdmm.orc
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered
distributed.nanny - WARNING - Restarting worker
BlazingContext ready
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing

After crash, nvidia-smi shows below, main python process is hung:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             On   | 00000000:01:00.0 Off |                  N/A |
| 29%   43C    P8    26W / 250W |    640MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     31262      C   python                                       627MiB |
+-----------------------------------------------------------------------------+

First time I ran it, it created a number of .orc files in /tmp before crashing with above error. Another time it gave:

listening: tcp://*:22170
BlazingContext ready
2020-06-14T16:59:14Z|-682139984|TRACE|deregisterFileSystem: filesystem authority not found
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing
Unable to start CUDA Context
Traceback (most recent call last):
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/dask_cuda/initialize.py", line 108, in dask_setup
numba.cuda.current_context()
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
return _runtime.get_or_create_context(devnum)
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
return self._get_or_create_context_uncached(devnum)
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
return self._activate_context_for(0)
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
newctx = gpu.get_primary_context()
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 529, in get_primary_context
driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 295, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 330, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [304] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OPERATING_SYSTEM
pure virtual method called
terminate called without an active exception

Edit:
Tried accessing the files locally instead of from s3 and reproduced same error. As soon as memory fills up or after several orc files created (varies) gets illegal memory access or worker dies. If the worker restarts successfully it does not appear to process anything. Using non-dask/cluster version of BlazingContext it says 'Killed' as soon as it runs out of memory on the GPU. Processing only one input file works fine as it does not run out of memory.

@threedliteguy threedliteguy added ? - Needs Triage needs team to review and classify bug Something isn't working labels Jun 14, 2020
@wmalpica wmalpica added this to Needs prioritizing in Scrum board Jun 15, 2020
@wmalpica wmalpica removed the ? - Needs Triage needs team to review and classify label Jun 15, 2020
@wmalpica wmalpica moved this from Needs prioritizing to Not Started in Scrum board Jun 15, 2020
@Christian8491 Christian8491 self-assigned this Jun 16, 2020
@Christian8491 Christian8491 moved this from Not Started to WIP in Scrum board Jun 16, 2020
@Christian8491
Copy link
Contributor

Hi @threedliteguy, if you still have those /tmp/.blazing-temp-*.orc, please remove it. Coming soon this feature (removing those /tmp/ files) will be merged in. I will try to reproduce this issue.

@Christian8491
Copy link
Contributor

@threedliteguy I would like to comment you that now we have a better handle when only want to see few rows (normally when want to know how data looks like). Please for the first step use something like this:
result = bc.sql('select * from taxi limit 10') # instead of bc.sql('select * from taxi').tail()
And then continue with the normal flow.

Or just use bc.sql('select * from taxi').tail() with a single parquet file. And for the next query (that uses dropoff_x dropoff_y) use all the parquet files.

If that works for you, for now we can close the issue. Meanwhile I will continue reviewing the source of this problem.

@threedliteguy
Copy link
Author

threedliteguy commented Jun 16, 2020 via email

@Christian8491
Copy link
Contributor

Christian8491 commented Jun 16, 2020

@threedliteguy thanks for the answer. Could yo please provide us the logs? You should find a new /blazing_log/ folder in the same path you ran the s3-test.py.txt script and attach it here, or at least the file(s) with the name(s) RAL.*.log ? These files help us to debbuging and get the source problem.

@threedliteguy
Copy link
Author

threedliteguy commented Jun 16, 2020 via email

@Christian8491
Copy link
Contributor

Hi @threedliteguy only to let you know a fix was merged in recently. So this issue should go away. Please, if you can run again bc.sql('select * from taxi limit 10') and verify whether the issue was fixed or do you still have the same problem.

@threedliteguy
Copy link
Author

threedliteguy commented Jun 19, 2020 via email

@Christian8491
Copy link
Contributor

Christian8491 commented Jun 19, 2020

Thanks for trying that. I have a thought what is happening. I will send a new fix and let you know when is ready to try again.

@Christian8491 Christian8491 moved this from WIP to Done in Scrum board Jun 19, 2020
@Christian8491
Copy link
Contributor

Hi @threedliteguy only to let you know a fix was merged in in these days. I hope you can verified it whether you still need.

@threedliteguy
Copy link
Author

threedliteguy commented Jun 25, 2020 via email

@Christian8491
Copy link
Contributor

I see you are using 0.14 (and some dependencies from that). So, changes were merged in to branch-0.15. I suggest you to try with the nightly version of BlazingSQL (0.15). If you have issues building this let me know. As well I suggest you create a new conda environment to work with 0.15 branch.

@threedliteguy
Copy link
Author

threedliteguy commented Jun 25, 2020 via email

@Christian8491
Copy link
Contributor

It looks like you have an issue related to your java version. Could you try this conda list | grep jdk in this environment.
Also could you only run the second query? SELECT dropoff_longitude * {o_shift} / 180 AS dropoff_x,.... and waiting a while, remember this file come from s3.

@threedliteguy
Copy link
Author

threedliteguy commented Jun 25, 2020 via email

@Christian8491
Copy link
Contributor

I do not know about datashader, but I suppose you should review the versions it offers and their compatibility with jdk if they exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Scrum board
  
Done
3 participants