Skip to content

Conversation

@edknv
Copy link
Contributor

@edknv edknv commented Apr 13, 2023

This reverts commit 8782c9d (which fixed #131).

Setting the device via the cupy API causes horovod (2GPU) tests to hang with:

[1,1]<stdout>:merlin/models/tf/models/base.py:1387: in fit                                                                                                                          
[1,1]<stdout>:    out = super().fit(**fit_kwargs)                                                                                                                                   
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py:70: in error_handler                                                                            
[1,1]<stdout>:    raise e.with_traceback(filtered_tb) from None                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:78: in __getitem__                                                                             
[1,1]<stdout>:    return self.__next__()                                                                                                                                            
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:82: in __next__                                                                                
[1,1]<stdout>:    converted_batch = self.convert_batch(super().__next__())                                                                                                          
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:261: in __next__                                                                              
[1,1]<stdout>:    return self._get_next_batch()                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:332: in _get_next_batch                                                                       
[1,1]<stdout>:    batch = next(self._batch_itr)                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:369: in make_tensors                                                                          
[1,1]<stdout>:    tensors_by_name = self._convert_df_to_tensors(gdf)                                                                                                                
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner                                                                                                     
[1,1]<stdout>:    result = func(*args, **kwargs)                                                                                                                                    
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:524: in _convert_df_to_tensors                                                                
[1,1]<stdout>:    tensors_by_name[column_name] = self._to_tensor(gdf_i[[column_name]])                                                                                              
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:453: in _to_tensor                                                                            
[1,1]<stdout>:    with cupy.cuda.Device(self.device):                                                                                                                               
[1,1]<stdout>:cupy/cuda/device.pyx:184: in cupy.cuda.device.Device.__enter__                                                                                                        
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:365: in cupy_backends.cuda.api.runtime.setDevice                                                                                   
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _                                                                                                                                                                       
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:>   ???                                                                                                                                                               
[1,1]<stdout>:E   cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal                                                                   
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:142: CUDARuntimeError                                                                                                              

@edknv edknv requested a review from jperez999 April 13, 2023 17:42
@edknv edknv self-assigned this Apr 13, 2023
@edknv edknv added bug Something isn't working chore labels Apr 13, 2023
@edknv edknv added this to the Merlin 23.04 milestone Apr 13, 2023
@edknv edknv merged commit 014b658 into NVIDIA-Merlin:main Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working chore

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Device assignment does not work in PyTorch

2 participants