You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to use a single universal_vector to replace a pair of host_vector and device_vector, hoping to reduce memory usage and support computation with buffer size larger than GPU memory. However, it seems that universal_vector is very slow for push_back operations, regardless if the operations requires reallocation or not.
So it seems to me that eachpush_back requires a cudaStreamSynchronize? I guess this might cause the problem, but I'm not familiar with CUDA so this might be wrong. I'm using a Geforce GTX1050, not sure if this is related to demand-paging.
The text was updated successfully, but these errors were encountered:
I was trying to use a single
universal_vector
to replace a pair ofhost_vector
anddevice_vector
, hoping to reduce memory usage and support computation with buffer size larger than GPU memory. However, it seems thatuniversal_vector
is very slow forpush_back
operations, regardless if the operations requires reallocation or not.Simple benchmark:
Output:
I tried running
nvprof
, and got the following result:So it seems to me that each
push_back
requires acudaStreamSynchronize
? I guess this might cause the problem, but I'm not familiar with CUDA so this might be wrong. I'm using a Geforce GTX1050, not sure if this is related to demand-paging.The text was updated successfully, but these errors were encountered: