New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cannot create blazingcontext for async dask_cudf clients #1272
Comments
I'm wondering if there's a ~cheap workaround here, like:
Or if it's not hard to get the context creator to use an async client. (Though maybe that's a deep assumption in bsql python, so should use the stub approach?) |
Thinking a bit more: The proxy isn't that great because it still has the issue of 2 localcudaclusters For our use case, we're moving to starting an a separate cuda cluster process (same-node) so mult processes can hit it. I think the proxy solution will work then: async dask client -> sync in-worker dask client -> 'remote' cluster. So we'll do sync for now, and once our cluster service is up, revisit some sort of async proxy. Our bsql calls are fairly thin |
@felipeblazing The async def collect_bsql_futures(sync_client: Client, bc: BlazingContext, dask_futures : list, as_gdf = False) -> Union[cudf.DataFrame, dask_cudf.DataFrame]:
"""
:param sync_client: dask.distributed.Client with asynchronous=False
:param bc: BlazingContext with dask_cudf cluster
:param dask_futures: Result of a bc.sql(..., return_futures=True) call
:param as_gudf: Whether to return dask_cudf.DataFrame or cudf.DataFrame
Async call of bsql, returning as original dgdf, or optionally, post-processed to gdf
May also need 'dask.config.set(scheduler=gpu_client_sync)' calls
Examples:
dgdf = await collect_bsql_futures(sybc_client, bc.sql(qry, return_futures=True))
gdf = await collect_bsql_futures(sync_client, bc.sql(qry, return_futures=True), as_gdf = True)
"""
from pyblazing.apiv2.context import distributed_remove_orc_files_from_disk
logger.info('@collect_bsql_futures')
#dask.config.set(scheduler=client)
try:
meta_results : list = await sync_client.gather(dask_futures, asynchronous=True)
#meta_results : list = sync_client.gather(dask_futures)
logger.debug('meta :: %s = %s', type(meta_results), meta_results)
except Exception as e:
logger.error('exn running query, cleaning up old orc file cache', exc_info=True)
try:
#FIXME somehow get txn id and plug in here?
distributed_remove_orc_files_from_disk(sync_client, bc.cache_dir_path)
logger.debug('... Cleaned up')
except Exception as e2:
logger.error('Failed cleanup of failed bsql query', exc_info=True)
raise e
futures : list = []
for query_partids, meta, worker_id in meta_results:
logger.info('meta results meta list item: %s', meta)
for query_partid in query_partids:
futures.append(sync_client.submit(
pop_worker_query_part,
query_partid,
workers=[worker_id],
pure=False))
logger.debug('collected futures: %s', futures)
result : dask_cudf.DataFrame = dask.dataframe.from_delayed(futures, meta=meta)
logger.info('from_delayed result :: %s = %s', type(result), result)
if as_gdf:
#gdf2 = result.compute()
[gdf2] = await sync_client.gather([result.compute(compute=False)], asynchronous=True )
logger.debug('gdf2::%s', type(gdf2))
logger.info('////collect_bsql_futures gdf')
return gdf2
else:
logger.info('////collect_bsql_futures dgdf')
return result
Passing tests: @pytest.mark.timeout(30)
def test_bsql_sync_warmup(bc, gpu_client_sync):
assert True
@pytest.mark.timeout(30)
def test_bsql_async_warmup(gpu_client):
assert True
@pytest.mark.timeout(30)
def test_bsql_default(bc, gpu_client_sync):
dask.config.set(scheduler=gpu_client_sync)
dgdf = bc.sql('SELECT SUM(bc_sample.a) AS my_sum FROM bc_sample')
gdf = dgdf.compute()
assert gdf['my_sum'][0] == 6
@pytest.mark.timeout(30)
async def test_collect_bsql_futures_dgdf(bc, gpu_client_sync, gpu_client):
logger.debug('@test_collect_bsql_futures_dgdf')
dask.config.set(scheduler=gpu_client_sync)
dask_futures = bc.sql('SELECT SUM(bc_sample.a) AS my_sum FROM bc_sample', return_futures=True)
logger.info('got futures: %s', dask_futures)
dgdf2 = await collect_bsql_futures(gpu_client_sync, bc, dask_futures)
logger.info('collected dgdf2: %s', dgdf2)
gdf2 = dgdf2.compute()
assert gdf2['my_sum'][0] == 6
@pytest.mark.timeout(30)
async def test_collect_bsql_futures_gdf(bc, gpu_client_sync, gpu_cluster):
from server.util.dask_cudf import dask_cudf_init_helper_sync, make_cluster, UserClient
async with await UserClient(f'async_{__name__}', allow_reset=False, address=gpu_cluster) as gpu_client:
dask.config.set(scheduler=gpu_client_sync)
dask_futures = bc.sql('SELECT SUM(bc_sample.a) AS my_sum FROM bc_sample', return_futures=True)
gdf2 = await collect_bsql_futures(gpu_client_sync, bc, dask_futures, as_gdf=True)
assert gdf2['my_sum'][0] == 6 |
@lmeyerov I dont think we will be able to support asynchronous=True dask_cudf clients for a while. Its very non-trivial. Additionally, a recent PR #1289 has gotten rid of |
Thanks! For our use case, where we're currently at:
FWIW, I'm about to get back to large dataset testing w/ bsql b/c of our big file loader feature + customer proveouts. I'm guessing more imp will be stuff like making sure groupbys work, and ideally, we can get big datasets back as part of a pipelie :) |
Describe the bug
Passing in an
asynchronous=True
dask_cudf client toBlazingContext()
throws an exceptionThis is unfortunate as:
async feat is good for sw (apps, dashboards, ..): we're using dask_cudf + async clients to make rapids stack less of a bottleneck, so bsql calls break this benefit
memory waste: for single-node (incl. multi-gpu), this means having to create 2 localcudaclusters: async clients need async clusters, and vice versa for sync (afaict!)
Steps/Code to reproduce bug
The second test fails with:
Expected behavior
Both tests pass
Environment overview (please complete the following information)
docker
w/10.2
->conda
->rapids 0.17
----For BlazingSQL Developers----
Suspected source of the issue
Where and what are potential sources of the issue
Other design considerations
What components of the engine could be affected by this?
The text was updated successfully, but these errors were encountered: