Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_sync with pandas operation function does not finish. #844

Open
yun881201 opened this issue Nov 10, 2023 · 1 comment
Open

map_sync with pandas operation function does not finish. #844

yun881201 opened this issue Nov 10, 2023 · 1 comment

Comments

@yun881201
Copy link

yun881201 commented Nov 10, 2023

Map_sync with pandas operation function does not finish.

I have very long dataframe. So I split the dataframe into 40 sub-dataframes, and apply pandas operation to 40 sub-dataframes parallelly by using map_sync. The pandas operation is just about groupby and apply.

My code is like this:
PEN = 40
dfs = np.array_split(target_df, PEN)
c = ipp.Cluster(n=PEN)
with c as rc:
e_all = rc[:]
results = e_all.map_sync(FUCTION, dfs)
results

I have 30 target_dfs. For the first 10 target dfs map_sync worked fine. But after that map_sync didn't complete.
I have found that without parallelism, the pandas job applied to target_df completes in under 2 hours.
I use window os and Ipyparallel version is the lastest.

@yun881201 yun881201 changed the title After a kernel start, the first map_sync give me a result, but the second mapc_sync does not finish. mapc_sync with pandas operation function does not finish. Nov 13, 2023
@yun881201 yun881201 changed the title mapc_sync with pandas operation function does not finish. map_sync with pandas operation function does not finish. Nov 13, 2023
@minrk
Copy link
Member

minrk commented Feb 8, 2024

Sorry for not responding in a reasonable amount of time, but I missed this one when it came in.

I'm afraid I'll need a more complete reproducible example, because all I can see is that map does work with a list of data frames when I test it. If I were to guess, it would be something in the serialization of pandas DataFrames, and might be specific to the data types of your columns.

There's a very good chance that you'll have a better experience parallelizing data frame operations with dask dataframe than IPython Parallel, which has no first-class understanding of DataFrames and will do some rather inefficient serialization, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants