Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast variable helper #94

Open
MrPowers opened this issue Apr 13, 2023 · 4 comments
Open

Broadcast variable helper #94

MrPowers opened this issue Apr 13, 2023 · 4 comments

Comments

@MrPowers
Copy link
Owner

MrPowers commented Apr 13, 2023

From a Redditor on this thread:

I’d love to see a function to check if your df is small enough to use a broadcast join. At the moment I take a 10% sample, convert that to pandas and then estimate memory size from that using a pd function. Then if the df is small enough I’ll use a broadcast join to improve speed.

@puneetsharma04
Copy link
Contributor

@MrPowers : I would like to contribute on this, could you please assign this issue to me.

@kunaljubce
Copy link
Contributor

@MrPowers @SemyonSinchenko Did we ever brainstorm on this? I have lost count of the number of times I would have loved a functionality like this. Would love to take this up.

@SemyonSinchenko
Copy link
Collaborator

SemyonSinchenko commented Mar 3, 2024

@kunaljubce Because of spark-connect we cannot use _jvm here. So, the only known for me option was to parse the plan. But @MrPowers does not like this idea (see arguments here: #159).

JFYI: This function do exactly this job -- it estimates the size of DF in bytes (megabytes) without computation.

@SemyonSinchenko
Copy link
Collaborator

So, in my opinion, there is no way to do it (except collection to driver that is a terrible option). @kunaljubce If you have other vision how it may be implemented or you have new arguments for my discussion with @MrPowers (his arg was that the plan representation is very unstable) we can raise this topic again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants