Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of collect calls #17

Merged
merged 3 commits into from
Oct 25, 2021

Conversation

osopardo1
Copy link
Member

This PR solves #14.

This is a workaround to reduce the number of collect calls on the commit log data. The optimal solution would be to process everything with DataFrame, but in order to build the structures for the search algorithm, it is necessary to load some information in the memory.

@osopardo1 osopardo1 requested review from cugni and removed request for cugni October 7, 2021 08:06
@osopardo1 osopardo1 self-assigned this Oct 8, 2021
@eavilaes eavilaes linked an issue Oct 19, 2021 that may be closed by this pull request
Copy link
Member

@cugni cugni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we still don't have a clear path between when and where we call collects. Lazy val are not a solution, we should avoid it. We should clearly define which are transformation class, where no action is triggered (and that must be fully serializable), and action class, where we actually call collect or head or similar functions.

I'm working on a version that used the Dataset interface instead of the Seq one more profusely. If it works, we can analyze that.

@osopardo1 osopardo1 merged commit b4f061d into Qbeast-io:main Oct 25, 2021
@osopardo1 osopardo1 deleted the 14-reduce-collect branch October 27, 2021 12:02
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading all data gets delayed on collect()
3 participants