Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory large dataframe processing #4

Open
romain-intel opened this issue Dec 2, 2019 · 5 comments
Open

In-memory large dataframe processing #4

romain-intel opened this issue Dec 2, 2019 · 5 comments
Labels

Comments

@romain-intel
Copy link
Contributor

@romain-intel romain-intel commented Dec 2, 2019

Metaflow tries to make the life of data scientists easier; this sometimes means providing ways to optimize certain common but expensive operations. Processing large dataframes in memory can be difficult and Metaflow could provide ways to do this more efficiently.

@tduffy000

This comment has been minimized.

Copy link

@tduffy000 tduffy000 commented Dec 12, 2019

@romain-intel is the idea to support this locally or on an AWS instance?

Wondering if the idea is just making it more integrated with Apache Spark (via pyspark), or finding a way like an IterableDataset in Pytorch, to split loading among workers and have them loaded at model time.

I imagine the difficulty might be in the atomicity of a @step given that a feature selection & engineering step would be wholly separated from the model step. Know from experience that there are still a lot of pandas fans out there.

Would be curious to hear your thoughts on this.

@savingoyal

This comment has been minimized.

Copy link
Contributor

@savingoyal savingoyal commented Dec 12, 2019

@tduffy000 We have an in-house implementation of dataframe which provides faster primitive operations with a lower memory footprint than Pandas. This is supported both on local instance and in the cloud. One can use this implementation inside a step or even outside of Metaflow (just like the metaflow.s3 client).

@leftys

This comment has been minimized.

Copy link

@leftys leftys commented Dec 15, 2019

Maybe the use of Metaflow could be somehow combined with Dask, which supports bigger-than-memory dataframes to solve this issue. I am not sure if/how it would be possible to serialize and restore Dasks big and lazy-evaluated dataframes between steps though.

@juarezr

This comment has been minimized.

Copy link

@juarezr juarezr commented Feb 3, 2020

Maybe something like a dataflow transfered between steps like Bonobo.

Also here is other example of software product that uses datapickle and Dask to run dataflows clusterized in cloud.

@benjaminbluhm

This comment has been minimized.

Copy link

@benjaminbluhm benjaminbluhm commented Feb 7, 2020

I think the possibility to use Apache Spark within Metaflow would be extremely useful. When you have your feature engineering workflow written in pyspark it's kind of a pain to translate everything to pandas and also it's hard to predict how well this will work on large datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.