Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove from_tabular_file in ArrowDataset #273

Open
mariosasko opened this issue Jan 14, 2021 · 4 comments · May be fixed by #274
Open

Remove from_tabular_file in ArrowDataset #273

mariosasko opened this issue Jan 14, 2021 · 4 comments · May be fixed by #274
Labels
dataset Issues or pull requests related to datasets

Comments

@mariosasko
Copy link
Collaborator

ArrowDataset.from_tabular_file is similar to TabularDataset's __init__. I think it's safe to remove this function. The same effect can be achieved with:

ArrowDataset.from_dataset(TabularDataset(...), ...)

E.g. #267 introduced some changes to TabularDataset's __init__ and this logic was not added to ArrowDataset. By deleting from_tabular_file, we won't have such problems in the future.

cc @ivansmokovic @mttk @FilipBolt

@mariosasko mariosasko added the dataset Issues or pull requests related to datasets label Jan 14, 2021
@mariosasko mariosasko added this to To discuss in Podium - external release Jan 14, 2021
@mttk
Copy link
Member

mttk commented Jan 14, 2021

Agreed but I think the difference is that .from_dataset has the examples in-memory, while from_tabular_file doesn't actually load them (they are lazily loaded via a genexp). Correct me if I'm wrong @ivansmokovic
I guess we can abstract the loading logic out of TabularDataset and reuse it instead of having the code duplicated.

Btw; what's the purpose of having a disk backed dataset if the dataset is already loaded in memory (e.g., using from_dataset)?

@ivansmokovic
Copy link
Collaborator

@mttk You are right. The reason to reimplement this in ArrowDataset was to avoid loading the whole dataset into memory.
We could abstract the loading code out in some way, but that just brings additional complexity to the user interface. Let's update this method and leave this as-is for now.

@mariosasko
Copy link
Collaborator Author

I'm fine with abstacting the logic out to avoid duplication or we can wait (to follow the Rule of three)

that just brings additional complexity to the user interface

Not if we keep it private.

@mttk mttk linked a pull request Jan 14, 2021 that will close this issue
@mttk
Copy link
Member

mttk commented Jan 14, 2021

@mariosasko @ivansmokovic @FilipBolt please take a look at #274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issues or pull requests related to datasets
Projects
Development

Successfully merging a pull request may close this issue.

3 participants