New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help wanted] preprocessing shuffling #1006

Open
vince62s opened this Issue Oct 24, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@vince62s
Copy link
Contributor

vince62s commented Oct 24, 2018

I just commited a change to raise an error when using the -shuffle flag in preprocess.py
Doc was not in line with code. (never implemented)

The point is as follow:
We could use the torchtext shuffling capability as long as we do not shard.
However since torchtext does not support data streaming we had to implement sharding to avoid memory overflow.
Hence data need to shuffled before sharding.
This can be done in bash but if someone is interested to do this n preprocess.py, welcome.
Cheers.

@vikrantsharma7

This comment has been minimized.

Copy link

vikrantsharma7 commented Oct 26, 2018

Just some thoughts, the sample repo looks like a good choice. Uses much less memory than shuf.
In absence of it, we could fallback to shuf.

@JoostvDoorn

This comment has been minimized.

Copy link
Contributor

JoostvDoorn commented Oct 28, 2018

I have implemented shuffling inside build_save_in_shards_using_shards_size, I can make a pull request for this later.

@vince62s

This comment has been minimized.

Copy link
Contributor Author

vince62s commented Jan 8, 2019

@JoostvDoorn code base has changed but do you intend to submit a PR soon ?
thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment