Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallelising processing #60

Open
RickMoynihan opened this issue May 12, 2021 · 0 comments
Open

Support parallelising processing #60

RickMoynihan opened this issue May 12, 2021 · 0 comments

Comments

@RickMoynihan
Copy link
Member

We can support a -p N parallelism flag that runs the transformation in N threads. Hopefully cutting processing time drastically.

This should be relatively straightforward by:

  1. Inspecting the dialect data and from that deriving what end of line tokens are etc.
  2. Looking at the file length in bytes
  3. Crudely splitting the file into equal portions equivalent to N
  4. Refine the split offsets slightly by scanning from their to the next true end of line.
  5. Wind N streams to their appropriate offsets
  6. Read the header row and give it to each thread
  7. Pass each stream to N threads
  8. Have each thread output to a separate file rdf file (appropriately numbered).
  9. Potentially if asked support a concat flag, that will reconcattenate the files together.

The key to making this fast is to avoid parsing the whole CSV into batches in the splitting step. Any final concat should also just do so at the file level without any parsing of RDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant