Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added async implementation of CSV reader #199

Closed
wants to merge 1 commit into from

Conversation

awestlake87
Copy link

Motivation

I was shopping for a crate that had a solid CSV parser and this one popped up. I'm looking to use it in a project that needs to parse large CSV files in WebAssembly.

That poses a problem because the standard io::Read and io::Write imply blocking I/O, which is a no-go on WASM due to its javascript interop. Another constraint is that these are large CSV files. I don't want to have to hold them in memory at any given time, so reading it all to a buffer, then parsing won't work very well. Ideally, I want to be able to read in chunks and emit a stream of records as I go, so I'm really only left with async I/O (AFAIK).

I read up on issue #141 and it looks like the main reason it didn't move forward was because the rust async ecosystem was still in flux? I think rust async is going in the direction of stability, and while standard AsyncRead and AsyncWrite traits haven't been stabilized, I wonder if this crate might be able to use patterns from other async libraries to implement an AsyncReader in the meantime.

AsyncReader

I've been able to use async-compression in my project, so I went ahead and modeled an AsyncReader off of the Stream<Item = io::Result<Bytes>> pattern in that crate and it appears to work.

I basically started from the synchronous Reader class and tried to change as little as possible about the interface. These new structures and dependencies are gated behind an additive "async" feature so they shouldn't interfere with any downstream crates.

You can test it by running the following command:

cargo test --features async

Serde support for async CSV parsing is questionable since serde is not currently built to support async serialize/deserialize operations. For this reason, I left their impls out of the proof-of-concept

Questions

  1. Is this something you want the CSV crate to allow at this point? I know the async ecosystem is still very new, so I wouldn't blame you at all if you wanted to hold off on committing to an impl.
  2. If so, do you have any feedback on the current design direction?
  3. If not, are there any alternatives you are willing to discuss at this point?

Alternatives

  1. Expose the parsing logic

    I'm mainly interested in reusing a solid foundation for a CSV parser. The reader is largely unimportant to me as I can implement it myself, I just want to be able to reuse an existing state machine to parse it.

  2. Create a new experimental async-csv crate to hold this logic while async stabilizes
  3. Hold off until the async ecosystem stabilizes

    I'll probably create my own personal crate based on the parsing logic in here to tide me over in the meantime, so this is similar to option 2, I just wouldn't publicly maintain it.

Closing Thoughts

This is far from a finished impl. I basically wanted to throw something together to get your feedback on if and how this should continue. I haven't changed any of the docs or done any in-depth testing of this feature aside from the simple sanity checks. It's just a proof-of-concept so I can verify that something like this will work for my purposes.

If you do decide to move forward with this idea, I'm willing to help write the docs, examples, and tests for this feature if you want.

Also, there's no rush on this, so just let me know what you think when you get time.

@BurntSushi
Copy link
Owner

BurntSushi commented May 5, 2020

Thank you for the detailed PR description and attempt here. I really do appreciate it. You made it very easy for me to review this and respond. :-)

That poses a problem because the standard io::Read and io::Write imply blocking I/O, which is a no-go on WASM due to its javascript interop.

If this is really true, then to me, it sounds like the entire ecosystem has to fracture between crates supporting std::io and some async alternative. To me, that's untenable. Putting aside concerns about this specific crate, there is just absolutely no way I could get on board with something like this unless there was a community wide consensus that this is indeed the only and best way to solve this problem.

I'm not too involved in the async space, so I don't really know what the current thinking is, but I really hope there is some path to allow std::io stuff to be used in an async context. smol at least seems to have figured something out here. Perhaps you could use that. I had thought that tokio had similar adapters.

Also, I'd like to give a friendly ping to @stjepang in case you want to add any thoughts to this. :-) Always happy to hear your opinion!

Is this something you want the CSV crate to allow at this point? I know the async ecosystem is still very new, so I wouldn't blame you at all if you wanted to hold off on committing to an impl.

I think aside from my concern above, I'm still a Hard No on this. This PR effectively duplicates a significant chunk of the code in this library and adds a lot of async complexity that I don't grok yet. Even if this was the best solution to the problem, I'm simply not in a position to maintain something like this because I've really not familiarized myself with async stuff. It would be irresponsible of me to assume maintenance of this code.

I'm mainly interested in reusing a solid foundation for a CSV parser. The reader is largely unimportant to me as I can implement it myself, I just want to be able to reuse an existing state machine to parse it.

It's already exposed. That's exactly why csv-core exists. It's a no_std library with no ties to I/O at all. It is precisely what you need: a fast implementation of the hard parsing parts. The csv crate is basically just a (substantial) facade around csv-core.

I would definitely encourage you to use any means necessary to move forward to solve your problem. But I really hope, for the sake of the Rust ecosystem, that something like this isn't really necessary in the longer term.

@awestlake87
Copy link
Author

Awesome, thanks for the quick response. csv-core looks like exactly what I need to move forward.

As far as the other feedback about the async/sync ecosystem split, I think that it is an unfortunate reality. I think blocking code is just so fundamentally different from async code that it often requires a different set of functionality.

I'm definitely not an expert, but I believe adapters like smol::reader exist to bridge existing io::Read structures to the async space, not the code that uses them. Even if you have async adapters for io::Read, you can't change the parsing code to use them without rewriting parts of it. @stjepang might have some more to say about it if he wants to, though.

Here's an article I remember reading about async/await in the node.js ecosystem that explains this problem pretty well. Basically you can't get around this ecosystem split in an ergonomic way without doing something like goroutines, which have other caveats that were deemed worse than the current async/await model for Rust.

But in any case, I think csv-core works for me. I really appreciate you taking the time to get back to me so quickly, and I agree that this adds quite a bit of code and complexity for a use-case that's probably pretty niche. You can go ahead and close this PR if you want.

@BurntSushi
Copy link
Owner

All righty, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants