Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more generic stream input #162

Open
mattday opened this issue Apr 20, 2021 · 3 comments
Open

Support more generic stream input #162

mattday opened this issue Apr 20, 2021 · 3 comments
Assignees

Comments

@mattday
Copy link

mattday commented Apr 20, 2021

First, thanks for an excellent library. It seems to be the best in class.

CSVReader has a templated stream constructor where the template parameter can be derived from a std::istream. As the documentation suggests, this works well with std::stringstream and std::ifstream. However, it can't be used with std::istream itself, or certain other derived streams. I think this is because it tries to move the stream into the StreamParser, so the code doesn't compile if the stream doesn't support this.

It would be incredibly useful to be able to use the parser with more generic streams, giving users the ability to read from compressed files. For example, this might be via the gzip_decompressor or the bzip2_decompressor in boost::iostreams. Reading from a gzip compressed CSV file is trivial in languages like Python. It's also a common requirement given how inefficient it can be to store a lot of data in a CSV file.

It seems as though the parser can almost support this already, so it probably doesn't need significant changes.

@vincentlaucsb
Copy link
Owner

Sorry for taking so long to respond, but this is a great suggestion. CSVReader didn't always have a templated stream constructor. Previously, it really only supported reading from std::string (and std::ifstream by copying data to an internal std::string--very inefficient).

I essentially had to rewrite it so it could support processing parsing data from std::stringstream, memory mapped files, and std::ifstream without too much duplication of code. Supporting reading CSVs directly from compressed files was also a motivation for this.

This was accomplished by creating the IBasicCSVParser interface, of which the StreamParser generic class derives from.

If you want to play around, you can see if StreamParser can be generalized to std::istream. If not, you can always create your own IBasicCSVParser implementation to work with a specific underlying data type. Currently, the library uses StreamParser for std::istream-derived types and MmapParser for memory-mapped IO.

@jamesmarsh99
Copy link

Hi Vincent, I would like to echo Matt and thank you for your excellent library. I was curious about how you were able to read from a gzipped file? Unfortunately, using an istream in the constructor of the CSVReader does not work since the move constructor is protected for istreams. Any help would be greatly appreciated

@MichaelSteffens
Copy link

Hi Vincent, I have played around, trying to adapt the StreamParser to work with std::istream. The move part is easy: initialize a member reference rather than moving the stream. And works when the actual stream is file. But I'm afraid to face a more fundamental problem now.

A generic stream input should also support pipes, such that you can consume CVS data from a decompression filter or similar source. But in case of pipes std::istream::tellg will not report anything useful other than pos_type(-1), and StreamParser::next will not be able to determine base class IBasicCSVParser::source_size upfront, but result in zero. It immediately "declares" EOF and terminates, even without catching the error of tellg.

Is the logic of IBasicCSVParser intended to work with implementations, where data would need to be processed before knowing the final source size?

@vincentlaucsb vincentlaucsb self-assigned this May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants