New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse rows on demand? #91
Comments
The simple answer is: yes. And you do it by implementing the If you can't do that, then you're probably signing up for some pain by insisting on reading CSV data asynchronously. An API based around lines isn't as simple as you might think, since you cannot guarantee that one line corresponds to one record. So you if you want to feed a CSV parser lines, then you actually wind up needing a full incremental parser. This is what the With all that said, I haven't used futures so I don't know what to tell you. I don't know if there's something you can do to provide an implementation of |
@BurntSushi thanks very much, I'll see if I can get |
@BurntSushi a quick follow-up if you don't mind. Is there a way to use a cursor without capturing it? fn run(&self, input: S) -> Box<Stream<Item = SensorEvent, Error = ()>> {
let source_field = self.config.get_string("source").unwrap();
let mut cursor: Cursor<Vec<u8>> = Cursor::new(vec![]);
let mut reader = csv::Reader::from_reader(cursor); // <--- captures it
let modified = input.filter_map(move |mut event| {
let input_line = match event.data_mut().get(&source_field) {
Some(&SensorDataValue::String(ref s)) => s.clone(),
_ => return None,
};
cursor.write(input_line.as_bytes()); // <-- can't write to it here anymore
for result in reader.deserialize() {
let record: HashMap<String, SensorDataValue> = result.unwrap();
println!("{:?}", record);
}
Some(event)
});
Box::new(modified)
} So in this example I'm trying to "fake" the cursor with an in-memory vec but its consumed by the csv::Reader so I can't push into it anymore. |
@daschl In your example, you're transferring ownership of the cursor to the CSV reader, so you can't use it after that. You could just ask the CSV reader for a mutable reference to the underlying reader, and that might work. So e.g., reader.get_mut().write(input_line.as_bytes()); |
@BurntSushi your approach works, but I can't figure out how ot make the Reader consume the data. I tried setting the cursor back as well as seeking, but the reader never emits something. fn run(&self, input: S) -> Box<Stream<Item = SensorEvent, Error = ()>> {
let source_field = self.config.get_string("source").unwrap();
let cursor: Cursor<Vec<u8>> = Cursor::new(vec![]);
let mut reader = csv::Reader::from_reader(cursor);
let modified = input.filter_map(move |mut event| {
let input_line = match event.data_mut().get(&source_field) {
Some(&SensorDataValue::String(ref s)) => s.clone(),
_ => return None,
};
let len = input_line.len() as i64;
reader.get_mut().write_all(input_line.as_bytes()).unwrap();
reader.get_mut().seek(::std::io::SeekFrom::Current(-len)).unwrap(); // tried all kinds of approaches
for result in reader.deserialize() {// <--- we never get in the loop
let record: HashMap<String, SensorDataValue> = result.unwrap();
println!("IN ---> {:?}", record);
}
Some(event)
});
Box::new(modified)
} When debug printing the reader I can see that the internal cursor of course got the data, but no state advanced. edit: actually I think this might be correct but it still doesn't emit anything.. I must be missing something. let start = reader.get_mut().position();
reader.get_mut().seek(::std::io::SeekFrom::End(0)).unwrap();
reader.get_mut().write_all(input_line.as_bytes()).unwrap();
reader.get_mut().set_position(start); |
@daschl Sorry but I don't have time to dig into your code. You might increase the chances of that happening by providing a full example that I can just run. In the mean time, I came across the I'll note that this program works for me: extern crate csv;
use std::collections::HashMap;
use std::io::{Cursor, Write};
fn main() {
let cursor = Cursor::new(vec![]);
let mut reader = csv::Reader::from_reader(cursor);
reader.get_mut().write_all(b"h1,h2,h3\nfoo,bar,baz\nabc,mno,xyz").unwrap();
reader.get_mut().set_position(0);
for result in reader.deserialize() {
let record: HashMap<String, String> = result.unwrap();
println!("{:?}", record);
}
} The output is:
|
@BurntSushi I think i figured out what the problem is. In each and every sample you find the headers and the first line of the csv are coming from the same input buffer, in which case it works fine. But if the header line is coming first, then decode is called nothing happens as expected. But then if the second line comes along it doesn't seem to "remember" the headers from the first line and still doesn't emit anything. This works (your example), very slightly modified: extern crate csv;
use std::collections::HashMap;
use std::io::{Cursor, Write};
fn main() {
let cursor = Cursor::new(vec![]);
let mut reader = csv::Reader::from_reader(cursor);
let position = reader.get_mut().position();
reader.get_mut().write_all(b"h1,h2,h3\nfoo,bar,baz\nabc,mno,xyz\n").unwrap();
reader.get_mut().set_position(position);
for result in reader.deserialize() {
let record: HashMap<String, String> = result.unwrap();
println!("{:?}", record);
}
} But this doesn't emit the second line, just the first one: extern crate csv;
use std::collections::HashMap;
use std::io::{Cursor, Write};
fn main() {
let cursor = Cursor::new(vec![]);
let mut reader = csv::Reader::from_reader(cursor);
let position = reader.get_mut().position();
reader.get_mut().write_all(b"h1,h2,h3\nfoo,bar,baz\n").unwrap();
reader.get_mut().set_position(position);
for result in reader.deserialize() {
let record: HashMap<String, String> = result.unwrap();
println!("{:?}", record);
}
let position = reader.get_mut().position();
reader.get_mut().write_all(b"abc,mno,xyz\n").unwrap();
reader.get_mut().set_position(position);
for result in reader.deserialize() {
let record: HashMap<String, String> = result.unwrap();
println!("{:?}", record);
}
} But even if I'm disabling the headers the last iteration doesn't emit anything: extern crate csv;
use std::io::{Cursor, Write};
fn main() {
let cursor = Cursor::new(vec![]);
let mut reader = csv::ReaderBuilder::new().has_headers(false).from_reader(cursor);
let position = reader.get_mut().position();
reader.get_mut().write_all(b"h1,h2,h3\nfoo,bar,baz").unwrap();
reader.get_mut().set_position(position);
for result in reader.records() {
println!("{:?}", result);
}
let position = reader.get_mut().position();
reader.get_mut().write_all(b"abc,mno,xyz").unwrap();
reader.get_mut().set_position(position);
for result in reader.records() {
println!("{:?}", result);
}
} Any idea what I am missing that |
Yeah unfortunately the header logic is quite complicated, and it is most complex when performing seeks. It's made even worse by trying to write to the underlying reader while the CSV reader is trying to maintain its own state. This code almost works: extern crate csv;
use std::collections::HashMap;
use std::io::{self, Cursor, Write};
fn main() {
let cursor = Cursor::new(vec![]);
let mut reader = csv::Reader::from_reader(cursor);
let position = reader.position().clone();
reader.get_mut().write_all(b"h1,h2,h3\nfoo,bar,baz\n").unwrap();
reader.get_mut().set_position(position.byte());
reader.seek_raw(io::SeekFrom::Start(position.byte()), position).unwrap();
for result in reader.deserialize() {
let record: HashMap<String, String> = result.unwrap();
println!("{:?}", record);
}
let position = reader.position().clone();
reader.get_mut().write_all(b"abc,mno,xyz\n").unwrap();
reader.seek_raw(io::SeekFrom::Start(position.byte()), position).unwrap();
for result in reader.deserialize() {
let record: HashMap<String, String> = result.unwrap();
println!("{:?}", record);
}
} The only flaw is that it emits the header row with its keys mapped to themselves, so you might need some additional state to skip the very first row you read. The only real trick in this scenario is the use of the CSV reader's
Most or all of this is actually documented on the CSV reader's |
@BurntSushi thanks, I think that gives me enough to move further. I'll provide the final solution here once I got it working :) |
@daschl Awesome! Look forward to it! |
@BurntSushi I think I got it sorted out, the only thing which worries me a bit is that I keep extending my cursor but I don't want to grow the vec out of bounds. I may just need to reset it back to 0 all the time so that the reader just has one line to parse. Note that I now use fn run(&self, input: S) -> Box<Stream<Item = SensorEvent, Error = ()>> {
let source_field = self.config.get_string("source").unwrap();
let cursor: Cursor<Vec<u8>> = Cursor::new(vec![]);
let mut reader = csv::ReaderBuilder::new()
.has_headers(false) // <--- avoid header specialization
.from_reader(cursor);
let mut headers: Option<StringRecord> = None; // <--- store the headers here
let modified = input.filter_map(move |mut event| {
let input_line = match event.data_mut().get(&source_field) {
Some(&SensorDataValue::String(ref s)) => s.clone(),
_ => return None,
};
let position = reader.position().clone();
reader.get_mut().write_all(input_line.as_bytes()).unwrap();
reader
.seek_raw(io::SeekFrom::Start(position.byte()), position)
.unwrap(); // <--- seek as you said above
for result in reader.records() {
if headers.is_none() { / record is treated as headers if not set
headers = Some(result.unwrap());
return None; // don't emit the headers as a downstream event
}
// <---still do the serde deserialization and then add it to my event data
let mut record: HashMap<String, SensorDataValue> =
result.unwrap().deserialize(headers.as_ref()).unwrap();
event.data_mut().extend(record.drain());
}
Some(event)
});
Box::new(modified)
} |
Ok, so here is what I ended up with setting it back to 0 for every row since I know rust-csv is consuming the line anyways. let mut position = reader.position().clone();
position.set_byte(0);
reader.get_mut().set_position(position.byte());
reader.get_mut().write_all(input_line.as_bytes()).unwrap();
reader
.seek_raw(io::SeekFrom::Start(position.byte()), position)
.unwrap(); I'll close it, thanks very much for your help @BurntSushi |
@daschl Glad you found something that works! |
I'd like to say a huge thanks for creating this thread @daschl. Was struggling with this for ages. PS, mind if I use your code snippet? |
@Restioson please go ahead and use it, happy I could help someone else with it too. Let me know if you have further questions :) |
I wonder if we should get an example together since it looks like more people need something like this, and the thread since closed is kinda hidden. I'll see if I can come up with a self-contained PR for this which showcases it. |
Might it even be worth having a helper struct, something like |
An example to add to the cookbook would be great! |
On a similar note, I've tried to implement this, but to no avail: https://gist.github.com/765d46640b29d8deb6863f4b895bd085 (missing field |
@Restioson I don't have time to read that much code right now unfortunately. If you can post a smaller (preferably minimal) example that reproduces your problem, then I might be able to take a look. |
I'm not very sure which exact bit caused the problem here, but I can probably cut down some of the less necessary stuff, like the full struct definitions and whatnot, so I'll try get to that tommorow. (Side note, it looks like I accidentally shared the gist link instead of playground) |
@BurntSushi I refactored and made my code a bit smaller. However, it works without the cursor hackery. |
@Restioson Thanks. The error is much clearer now. I think I said this before, but the key is realizing that any call to |
Wow, I feel dumb! Thanks so so much! |
I'm trying to use rust-csv in an async context and have troubles getting this done with the Reader - maybe there is a way to improve the API ? (or I am missing anything).
So I'm consuming a
futures::Stream
and want tomap
a line ofString
(which would be my csv line) into aHashMap
of the parsed one.So my question is: Is there a way to "setup"
rust-csv
with all the params and then push/consume lines in as they come in and consume the parsed output?Thanks,
Michael
The text was updated successfully, but these errors were encountered: