Add support for compressed QMDL#970
Conversation
This reworks the QmdlWriter to output gzipped QMDL files by default, and allows QmdlReader to operate on either compressed or uncompressed QMDLs. QmdlReader has been significantly rewritten to expose a single AsyncRead interface to both compressed and uncompressed QMDL sources.
This'll balance the enum size given QmdlWriter's larger size
| self.total_written += msg.data.len(); | ||
| // for a gzipped file, we can't use `msg.data.len()` to | ||
| // determine the number of bytes written, so we have to | ||
| // manually do a `write_all()` type loop |
There was a problem hiding this comment.
i'm not understanding this and i'm hoping you can help me
if what we're tracking here is the number of uncompressed bytes, why can't we still use write_all and msg.data.len()?
There was a problem hiding this comment.
ah you're totally right, this is an outdated comment/implementation from when i was trying to track total_bytes_written rather than total_uncompressed_bytes
There was a problem hiding this comment.
ah ok. that makes sense
i do want to flag tho that when i try changing this back to use write_all, the tests fail so there may be another reason to use this approach which i also don't understand right now lol
There was a problem hiding this comment.
i wager it's due to a missing .flush() call -- i'll push a commit w/ working tests in a sec
bmw
left a comment
There was a problem hiding this comment.
i didn't make the time to really test this (at least not yet), but i wanted to share the comments i had after reading the code and playing with things just a little bit so you could continue working on this
for my benefit at least as much as yours for when i come back to this, my comment above about continuing to use write_all in write_container has not been fully addressed yet
| let compressed = qmdl_path.ends_with(".gz"); | ||
| let qmdl_file_size = qmdl_file.metadata().await.unwrap().len(); | ||
| let mut qmdl_reader = QmdlReader::new(qmdl_file, Some(qmdl_file_size as usize)); | ||
| let mut qmdl_reader = QmdlReader::new(qmdl_file, compressed, Some(qmdl_file_size as usize)); |
There was a problem hiding this comment.
the qmdl_file_size value won't work here if compressed is true right? it should probably be omitted like it was above
| start_time: start_time.into(), | ||
| last_message_time: Some(last_message_time.into()), | ||
| qmdl_size_bytes: metadata.size() as usize, | ||
| uncompressed_qmdl_size_bytes: metadata.size() as usize, |
There was a problem hiding this comment.
isn't metadata.size() going to be the compressed size for qmdl.gz files?
fixing this seems tricky. it seems to me like we either have to
- read each file in its entirety to determine the real uncompressed file size
- do the refactoring i believe you were talking about of removing the need for tracking max file size entirely by working at the level of HDLC containers
the latter seems much cleaner, but idk how much work it is
what do you think?
| reader: BufReader::new(reader), | ||
| bytes_read: 0, | ||
| max_bytes, | ||
| buf_reader: BufReader::new(QmdlAsyncReader::new( |
There was a problem hiding this comment.
if we're going to continue accepting max_uncompressed_bytes in this function, what about using an approach something like this and dropping the handling of max bytes inside QmdlAsyncReader entirely? i think it'd allow us to simplify the code significantly
| { | ||
| let entry = | ||
| ZipEntryBuilder::new(format!("{qmdl_idx}.qmdl").into(), Compression::Stored); | ||
| let extension = if compressed { "qmdl.gz" } else { "qmdl" }; |
There was a problem hiding this comment.
i didn't test this, but won't the resulting file here always be a qmdl file as reading using QmdlReader below will decompress it?
i personally think only including qmdl files in the zip is the nicest behavior for users anyway, but we should make sure the file extension is right regardless. if i'm correct and it is always should just be qmdl, it'll simplify the code and diff here
There was a problem hiding this comment.
yup, i ran into this while refactoring the write_all implementation, and am currently stuck rooting out the empty zip bug you mentioned below
There was a problem hiding this comment.
ah ok. let me know if you'd like a 2nd set of eyes on the empty zip problem. i hit it when trying to verify my understanding of the code here, but otherwise didn't really dig into it
| let body_bytes = axum::body::to_bytes(body, usize::MAX).await.unwrap(); | ||
|
|
||
| let zip_reader = ZipFileReader::new(body_bytes.to_vec()).await.unwrap(); | ||
|
|
There was a problem hiding this comment.
i think something is going wrong somewhere in here as the qmdl.gz file added to the zip here is empty. if i add code like this, tests pass on main but fail on this branch
for entry in zip_reader.file().entries() {
assert_ne!(entry.uncompressed_size(), 0);
}
| if self.uncompressed_bytes_read > max_bytes { | ||
| error!( | ||
| "warning: {} bytes read, but max_bytes was {}", | ||
| self.uncompressed_bytes_read, max_bytes |
There was a problem hiding this comment.
uncompressed_bytes_read never gets incremented afaict
| } | ||
|
|
||
| #[derive(Debug)] | ||
| struct QmdlAsyncReader<T> { |
There was a problem hiding this comment.
I struggle to understand why we have both QmdlReader and QmdlAsyncReader. can you rename one of them? I assume we need this layering, but i'm not sure why.
| .expect("failed to get QMDL file metadata") | ||
| .len(); | ||
| let mut qmdl_reader = QmdlReader::new(qmdl_file, Some(file_size as usize)); | ||
| let compressed = qmdl_path.ends_with(".gz"); |
There was a problem hiding this comment.
I believe it would be easier to sniff gzip magic bytes. Then the distinction between gzip vs non-gzip can happen within the reader entirely, it doesn't need extra params, and no other code needs to be touched.
this is prerequisite work for #81, since the diag logs for things like RSSI massively increase the size of QMDL files. in my experience, simply gzipping the qmdls reduces their size by 4-5x, which i think should be sufficient for our purposes.
this PR reworks QmdlWriter to output gzipped QMDL files by default, and allows QmdlReader to operate on either compressed or uncompressed QMDLs.
QmdlReader has been significantly rewritten to expose a single AsyncRead interface to both compressed and uncompressed QMDL sources.
i'd still like to do some more in-depth testing of this, but in the meantime i'd love a review on it