-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python data compression algorithm #172
Conversation
Add data compression script explanation to README Add option to delete original uncompressed file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rmaganza,
Thanks a lot for this PR and apologies for the belated review. I think we have a few things that need to be sorted out before we move ahead with this.
Let me attach an example output from a production job (memory_monitor_output.txt). In the current implementation, it seems as if we're a bit too aggressive and having a bit of a problem interpreting the plateaus:
Original: Text file size = 50 KB
Default: Text file size = 4.4 KB
Default + interpolate: Text file size = 15 KB
I think we should interpolate
by default as it seems to provide a nice balance between the disk-space and the level of detail. A simple tar
of the output reduces the size to 23 KB. So, there will still be a sizable gain in that configuration.
Many thanks.
Best,
Serhan
Thank you for you suggestions @amete and don't worry about the timing. I added some comments to your correct remarks. Regards, |
Thanks a lot Riccardo. Please see above for my response and let me know if there are any loose ends. |
Hi Serhan @amete, Please take a last look and let me know if anything is off. Regards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @rmaganza. It looks like we're almost there. I have two minor comments. Then we should be good to go. Please let me know.
@amete Fixed last points in the README as well as the default precision value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few follow-ups/suggestions.
Thank you for your suggestions @amete. Regards, |
Hi @amete, I also added back the Let me know if that works for you. |
Hi @rmaganza, Could you quickly check the most up-to-date script along with the file I posted above? Locally, I see we get no size reduction anymore. We seem to drop only two rows, and replace all the metrics w/ the interpolated values instead. Am I missing something? Best, |
Hi @amete, If you run the same code but remove Let me know if you have any ideas about this, as I am not sure how we could keep them in the output file while maintaining the same pruning logic. Regards, |
Yes, |
Thanks, The test file size after compression is now 18kb. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @rmaganza. I think this looks good to go in as the first version. I think there are a few things that we can improve but given this is an optional script that's not going to affect the core functionality, let's follow-up on those later.
I implemented Andrea Sciabà's data compression algorithm in a new Python script as we had previously discussed and also added a new section in the README based on the visualization script one.
The algorithm could surely be expanded (i.e. by providing some information on the original series' variance) but it seems like a good starting point to me.
Script has been run through black and flake8.