Add Python data compression algorithm #172

rmaganza · 2020-10-20T13:27:19Z

I implemented Andrea Sciabà's data compression algorithm in a new Python script as we had previously discussed and also added a new section in the README based on the visualization script one.

The algorithm could surely be expanded (i.e. by providing some information on the original series' variance) but it seems like a good starting point to me.

Script has been run through black and flake8.

Add data compression script explanation to README Add option to delete original uncompressed file

amete

Hi @rmaganza,

Thanks a lot for this PR and apologies for the belated review. I think we have a few things that need to be sorted out before we move ahead with this.

Let me attach an example output from a production job (memory_monitor_output.txt). In the current implementation, it seems as if we're a bit too aggressive and having a bit of a problem interpreting the plateaus:

Original: Text file size = 50 KB

Default: Text file size = 4.4 KB

Default + interpolate: Text file size = 15 KB

I think we should interpolate by default as it seems to provide a nice balance between the disk-space and the level of detail. A simple tar of the output reduces the size to 23 KB. So, there will still be a sizable gain in that configuration.

Many thanks.

Best,
Serhan

package/scripts/prmon_compress_output.py

README.md

rmaganza · 2021-01-27T12:38:10Z

Thank you for you suggestions @amete and don't worry about the timing. I added some comments to your correct remarks.
Let me know what you think about the problem with the interpolation I mentioned.

Regards,
Riccardo.

amete · 2021-01-27T13:56:58Z

Thanks a lot Riccardo. Please see above for my response and let me know if there are any loose ends.

rmaganza · 2021-01-28T08:57:06Z

Hi Serhan @amete,
the points we talked about should have been fixed.

Please take a last look and let me know if anything is off.

Regards,
Riccardo

amete

Thanks a lot @rmaganza. It looks like we're almost there. I have two minor comments. Then we should be good to go. Please let me know.

README.md

package/scripts/prmon_compress_output.py

…tency

rmaganza · 2021-01-28T13:50:38Z

@amete Fixed last points in the README as well as the default precision value.

amete

Just a few follow-ups/suggestions.

README.md

package/scripts/prmon_compress_output.py

rmaganza · 2021-01-28T15:10:36Z

Thank you for your suggestions @amete.
I added a comment for the skip-interpolation case, which I think it's the only thing to iron out.

Regards,
Riccardo

rmaganza · 2021-02-01T08:33:46Z

Hi @amete,
I added the suggestions to the README.

I also added back the utime and stime variables. I appended them together with the CPU, GPU and thread numbers, since they work on a lower scale than the rest of the metrics and perhaps linearly interpolating would not be very suited in that case.

Let me know if that works for you.

amete · 2021-02-03T10:05:15Z

Hi @rmaganza,

Could you quickly check the most up-to-date script along with the file I posted above? Locally, I see we get no size reduction anymore. We seem to drop only two rows, and replace all the metrics w/ the interpolated values instead. Am I missing something?

Best,
Serhan

rmaganza · 2021-02-03T10:37:26Z

Hi @amete,
unfortunately, it seems like adding back utime and stime causes almost every row not to be deleted.
Every data point of the utime and stime series is marked as a changepoint and so it's not deleting any of the rows.

If you run the same code but remove utime and stime from the required metrics the output file size is 17 kb.

Let me know if you have any ideas about this, as I am not sure how we could keep them in the output file while maintaining the same pruning logic.

Regards,
Riccardo

amete · 2021-02-03T12:40:16Z

Yes, utime and stime monotonically increase, typically at least. I think you can treat them as the other metrics, e.g. memory. This way you should be able to keep certain points where there are large enough changes but drop most of them. The other trick you can try is to take the first order derivative, i.e. df['utime'].diff() and use that to determine if a point should be dropped or not, instead of df['utime']. If you play around w/ these a little bit I'm sure you can find a good compromise but please let me know how it goes.

rmaganza · 2021-02-03T13:27:02Z

Thanks,
I was afraid interpolating a time variable would not have had much sense logically, but you're right that moving them together with the memory variables does result in compression.

The test file size after compression is now 18kb.

amete

Thanks a lot @rmaganza. I think this looks good to go in as the first version. I think there are a few things that we can improve but given this is an optional script that's not going to affect the core functionality, let's follow-up on those later.

Riccardo Maganza added 2 commits October 20, 2020 15:20

Add Python data compression algorithm

0632813

Add data compression script explanation to README Add option to delete original uncompressed file

Fix unnecessary else after return

f54bb9a

graeme-a-stewart self-requested a review October 27, 2020 13:31

Riccardo Maganza added 2 commits November 11, 2020 17:11

Fix some slight bugs found in day-to-day usage

f9f240f

flake8:fix whitespace issue

83a014c

amete self-requested a review January 27, 2021 10:34

amete requested changes Jan 27, 2021

View reviewed changes

package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved

package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

Riccardo Maganza added 3 commits January 28, 2021 09:16

Fix argument parsing and default to true interpolation

6ba7dc9

Fix compression script permissions

a4cf230

fix: only perform rounding if interpolating

ec7e452

amete self-requested a review January 28, 2021 13:16

amete requested changes Jan 28, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved

Riccardo Maganza added 2 commits January 28, 2021 14:43

Change default precision value and refactor variable names for consis…

d6aa380

…tency

Fix README for new defaults in compression script

c9cf659

amete reviewed Jan 28, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved

Riccardo Maganza added 3 commits February 1, 2021 09:25

Add back CPU measurements to compression algorithm

214a344

Better specification for compresson algorithm in README

516a788

Add utime and stime information to README

58440b6

Riccardo Maganza added 2 commits February 3, 2021 14:23

Fix README for utime and stime

66071ac

Move utime and stime back to other list

6623b94

amete self-requested a review February 5, 2021 12:57

amete approved these changes Feb 5, 2021

View reviewed changes

amete merged commit 1858e7e into HSF:master Feb 5, 2021

amete mentioned this pull request Feb 5, 2021

Preparing v2.2.0 #184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python data compression algorithm #172

Add Python data compression algorithm #172

rmaganza commented Oct 20, 2020

amete left a comment •

edited

Loading

rmaganza commented Jan 27, 2021 •

edited

Loading

amete commented Jan 27, 2021

rmaganza commented Jan 28, 2021

amete left a comment

rmaganza commented Jan 28, 2021

amete left a comment

rmaganza commented Jan 28, 2021

rmaganza commented Feb 1, 2021

amete commented Feb 3, 2021

rmaganza commented Feb 3, 2021

amete commented Feb 3, 2021

rmaganza commented Feb 3, 2021

amete left a comment

Add Python data compression algorithm #172

Add Python data compression algorithm #172

Conversation

rmaganza commented Oct 20, 2020

amete left a comment • edited Loading

Choose a reason for hiding this comment

rmaganza commented Jan 27, 2021 • edited Loading

amete commented Jan 27, 2021

rmaganza commented Jan 28, 2021

amete left a comment

Choose a reason for hiding this comment

rmaganza commented Jan 28, 2021

amete left a comment

Choose a reason for hiding this comment

rmaganza commented Jan 28, 2021

rmaganza commented Feb 1, 2021

amete commented Feb 3, 2021

rmaganza commented Feb 3, 2021

amete commented Feb 3, 2021

rmaganza commented Feb 3, 2021

amete left a comment

Choose a reason for hiding this comment

amete left a comment •

edited

Loading

rmaganza commented Jan 27, 2021 •

edited

Loading