-
Notifications
You must be signed in to change notification settings - Fork 23
Compression_for_Exascale
Compression in the rich man’s parallel sense is hard. HDF5 doesn’t support it. Problems have to do with not knowing ahead of time where data lands in files due to variations in size. A variable loss but fixed size compression might be better here (e.g. wavelets). That way, can always hit a target compressed size but quality of compressed result varies. Could be useful in plot files but obviously not for restart.
Alternatively, can do compression in rich man’s parallel if HDF5 operated in such a way as to assume a target compression ratio of R (R set by application in a property list), and then assume each block compresses R:1 so compressed block size is always 1/R of orig. This means compressed block size is always predictable though if actual compression exceeds R:1, some space savings will be sacrificed because we’ll assume size is 1/R of orig. So what? The real problem is if a given block cannot be compressed R:1. Then what? One option is to fail the write and then app. can re-try the write with a lower R. Another option is to have two kinds of blocks. Those that hit or exceeded the compression target of R:1 and those that didn’t. The former will always be treated as size 1/R of orig. and the latter are size of orig. Either way, size is predictable and then manageable in rich man’s parallel.
Adding in additional preprocessing filters to the compression pipeline may give a better chance of achieving the R:1 compression ratio (or may allow the compression ratio to be increased), at the expense of additional computing power. Some examples include: shuffle, delta and/or space-filling curve filters.
Eliminating the block-level indirection here might be useful. Yes, its bad for an eventual attempt to subset on read but if caller accepts limitations and/or costs of that, we allow it. Then, whole dataset is single block and it is either compressed to target R:1, with possible wasted space if it exceeded, or not.
Exascale may involve higher than double precision of 64 bits. Maybe 96 or 128 bits are required. What does this mean for compression of floating point data compared to single or double precision? Would we expect to be able to do better because there are more exponent bits or worse because there are more mantissa bits?
Also, see this HDF5 document, Chunking in HDF5
Replying to multiple comments at once.
Quincey : “multiple processes may be writing into each chunk, which MPI-I/O can handle when the data is not compressed, but since compressed data is context-sensitive”
My initial use case would be much simpler. A chunk would be aligned with the boundaries of the domain decomposition and each process would write one chunk – one at a time – A compression filter would be applied by the process owning the data and then it would be written to disk (much like Marks’ suggestion).
a) lossless. Problem understood, chunks varying in size, nasty metadata synchronization, sparse files, issues.
b) lossy. Seems feasible. We were in fact considering a wavelet type compression as a first pass (pun intended). “It’s great from the perspective that it completely eliminates the space allocation problem”. Absolutely. All chunks are known to be of size X beforehand, so nothing changes except for the indexing and actual chunk storage/retrieval + de/compression.
I also like the idea of using a lossless compression and having the IO operation fail if the data doesn’t fit. Would give the user the chance to try their best to compress with some knowledge of the data type and if it doesn’t fit the allocated space, to abort.
Mark : Multi-pass VFD. I like this too. It potentially allows a very flexible approach where even if collective IO is writing to the same chunk, the collection/compression phase can do the sums and transmit the info into the hdf5 metadata layer. We’d certainly need to extend the chunking interface to handle variable seized chunks to allow for more/less compression in different areas of the data (actually this would be true for any option involving lossless compression). I think the chunk hashing relies on all chunks being the same size, so any change to that is going to be a huge compatibility breaker. Also, the chunking layer sits on top of the VFD, so I’m not sure if the VFD would be able to manipulate the chunks in the way desired. Perhaps I’m mstaked and the VFD does see the chunks. Correct me anyway.
Quincey : One idea I had and which I think Mark also expounded on is … each process takes its own data and compresses it as it sees fit, then the processes do a synchronization step to tell each other how much (new compressed) data they have got – and then a dataset create is called – using the size of the compressed data. Now each process creates a hyperslab for its piece of compressed data and writes into the file using collective IO. We now add an array of extent information and compression algorithm info to the dataset as an attribute where each entry has a start and end index of the data for each process.
Now the only trouble is that reading the data back requires a double step of reading the attributes and decompressing the desired piece- quite nasty when odd slices are being requested.
Now I start to think that Marks double VFD suggestion would do basically this (in one way or another), but maintaining the normal data layout rather than writing a special dataset representing the compressed data.
step 1 : Data is collected into chunks (if already aligned with domain decomposition, no-op), chunks are compressed.
step 2 : Sizes of chunks are exchanged and space is allocated in the file for all the chunks.
step 3 : chunks of compressed data are written
not sure two passes are actually needed, as long as the 3 steps are followed.
…but variable chunk sizes are not allowed in hdf (true or false?) – this seems like a showstopper.
Aha. I understand. The actual written data can/could vary in size, as long as the chunk indices as referring to the original dataspace are regular. yes?