Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offline file-size optimisation #45

Open
8 of 14 tasks
cschwan opened this issue Oct 19, 2020 · 18 comments
Open
8 of 14 tasks

Add offline file-size optimisation #45

cschwan opened this issue Oct 19, 2020 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@cschwan
Copy link
Contributor

cschwan commented Oct 19, 2020

Possible optimisations:

@cschwan cschwan added the enhancement New feature or request label Oct 29, 2020
@cschwan
Copy link
Contributor Author

cschwan commented Nov 22, 2020

Pull request #48 implements a more efficient data structure.

@cschwan cschwan self-assigned this Nov 22, 2020
@cschwan
Copy link
Contributor Author

cschwan commented Nov 24, 2020

I've tested PR #48 with the complete ATLAS DY 3D grid, before (LagrangeSubgridV1) and after (LagrangeSparseSubgridV1) the optimisation, and both with and without LZ4 compression. Here's the table:

Compression LagrangeSubgridV1 LagrangeSparseSubgridV1
none 3.1 GB 497 MB
LZ4 377 MB 364 MB

The numbers before compression are basically the memory requirements when loading the grid (for convolutions, etc.). Due to the smaller size also convolutions of the grid with PDFs are faster: from 22 seconds down to 16 seconds, where the LZ4 compression virtually makes no difference.

@cschwan
Copy link
Contributor Author

cschwan commented Nov 25, 2020

To optimise a grid, simply run pineappl optimize input.pineappl optimized.pineappl; no re-generation of the grid is needed.

@cschwan
Copy link
Contributor Author

cschwan commented Nov 25, 2020

Here a comparison against APPLgrid (using the converter from PR #17 and CMS_SINGLETOP_TCH_R_7TEV_T.root):

Setup File size
APPLgrid 2.1 MB
converted 1.9 MB
converted+optimized 1.5 MB
converted+optimized+compressed 1.4 MB
converted+optimized+symmetrized+compressed 683K

@cschwan
Copy link
Contributor Author

cschwan commented Jan 4, 2021

Commit 1ca0a55 further optimizes the file sizes of grids for initial-state symmetric processes (for instance proton-proton collisions) by making use of the symmetry of the double-sum over the interpolated x1 and x2.

  • Up to numerical noise the results are unchanged.
  • The size optimization is a real reduction of information, so that both the compressed and uncompressed sizes of the grids are halved (see also table above).
  • Convolution time is also twice as fast.
  • If FK tables are generated from the optimized grids their size should also be cut in half.

@cschwan
Copy link
Contributor Author

cschwan commented Feb 7, 2021

Commits 098fe5d and a0f32fc further decrease the size of all grids that have a static scale (different static scales in different bins are also optimised). The size improvement is a factor of four by default (the interpolation degree plus one).

This optimisation modifies the numerical value of the convolution, since the PDFs are no longer evaluated at multiple q2 grid points, but instead at the single static scale; however, in general the result should be more accurate, because one interpolation dimension is removed.

@cschwan
Copy link
Contributor Author

cschwan commented May 11, 2021

Using ./appl2pine (see also #17) I converted all grids (except two) into PineAPPL grids. The following are my observations:

  • in the worst case scenario the PineAPPL grid is 0.3% percent larger than the APPLgrid. This happens 4 times out of 489
  • in the best case scenario the PineAPPL grid is 98.5% percent smaller, however in this case the APPLgrid is empty
  • the total size of all APPLgrids is 5.8 GB, the converted PineAPPL grids are 3.7 GB large. That's a reduction of 36% in size

@cschwan
Copy link
Contributor Author

cschwan commented Jul 16, 2021

Commit fce09e1 removes empty luminosity entries, which is primarily required for generating smaller FK tables.

@felixhekhorn
Copy link
Contributor

@scarlehoff just came across this one: "strip numerical zeros" - maybe we can increase the priority?

@cschwan
Copy link
Contributor Author

cschwan commented May 16, 2022

In the meantime the size of CMS_SINGLETOP_TCH_R_7TEV_T has slightly degraded to 684K, but the ATLAS DY 3D @ 8 TeV grid has shrunk to 66 MB from 364 MB.

@alecandido
Copy link
Member

In the meantime the size of CMS_SINGLETOP_TCH_R_7TEV_T has slightly degraded to 684K

Definitely not a problem

but the ATLAS DY 3D @ 8 TeV grid has shrunk to 66 MB from 364 MB.

That's great :)

@cschwan
Copy link
Contributor Author

cschwan commented Sep 7, 2022

Here's an update of the numbers from #45 (comment), using the CLI pineappl v0.5.5-19-gd924e9e and converting NNPDF/applgrids@8944089:

  • APPLgrids: 6214 MBytes (full repository without .git subfolder)
  • PineAPPL grids: 3338 MBytes (without compression: 3707 MBytes)

That's a -46% reduction!

@alecandido
Copy link
Member

alecandido commented Sep 8, 2022

  • PineAPPL grids: 3338 MBytes

That's a -46% reduction!

Are you comparing with the .pineappl or .pineappl.lz4?
I'd just like to decouple PineAPPL optimization from lz4 compression :) (or are the APPLgrids compressed as well?)

@cschwan
Copy link
Contributor Author

cschwan commented Sep 8, 2022

Are you comparing with the .pineappl or .pineappl.lz4? I'd just like to decouple PineAPPL optimization from lz4 compression :) (or are the APPLgrids compressed as well?)

The PineAPPL grids are LZ4 compressed and as far as I understand the ROOT file format is ZLIB compressed1, so in that sense it's a fair comparison I think.

However, you might wonder how good or helpful the compression in PineAPPL's case is, so I added the number without compression in the comment above.

Footnotes

  1. https://iopscience.iop.org/article/10.1088/1742-6596/898/7/072043

@alecandido
Copy link
Member

Ok, good. Then PineAPPL is already doing a great job on its own :D

The PineAPPL grids are LZ4 compressed and as far as I understand the ROOT file format is ZLIB compressed1, so in that sense it's a fair comparison I think.

Perfect, it was reasonable.

  • PineAPPL grids: 3338 MBytes (without compression: 3707 MBytes)

I wonder if there is a reason why LZ4 compression is doing so little. In some sense, that's a good sign on its own.
In eko it is changing a lot, because we are saving almost-triangular matrices in rectangular ones, with plenty of zeros - that is not the smartest choice for storage, but it was the best compromise for usage (maybe at some point we might want to reconsider, to see if we can equally good support for almost-triangular, saving memory and operations @felixhekhorn).

@cschwan
Copy link
Contributor Author

cschwan commented Sep 8, 2022

I think the reason is that the format is binary with already very small entropy (one f64 is 8 bytes). Are you compressing text files/yaml? In that case the compression should work much better.

@felixhekhorn
Copy link
Contributor

I think the reason is that the format is binary with already very small entropy (one f64 is 8 bytes). Are you compressing text files/yaml? In that case the compression should work much better.

In eko we're compressing .npy

maybe at some point we might want to reconsider, to see if we can equally good support for almost-triangular, saving memory and operations @felixhekhorn

we can - but this is a N3LO problem, I'd say

@alecandido
Copy link
Member

alecandido commented Sep 8, 2022

we can - but this is a N3LO problem, I'd say

N3LO is essentially now :)
(however, let's discuss somewhere else)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants