Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in Grid::convolute_eko #182

Merged
merged 1 commit into from
Oct 3, 2022
Merged

Fix bug in Grid::convolute_eko #182

merged 1 commit into from
Oct 3, 2022

Conversation

alecandido
Copy link
Member

I reverted b7014fd, as @cschwan suspect this to be the issue #103 (comment).

@alecandido alecandido marked this pull request as draft September 30, 2022 09:28
@alecandido
Copy link
Member Author

@felixhekhorn can you test on your example?

@codecov
Copy link

codecov bot commented Sep 30, 2022

Codecov Report

Merging #182 (10d5ccd) into master (7851ebc) will increase coverage by 0.01%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master     #182      +/-   ##
==========================================
+ Coverage   89.32%   89.33%   +0.01%     
==========================================
  Files          47       47              
  Lines        6932     6930       -2     
==========================================
- Hits         6192     6191       -1     
+ Misses        740      739       -1     
Flag Coverage Δ
python 80.64% <ø> (ø)
rust 89.45% <77.77%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pineappl/src/grid.rs 87.45% <77.77%> (+0.06%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@felixhekhorn
Copy link
Contributor

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!

I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

@andreab1997
Copy link
Contributor

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!

I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

Ok very good, thank you very much for this. As soon as I have time I will run a lot of tests to really be sure, but I would say that most likely we found the issue.

@alecandido
Copy link
Member Author

alecandido commented Sep 30, 2022

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

@cschwan can we maybe consider an optimization for the grid, in which we reinterpolate all the subgrids on a common xgrid?

For sure this will bring some loss in accuracy, but might be an acceptable one (against a considerable speed-up).

@cschwan
Copy link
Contributor

cschwan commented Sep 30, 2022

@andreab1997 @felixhekhorn as a guiding principle: all grids converted from APPLgrids are potentially affected by this bug, grids generated from pinefarm (with yadism or Madgraph5@MC_NLO) should be OK.

@cschwan
Copy link
Contributor

cschwan commented Sep 30, 2022

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

@cschwan can we maybe consider an optimization for the grid, in which we reinterpolate all the subgrids on a common xgrid?

For sure this will bring some loss in accuracy, but might be an acceptable one (against a considerable speed-up).

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow. However, I've written a realisitic unit test and with that I'll be able to improve the performance. Monday is a holiday in Germany and by then I hopefully have something faster!

@cschwan cschwan changed the title Bug Hunt convolute_eko Fix bug in Grid::convolute_eko Sep 30, 2022
@alecandido
Copy link
Member Author

alecandido commented Sep 30, 2022

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow.

The idea would be to trash this PR and rely on the assumptions that they are all the same. According to @felixhekhorn

this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

so this bug fix is heavily affecting performances. If we can live without, so much the better.

@cschwan
Copy link
Contributor

cschwan commented Sep 30, 2022

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow.

The idea would be to trash this PR and rely on the assumptions that they are all the same. According to @felixhekhorn

This I don't understand: why wouldn't we want to merge this pull request?

this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

so this bug fix is heavily affecting performances. If we can live without, so much the better.

I can always write a program that calculates the wrong results very quickly, but that isn't very helpful, of course.

@felixhekhorn
Copy link
Contributor

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow.

The idea would be to trash this PR and rely on the assumptions that they are all the same. According to @felixhekhorn

hm? if they are coming from the outside (meaning APPL) there is not much we can do, can we? reinterpolate is an option, but not the short term answer

@andreab1997
Copy link
Contributor

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!

I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

I am computing now all the fktables that are listed in "Things to check" in NNPDF/pineko#62 with this branch and I am ticking its elements one by one as I check them. For the moment the three that I have computed are fine so I believe also the others will be fine as well. Thank you again!

@alecandido
Copy link
Member Author

This I don't understand: why wouldn't we want to merge this pull request?
I can always write a program that calculates the wrong results very quickly, but that isn't very helpful, of course.

Of course, we want correct results, but we want also to be fast. That's why I proposed reinterpolation: if we can transform subgrids, such that they all have a common x, we will loose a bit in precision, but for a great gain in performances.

@Zaharid
Copy link

Zaharid commented Oct 3, 2022

fwiw the more I think about it the more I am convinced that what we want at the fit level is a common x grid. The fitting problem becomes a lot easier to reason about then.

Measuring how fine the grid needs to be is relatively easy.

@alecandido
Copy link
Member Author

alecandido commented Oct 3, 2022

fwiw the more I think about it the more I am convinced that what we want at the fit level is a common x grid. The fitting problem becomes a lot easier to reason about then.

I believe we all agree, that is definitely a good thing. But it is about the xgrid in the FkTable, that is coming from the EKO output, and it is already guaranteed.

Here we are talking of the EKO input (eko is already reinterpolating, so allows for input xgrid different from output), that has to be matched with the equivalent of an APPLgrid.
Unfortunately, it seems that not all the old APPLgrids where using a single xgrid for all the subgrids...

By the way @cschwan, I believe that the current status (this PR) might also be wrong: if subgrid A has a different xgrid than subgrid B, in order to use a single EKO we have to compute it on the union (since we are not reinterpolating it on the fly).
But then the subgrid is convoluted only over a subset of the EKO, and this should be wrong: even if you have a subset of the points in the extended one, the interpolation basis is not a subset of the basis, because the support of the polynomials in the smaller basis has to be larger than for the corresponding in the one related to the union.
Thus I believe we need reinterpolation, and also what we were doing before might have been wrong (and now I doubt also what was doing APFELcomb on those grids...)

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

@alecandido I think what you're probably doing is to calculate the EKO for each bin (the x grids don't change over luminosities and there's only one order) and then merge all EKOs. I think in this case there shouldn't be a problem, since most likely the x grids for each bin are all pairwise disjoint sets. If this isn't the case, merging EKOs for different bins should fail. In other words there's probably no problem.

@alecandido
Copy link
Member Author

@alecandido I think what you're probably doing is to calculate the EKO for each bin (the x grids don't change over luminosities and there's only one order) and then merge all EKOs. I think in this case there shouldn't be a problem, since most likely the x grids for each bin are all pairwise disjoint sets. If they aren't the merging of the EKOs for each bin should be going wrong.

Don't think so, since we rely on grid information to construct the EKO, in particular on GridAxes, that

So, given a grid, PineAPPL is always returning us a single xgrid, not a bin-dependent one.
The only thing we can do at this point, it is to compute one single EKO for the single xgrid given.

@alecandido
Copy link
Member Author

alecandido commented Oct 3, 2022

And here is the proof that we are computing the EKO for the union xgrid:
https://github.com/N3PDF/pineappl/blob/ad22780918c4452cca0c4b6ddded6a600237f143/pineappl/src/grid.rs#L1369-L1378
(and not the union of the EKOs for each individual xgrid).

The key is, of course, x_grid.extend_from_slice().

@felixhekhorn
Copy link
Contributor

Thus I believe we need reinterpolation, and also what we were doing before might have been wrong (and now I doubt also what was doing APFELcomb on those grids...)

I don't believe this is a problem, because we're already at the level of matrix multiplication (after the grid), so the interpolation only plays a role inside the grid. Afterwards, it is just linear algebra: meaning eko is exact on the grid points (by construction) and so the information from the input to any output point is exact

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

Let's say you have a grid with two bins for DIS. The first bin has this x grid:

[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]

and the second bin:

[0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85]

which means you calculate the EKO for the union of the x grid points given above. For the first bin that means the evolution is done using the a subset (the first set), while the other x grid points contribute with zero. I don't see anything wrong with that; the interpolation should still work.

@alecandido
Copy link
Member Author

I don't believe this is a problem, because we're already at the level of matrix multiplication (after the grid), so the interpolation only plays a role inside the grid. Afterwards, it is just linear algebra: meaning eko is exact on the grid points (by construction) and so the information from the input to any output point is exact

One of the two is exact, this is true: matrix multiplication is always a contraction over two paired, but different, indices:

  1. on the interpolation grid (function evaluate at that point) and
  2. on the interpolation basis (coefficient of that polynomial)

This only works if the index over the EKO is of kind 1., can you convince me of this? @felixhekhorn

@alecandido
Copy link
Member Author

you calculate the EKO for the union of the x grid points given above. For the first bin that means the evolution is done using the a subset (the first set), while the other x grid points contribute with zero

@cschwan the problem is there for one of the EKO indices (input or output, it is always confusing to me).
On one of the two, the EKO is the coefficient over one basis of functions: if you compute the coefficients of a given function over an extended basis, but then you only use a subset, you are effectively trashing the others, so you are telling that they should be multiplied by zeros. But that is not true: if the yadism function is evaluated at a limited subset of points, it doesn't mean that is zero elsewhere.

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

@alecandido if we use data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv:1109.5141-Z0_eta34.pineappl.lz4, delete everything but, say, the first bin, and then evolve the grid with one bin using

  1. the already computed EKO and compare that against
  2. the result obtained using a newly-computed EKO only for the first bin

we should be able to see what happens and how big the difference is.

@andreab1997
Copy link
Contributor

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!
I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion
I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

I am computing now all the fktables that are listed in "Things to check" in NNPDF/pineko#62 with this branch and I am ticking its elements one by one as I check them. For the moment the three that I have computed are fine so I believe also the others will be fine as well. Thank you again!

Sorry to step into the discussion but I just wanted to add a detail. It seems that for the fktables in the list "Things with a conjectured solution" the conjectured solution was wrong because with this branch I got the correct result.

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

@andreab1997 so if I understand you then in every case you've seen so far the differences between the grids and the corresponding FK tables are small? That would be reassuring.

@alecandido
Copy link
Member Author

we should be able to see what happens and how big the difference is.

Let me think one moment, and let's wait for @felixhekhorn reply.

In any case, I don't believe the exercise to be extremely relevant: either it is correct, or it is wrong.
Empirically correct is telling us little, because I know that if it is wrong I can make the result arbitrarily different, so we would rely on the specific xgrids chosen (and they are not chosen by us).
Moreover, empirically correct definition is based on agreement with APFELcomb, but we may well have perfect agreement, and just being both wrong in the same way.

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

@alecandido we had a similar discussion here: #151.

@andreab1997
Copy link
Contributor

@andreab1997 so if I understand you then in every case you've seen so far the differences between the grids and the corresponding FK tables are small? That would be reassuring.

Yes exactly, most of the times under the permille level. Sometimes the error is slightly above 1 permille but never more than that.

@alecandido
Copy link
Member Author

@alecandido we had a similar discussion here: #151.

In that case, we were considering different grids. Now the problem is within a single grid.
But I agree it is a similar problem.

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

In any case, I don't believe the exercise to be extremely relevant: either it is correct, or it is wrong. Empirically correct is telling us little, because I know that if it is wrong I can make the result arbitrarily different, so we would rely on the specific xgrids chosen (and they are not chosen by us). Moreover, empirically correct definition is based on agreement with APFELcomb, but we may well have perfect agreement, and just being both wrong in the same way.

If in all cases we've seen so far the differences are small then I believe - which is a rather weak statement, I admit - the interpolation can't be too far off and at the very least we probably don't have to hold off producing fits with the new pipeline.

In any case I agree we should thoroughly test it and being correct is paramount!

@andreab1997
Copy link
Contributor

Since we are now pretty sure that this was the source of the bug, @cschwan are you going to release a new version? Or do you want to wait to improve performances first? For me it makes essentially no difference (although it is obviuosly simpler with a release).

@cschwan
Copy link
Contributor

cschwan commented Oct 3, 2022

@andreab1997 I'll probably release a new version at the end of today.

@alecandido @felixhekhorn given that we fixed the initial problem, are you OK with merging this pull request?

@alecandido
Copy link
Member Author

@alecandido @felixhekhorn given that we fixed the initial problem, are you OK with merging this pull request?

Yes, let's merge this one, and go back to the former status. We can keep discussing about the rest separately.

@alecandido alecandido marked this pull request as ready for review October 3, 2022 13:38
@cschwan cschwan merged commit 96e4661 into master Oct 3, 2022
@cschwan cschwan deleted the bug-hunt-conv-eko branch October 3, 2022 13:42
@cschwan
Copy link
Contributor

cschwan commented Oct 4, 2022

@andreab1997 I've just released v0.5.6 which includes this fix. Crates and packages should become available soon!

@andreab1997
Copy link
Contributor

@andreab1997 I've just released v0.5.6 which includes this fix. Crates and packages should become available soon!

Ok thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants