Utils for the new commondata #1693

t7phy · 2023-03-10T21:21:19Z

The PR adds utils for the new commondata in validphys such that they can be used without the need to duplicate the utils functions in every filter file.

scarlehoff · 2023-03-13T08:28:56Z

I'm confused. There is no new commondata in master so this for sure should not point to master. If this is for the generation of new commondata (and for the filters) this should go in a PR together with the first "new commondata" that we agreed during the code meeting should be basically the update of #1500 with the new style.

Also, as a general comment, please use the same style as it is used in the rest of the code.

t7phy · 2023-03-13T08:45:41Z

I'm confused. There is no new commondata in master so this for sure should not point to master. If this is for the generation of new commondata (and for the filters) this should go in a PR together with the first "new commondata" that we agreed during the code meeting should be basically the update of #1500 with the new style.

Also, as a general comment, please use the same style as it is used in the rest of the code.

This is pointing to master because multiple datasets that will be in different branches by different people will all be requiring these utils. This PR was agreed upon discussions following the code meeting as to where these should be included.

scarlehoff · 2023-03-13T08:50:09Z

Then it should be in the "master" of the new commondata from which all will branch (which is #1500, or #1678 once I have a reader for the new version, but that's a bit of a catch 22, I'll do it quickly as soon as I have something to test upon)
I don't think we should have pieces of the new commondata in master until everything is ready. Please coordinate with @enocera so that the (new, updated) template for the new commondata includes all these utilities.

t7phy · 2023-03-13T08:56:03Z

Then it should be in the "master" of the new commondata from which all will branch (which is #1500, or #1678 once I have a reader for the new version, but that's a bit of a catch 22, I'll do it quickly as soon as I have something to test upon) I don't think we should have pieces of the new commondata in master until everything is ready. Please coordinate with @enocera so that the (new, updated) template for the new commondata includes all these utilities.

In that case could you at least rebase #1678 (as this is where dataset braches are pointing to, as you asked) from #1693? So implementations can go on even if the final reader will need some time

scarlehoff · 2023-03-13T09:11:41Z

Once there is a new template I can test and compare, I'll update the reader and everyone can rebase on top of the changed #1678

If you want this utilities to be usable for everyone then you can rebase this on top of #1678 as well so that there is

master --- reader --- utilities ---- {all new commondata}

I'm also happy to have master --- utilities ---- {all new commondata} is you feel that tracking the reader is too much of a hassle. I can take care of updating the reader as new implemented data create new corner cases. In any case, the reader should not include any utilities to create new data.

In any of the two cases, since all new commondata, reader and utilities should be orthogonal changes rebasing, cherry-picking and merging these commits should be painless (actually, if it isn't, that means that something went wrong!)

t7phy · 2023-03-13T09:23:19Z

Once there is a new template I can test and compare, I'll update the reader and everyone can rebase on top of the changed #1678

It was my understanding that, that was the point of #1684, but anyway...

If you want this utilities to be usable for everyone then you can rebase this on top of #1678 as well so that there is

master --- reader --- utilities ---- {all new commondata}

I'm also happy to have master --- utilities ---- {all new commondata} is you feel that tracking the reader is too much of a hassle. I can take care of updating the reader as new implemented data create new corner cases. In any case, the reader should not include any utilities to create new data.

In any of the two cases, since all new commondata, reader and utilities should be orthogonal changes rebasing, cherry-picking and merging these commits should be painless (actually, if it isn't, that means that something went wrong!)

Fine, then I think it should be master --- utils --- all new commondata. This as, utils will not be modified (unless someone asks for some specific functions they might need) whereas reader will be so I will prefer to keep this standalone. Once the reader is ready, we can decide what to base on what (and hopefully it will be simple)

scarlehoff · 2023-03-13T10:02:51Z

It was my understanding that, that was the point of #1684, but anyway...

As it was discussed on Wednesday I'd like to have something I can compare the previous version to, otherwise it is hard for me to know what is a bug and what is a feature :P

I think it should be master --- utils --- all new commondata

Perfect.

(unless someone asks for some specific functions they might need)

Actually, I think this is a good reason to have it stand-alone, that way as people add new functions they can be included here and be propagated upon merge to the ones that are not using it these new functions. And once everything is implemented (and therefore the utilities are final) then it can be reviewed/modified (for instance, I think we should leave the top level commondata for the one that is used by vp during the fits/reports, and I would separate utilities to generate data... but that's a discussion for the future)

Zaharid · 2023-03-22T23:52:54Z

A few general notes:

These functions should decide if they are dealing with python lists or numpy arrays. It should be most likely be arrays, and so we should not cast them to python lists in various places. It seems to me in practice it is easier to get well formatted arrays out of various raw data parsers (and if not, we can have utils for that). I don't believe we should have formatting conventions that drop the information of the shape such as in the return value of triMat_to_fullMat. Especially if we are also going to have other conventions in other functions such as in the return value of concatMatrices.

Functions with "mode" flags should be likely broken into separate functions, for each of the modes. It is easier to understand what rows_to_fullMat(mylists) does than triMat_to_fullMat(0, mylists).

Numpy has a lot of functionality that removes the need to write explicit loops. Examples of things that may be useful include numpy.block, array.tolist() or numpy.tril_indices_from.

The naming of functions and variables should follow the usual convention and be snake_case rather than camelCase.

t7phy · 2023-03-23T00:16:03Z

A few general notes:

These functions should decide if they are dealing with python lists or numpy arrays. It should be most likely be arrays, and so we should not cast them to python lists in various places. It seems to me in practice it is easier to get well formatted arrays out of various raw data parsers (and if not, we can have utils for that). I don't believe we should have formatting conventions that drop the information of the shape such as in the return value of triMat_to_fullMat. Especially if we are also going to have other conventions in other functions such as in the return value of concatMatrices.

These are of course points to be finalized, however, it should be noted that the choices made here are due to practical considerations made while datasets were being implemented. The following are the explanations:

The user benefits from dealing with python lists as opposed to numpy.array because setting a variable equal to the numpymatrix[i][j] leads to a gibberish output in a .yaml file. It requires the user to set the variable equal to float(numpymatrix[i][j]) or int(numpymatrix[i][j]). This is not the case with lists. It is a small difference, but a difference never the less.
The utils consistently take in as input and give out as output matrices in the form of a 1d list which contains the elements of a matrix row by row. This is useful because when we obtain data from hepdata tables corresponding to correlation/covariance matrices, it becomes very simple to do so in a 1d list, as the user just needs to create a loop that appends the value (from .yaml data file) to the list in the filter.
concatMatrices do not ask the user to input the shape of the matrices, it only requires that the matrices be in a 2d form (it accepts both list of lists and numpy.ndarray format). The number of rows and columns in the arguments actually refer to number of matrices in rows and in columns. I.e. if you want to concat 6 matrices, A, B, C, D, E, F, the function needs to know whether you want it as [[A,B,C],[D,E,F]] or [[A,B],[C,D],[E,F]].

Zaharid · 2023-03-23T01:35:42Z

A few general notes:
These functions should decide if they are dealing with python lists or numpy arrays. It should be most likely be arrays, and so we should not cast them to python lists in various places. It seems to me in practice it is easier to get well formatted arrays out of various raw data parsers (and if not, we can have utils for that). I don't believe we should have formatting conventions that drop the information of the shape such as in the return value of triMat_to_fullMat. Especially if we are also going to have other conventions in other functions such as in the return value of concatMatrices.

These are of course points to be finalized, however, it should be noted that the choices made here are due to practical considerations made while datasets were being implemented. The following are the explanations:
* The user benefits from dealing with python lists as opposed to numpy.array because setting a variable equal to the numpymatrix[i][j] leads to a gibberish output in a .yaml file. It requires the user to set the variable equal to float(numpymatrix[i][j]) or int(numpymatrix[i][j]). This is not the case with lists. It is a small difference, but a difference never the less.

* The utils consistently take in as input and give out as output matrices in the form of a 1d list which contains the elements of a matrix row by row. This is useful because when we obtain data from hepdata tables corresponding to correlation/covariance matrices, it becomes very simple to do so in a 1d list, as the user just needs to create a loop that appends the value (from .yaml data file) to the list in the filter.

These suggest that we should have utilities to convert to and from numpy at the edges, when reading and writing to a file. But the list representation (in its various flavours) does make things needlessly fiddly when computing with it. For example the corrmat to covmat could be the one liner corrmat*errors*errors[:,np.newaxis], which to my eye is much clearer (and clearly correct).

Radonirinaunimi · 2023-10-16T15:34:13Z

What is the status of this PR? Should the people who implement commondata implement their own utils (which I think is the case now)? In other words, why weren't the so-far implemented commondata not relying on the master --- utilities ---- {all new commondata} as discussed above? The status now seems to be that similar utility functions are re-implemented in different PRs (sometimes repeated per dataset) therefore introducing redundancies and rendering this deprecated.

t7phy · 2023-10-16T15:40:42Z

What is the status of this PR? Should the people who implement commondata implement their own utils (which I think is the case now)? In other words, why weren't the so-far implemented commondata not relying on the master --- utilities ---- {all new commondata} as discussed above? The status now seems to be that similar utility functions are re-implemented in different PRs (sometimes repeated per dataset) therefore introducing redundancies and making this deprecated.

It is my understanding that its review is on To Do list of @enocera , for all the top, jets and dis+jets datasets, they rely on this file being in the validphys folder, i.e. all these datasets do not have their own utilities.

I agree, the idea was to base all the new datasets off this branch, but I don't know what happened to that idea 🤔

Radonirinaunimi · 2023-10-16T15:50:22Z

It is my understanding that its review is on To Do list of @enocera , for all the top, jets and dis+jets datasets, they rely on this file being in the validphys folder, i.e. all these datasets do not have their own utilities.

This is not the case for Jets for instance. And for the datasets that rely on the utility files to be on validphys, I think it would have been better for these to branch-out from this such that when this PR gets reviewed/approved then the propagation of the changes throughout is smooth.

I (and presumably the people who are implementing the Collider DY) am/are confused now on what should be done.

t7phy · 2023-10-17T13:51:54Z

This is not the case for Jets for instance. And for the datasets that rely on the utility files to be on validphys, I think it would have been better for these to branch-out from this such that when this PR gets reviewed/approved then the propagation of the changes throughout is smooth.

I (and presumably the people who are implementing the Collider DY) am/are confused now on what should be done.

The new jet datasets do use these (because I implemented them). The old ones by Mark do not.

I agree, it would be ideal to branch out from this, and this indeed what was agreed upon.

Radonirinaunimi · 2023-10-17T13:55:42Z

This is not the case for Jets for instance. And for the datasets that rely on the utility files to be on validphys, I think it would have been better for these to branch-out from this such that when this PR gets reviewed/approved then the propagation of the changes throughout is smooth.
I (and presumably the people who are implementing the Collider DY) am/are confused now on what should be done.

The new jet datasets do use these indeed (because I implemented them). The old ones by Mark do not.

I agree, it would be ideal to branch out from this, and this indeed what was agreed upon.

I was indeed referring to the re-implementation of the old jets.

@enocera, @scarlehoff What should be done about this?

scarlehoff · 2023-10-17T13:59:17Z

If the question regards the branching out, branch out of this one if it is convenient.

When it is done just point to #1813 in the PR and I will deal with the merging. At a first stage I will only merge the data itself.

Radonirinaunimi · 2023-10-17T14:34:00Z

If the question regards the branching out, branch out of this one if it is convenient.

When it is done just point to #1813 in the PR and I will deal with the merging. At a first stage I will only merge the data itself.

My questions basically touched two main points:

Are the utilities here stable enough to be deployed to the commondata implementations? If yes, were there good reasons why some of the other implementations were not based upon them? I was trying to understand the status and foresee some issues, but I get from your comment that we should just try them and find out which is fine.
In the near future, does this then imply that the datasets which were not using these will have to be re-written?

Let me just add that, in the same way as in old implementation, the utils should be at the heart of data implementation. In any case, cc-ing @cschwan, @peterkrack, and @niclaurenti who are also implementing the Collider DY as this might be useful information.

RoyStegeman · 2024-03-11T10:27:23Z

What is the status of this? Is it still intended to be merged?

t7phy · 2024-03-11T11:22:02Z

What is the status of this?

Not sure, last we discussed, @enocera had said he would like to review it, but I am not sure if he had the chance to do so..??

Is it still intended to be merged?

Ideally, yes, because otherwise a lot of new implementations and reimplementations would not work (i.e. their filters would not work) because they rely on these utils.

RoyStegeman · 2024-03-11T11:29:13Z

Okay, in that case, could you rebase instead of merging master into this?

t7phy · 2024-03-11T11:30:41Z

Ok sure, I will undo the merge commits and then rebase

RoyStegeman · 2024-03-11T11:32:54Z

No need to undo the merging. If you just rebase it's fine.

Or rather, while rebasing you get that for free anyway.

RoyStegeman · 2024-03-11T11:37:23Z

If @enocera is busy I could have a look as well some point this week.

t7phy · 2024-03-11T11:42:26Z

If @enocera is busy I could have a look as well some point this week.

that would be helpful

RoyStegeman

Given that the filter.py files have only received the bare minimum of scrutiny (many of them are broken now) and we're long past the stage where I can ask for significant changes anyway, I won't closely review these utils. Reviewing this only made sense if there was the intention of a larger effort to review and fix all the filter.py files, but there isn't.

I would however like to ask you to move the commondata_utils.py inside the new_commondata folder (which @comane is turning into it's own package in #1965), since it more suitable there than inside validphys. I think the tests should be scrapped.

t7phy · 2024-03-12T11:45:04Z

done!

RoyStegeman · 2024-03-12T12:00:26Z

Thanks! On second thought, do any of the dataset implementations that are currently in a PR and have not yet been merged use these utils?

t7phy · 2024-03-12T12:30:10Z

I am not sure, but I don't think so. The most extensive use of the utils was in new (di)jets impl., old and new ttb impl. and all dis+j impl. All of these are merged but their filters would not work without these utils.

scarlehoff · 2024-03-13T15:42:08Z

Let's do the following:

The function below

def covmat_to_artunc(ndata, covmat_list, no_of_norm_mat=0):

can go to validphys.commondata_utils.

And the others go to one of the datasets that use them (and the rest can import from that)

t7phy added the urgent label Mar 10, 2023

scarlehoff added data toolchain and removed urgent labels Mar 13, 2023

enocera mentioned this pull request Mar 22, 2023

[New commondata format] Jets #1699

Closed

t7phy requested a review from enocera April 4, 2023 22:55

Radonirinaunimi mentioned this pull request Apr 19, 2023

Validphys script to download hepdata tables #1717

Closed

scarlehoff mentioned this pull request Mar 8, 2024

Removing traces of Old Commondata #1974

Open

19 tasks

t7phy force-pushed the new_commondata_utils branch 2 times, most recently from 7ed0f6d to e8feb46 Compare March 11, 2024 11:41

RoyStegeman self-requested a review March 11, 2024 11:48

RoyStegeman reviewed Mar 12, 2024

View reviewed changes

t7phy and others added 9 commits March 12, 2024 12:44

Create commondata_utils.py

04c5ad8

Update commondata_utils.py

ee2eb7b

more util functions added

77d5b4e

docstring corrections

f434e67

percentage_to_absolute made more user friendly

3102e95

Formatting changes as per suggestions and more robust covmat_to_artunc

703ca18

tests added

585d35a

tests formatting

df20a8d

suggestions by RS

d810096

t7phy force-pushed the new_commondata_utils branch from e8feb46 to d810096 Compare March 12, 2024 11:44

RoyStegeman mentioned this pull request Mar 25, 2024

add y labels, integrate utils in filters and remove comma in sys names #2020

Closed

t7phy closed this Mar 25, 2024

t7phy mentioned this pull request Mar 25, 2024

add y labels, integrate utils in filters and remove comma in sys names v2 #2021

Merged

t7phy deleted the new_commondata_utils branch March 25, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utils for the new commondata #1693

Utils for the new commondata #1693

t7phy commented Mar 10, 2023

scarlehoff commented Mar 13, 2023

t7phy commented Mar 13, 2023

scarlehoff commented Mar 13, 2023 •

edited

Loading

t7phy commented Mar 13, 2023 •

edited

Loading

scarlehoff commented Mar 13, 2023 •

edited

Loading

t7phy commented Mar 13, 2023 •

edited

Loading

scarlehoff commented Mar 13, 2023

Zaharid commented Mar 22, 2023 •

edited

Loading

t7phy commented Mar 23, 2023 •

edited

Loading

Zaharid commented Mar 23, 2023

Radonirinaunimi commented Oct 16, 2023 •

edited

Loading

t7phy commented Oct 16, 2023 •

edited

Loading

Radonirinaunimi commented Oct 16, 2023 •

edited

Loading

t7phy commented Oct 17, 2023 •

edited

Loading

Radonirinaunimi commented Oct 17, 2023

scarlehoff commented Oct 17, 2023

Radonirinaunimi commented Oct 17, 2023

RoyStegeman commented Mar 11, 2024

t7phy commented Mar 11, 2024

RoyStegeman commented Mar 11, 2024

t7phy commented Mar 11, 2024

RoyStegeman commented Mar 11, 2024 •

edited

Loading

RoyStegeman commented Mar 11, 2024

t7phy commented Mar 11, 2024

RoyStegeman left a comment

t7phy commented Mar 12, 2024

RoyStegeman commented Mar 12, 2024

t7phy commented Mar 12, 2024

scarlehoff commented Mar 13, 2024

Utils for the new commondata #1693

Utils for the new commondata #1693

Conversation

t7phy commented Mar 10, 2023

scarlehoff commented Mar 13, 2023

t7phy commented Mar 13, 2023

scarlehoff commented Mar 13, 2023 • edited Loading

t7phy commented Mar 13, 2023 • edited Loading

scarlehoff commented Mar 13, 2023 • edited Loading

t7phy commented Mar 13, 2023 • edited Loading

scarlehoff commented Mar 13, 2023

Zaharid commented Mar 22, 2023 • edited Loading

t7phy commented Mar 23, 2023 • edited Loading

Zaharid commented Mar 23, 2023

Radonirinaunimi commented Oct 16, 2023 • edited Loading

t7phy commented Oct 16, 2023 • edited Loading

Radonirinaunimi commented Oct 16, 2023 • edited Loading

t7phy commented Oct 17, 2023 • edited Loading

Radonirinaunimi commented Oct 17, 2023

scarlehoff commented Oct 17, 2023

Radonirinaunimi commented Oct 17, 2023

RoyStegeman commented Mar 11, 2024

t7phy commented Mar 11, 2024

RoyStegeman commented Mar 11, 2024

t7phy commented Mar 11, 2024

RoyStegeman commented Mar 11, 2024 • edited Loading

RoyStegeman commented Mar 11, 2024

t7phy commented Mar 11, 2024

RoyStegeman left a comment

Choose a reason for hiding this comment

t7phy commented Mar 12, 2024

RoyStegeman commented Mar 12, 2024

t7phy commented Mar 12, 2024

scarlehoff commented Mar 13, 2024

scarlehoff commented Mar 13, 2023 •

edited

Loading

t7phy commented Mar 13, 2023 •

edited

Loading

scarlehoff commented Mar 13, 2023 •

edited

Loading

t7phy commented Mar 13, 2023 •

edited

Loading

Zaharid commented Mar 22, 2023 •

edited

Loading

t7phy commented Mar 23, 2023 •

edited

Loading

Radonirinaunimi commented Oct 16, 2023 •

edited

Loading

t7phy commented Oct 16, 2023 •

edited

Loading

Radonirinaunimi commented Oct 16, 2023 •

edited

Loading

t7phy commented Oct 17, 2023 •

edited

Loading

RoyStegeman commented Mar 11, 2024 •

edited

Loading