Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utils for the new commondata #1693

Closed
wants to merge 9 commits into from
Closed

Utils for the new commondata #1693

wants to merge 9 commits into from

Conversation

t7phy
Copy link
Member

@t7phy t7phy commented Mar 10, 2023

The PR adds utils for the new commondata in validphys such that they can be used without the need to duplicate the utils functions in every filter file.

@t7phy t7phy added the urgent label Mar 10, 2023
@scarlehoff
Copy link
Member

I'm confused. There is no new commondata in master so this for sure should not point to master. If this is for the generation of new commondata (and for the filters) this should go in a PR together with the first "new commondata" that we agreed during the code meeting should be basically the update of #1500 with the new style.

Also, as a general comment, please use the same style as it is used in the rest of the code.

@t7phy
Copy link
Member Author

t7phy commented Mar 13, 2023

I'm confused. There is no new commondata in master so this for sure should not point to master. If this is for the generation of new commondata (and for the filters) this should go in a PR together with the first "new commondata" that we agreed during the code meeting should be basically the update of #1500 with the new style.

Also, as a general comment, please use the same style as it is used in the rest of the code.

This is pointing to master because multiple datasets that will be in different branches by different people will all be requiring these utils. This PR was agreed upon discussions following the code meeting as to where these should be included.

@scarlehoff
Copy link
Member

scarlehoff commented Mar 13, 2023

Then it should be in the "master" of the new commondata from which all will branch (which is #1500, or #1678 once I have a reader for the new version, but that's a bit of a catch 22, I'll do it quickly as soon as I have something to test upon)
I don't think we should have pieces of the new commondata in master until everything is ready. Please coordinate with @enocera so that the (new, updated) template for the new commondata includes all these utilities.

@t7phy
Copy link
Member Author

t7phy commented Mar 13, 2023

Then it should be in the "master" of the new commondata from which all will branch (which is #1500, or #1678 once I have a reader for the new version, but that's a bit of a catch 22, I'll do it quickly as soon as I have something to test upon) I don't think we should have pieces of the new commondata in master until everything is ready. Please coordinate with @enocera so that the (new, updated) template for the new commondata includes all these utilities.

In that case could you at least rebase #1678 (as this is where dataset braches are pointing to, as you asked) from #1693? So implementations can go on even if the final reader will need some time

@scarlehoff
Copy link
Member

scarlehoff commented Mar 13, 2023

Once there is a new template I can test and compare, I'll update the reader and everyone can rebase on top of the changed #1678

If you want this utilities to be usable for everyone then you can rebase this on top of #1678 as well so that there is

master --- reader --- utilities ---- {all new commondata}

I'm also happy to have master --- utilities ---- {all new commondata} is you feel that tracking the reader is too much of a hassle. I can take care of updating the reader as new implemented data create new corner cases. In any case, the reader should not include any utilities to create new data.

In any of the two cases, since all new commondata, reader and utilities should be orthogonal changes rebasing, cherry-picking and merging these commits should be painless (actually, if it isn't, that means that something went wrong!)

@t7phy
Copy link
Member Author

t7phy commented Mar 13, 2023

Once there is a new template I can test and compare, I'll update the reader and everyone can rebase on top of the changed #1678

It was my understanding that, that was the point of #1684, but anyway...

If you want this utilities to be usable for everyone then you can rebase this on top of #1678 as well so that there is

master --- reader --- utilities ---- {all new commondata}

I'm also happy to have master --- utilities ---- {all new commondata} is you feel that tracking the reader is too much of a hassle. I can take care of updating the reader as new implemented data create new corner cases. In any case, the reader should not include any utilities to create new data.

In any of the two cases, since all new commondata, reader and utilities should be orthogonal changes rebasing, cherry-picking and merging these commits should be painless (actually, if it isn't, that means that something went wrong!)

Fine, then I think it should be master --- utils --- all new commondata. This as, utils will not be modified (unless someone asks for some specific functions they might need) whereas reader will be so I will prefer to keep this standalone. Once the reader is ready, we can decide what to base on what (and hopefully it will be simple)

@scarlehoff
Copy link
Member

It was my understanding that, that was the point of #1684, but anyway...

As it was discussed on Wednesday I'd like to have something I can compare the previous version to, otherwise it is hard for me to know what is a bug and what is a feature :P

I think it should be master --- utils --- all new commondata

Perfect.

(unless someone asks for some specific functions they might need)

Actually, I think this is a good reason to have it stand-alone, that way as people add new functions they can be included here and be propagated upon merge to the ones that are not using it these new functions. And once everything is implemented (and therefore the utilities are final) then it can be reviewed/modified (for instance, I think we should leave the top level commondata for the one that is used by vp during the fits/reports, and I would separate utilities to generate data... but that's a discussion for the future)

@Zaharid
Copy link
Contributor

Zaharid commented Mar 22, 2023

A few general notes:

These functions should decide if they are dealing with python lists or numpy arrays. It should be most likely be arrays, and so we should not cast them to python lists in various places. It seems to me in practice it is easier to get well formatted arrays out of various raw data parsers (and if not, we can have utils for that). I don't believe we should have formatting conventions that drop the information of the shape such as in the return value of triMat_to_fullMat. Especially if we are also going to have other conventions in other functions such as in the return value of concatMatrices.

Functions with "mode" flags should be likely broken into separate functions, for each of the modes. It is easier to understand what rows_to_fullMat(mylists) does than triMat_to_fullMat(0, mylists).

Numpy has a lot of functionality that removes the need to write explicit loops. Examples of things that may be useful include numpy.block, array.tolist() or numpy.tril_indices_from.

The naming of functions and variables should follow the usual convention and be snake_case rather than camelCase.

@t7phy
Copy link
Member Author

t7phy commented Mar 23, 2023

A few general notes:

These functions should decide if they are dealing with python lists or numpy arrays. It should be most likely be arrays, and so we should not cast them to python lists in various places. It seems to me in practice it is easier to get well formatted arrays out of various raw data parsers (and if not, we can have utils for that). I don't believe we should have formatting conventions that drop the information of the shape such as in the return value of triMat_to_fullMat. Especially if we are also going to have other conventions in other functions such as in the return value of concatMatrices.

These are of course points to be finalized, however, it should be noted that the choices made here are due to practical considerations made while datasets were being implemented. The following are the explanations:

  • The user benefits from dealing with python lists as opposed to numpy.array because setting a variable equal to the numpymatrix[i][j] leads to a gibberish output in a .yaml file. It requires the user to set the variable equal to float(numpymatrix[i][j]) or int(numpymatrix[i][j]). This is not the case with lists. It is a small difference, but a difference never the less.
  • The utils consistently take in as input and give out as output matrices in the form of a 1d list which contains the elements of a matrix row by row. This is useful because when we obtain data from hepdata tables corresponding to correlation/covariance matrices, it becomes very simple to do so in a 1d list, as the user just needs to create a loop that appends the value (from .yaml data file) to the list in the filter.
  • concatMatrices do not ask the user to input the shape of the matrices, it only requires that the matrices be in a 2d form (it accepts both list of lists and numpy.ndarray format). The number of rows and columns in the arguments actually refer to number of matrices in rows and in columns. I.e. if you want to concat 6 matrices, A, B, C, D, E, F, the function needs to know whether you want it as [[A,B,C],[D,E,F]] or [[A,B],[C,D],[E,F]].

@Zaharid
Copy link
Contributor

Zaharid commented Mar 23, 2023

A few general notes:
These functions should decide if they are dealing with python lists or numpy arrays. It should be most likely be arrays, and so we should not cast them to python lists in various places. It seems to me in practice it is easier to get well formatted arrays out of various raw data parsers (and if not, we can have utils for that). I don't believe we should have formatting conventions that drop the information of the shape such as in the return value of triMat_to_fullMat. Especially if we are also going to have other conventions in other functions such as in the return value of concatMatrices.

These are of course points to be finalized, however, it should be noted that the choices made here are due to practical considerations made while datasets were being implemented. The following are the explanations:

* The user benefits from dealing with python lists as opposed to numpy.array because setting a variable equal to the numpymatrix[i][j] leads to a gibberish output in a .yaml file. It requires the user to set the variable equal to float(numpymatrix[i][j]) or int(numpymatrix[i][j]). This is not the case with lists. It is a small difference, but a difference never the less.

* The utils consistently take in as input and give out as output matrices in the form of a 1d list which contains the elements of a matrix row by row. This is useful because when we obtain data from hepdata tables corresponding to correlation/covariance matrices, it becomes very simple to do so in a 1d list, as the user just needs to create a loop that appends the value (from .yaml data file) to the list in the filter.

These suggest that we should have utilities to convert to and from numpy at the edges, when reading and writing to a file. But the list representation (in its various flavours) does make things needlessly fiddly when computing with it. For example the corrmat to covmat could be the one liner corrmat*errors*errors[:,np.newaxis], which to my eye is much clearer (and clearly correct).

@Radonirinaunimi
Copy link
Member

Radonirinaunimi commented Oct 16, 2023

What is the status of this PR? Should the people who implement commondata implement their own utils (which I think is the case now)? In other words, why weren't the so-far implemented commondata not relying on the master --- utilities ---- {all new commondata} as discussed above? The status now seems to be that similar utility functions are re-implemented in different PRs (sometimes repeated per dataset) therefore introducing redundancies and rendering this deprecated.

@t7phy
Copy link
Member Author

t7phy commented Oct 16, 2023

What is the status of this PR? Should the people who implement commondata implement their own utils (which I think is the case now)? In other words, why weren't the so-far implemented commondata not relying on the master --- utilities ---- {all new commondata} as discussed above? The status now seems to be that similar utility functions are re-implemented in different PRs (sometimes repeated per dataset) therefore introducing redundancies and making this deprecated.

It is my understanding that its review is on To Do list of @enocera , for all the top, jets and dis+jets datasets, they rely on this file being in the validphys folder, i.e. all these datasets do not have their own utilities.

I agree, the idea was to base all the new datasets off this branch, but I don't know what happened to that idea 🤔

@Radonirinaunimi
Copy link
Member

Radonirinaunimi commented Oct 16, 2023

It is my understanding that its review is on To Do list of @enocera , for all the top, jets and dis+jets datasets, they rely on this file being in the validphys folder, i.e. all these datasets do not have their own utilities.

This is not the case for Jets for instance. And for the datasets that rely on the utility files to be on validphys, I think it would have been better for these to branch-out from this such that when this PR gets reviewed/approved then the propagation of the changes throughout is smooth.

I (and presumably the people who are implementing the Collider DY) am/are confused now on what should be done.

@t7phy
Copy link
Member Author

t7phy commented Oct 17, 2023

This is not the case for Jets for instance. And for the datasets that rely on the utility files to be on validphys, I think it would have been better for these to branch-out from this such that when this PR gets reviewed/approved then the propagation of the changes throughout is smooth.

I (and presumably the people who are implementing the Collider DY) am/are confused now on what should be done.

The new jet datasets do use these (because I implemented them). The old ones by Mark do not.

I agree, it would be ideal to branch out from this, and this indeed what was agreed upon.

@Radonirinaunimi
Copy link
Member

This is not the case for Jets for instance. And for the datasets that rely on the utility files to be on validphys, I think it would have been better for these to branch-out from this such that when this PR gets reviewed/approved then the propagation of the changes throughout is smooth.
I (and presumably the people who are implementing the Collider DY) am/are confused now on what should be done.

The new jet datasets do use these indeed (because I implemented them). The old ones by Mark do not.

I agree, it would be ideal to branch out from this, and this indeed what was agreed upon.

I was indeed referring to the re-implementation of the old jets.

@enocera, @scarlehoff What should be done about this?

@scarlehoff
Copy link
Member

If the question regards the branching out, branch out of this one if it is convenient.

When it is done just point to #1813 in the PR and I will deal with the merging. At a first stage I will only merge the data itself.

@Radonirinaunimi
Copy link
Member

If the question regards the branching out, branch out of this one if it is convenient.

When it is done just point to #1813 in the PR and I will deal with the merging. At a first stage I will only merge the data itself.

My questions basically touched two main points:

  • Are the utilities here stable enough to be deployed to the commondata implementations? If yes, were there good reasons why some of the other implementations were not based upon them? I was trying to understand the status and foresee some issues, but I get from your comment that we should just try them and find out which is fine.
  • In the near future, does this then imply that the datasets which were not using these will have to be re-written?

Let me just add that, in the same way as in old implementation, the utils should be at the heart of data implementation. In any case, cc-ing @cschwan, @peterkrack, and @niclaurenti who are also implementing the Collider DY as this might be useful information.

@RoyStegeman
Copy link
Member

What is the status of this? Is it still intended to be merged?

@t7phy
Copy link
Member Author

t7phy commented Mar 11, 2024

What is the status of this?

Not sure, last we discussed, @enocera had said he would like to review it, but I am not sure if he had the chance to do so..??

Is it still intended to be merged?

Ideally, yes, because otherwise a lot of new implementations and reimplementations would not work (i.e. their filters would not work) because they rely on these utils.

@RoyStegeman
Copy link
Member

Okay, in that case, could you rebase instead of merging master into this?

@t7phy
Copy link
Member Author

t7phy commented Mar 11, 2024

Ok sure, I will undo the merge commits and then rebase

@RoyStegeman
Copy link
Member

RoyStegeman commented Mar 11, 2024

No need to undo the merging. If you just rebase it's fine.

Or rather, while rebasing you get that for free anyway.

@RoyStegeman
Copy link
Member

If @enocera is busy I could have a look as well some point this week.

@t7phy t7phy force-pushed the new_commondata_utils branch 2 times, most recently from 7ed0f6d to e8feb46 Compare March 11, 2024 11:41
@t7phy
Copy link
Member Author

t7phy commented Mar 11, 2024

If @enocera is busy I could have a look as well some point this week.

that would be helpful

@RoyStegeman RoyStegeman self-requested a review March 11, 2024 11:48
Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the filter.py files have only received the bare minimum of scrutiny (many of them are broken now) and we're long past the stage where I can ask for significant changes anyway, I won't closely review these utils. Reviewing this only made sense if there was the intention of a larger effort to review and fix all the filter.py files, but there isn't.

I would however like to ask you to move the commondata_utils.py inside the new_commondata folder (which @comane is turning into it's own package in #1965), since it more suitable there than inside validphys. I think the tests should be scrapped.

@t7phy
Copy link
Member Author

t7phy commented Mar 12, 2024

done!

@RoyStegeman
Copy link
Member

Thanks! On second thought, do any of the dataset implementations that are currently in a PR and have not yet been merged use these utils?

@t7phy
Copy link
Member Author

t7phy commented Mar 12, 2024

I am not sure, but I don't think so. The most extensive use of the utils was in new (di)jets impl., old and new ttb impl. and all dis+j impl. All of these are merged but their filters would not work without these utils.

@scarlehoff
Copy link
Member

Let's do the following:

The function below

def covmat_to_artunc(ndata, covmat_list, no_of_norm_mat=0):

can go to validphys.commondata_utils.

And the others go to one of the datasets that use them (and the rest can import from that)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants