Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_regression with storage in NPZ files? #71

Closed
tovrstra opened this issue Sep 3, 2021 · 4 comments · Fixed by #72
Closed

num_regression with storage in NPZ files? #71

tovrstra opened this issue Sep 3, 2021 · 4 comments · Fixed by #72
Labels
question Further information is requested

Comments

@tovrstra
Copy link
Contributor

tovrstra commented Sep 3, 2021

It would be useful to store numerical results in NPZ files, because these files can contain arrays with arbitrary shapes. Also for large arrays, this format is more compact (binary + compression), which is useful when dealing with large arrays. The downside is obviously that NPZ files are not human-readable. Would this be a feature of interest for the project? Are there currently other ways of handling large (say 10^6 elements) and/or high-dimensional arrays?

@nicoddemus nicoddemus added the question Further information is requested label Sep 3, 2021
@nicoddemus
Copy link
Member

nicoddemus commented Sep 3, 2021

Hi @tovrstra!

Thanks for the suggestion.

Yeah I think an option to num_regression.check and dataframe_regression.check could be use to save arrays in NPZ format (or automatically if it detects that the current array wouldn't be saved in CSV, the current format).

We would love to review a PR in that direction. 👍

@tarcisiofischer
Copy link
Contributor

I'm interested in this, specially for 2d arrays (e.g. For comparing scalar fields).
I'm curious to see the implementation details. Perhaps some questions to think about:

  • Could this be implemented keeping the dict input for num-regression or would it be only for a single nparray?
    For example, would this code below be possible? (I believe no, but is there any alternative? Perhaps a new specific nparray_regression?)
            num_regression.check({
                'U_field': my_np_array_3d,
                'U_mean': my_np_array_2d,
                'U_center_mean': my_np_array_1d,
                'P': Pa_to_bar(P)[positions], // 1d np array smaller than the other one
            })
  • How would the error messages be presented for such dense matrices / multidimensional data? Perhaps sparse visualization? Or no message at all, just the file re-generation and the user makes a diff manually?

Perhaps I'm overthinking. Anything that improves the current implementation is actually very welcome, IMO!
Would love to hear more and/or see a PR too :) Thanks!

@tovrstra
Copy link
Contributor Author

tovrstra commented Sep 3, 2021

Thanks for the enthusiasm! Just a quick attempt to answer a few of the questions:

  • NPZ is a ZIP file of NPY files, so a dictionary of arrays (with different shapes and dtypes) can be stored in such a file. See NumPy's savez_compressed. You can pass in a dictionary whose keys are not proper Python variable names. E.g. following works (!):

    import numpy as np
    data = {"odd/name": np.array([1, 2]), "other.name": np.array([[3.0, 4.0, 5.0]])}
    np.savez_compressed("somefile.npz", **data)
    data2 = dict(np.load("somefile.npz"))
  • Error reporting can indeed be difficult. It depends on how sensitive the test should be. The simplest is to requires strict reproduction of the arrays, e.g. reporting the number of mismatched elements for each item in the dictionary, some examples of mismatches (old, new and error value for each). In some cases, one may prefer to have atol and rtol arguments for the check method, in analogy to the same arguments of NumPy's allclose method. This may not be perfect though, because the dictionary may contain different arrays with different precision requirements (or some not even containing floating point numbers).

@tarcisiofischer
Copy link
Contributor

SGTM. About the error reporting, I think you can choose whatever you think its good enough for a first version, and perhaps we could discuss further in the PR :)

tovrstra added a commit to tovrstra/pytest-regressions that referenced this issue Sep 3, 2021
Fixes ESSS#71

This is a first attempt, including unit tests.
Also some minor issues in DataFrameRegressionFixture were fixed,
which was used as a starting point for this PR.
The new fixture only uses NumPy, so also the error message makes
no use for pandas for formatting the errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants