Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve bc file formatting #308

Closed
veenstrajelmer opened this issue Aug 17, 2022 · 5 comments · Fixed by #406, #407 or #417
Closed

Improve bc file formatting #308

veenstrajelmer opened this issue Aug 17, 2022 · 5 comments · Fixed by #406, #407 or #417
Labels
type: enhancement Improvements to existing functionality
Milestone

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Aug 17, 2022

Is your feature request related to a problem? Please describe.
example code:

from pathlib import Path

from hydrolib.core.io.bc.models import (
    ForcingModel,
    QuantityUnitPair,
    TimeInterpolation,
    TimeSeries,
)

# Chosen approach: separate .bc file for each boundary (east and south)
bc_east = ForcingModel()

# Each .bc file can contain 1 or more timeseries, one for each support point:
steric = TimeSeries(
    name="east2_0001",
    quantityunitpair=[QuantityUnitPair(quantity="time", unit="seconds since 2022-01-01 00:00:00 +00:00"), #TODO: should fail if only one QUP is supplied, but does not: https://github.com/Deltares/HYDROLIB-core/issues/305
                      QuantityUnitPair(quantity="waterlevel", unit="m")],
    timeinterpolation=TimeInterpolation.linear,
    datablock=[[0, 1.1], [3600, 2.3], [7200, 3.5], [10800, 2.4], [86400, -0.123]],
)
bc_east.forcing.append(steric)

bc_east.save(filepath=Path("steric_east2.bc"))

Results in this file:

[Forcing]
name              = east2_0001
function          = timeseries
timeInterpolation = linear
offset            = 0.0
factor            = 1.0
quantity          = time
unit              = seconds since 2022-01-01 00:00:00 +00:00
quantity          = waterlevel
unit              = m
0.0        1.1
3600.0     2.3
7200.0     3.5
10800.0    2.4
86400.0    -0.123 

Describe the solution you'd like
The time and data values are not neatly aligned. This can easily be solved. User provides a formatting string for time and data columns, or done semi-automatically. Some thinking is still required, since different data types require different precision.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered (if applicable).

Additional context
This is important with large t3D bc files, because of readability, but also disk space. The latter is significantly reduced when reducing precision and that is relevant since even then the bc files are >1GB. It could also be that formatting with limited significance improves writing speed.

@arthurvd arthurvd added the type: enhancement Improvements to existing functionality label Sep 20, 2022
@arthurvd arthurvd added this to To do in HYDROLIB-core via automation Sep 20, 2022
@arthurvd arthurvd added this to the Release 0.4 milestone Sep 20, 2022
@tim-vd-aardweg tim-vd-aardweg moved this from To do to In progress in HYDROLIB-core Oct 11, 2022
@tim-vd-aardweg
Copy link
Contributor

Discussed the issue with @veenstrajelmer. The data blocks in the .bc file were not as bad as he remembered. We did discuss that we will do the following:

  • Reduce the default number of spaces between columns on 2 (instead of 4).
  • The user should be able to specify the number of decimals they want written for the non-time columns.
  • We will not change the way the numbers are currently aligned.

@tim-vd-aardweg
Copy link
Contributor

tim-vd-aardweg commented Oct 14, 2022

Initially, I had a working implementation by adding a new optional property (that defaults to none) to the DataBlockINIBasedModel called something like number_of_decimals_to_write.

In the _to_section() function of that same class (where the data is implicitly converted from floats to strings) we can then do something like:

def _to_section(self) -> Section:
        section = super()._to_section()
        if not self.number_of_decimals_to_write:
            section.datablock = self.datablock
        else:
            section.datablock = []
            for row in self.datablock:
                formatted_row = [
                    f"%.{self.number_of_decimals_to_write}f" % i for i in row
                ]

            section.datablock.append(formatted_row)

        return section

This correctly formats the datablocks in the .bc file and can be used as follows:

for model in forcingmodel.forcing:
     model.number_of_decimals_to_write = 12

or by specifiying the number_of_decimals_to_write when creating for example a T3D block.

However, after discussing it with Jelmer, we came to the conclusion that this is not very intuitive for the user. The user may think that the precision of the actual data (so the floats) are set, but that's not the case. We also came to the conclusion that it would be more intuitive to somehow add it to the save method. For example:
forcingModel.save(filepath, number_of_decimals_to_write=12)

@tim-vd-aardweg
Copy link
Contributor

So I discussed these requirements with Prisca and she thought it would be best to make something like a SaveConfig class that would not only hold the number_of_decimals_to_write, but could also be used for, for example, whether or not to recursively save models.

Also, after talking to Jelmer once more, he also mentioned that he would like to be able to specifiy the precision for, for example, the poly file. So the issue is in fact bigger than only the datablock in the .bc file and therefore requires a solution at a higher level. We should create a better design than the very limited option I initially wrote.

@tim-vd-aardweg tim-vd-aardweg moved this from In progress to To do in HYDROLIB-core Oct 18, 2022
@priscavdsluis priscavdsluis moved this from To do to In progress in HYDROLIB-core Nov 1, 2022
@priscavdsluis priscavdsluis linked a pull request Nov 1, 2022 that will close this issue
priscavdsluis added a commit that referenced this issue Nov 8, 2022
priscavdsluis added a commit that referenced this issue Nov 8, 2022
@priscavdsluis priscavdsluis linked a pull request Nov 8, 2022 that will close this issue
@priscavdsluis
Copy link
Contributor

priscavdsluis commented Nov 8, 2022

PR #407 has been created to reduce the datablock spacing from 4 to 2.

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Nov 14, 2022

Just tested the performance with the code in #313. These are the results:

1000 rows with hydrolib (unformatted):     0.27 sec
1000 rows with hydrolib (formatted single): 0.26 sec
1000 rows with savetxt (unformatted):       0.08 sec
1000 rows with savetxt (formatted single):  0.03 sec
1000 rows with savetxt (formatted percol):  0.03 sec

10000 rows with hydrolib (unformatted):     2.79 sec
10000 rows with hydrolib (formatted single): 4.40 sec
10000 rows with savetxt (unformatted):       0.66 sec
10000 rows with savetxt (formatted single):  0.22 sec
10000 rows with savetxt (formatted percol):  0.21 sec

100000 rows with hydrolib (unformatted):     31.48 sec
100000 rows with hydrolib (formatted single): 29.18 sec
100000 rows with savetxt (unformatted):       6.57 sec
100000 rows with savetxt (formatted single):  2.30 sec
100000 rows with savetxt (formatted percol):  3.63 sec

In the original issue, the performance of hydrolib and savetxt was equal in unformatted mode, while it is a factor 5 different now. The limited performance of hydrolib seems to mainly come from the formatting, the writing of the file seems quite fast. numpy.savetxt() seems to format per line instead of per item (if I interpret it correctly): https://github.com/numpy/numpy/blob/v1.23.0/numpy/lib/npyio.py#L1553
However, that would interfere with the hydrolib column aligner I think, which is a useful feature.

priscavdsluis added a commit that referenced this issue Nov 14, 2022
@priscavdsluis priscavdsluis linked a pull request Nov 16, 2022 that will close this issue
priscavdsluis added a commit that referenced this issue Nov 16, 2022
priscavdsluis added a commit that referenced this issue Nov 17, 2022
priscavdsluis added a commit that referenced this issue Nov 17, 2022
@priscavdsluis priscavdsluis moved this from In progress to Reviewer follow up in HYDROLIB-core Nov 18, 2022
priscavdsluis added a commit that referenced this issue Nov 18, 2022
@priscavdsluis priscavdsluis moved this from Reviewer follow up to Ready to review in HYDROLIB-core Nov 22, 2022
@tim-vd-aardweg tim-vd-aardweg moved this from Ready to review to Review in progress in HYDROLIB-core Nov 25, 2022
tim-vd-aardweg pushed a commit that referenced this issue Dec 15, 2022
HYDROLIB-core automation moved this from Review in progress to Done Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvements to existing functionality
Projects
HYDROLIB-core
  
Done
4 participants