Merge develop_readWDM into develop to read time series by block & group #37

ptomasula · 2021-04-28T23:55:33Z

This pull request replaces pull request #35. See below for additional details.

Purpose

This pull request directly addresses issue #21 Rewrite readWDM.py to read by data group & block. The update revises the readWDM function to processes timeseries in WDM files by explicitly looping over groups and processing each block. The logic implemented is based off a function in RESPEC's BASINS repository that provides similar functionality. These improvements enable the readWDM function to handle WDM files containing timeseries with irregular timesteps. The previous version of the function was unable to handle these files (see issue #40 in RESPEC's repo). The updated function also restored Numba support which dramatically decreases the processing time required for WDM files.

Testing

The testing was conducted manually by processing the test10.wdm and calleg.wdm files with both the develop and updated versions of the readWDM function. Timeseries from the resultant hdf files were then compared to ensure they matched. Additionally, the main function of the model was run to ensure the input files generated from the updated readWDM functions still allowed the model to execute.

Replacement of PR #35

During an initial merge of the develop branch into develop_readWDM a suspected clipboard error decremented all the numeric values in a section of code that was cut and pasted while resolving a merge conflict. See commit 16325d1
which attempted to revert the issue. Ultimately the reversion of the merge appeared unsuccessful because we could not then remerge in the develop branch. As a work around, we created a new branch based on commit 146036c and successfully completed the merge under the new branch.

Unaddressed Issues/Next Steps

Testing revealed that executing the main function of the model fails for timeseries with irregular timeseries. This issue was initially documented in issue #21 and will be addressed in a future updated. The readWDM function is working as intended and further discussion is needed on how to best handle timeseries with irregular timesteps within the rest of the project. This issue is now being tracked as issues #51 under RESPEC's repository.

Adds new function 'process_groups' which replaces the 'getfloats' function. This new function processes WDM files by blocks which consist of a control word (32bit integer) and then one or more float32 values which contain the data for the block. This approach will provide support for Timeseries records with irregular timestep intervals.

Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.

My understanding is that when working with numpy array allocating the array size up front is very important for performance, as expanding the array later essentially means copying the array to a larger array in new block of memory. With the block processing approach there is presently no way to determine the exact size of the resultant array until you read through the groups. Allocating an array of a fixed size can cause files to fail with an IndexError. A quick solution to this was to implement a chunk allocation approach which allocates numpy arrays in large 100mb chunks and only expands and array when processing the next block will exceed the boundaries of the existing array. This solution is far from perfect but at least resolves the IndexErrors. We should consult with Jack to see if there is a way to determine the number of elements prior to processing the timeseries, or alternatively consider switching to Lists instead because adding an element to a list is time constant and should perform better than this approach.

Blocks in a timeseries consist of a minimum of 2 elements, the block control word and one or more float data values. When block ends there is either another block, the end of the group, or the end of the record (meaning we'd go to the next record in the chain). Some of our test WDM files have a block that ends on the 511th element of a record. In these example files the 512th element also did not parse to a valid block control word. At this time I'm not sure what the significance of the 512th elements when a block end, but processing those elements as a control word causes a series of errors which will throw off the accuracy of all subsequent blocks processed. We'll need to confirm the significance of these elements with Jack, but for now I implemented logic to skip to the next record if the end of a block fails the 511th element of the record.

@PaulDudaRESPEC

From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.

@jlkittle

.exp & .OUT files from @@jlkittle using WDMRX debugger: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE CSV file from @htaolimno using https://github.com/respec/BASINS/tree/master/atcWdmVb Connects to #21 & #22.

# 1. used lists to replace numpy matrix; # 2. added a loop to iterate each group and used ending date as the ending condition

The rewrite to process wdm files as groups led to the deprecation of the getfloats and adjustNval functions. Additionally Hua refactored the original process_groups method in a previous commit as process_groups2. The original process_groups method was removed and process_groups2 was renamed to process_group.

A single leading underscore is one of the methods used in python (see PEP8) to denote internal classes and functions. Internal functions come with no guarantee of backward compatibility. We want to update the naming of supporting functions that we do not the public to interface with.

General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.

@ptomasula

To help with #21 @ptomasula & @PaulDudaRESPEC

Connected to #28

Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch

The constraints of Numba meant that datetime conversion cannot occur with the main block processing loop (_process_groups function). This commit replaces the previous use of python datetime object for a bit approach in which date components are stored in a single integer who individual bits can be parsed into the time step components. Conversion to a datetime object now occurs outside of the processing loop prior to output to HDF.

…ared into develop_readWDM

The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.

Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost. Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.

I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.

@ptomasula

@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35

The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.

When the transform is unable to detect the timestep of the input timeseries, there was a bit of code that performed a reindex and fillna using the 'tbase' of the siminfo. However this key does not appear in that dictionary. Additionally our initial test case of test 10b does not encounter this bit of code. We'll revisit the necessity of this during our IO abstraction but commenting it out for now.

@ptomasula

@ptomasula, I found the bug recently described in #21

for assisted value comparisons.

This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.

Add two lines that were missing from the previous commit. ca50dd0

@PaulDudaRESPEC

Commit c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`. Still need to compare outputs

Reversing b0edc39 as a step toward advancing #21

This commit assigned freq attribute of the timeseries generated in the ReadWDM function. Without the freq assignment, the timeseries were failing to execute in some of the model modules. Reference: #21 and #21 (comment)

This fixes a similar parsing on invalid block control word issue that we saw previously. The 'offset' variable keeps track of where on a given record the loop is and a block must be at least 2 words long. When at the final (512th) index of the record and attempting to process the next block we must first go to the next record in the timeseries. See line 307. However, we've used python indexing for the offset variable which starts at 0. The last index of record should therefor 511 and not the 512. Reference: #21 (comment)

Timeseries with irregular timestep do not conform to the requirement for setting the index.freq argument. This results in a value error. This commit adds a try except to allow for the reader to handle timeseries with irregular timesteps. However as of this commit, the model with not be able to run these timeseries. Additional effort is needed to handle timesteps with irregular timesteps.

Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.

rename bits_to_date to indicate it is an internal function.

Adds deprecation warning to functions replaced by the updated WDMReader.

ptomasula and others added 30 commits February 19, 2021 14:43

Merge branch 'develop' into develop_readWDM

152d7ed

Adding rpo772.wdm

90e7ff7

Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.

Add WDM Programmers Guide

762b90c

From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.

Adding rpo772.wdm debugging files

b6fbef4

.exp & .OUT files from @@jlkittle using WDMRX debugger: https://github.com/respec/FORTRAN/blob/master/lib3.0/BIN/WDMRX.EXE CSV file from @htaolimno using https://github.com/respec/BASINS/tree/master/atcWdmVb Connects to #21 & #22.

alternative implementation of process_group:

0f5a2d1

# 1. used lists to replace numpy matrix; # 2. added a loop to iterate each group and used ending date as the ending condition

tidy up readWDM

2e205f1

General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.

Searchable WDMProgrammersGuide+Search+Nav.pdf with sidebar navigation

8b701e9

To help with #21 @ptomasula & @PaulDudaRESPEC

Upload HSPF v12.2 manual with added sidebar navigation

ef2d8df

Connected to #28

Merge pull request #32 from LimnoTech/develop

0cfd7c2

Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch

Merge branch 'develop_readWDM' of https://github.com/LimnoTech/HSPsqu…

677adaa

…ared into develop_readWDM

Remove old datetime conversion

9b19a92

I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.

Test10 no longer runs since readWDM time series updates

2a8373e

@ptomasula, try running either of these notebooks. Connects to issue #21 & PR #35

Datetime shift fix

2014cd0

The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.

Merge branch 'develop' into develop_readWDM

a094cfe

Merge branch 'develop' into develop_readWDM

0f1f7bc

TimeSeries dtypes change in readWDM

c52eaad

@ptomasula, I found the bug recently described in #21

Update HDF5_compare notebook

2d09713

for assisted value comparisons.

Merge branch 'develop' into develop_readWDM

ca50dd0

Stop datetime fix

543ea20

This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.

Adding missing lines to stop datetime fix

ae374f1

Add two lines that were missing from the previous commit. ca50dd0

Merge branch 'develop' into develop_readWDM

0d910ed

aufdenkampe and others added 11 commits April 14, 2021 10:58

Testing outputs. Fix worked!

6e6beab

Commit c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`. Still need to compare outputs

Sparse time series fill added back

479e6d6

Reversing b0edc39 as a step toward advancing #21

timeseries freq assignment

b9e635f

This commit assigned freq attribute of the timeseries generated in the ReadWDM function. Without the freq assignment, the timeseries were failing to execute in some of the model modules. Reference: #21 and #21 (comment)

Update conda env & notebooks

1c450ae

renaming private function

c62adb6

Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.

Merge branch 'develop' into develop_readWDM2

5039cf2

rename bits_to_date

e651d08

rename bits_to_date to indicate it is an internal function.

remove unnecessary dependencies

4f4f15a

deprecation notices

12dc1d5

Adds deprecation warning to functions replaced by the updated WDMReader.

ptomasula mentioned this pull request Apr 28, 2021

Merge develop_readWDM into develop to read time series by block & group #35

Closed

ptomasula merged commit ba64128 into develop Apr 28, 2021

aufdenkampe mentioned this pull request Apr 29, 2021

Merge LimnoTech readWDM & other updates to RESPEC develop respec/HSPsquared#52

Merged

ptomasula deleted the develop_readWDM2 branch September 3, 2021 14:17

aufdenkampe mentioned this pull request Oct 1, 2021

UCI import with 15min timestep respec/HSPsquared#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge develop_readWDM into develop to read time series by block & group #37

Merge develop_readWDM into develop to read time series by block & group #37

ptomasula commented Apr 28, 2021 •

edited

Loading

Merge develop_readWDM into develop to read time series by block & group #37

Merge develop_readWDM into develop to read time series by block & group #37

Conversation

ptomasula commented Apr 28, 2021 • edited Loading

Purpose

Testing

Replacement of PR #35

Unaddressed Issues/Next Steps

ptomasula commented Apr 28, 2021 •

edited

Loading