Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge develop_readWDM into develop to read time series by block & group #37

Merged
merged 41 commits into from
Apr 28, 2021

Conversation

ptomasula
Copy link
Member

@ptomasula ptomasula commented Apr 28, 2021

This pull request replaces pull request #35. See below for additional details.

Purpose

This pull request directly addresses issue #21 Rewrite readWDM.py to read by data group & block. The update revises the readWDM function to processes timeseries in WDM files by explicitly looping over groups and processing each block. The logic implemented is based off a function in RESPEC's BASINS repository that provides similar functionality. These improvements enable the readWDM function to handle WDM files containing timeseries with irregular timesteps. The previous version of the function was unable to handle these files (see issue #40 in RESPEC's repo). The updated function also restored Numba support which dramatically decreases the processing time required for WDM files.

Testing

The testing was conducted manually by processing the test10.wdm and calleg.wdm files with both the develop and updated versions of the readWDM function. Timeseries from the resultant hdf files were then compared to ensure they matched. Additionally, the main function of the model was run to ensure the input files generated from the updated readWDM functions still allowed the model to execute.

Replacement of PR #35

During an initial merge of the develop branch into develop_readWDM a suspected clipboard error decremented all the numeric values in a section of code that was cut and pasted while resolving a merge conflict. See commit 16325d1
which attempted to revert the issue. Ultimately the reversion of the merge appeared unsuccessful because we could not then remerge in the develop branch. As a work around, we created a new branch based on commit 146036c and successfully completed the merge under the new branch.

Unaddressed Issues/Next Steps

Testing revealed that executing the main function of the model fails for timeseries with irregular timeseries. This issue was initially documented in issue #21 and will be addressed in a future updated. The readWDM function is working as intended and further discussion is needed on how to best handle timeseries with irregular timesteps within the rest of the project. This issue is now being tracked as issues #51 under RESPEC's repository.

ptomasula and others added 30 commits February 19, 2021 14:43
Adds new function 'process_groups' which replaces the 'getfloats' function.

This new function processes WDM files by blocks which consist of a control word (32bit integer) and then one or more float32 values which contain the data for the block. This approach will provide support for Timeseries records with irregular timestep intervals.
Adding an additional WDM test file rpo772.wdm. This file contains single time with an irregular timestep.
My understanding is that when working with numpy array allocating the array size up front is very important for performance, as expanding the array later essentially means copying the array to a larger array in new block of memory. With the block processing approach there is presently no way to determine the exact size of the resultant array until you read through the groups. Allocating an array of a fixed size can cause files to fail with an IndexError. A quick solution to this was to implement a chunk allocation approach which allocates numpy arrays in large 100mb chunks and only expands and array when processing the next block will exceed the boundaries of the existing array. This solution is far from perfect but at least resolves the IndexErrors. We should consult with Jack to see if there is a way to determine the number of elements prior to processing the timeseries, or alternatively consider switching to Lists instead because adding an element to a list is time constant and should perform better than this approach.
Blocks in a timeseries consist of a minimum of 2 elements, the block control word and one or more float data values. When block ends there is either another block, the end of the group, or the end of the record (meaning we'd go to the next record in the chain). Some of our test WDM files have a block that ends on the 511th element of a record. In these example files the 512th element also did not parse to a valid block control word. At this time I'm not sure what the significance of the 512th elements when a block end, but processing those elements as a control word causes a series of errors which will throw off the accuracy of all subsequent blocks processed. We'll need to confirm the significance of these elements with Jack, but for now I implemented logic to skip to the next record if the end of a block fails the 511th element of the record.
From @PaulDudaRESPEC, added to new `docs` directory that we'll want to build out over time. Connects to #20 & #21.
# 1. used lists to replace numpy matrix;
# 2. added a loop to iterate each group and used ending date as the ending condition
The rewrite to process wdm files as groups led to the deprecation of the getfloats and adjustNval functions. Additionally Hua refactored the original process_groups method in a previous commit as process_groups2. The original process_groups method was removed and process_groups2 was renamed to process_group.
A single leading underscore is one of the methods used in python (see PEP8) to denote internal classes and functions. Internal functions come with no guarantee of backward compatibility. We want to update the naming of supporting functions that we do not the public to interface with.
General refactoring to cleanup code by removing old comments and slight restructuring to increase readability. Also replace print statements for error with raise exceptions.
Merge updates to readUCI and GQUAL from respec/HSPsquared - develop branch
The constraints of Numba meant that datetime conversion cannot occur with the main block processing loop (_process_groups function). This commit replaces the previous use of python datetime object for a bit approach in which date components are stored in a single integer who individual bits can be parsed into the time step components. Conversion to a datetime object now occurs outside of the processing loop prior to output to HDF.
The datetime functions added to support numba in commit e5d64a1 required that integers input into these functions are 64bits or year information will be lost during bit manipulations. The previous implementation left integer type up to numba and in some instances could produce a in32 object. This commit causes integer conversion to be explicitly int64 so that year information is not lost.
Even with datetime conversions removed from the group processing loop, the conversion time using datetime.datetime() remains slow. After trying attempts using some datetime conversion approaches with pandas I was still unable to achieve a significant performance boost.

Numba does not support the creation of datetime objects, however it does support datetime arithmetic. This commit adds in a numba compatible datetime conversion function which calculates a dates offset from the epoch and adds the appropriate timedelta64 objects to return a datetime64 object.
I missed committing 3 line deletions which remove the old pandas.apply based datetime conversion approach.
@ptomasula, try running either of these notebooks.
Connects to issue #21 & PR #35
The timeseries produced by the new version of the WMD reader are offset by one datetime step from the previous version.
When the transform is unable to detect the timestep of the input timeseries, there was a bit of code that performed a reindex and fillna using the 'tbase' of the siminfo. However this key does not appear in that dictionary. Additionally our initial test case of test 10b does not encounter this bit of code. We'll revisit the necessity of this during our IO abstraction but commenting it out for now.
@ptomasula, I found the bug recently described in #21
for assisted value comparisons.
This commit addresses a mismatch in the 'Stop' parameter of the '/TIMESERIES/SUMMARY' table. The updated WDMreader wrote a value 1 timestep less than the end of the last group in the timeseries. See Issue #21.
Add two lines that were missing from the previous commit.
ca50dd0
aufdenkampe and others added 11 commits April 14, 2021 10:58
Commit  c01199a by @PaulDudaRESPEC got HSP2 to run with HDF5 files from new `readWDM`.
Still need to compare outputs
Reversing b0edc39 as a step toward advancing #21
This commit assigned freq attribute of the timeseries generated in the ReadWDM
function. Without the freq assignment, the timeseries were failing to execute in some of the model modules.

Reference: #21 and
#21 (comment)
This fixes a similar parsing on invalid block control word issue that we saw previously. The 'offset' variable keeps track of where on a given record the loop is and a block must be at least 2 words long. When at the final (512th) index of the record and attempting to process the next block we must first go to the next record in the timeseries. See line 307.

However, we've used python indexing for the offset variable which starts at 0. The last index of record should therefor 511 and not the 512.

Reference: #21 (comment)
Timeseries with irregular timestep do not conform to the requirement for setting the index.freq argument. This results in a value error. This commit adds a try except to allow for the reader to handle timeseries with irregular timesteps. However as of this commit, the model with not be able to run these timeseries. Additional effort is needed to handle timesteps with irregular timesteps.
Renames internal functions with prepending underscore to indicate they are private functions as per PEP8.
rename bits_to_date to indicate it is an internal function.
Adds deprecation warning to functions replaced by the updated WDMReader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants