-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce TSV Sizes #877
Reduce TSV Sizes #877
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Nice job updating all related tests, documentation, etc. Just one minor bug to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the numbers on how much this reduces total file size?
float_precision = line[j].split('.')[1].length() | ||
rescue NoMethodError | ||
# Catch non floats | ||
raise "ERROR: Incorrect non float found in '#{name}', line '#{i}'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to remove this check, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joseph-robertson: I had to think about this for a second, but in the new format integers are acceptable. An example is our mapping characteristics (ex: State is dependant on County and results in a bunch of 1s and 0s).
There is still a sum to 1.0 and positivity check, which should check for nans and odd values.
# Check for scientific format | ||
if (line[j].include?('e-') || line[j].include?('e+') || | ||
line[j].include?('E-') || line[j].include?('E+')) | ||
raise "ERROR: Scientific notation found in '#{name}', line '#{i}'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joseph-robertson: The %g
format allows for scientific notation. I actually think this is a good thing as it handles smaller values really well and helps with truncation errors.
@joseph-robertson: Good suggestion. I updated the PR description. Project national reduced by about 50%. |
d68fa42
to
7587bb6
Compare
Pull Request Description
Companion PR: resstock-estimation #200
As housing characteristics get bigger, we are moving to a compact writing style. The 6-digit float format requirement was put in place when most of the characteristics were not scripted. Now that there is a standard workflow for creating the characteristics, the formatting requirements are being relaxed. This change should allow for more accurate characteristic distributions because the exponential format can be used. The change should also stop round-off errors.
The change is mainly from
%.6f
to%g
inpandas.to_csv(...)
in the resstock-estimation repository.Some CI tests were removed in the integrity checks that related to formatting checks in the characteristics.
Housing Characteristic size reduction
Project National
Project Testing
Checklist
Not all may apply:
[ ]openstudio tasks.rb update_measures
has been run