Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce TSV Sizes #877

Merged
merged 9 commits into from
Apr 6, 2022
Merged

Reduce TSV Sizes #877

merged 9 commits into from
Apr 6, 2022

Conversation

afontani
Copy link
Contributor

@afontani afontani commented Apr 5, 2022

Pull Request Description

Companion PR: resstock-estimation #200

As housing characteristics get bigger, we are moving to a compact writing style. The 6-digit float format requirement was put in place when most of the characteristics were not scripted. Now that there is a standard workflow for creating the characteristics, the formatting requirements are being relaxed. This change should allow for more accurate characteristic distributions because the exponential format can be used. The change should also stop round-off errors.

The change is mainly from %.6f to %g in pandas.to_csv(...) in the resstock-estimation repository.

Some CI tests were removed in the integrity checks that related to formatting checks in the characteristics.

Housing Characteristic size reduction

Project National

  • Before Change: 103MB
  • After Change: 46MB

Project Testing

  • Before Change: 648KB
  • After Change: 560KB

Checklist

Not all may apply:

  • Tests (and test files) have been updated
  • Documentation has been updated
  • Changelog has been updated
    [ ] openstudio tasks.rb update_measures has been run
  • No unexpected integration/regression test changes

@afontani afontani self-assigned this Apr 5, 2022
@afontani afontani added this to the ResStock v2.6.0 Beta milestone Apr 5, 2022
Copy link
Contributor

@shorowit shorowit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Nice job updating all related tests, documentation, etc. Just one minor bug to address.

test/integrity_checks.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@joseph-robertson joseph-robertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the numbers on how much this reduces total file size?

float_precision = line[j].split('.')[1].length()
rescue NoMethodError
# Catch non floats
raise "ERROR: Incorrect non float found in '#{name}', line '#{i}'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to remove this check, do we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joseph-robertson: I had to think about this for a second, but in the new format integers are acceptable. An example is our mapping characteristics (ex: State is dependant on County and results in a bunch of 1s and 0s).

There is still a sum to 1.0 and positivity check, which should check for nans and odd values.

# Check for scientific format
if (line[j].include?('e-') || line[j].include?('e+') ||
line[j].include?('E-') || line[j].include?('E+'))
raise "ERROR: Scientific notation found in '#{name}', line '#{i}'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this check?

Copy link
Contributor Author

@afontani afontani Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joseph-robertson: The %g format allows for scientific notation. I actually think this is a good thing as it handles smaller values really well and helps with truncation errors.

@afontani
Copy link
Contributor Author

afontani commented Apr 5, 2022

Do you have the numbers on how much this reduces total file size?

@joseph-robertson: Good suggestion. I updated the PR description. Project national reduced by about 50%.

@afontani afontani merged commit ee49ad6 into develop Apr 6, 2022
@afontani afontani deleted the reduce_tsv_sizes branch April 6, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants