Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new pds-deep-archive program and improve performance #26

Merged
merged 3 commits into from
Apr 11, 2020

Conversation

nutjob4life
Copy link
Member

With these changes, running sipgen on my Mac¹ can process a 272GiB insight_cameras export in 1:03. On pdsimg-int1, it handles the 1.5TiBinsight_cameras dataset in under 4 hours.

Footnotes:

  • ¹2.4 GHz 8-core Intel Core i9, SSD
  • ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive

- Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!).
    - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21).
    - Refactors logging and command-line argument setup (also for #21).
- Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output.
- Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching.
    - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching.
    - Clear up logging messages so we can know what's calling what.
    - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups
        - But see also #25 for other uses of that DB.
- Add standardized `--version` arguments for all three programs.

With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours.

Footnotes:

- ¹2.4 GHz 8-core Intel Core i9, SSD
- ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive
@nutjob4life
Copy link
Member Author

Again, you can ignore the Packaging step of the check failing; it's because it ran earlier today, the file is date-stamped, and the test PyPI doesn't allow duplicate files to be uploaded.

@jordanpadams
Copy link
Member

@nutjob4life could we change this to datetime? or too much of a pain?

* After running validate, there were a few minor fixes that needed to be implemented.
* Commented out / removed several CLI options for the time being until functionality is fully developed.
* Updated file naming to take into the account bundle versioning separate from the AIP/SIP version
* Updated docs per new pds-deep-archive script which combines aipgen and sipgen.

Refs #21
@jordanpadams jordanpadams changed the title Resolve #13 and #21 Create new pds-deep-archive program and improve performance Apr 11, 2020
@jordanpadams jordanpadams merged commit 43d0f66 into master Apr 11, 2020
@jordanpadams jordanpadams deleted the pds.14-i13+i21 branch April 11, 2020 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop one script to run both AIP and SIP generator Improve SIP Gen performance for very large data sets
2 participants