Skip to content

Commit

Permalink
Create new pds-deep-archive program and improve performance (#26)
Browse files Browse the repository at this point in the history
* Resolutions for #13 and #21

- Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!).
    - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21).
    - Refactors logging and command-line argument setup (also for #21).
- Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output.
- Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching.
    - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching.
    - Clear up logging messages so we can know what's calling what.
    - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups
        - But see also #25 for other uses of that DB.
- Add standardized `--version` arguments for all three programs.

With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours.

Footnotes:

- ¹2.4 GHz 8-core Intel Core i9, SSD
- ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive

* Improvements for usability and bug fixes for validate errors

* After running validate, there were a few minor fixes that needed to be implemented.
* Commented out / removed several CLI options for the time being until functionality is fully developed.
* Updated file naming to take into the account bundle versioning separate from the AIP/SIP version
* Updated docs per new pds-deep-archive script which combines aipgen and sipgen.

Refs #21

Co-authored-by: Jordan Padams <jordan.h.padams@jpl.nasa.gov>
  • Loading branch information
nutjob4life and jordanpadams committed Apr 11, 2020
1 parent 2207dcf commit 43d0f66
Show file tree
Hide file tree
Showing 9 changed files with 412 additions and 343 deletions.
97 changes: 4 additions & 93 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Archival Information System (OAIS_) standards.
Features
========

• Provides an exectuble Python script ``pds-deep-archive``. Run ``pds-deep-archive --help`` for
more details.
• Provides an exectuble Python script ``aipgen``. Run ``aipgen --help`` for
more details.
• Provides an exectuble Python script ``sipgen``. Run ``sipgen --help`` for
Expand Down Expand Up @@ -42,6 +44,7 @@ well as ``libxsl2`` 1.1.28 or later.
4. You should now be able to run the deep archive utilities::

(pds-deep-archive) bash> pds-deep-archive --help
(pds-deep-archive) bash> aipgen --help
(pds-deep-archive) bash> sipgen --help

Expand All @@ -63,102 +66,10 @@ To build the software for distribution:
3. A tar.gz should now be available in the ``dist/`` directory for distribution.


Usage
=====

1. If not already activated, activate your virtualenv::

bash> $HOME/.virtualenvs/pds-deep-archive/bin/activate
(pds-deep-archive) bash>

2. Then you can run aipgen. Here's a basic example using data in the test directory::

(pds-deep-archive) bash> aipgen test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
INFO 🏃‍♀️ Starting AIP generation for test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
INFO 🧾 Writing checksum manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_checksum_manifest_v1.0.tab
INFO 🚢 Writing transfer manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_transfer_manifest_v1.0.tab
INFO 🏷 Writing AIP label to ladee_mission_bundle_aip_v1.0.xml
INFO 🎉 Success! All done, files generated:
INFO • Checksum manifest: ladee_mission_bundle_checksum_manifest_v1.0.tab
INFO • Transfer manifest: ladee_mission_bundle_transfer_manifest_v1.0.tab
INFO • XML label: ladee_mission_bundle_aip_v1.0.xml
INFO 👋 Thanks for using this program! Bye!

3. You can also run sipgen. Here is a basic usage example using data in the test directory::

(pds-deep-archive) bash> sipgen -c ladee_mission_bundle_checksum_manifest_v1.0.tab -s PDS_ATM -n -b https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/ test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
⚙︎ ``sipgen`` — Submission Information Package (SIP) Generator, version 0.0.0
🎉 Success! From test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml, generated these output files:
• Manifest: ladee_mission_bundle_sip_v1.0.tab
• Label: ladee_mission_bundle_sip_v1.0.xml

Note how the checksum manifest from ``aipgen`` was the input to ``-c`` in
``sipgen``.

Full usage from the ``--help`` flag to ``aipgen``::

usage: aipgen [-h] [-v] IN-BUNDLE.XML

Generate an Archive Information Package or AIP. An AIP consists of three
files: ➀ a "checksum manifest" which contains MD5 hashes of *all* files in a
product; ➁ a "transfer manifest" which lists the "lidvids" for files within
each XML label mentioned in a product; and ➂ an XML label for these two files.
You can use the checksum manifest file ➀ as input to ``sipgen`` in order to
create a Submission Information Package.

positional arguments:
IN-BUNDLE.XML Root bundle XML file to read

optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbose logging; defaults False

And usage from the ``--help`` flag for ``sipgen``::

usage: sipgen [-h] [-a {MD5,SHA-1,SHA-256}] -s
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
[-u URL | -n] [-k] [-c AIP-CHECKSUM-MANIFEST.TAB]
[-b BUNDLE_BASE_URL] [-v] [-i PDS4_INFORMATION_MODEL_VERSION]
IN-BUNDLE.XML

Generate Submission Information Packages (SIPs) from bundles. This program
takes a bundle XML file as input and produces two output files: ① A Submission
Information Package (SIP) manifest file; and ② A PDS XML label of that file.
The files are created in the current working directory when this program is
run. The names of the files are based on the logical identifier found in the
bundle file, and any existing files are overwritten. The names of the
generated files are printed upon successful completion.

positional arguments:
IN-BUNDLE.XML Bundle XML file to read

optional arguments:
-h, --help show this help message and exit
-a {MD5,SHA-1,SHA-256}, --algorithm {MD5,SHA-1,SHA-256}
File hash (checksum) algorithm; default MD5
-s {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}, --site {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
Provider site ID for the manifest's label; default
None
-u URL, --url URL URL to the registry service; default https://pds-dev-
el7.jpl.nasa.gov/services/registry/pds
-n, --offline Run offline, scanning bundle directory for matching
files instead of querying registry service
-k, --insecure Ignore SSL/TLS security issues; default False
-c AIP-CHECKSUM-MANIFEST.TAB, --aip AIP-CHECKSUM-MANIFEST.TAB
Archive Information Product checksum manifest file
-b BUNDLE_BASE_URL, --bundle-base-url BUNDLE_BASE_URL
Base URL prepended to URLs in the generated manifest
for local files in "offline" mode
-v, --verbose Verbose logging; defaults False
-i PDS4_INFORMATION_MODEL_VERSION, --pds4-information-model-version PDS4_INFORMATION_MODEL_VERSION
Specify PDS4 Information Model version to generate
SIP. Must be 1.13.0.0+; default 1.13.0.0


Documentation
=============

Additional documentation is available in the ``docs`` directory and also TBD.
Installation and Usage information can be found in the documentation online at https://nasa-pds-incubator.github.io/pds-deep-archive/ or the latest version is maintained under the ``docs`` directory.



Expand Down
4 changes: 2 additions & 2 deletions docs/source/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ build it out::
python3 bootstrap.py
bin/buildout

At this point, you'll have the ``aipgen`` and ``sipgen`` programs ready to run
as ``bin/aipgen`` and ``bin/sipgen`` that's set up to use source Python code
At this point, you'll have the ``pds-deep-archive``, ``aipgen``, ``sipgen`` programs ready to run
as ``bin/pds-deep-archive``, ``bin/aipgen``, and ``bin/sipgen`` that's set up to use source Python code
under ``src``. Changes you make to the code are reflected in ``bin/sipgen``
immediately.

Expand Down
145 changes: 69 additions & 76 deletions docs/source/usage/index.rst
Original file line number Diff line number Diff line change
@@ -1,81 +1,77 @@
🏃‍♀️ Usage
===========

This package provides two executables, ``aipgen`` that generats Archive
Information Packages; and ``sipgen``, that generates Submission Information
Package (SIP)—both from PDS bundles.

Running ``aipgen --help`` or ``sipgen --help`` will give a summary of the
This package provides one primary executable, ``pds-deep-archive`` that generates both
and Archive Information Package (AIP) and a Submission Information Package (SIP). The
SIP is what is delivered by the PDS to the NASA Space Science Data Coordinated Archive (NSSDCA).
For more information about the products produced, see the following references:
* OAIS Information - http://www.oais.info/
* AIP Information - https://www.iasa-web.org/tc04/archival-information-package-aip
* SIP Information - https://www.iasa-web.org/tc04/submission-information-package-sip

This package also comes with the two sub-components of ``pds-deep-archive`` that can be ran
individually:
* ``aipgen`` that generates Archive Information Packages from a PDS4 bundle
* ``sipgen`` that generates Submission Information from a PDS4 bundle

Running ``pds-deep-archive --help`` will give a summary of the
command-line invocation, its required arguments, and any options that refine
the behavior. For example, to create an AIP from the LADEE 1101 bundle in
``test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml`` run::
``test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml`` run::

aipgen test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
aipgen test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml

The program will print::

INFO 🏃‍♀️ Starting AIP generation for test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
INFO 🧾 Writing checksum manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_checksum_manifest_v1.0.tab
INFO 🚢 Writing transfer manifest for /Users/kelly/Documents/Clients/JPL/PDS/Development/pds-deep-archive/test/data/ladee_test/ladee_mission_bundle to ladee_mission_bundle_transfer_manifest_v1.0.tab
INFO 🏷 Writing AIP label to ladee_mission_bundle_aip_v1.0.xml
INFO 🎉 Success! All done, files generated:
INFO • Checksum manifest: ladee_mission_bundle_checksum_manifest_v1.0.tab
INFO • Transfer manifest: ladee_mission_bundle_transfer_manifest_v1.0.tab
INFO • XML label: ladee_mission_bundle_aip_v1.0.xml
INFO 👋 Thanks for using this program! Bye!

This creates three output files in the current directory as part of the AIP:
INFO 👟 PDS Deep Archive, version 0.0.0
INFO 🏃‍♀️ Starting AIP generation for test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml

• ``ladee_mission_bundle_checksum_manifest_v1.0.tab``, the checksum manifest
• ``ladee_mission_bundle_transfer_manifest_v1.0.tab``, the transfer manifest
• ``ladee_mission_bundle_aip_v1.0.xml``, the label for these two files
INFO 🎉 Success! AIP done, files generated:
INFO • Checksum manifest: ladee_mission_bundle_v1.0_checksum_manifest_v1.0.tab
INFO • Transfer manifest: ladee_mission_bundle_v1.0_transfer_manifest_v1.0.tab
INFO • XML label for them both: ladee_mission_bundle_v1.0_aip_v1.0.xml

The checkum manifest may then be fed into ``sipgen`` to create the SIP::
INFO 🏃‍♀️ Starting SIP generation for test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml

sipgen --aip ladee_mission_bundle_checksum_manifest_v1.0.tab ladee_mission_bundle_checksum_manifest_v1.0.tab --s PDS_ATM --offline --bundle-base-url https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/ test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml
INFO 🎉 Success! From /Users/jpadams/Documents/proj/pds/pdsen/workspace/pds-deep-archive/test/data/ladee_test/mission_bundle/LADEE_Bundle_1101.xml, generated these output files:
INFO • SIP Manifest: ladee_mission_bundle_v1.0_sip_v1.0.tab
INFO • XML label for the SIP: ladee_mission_bundle_v1.0_sip_v1.0.xml

This program will print::
INFO 👋 That's it! Thanks for making an AIP and SIP with us today. Bye!

⚙︎ ``sipgen`` — Submission Information Package (SIP) Generator, version 0.0.0
🎉 Success! From test/data/ladee_test/ladee_mission_bundle/LADEE_Bundle_1101.xml, generated these output files:
• Manifest: ladee_mission_bundle_sip_v1.0.tab
• Label: ladee_mission_bundle_sip_v1.0.xml
This creates 5 output files in the current directory as part of the AIP and SIP Generation:

And two new files will appear in the current directory:
• ``ladee_mission_bundle_v1.0_checksum_manifest_v1.0.tab``, the checksum manifest
• ``ladee_mission_bundle_v1.0_transfer_manifest_v1.0.tab``, the transfer manifest
• ``ladee_mission_bundle_v1.0_aip_v1.0.xml``, the label for these two files

• ``ladee_mission_bundle_sip_v1.0.tab``, the created SIP manifest as a
• ``ladee_mission_bundle_v1.0_sip_v1.0.tab``, the created SIP manifest as a
tab-separated values file.
• ``ladee_mission_bundle_sip_v1.0.xml``, an PDS label for the SIP file.

For reference, the full "usage" message from ``aipgen`` is::

usage: aipgen [-h] [-v] IN-BUNDLE.XML

Generate an Archive Information Package or AIP. An AIP consists of three
files: ➀ a "checksum manifest" which contains MD5 hashes of *all* files in a
product; ➁ a "transfer manifest" which lists the "lidvids" for files within
each XML label mentioned in a product; and ➂ an XML label for these two files.
You can use the checksum manifest file ➀ as input to ``sipgen`` in order to
create a Submission Information Package.

positional arguments:
IN-BUNDLE.XML Root bundle XML file to read

optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbose logging; defaults False

For reference, the full "usage" message from ``sipgen`` follows::

usage: sipgen [-h] [-a {MD5,SHA-1,SHA-256}] -s
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
[-u URL | -n] [-k] [-c AIP-CHECKSUM-MANIFEST.TAB]
[-b BUNDLE_BASE_URL] [-v] [-i PDS4_INFORMATION_MODEL_VERSION]
IN-BUNDLE.XML
• ``ladee_mission_bundle_v1.0_sip_v1.0.xml``, an PDS label for the SIP file.

For reference, the full "usage" message from ``pds-deep-archive`` is::

$ pds-deep-archive --help
usage: pds-deep-archive [-h] [--version] -s
{PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
[-n] -b BUNDLE_BASE_URL [-d] [-q]
IN-BUNDLE.XML

Generate an Archive Information Package (AIP) and a Submission Information
Package (SIP). This creates three files for the AIP in the current directory
(overwriting them if they already exist):
➀ a "checksum manifest" which contains MD5 hashes of *all* files in a product
➁ a "transfer manifest" which lists the "lidvids" for files within each XML
label mentioned in a product
➂ an XML label for these two files.

It also creates two files for the SIP (also overwriting them if they exist):
① A "SIP manifest" file; and an XML label of that file too. The names of
the generated files are based on the logical identifier found in the
bundle file, and any existing files are overwritten. The names of the
generated files are printed upon successful completion.
② A PDS XML label of that file.

Generate Submission Information Packages (SIPs) from bundles. This program
takes a bundle XML file as input and produces two output files: ① A Submission
Information Package (SIP) manifest file; and ② A PDS XML label of that file.
The files are created in the current working directory when this program is
run. The names of the files are based on the logical identifier found in the
bundle file, and any existing files are overwritten. The names of the
Expand All @@ -86,22 +82,19 @@ For reference, the full "usage" message from ``sipgen`` follows::

optional arguments:
-h, --help show this help message and exit
-a {MD5,SHA-1,SHA-256}, --algorithm {MD5,SHA-1,SHA-256}
File hash (checksum) algorithm; default MD5
--version show program's version number and exit
-s {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}, --site {PDS_ATM,PDS_ENG,PDS_GEO,PDS_IMG,PDS_JPL,PDS_NAI,PDS_PPI,PDS_PSI,PDS_RNG,PDS_SBN}
Provider site ID for the manifest's label; default
None
-u URL, --url URL URL to the registry service; default https://pds-dev-
el7.jpl.nasa.gov/services/registry/pds
Provider site ID for the manifest's label
-n, --offline Run offline, scanning bundle directory for matching
files instead of querying registry service
-k, --insecure Ignore SSL/TLS security issues; default False
-c AIP-CHECKSUM-MANIFEST.TAB, --aip AIP-CHECKSUM-MANIFEST.TAB
Archive Information Product checksum manifest file
files instead of querying registry service. NOTE: By
default, set to True until online mode is available.
-b BUNDLE_BASE_URL, --bundle-base-url BUNDLE_BASE_URL
Base URL prepended to URLs in the generated manifest
for local files in "offline" mode
-v, --verbose Verbose logging; defaults False
-i PDS4_INFORMATION_MODEL_VERSION, --pds4-information-model-version PDS4_INFORMATION_MODEL_VERSION
Specify PDS4 Information Model version to generate
SIP. Must be 1.13.0.0+; default 1.13.0.0
Base URL for Node data archive. This URL will be
prepended to the bundle directory to form URLs to the
products. For example, if we are generating a SIP for
mission_bundle/LADEE_Bundle_1101.xml, and bundle-base-
url is https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/,
the URL in the SIP will be https://atmos.nmsu.edu/PDS/
data/PDS4/LADEE/mission_bundle/LADEE_Bundle_1101.xml.
-d, --debug Log debugging messages for developers
-q, --quiet Don't log informational messages
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@
entry_points={
'console_scripts': [
'sipgen=pds.aipgen.sip:main',
'aipgen=pds.aipgen.aip:main'
'aipgen=pds.aipgen.aip:main',
'pds-deep-archive=pds.aipgen.main:main'
]
},
namespace_packages=['pds'],
Expand Down
Loading

0 comments on commit 43d0f66

Please sign in to comment.