Skip to content

Collection of Tools, Workflows and Processes to help Facilitate Archiving Scientific Data.

License

Notifications You must be signed in to change notification settings

MichaelAkridge-NOAA/archive-toolbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Manifest Tool

ESD/ARP Archive Toolbox

Collection of Tools, Workflows and Processes to help Facilitate Archiving Scientific Data.

Table of Contents

  1. Archive Tools
  2. Google Cloud Platform Upload Tools

Archive Tools

File Copy

File copy tool will copy files and directories from one place to another.

  • It uses a subprocess to call a windows robust file copy command
  • The app will skip any existing files in a destination directory
  • It will also run multi-threaded for performance
  • If a copy process is interrupted, then simply run again since it also has the ability to restart the transfer.
  • NOTE: Multiplatform versions available 

Folder Stats

  • Lightweight,python based, tool to gather folder stats like name, path, and size
  • Exports a CSV of folder stats

garmin-gps-file-converter

  • Tool to convert Garmin GPS files(GPX) to standard CSV/TXT file format

HEIC_HEIF_converter

  • Tool to batch convert HEIC/HEIF files to standard JPG file format

Manifest Tool

  • Tool to Verify or Generate Archive Manifest Files
  • Required when sending archive packages

Manifest File Details

  • A separate manifest file is required for every file that is transferred
  • A manifest file contains three text values comma delimited on one line with no spaces for each submitted file.
<file_name>,<file_md5_checksum>,<file_size_in_bytes>
  • The manifest file name pattern is the name of the associated data file with an added '.mnf'.
<file_name>.mnf

Other Archive Tools

PIFSC Centralized Data Tools - NCEI Tools Libary

  • PHP based tools. Data packager, bagit data packager, and submission manifest tools.
  • placeholder - link coming soon - placeholder

Send2NCEI(S2N)

  • Send2NCEI (S2N) is an archiving tool that allows you to easily submit your data files and related documentation to the National Centers for Environmental Information for long term preservation, stewardship, and access.
  • https://www.ncei.noaa.gov/archive/send2ncei/

Advanced Tracking and Resource Tool for Archive Collections (ATRAC)

  • The Advanced Tracking and Resource tool for Archive Collections (ATRAC) provides a common interface for users to enter and display information on archiving projects at the NOAA National Centers for Environmental Information (NCEI).
  • https://www.ncdc.noaa.gov/atrac/guidelines.html

Google Cloud Platform Upload Tools

NOAA Open Data Dissemination (NODD) Workflow

NODD Upload Tool

Requirements

Tool Details

  • syncs local data to Google Cloud Platform Storage using Google's gsutil backend
  • Customizable gsutil command generation
  • Logging and output message management
  • Configurable parameters including dry run, multi-threading, and recursion
  • A graphical user interface (GUI) is built using the Gooey library.
  • Users can interact with the GUI to specify the source and destination paths, adjust threading, select a dry run, and decide whether to print or run the gsutil command.
  • The main function fetches user input, configures the logger, and executes the copy process or prints the gsutil command based on user preference.

Simple NODD Upload Script Example

gsutil -m rsync -r  C:\destination_folder_path gs://<bucket_name>/<destination_folder_path>

Shell Script - List of Files - NODD Upload Script Example

#!/bin/bash

# List of files to upload
source_files=(
"C:\example\of\file\path\file.txt"
[ADD LIST of FILES HERE]
)

# Destination Google Cloud Storage bucket path
destination_bucket="gs://bucket_name/destination_folder_path..."

# Loop through the source file list and upload each file
for source_file in "${source_files[@]}"; do
    gsutil -o "GSUtil:parallel_thread_count=12" -m cp "$source_file" "$destination_bucket"
done

breakdown of what each flag and option does

  • gsutil: This is the command-line tool for interacting with Google Cloud Storage. https://cloud.google.com/storage/docs/gsutil
  • -o "GSUtil:parallel_thread_count=12" if use multi-threading, limit bandwidth usage with this flag
  • -m: This flag enables multi-threading, which speeds up the transfer by using multiple connections.
  • rsync: This is the command to synchronize files and directories between a local and a remote location.
  • -r: This flag tells rsync to perform a recursive copy of the entire directory tree.
  • -n: This flag performs a "dry run," which means that the synchronization will be simulated, and no actual data will be transferred. This is useful to preview what would happen if the command were executed without actually making any changes.
  • -x: This flag excludes files and directories that match any of the specified patterns using regular expressions.
  • local_dir C:\nodd\test: This is the local directory that will be synchronized.
  • cloud_dir gs://nmfs_odp_pifsc/PIFSC/ESD/ARP/test: This is the Google Cloud Storage bucket and destination path where the data will be synchronized.

The -x excluded patterns in detail we use are:

  • -x "(^.(?<!.JPG)$)|(._archive.|._YEAR.*|.ISLAND.|.SITE-ID.|.SITE_PHOTOS.|.uncorrected.|.MISC.|.DARK.|.Products.)"
  • (^.*(?<!.JPG)$)
    • Matches any files that do not end with the ".JPG" extension.
  • (._archive.|._YEAR.|.ISLAND.|.SITE-ID.|.SITE_PHOTOS.|.uncorrected.|.MISC.|.DARK.|.Products.)
    • Matches any files or directories that contain any of the specified strings.

Simple NODD Download Script Example

gsutil -m rsync -r gs://<source_bucket_name>/<source_folder_path> C:\destination_folder_path

NODD Details

NODD

PIFSC NODD

NODD for other NMFS Centers:

Google Cloud SDK Docs


made-with-python

Disclaimer

This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project content is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

License

See the LICENSE.md for details