Skip to content

britishlibrary/mpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MPT (Minimum Preservation Tool)

A utility for staging files, calculating and validating file checksums, and comparing checksum values between storage locations.

Requirements

  • Python (version 3.6+)
  • Pip (version 19.0+)

How to install

MPT works best within a virtual environment. To create a new virtual environment, start a command prompt and enter the following command: :

python -m venv [path-to-venv-directory]

This will create a directory structure in [path-to-venv-directory] containing all the necessary configuration and data files required. The virtual environment can be activated by entering one of the following at the command prompt:

Windows: :

[path-to-venv-directory]\Scripts\activate.bat

Linux: :

source [path-to-venv-directory]/bin/activate

When you've activated the virtual environment, install MPT from a Git repository: :

pip install git+http://github.com/britishlibrary/mpt

Or from a local source: :

pip install /path/to/mpt-source/

All dependencies should be automatically downloaded and installed as part of pip's install process.

Configuration

In order to automatically e-mail summary reports, MPT requires that three environment variables be set: :

MAIL_SERVER = mail.example.com
MAIL_SERVER_PORT = 587
MAIL_SENDER_ADDRESS = <the sender address you wish displayed in all e-mails>

An example of MAIL_SENDER_ADDRESS might be Bitwise Checks <do_not_reply@example.com>

On Windows, these should be set via Control Panel > System > Advanced System Settings > Environment Variables.

On Linux, these should be added to the ~/.bash_profile or ~/.profile file for the user running MPT.

How to use

MPT has several modes of operation.

Checksum Creation

MPT can calculate checksums for an existing collection of files, and store those checksums in a 'checksum tree' which mimics the directory structure of the original files. Optionally it can also store these checksum values in a single manifest file. :

mpt create dir -t TREE [-a ALGORITHM] [--formats FORMATS ] [-m MANIFEST] [-r]

The various command line options and arguments are described below.

Directory to check (required)

The directory of files to process.

Directory for checksum tree (required)

Use the -t or --tree option to specify the directory in which the 'checksum tree' should be created. A checksum file will be created in the tree for each file checked. The name and path to the checksum file will mirror that of the original file checked.

Recursive operation (optional)

Use the -r or --recursive option to process all sub-folders beneath the given directory. By default only the top-level directory will be processed.

Specify checksum algorithm (optional)

Use the -a or --algorithm option to specify the checksum algorithm to use. A number of different algorithms are supported (use mpt create -h to list them all). The default algorithm is sha256.

Limit to certain file extensions (optional)

Use the --formats option to limit checksum creation to files with a particular file extension.

Specify manifest file (optional)

Use the -m or --manifest option to specify a manifest file to be created in addition to the 'checksum tree'.

Example of command syntax

mpt create -r c:\storage\files
           -t c:\storage\checksums
           -m c:\storage\manifest.sha256
           --formats tiff tif

This will create checksums for all files ending in tiff or tif in c:\storage\files and all subdirectories. The SHA256 algorithm will be used as the default option. The resulting 'checksum tree' will be created in c:\storage\checksums mirroring the original directory structure. A manifest file containing all checksums will also be created (if it does not already exist) or updated at c:\storage\manifest.sha256.

Checksum Validation (Checksum Tree)

MPT can verify the checksums of all files listed in a 'checksum tree' created by the creation or staging mode. :

mpt validate_tree dir -t TREE [-r]

The various command line options and arguments are described below.

Data directory root (required)

The root directory of files to validate.

Checksum tree root (required)

Use the -t or --tree option to specify the root directory of the 'checksum tree' used to validate the data files.

Recursive operation (optional)

Use the -r or --recursive option to process all sub-folders beneath the given directory. By default only the top-level directory will be processed.

Example of command syntax

mpt validate_tree -r c:\storage\files -t c:\storage\checksums

This will validate all data files in c:\storage\files and all subdirectories. Each file will be validated using its checksum file in the 'checksum tree' in c:\storage\checksums.

Checksum Validation (Manifest)

MPT can verify the checksums of all files listed in a manifest file created by the creation or staging mode. :

mpt validate_manifest dir -m MANIFEST [-r] [-a ALGORITHM]

The various command line options and arguments are described below.

Data directory root (required)

The root directory of files to validate.

Manifest file path (required)

Use the -m or --manifest option to specify the location of the manifest file used to validate the data files.

Specify checksum algorithm (optional)

Use the -a or --algorithm option to specify the checksum algorithm to use. A number of different algorithms are supported (use mpt validate_manifest -h to list them all). The default algorithm is sha256.

Example of command syntax

mpt validate_manifest c:\storage\files -m c:\storage\manifest.sha256

This will validate all data files in c:\storage\files and all subdirectories. Each file will be validated using its entry in the manifest file c:\storage\manifest.sha256.

Checksum Comparison (Checksum Trees)

MPT can compare the checksums stored in a 'checksum tree' to other 'trees' stored in different locations in order to detect any discrepancies. :

mpt compare_trees dir -t OTHER_TREES

The various command line options and arguments are described below.

Checksum tree root (required)

The root directory of the master checksum tree to use as a base of comparison.

Other checksum tree roots (required)

Use the -t or --trees option to specify the location of other checksum trees to compare to the master.

Example of command syntax

mpt compare_trees c:\storage\checksums
                  -t q:\backup_storage_1\checksums z:\backup_storage_2\checksums

This will compare all checksum files in the 'checksum tree' located in c:\storage\checksums against the corresponding files in q:\backup_storage_1\checksums and z:\backup_storage_2\checksums and highlight any discrepancies.

Checksum Comparison (Manifests)

MPT can compare the checksums stored in a manifest file to manifests in other locations in order to detect any discrepancies. :

mpt compare_manifests manifest -m OTHER_MANIFESTS

The various command line options and arguments are described below.

Master manifest file (required)

The path to the master manifest file to use as a base of comparison.

Other manifest files (required)

Use the -m or --other_manifests option to specify the location of other manifests to compare to the master.

Example of command syntax

mpt compare_manifests c:\storage\manifest.sha256
                      -m q:\backup_storage_1\manifest.sha256 z:\backup_storage_2\manifest.sha256

This will compare all entries in the manifest file c:\storage\manifest.sha256 against the corresponding files q:\backup_storage_1\manifest.sha256 and z:\backup_storage_2\manifest.sha256 and highlight any discrepancies.

File Staging

File staging involves processing all files in a particular directory and moving them to one or more storage locations, calculating their checksums in the process.

If staging is successful for all destinations then the original file will be removed from the staging area. If any part of the staging process fails for a particular file, then the entire staging process will be backed out for that file. This is to ensure that the staged file is present either in all destinations or in none.

For example, if a file is successfully copied to three out of four destinations, but fails on the fourth destination, the file will be removed from each of the three other nodes. The final summary report would describe the details of the error condition for the one destination which failed, while the other three would be listed as "Unstaged." :

mpt stage dir -d DESTINATIONS [-a ALGORITHM] [-t TREES] [-m MANIFESTS ] [--max-failures MAX_FAILURES]

The various command line options and arguments are described below.

Staging Directory (required)

The directory of files to be staged.

Staging Destinations (required)

Use the -d or --destinations option to specify the root directory of each staging destination (i.e. where the files should be moved to). These destinations can be in any order, but the order must be consistent between this option and the --trees and --manifests options if they are used.

If the --trees option to specify 'checksum tree' locations is omitted, then the files will actually be staged to a subdirectory named files directly beneath each specified staging destination.

Specify checksum algorithm (optional)

Use the -a or --algorithm option to specify the checksum algorithm to use. A number of different algorithms are supported (use mpt stage -h to list them all). The default algorithm is sha256.

Destination checksum trees (optional)

Use the -t or --trees option to specify the root directory of each destination checksum tree (i.e. where the checksums should be stored in each staging destination).

If provided, then these destination tree paths must be listed in the same order as the staging destinations listed for the --destinations option - e.g. the first path listed for -t must be for the checksum tree corresponding to the first destination listed for the -d option, and so on.

If this option is omitted altogether, then checksum trees will actually be created in a subdirectory named checksums directly beneath each specified staging destination.

Destination manifest files (optional)

Use the -m or --manifests option to specify the location of a manifest file to create or update in each staging destination.

If provided, then these manifest paths must be listed in the same order as the staging destinations listed for the --destinations option - e.g. the first manifest listed for -m must be for the manifest corresponding to the first destination listed for the -d option, and so on.

If this option is omitted altogether, then no manifest files will be created.

Bypass confirmation prompt (optional)

By default, staging mode will prompt the user to confirm that all file paths are correct before commencing. Using the --no-confirm option will bypass this prompt. The intention is for the user to prepare and test their command-line syntax interactively using the confirmation prompt as a guide, and use the --no-confirm option when scheduling the staging process to run automatically.

Override maximum number of consecutive failures (optional)

By default, staging will be aborted if 10 consecutive write failures occur. Use the --max-failures option to override this threshold.

Keep empty folders in staging directory (optional)

By default, any empty folders left in the staging directory will be deleted once staging is complete. Using the --keep-staging-folders option will change this behaviour and leave empty folders untouched. This may be useful in cases where a complex hierarchical structure needs to be maintained for new files and maintaining an empty file system in the staging directory is easier than recreating the structure for each run.

Examples of command syntax

Example 1 (use defaults for file & checksum destinations): :

mpt stage f:\staging
          -d c:\storage q:\backup_storage_1 z:\backup_storage_2

This will process all files in f:\staging and create output in the following locations

Destination Files Checksums Manifest
1 c:\storage\files c:\storage\checksums None
2 q:\backup_storage_1\files q:\backup_storage_1\checksums None
3 z:\backup_storage_2\files z:\backup_storage_2\checksums None

Example 2 (use specific checksum & manifest locations): :

mpt stage f:\staging
          -d c:\storage\datastore q:\backup_storage_1\datastore z:\backup_storage_2\file_data
          -t c:\storage\checksumdata q:\backup_storage_1\checksumdata
             z:\backup_storage_2\meta_data\checksums
          -m c:\storage\manifest.sha256 q:\backup_storage_\manifest.sha256
             z:\backup_storage_2\meta_data\manifest.sha256

This will process all files in f:\staging and create output in the following locations:

Destination Files Checksums Manifest
1 c:\storage\datastore c:\storage\checksumdata c:\storage\manifest.sha256
2 q:\backup_storage_1\datastore q:\backup_storage_1\checksumdata q:\backup_storage_1\manifest.sha256
3 z:\backup_storage_2\file_data z:\backup_storage_2\meta_data\checksums z:\backup_storage_2\meta_data\manifest.sha256

Common Options

The following options can be used with all modes of operation. They should be used in the command line before the mode of operation (e.g. create, stage, etc) is specified.

Number of processes

Use the -p or --num-processes option to specify the number of concurrent processes MPT should use. The default value is 2. The ideal number will depend on the number of CPUs and processor cores the host machine has.

E-mail recipients """"""""""""""""

Use the -e or --email-results option to specify e-mail recipients for MPT's summary reports.

Output directory

Use the -o or --output option to specify the root directory used to store reports. Subdirectories will be created beneath this directory for each type of report (creation, validation, comparison and staging), and a separate dated directory will be created each time MPT runs.

Disable file count

Normally MPT will count the number of files to be processed before it starts. When run interactively, this can provide a useful picture of its progress - however, this is at the cost of potentially taking a long time to begin processing, as all files have to be counted before processing can begin. Use the --no-count option to skip file counting and simply display a count of how many files have been processed so far.

Use absolute path in reports

By default, the summary reports produced by MPT show each file's path relative to the root directory specified on the command line. Use the --absolute-path option to instead show an absolute path. Note that this may include a drive letter (on Windows) or mount point (on Linux) which does not exist for all users.

Override cache size

MPT produces its output reports as it is running. By default, it caches 1000 records in memory before writing them to disk. To override this setting, use the --cache-size option to specify a different number of records. A higher value will result in higher memory usage, whereas a lower number will cause more frequent writing to disk. Depending on the number of files being processed by MPT, adjustments to the cache size may improve overall performance.

Example of command syntax

mpt --email-results recipient@example.com recipient2@example.com
    --num-processes 8
    --no-count
    --cache-size 0
    --output c:\storage\reports
    validate_tree c:\storage\files
    --tree c:\storage\checksums

This will validate the files stored in c:\storage\files using the checksum tree in c:\storage\checksums, using 8 concurrent processes and without counting the files to be processed. Results will be written out to disk immediately rather than being cached. The resulting reports will be written to the directory c:\storage\reports and sent via e-mail to the two listed recipients.

Licence

This project is licensed under the Apache License 2.0. For details see the accompanying LICENSE file or visit:

http://www.apache.org/licenses/LICENSE-2.0

Copyright (c) 2020, The British Library

About

A utility for staging files, calculating and validating file checksums, and comparing checksum values between storage locations.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages