Skip to content

A ChRIS DS plugin that can perform somewhat arbitrary (housekeeping) operations on patterns of input directories and files.

License

Notifications You must be signed in to change notification settings

FNNDSC/pl-pfdorun

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pl-pfdorun

https://api.travis-ci.com/FNNDSC/pfdorun.svg?branch=master

The pl-pfdorun plugin is a general purpose "swiss army" knife DS plugin that can be used to execute some CLI type commands on input directories.

The pl-pfdorun plugin is a general purpose "swiss army" knife type plugin that can be used to perform somewhat arbitrary exec command line type commands on input directores/data. For instance:

  • copy (subsets of) data from the input space to output;
  • create explicit (g)zip files of data;
  • un(g)zip data;
  • reorganize data in the input dir in some idiosyncratic fashion in the ouput directory;
  • misc operations on images using imagemagick;
  • and others..

In some respects it functions as a dynamic "impedance matching" plugin that can be used to per-usecase match the output directories and files of one plugin to the input requirements of another. This plugin is for the most a simple wrapper around an underlying pfdo_run CLI exec module.

pfdorun                                                         \
    --exec <CLIcmdToExec>                                       \
    [-i|--inputFile <inputFile>]                                \
    [-f|--fileFilter <filter1,filter2,...>]                     \
    [-d|--dirFilter <filter1,filter2,...>]                      \
    [--analyzeFileIndex <someIndex>]                            \
    [--outputLeafDir <outputLeafDirFormat>]                     \
    [--threads <numThreads>]                                    \
    [--noJobLogging]                                            \
    [--test]                                                    \
    [--maxdepth <dirDepth>]                                     \
    [--syslog]                                                  \
    [-h] [--help]                                               \
    [--json]                                                    \
    [--man]                                                     \
    [--meta]                                                    \
    [--savejson <DIR>]                                          \
    [--verbose <level>]                                         \
    [--version]                                                 \
    <inputDir>                                                  \
    <outputDir>
--exec <CLIcmdToExec>
The command line expression to apply at each directory node of the
input tree. See the CLI SPECIFICATION section for more information.

[-i|--inputFile <inputFile>]
An optional <inputFile> specified relative to the <inputDir>. If
specified, then do not perform a directory walk, but function only
on the directory containing this file.

[-f|--fileFilter <someFilter1,someFilter2,...>]
An optional comma-delimated string to filter out files of interest
from the <inputDir> tree. Each token in the expression is applied in
turn over the space of files in a directory location, and only files
that contain this token string in their filename are preserved

[-d|--dirFilter <someFilter1,someFilter2,...>]
An additional filter that will further limit any files to process to
only those files that exist in leaf directory nodes that have some
substring of each of the comma separated <someFilter> in their
directory name.

[--analyzeFileIndex <someIndex>]
An optional string to control which file(s) in a specific directory
to which the analysis is applied. The default is "-1" which implies
*ALL* files in a given directory. Other valid <someIndex> are:

    'm':   only the "middle" file in the returned file list
    "f":   only the first file in the returned file list
    "l":   only the last file in the returned file list
    "<N>": the file at index N in the file list. If this index
           is out of bounds, no analysis is performed.

    "-1":  all files.

[--outputLeafDir <outputLeafDirFormat>]
If specified, will apply the <outputLeafDirFormat> to the output
directories containing data. This is useful to blanket describe
final output directories with some descriptive text, such as
'anon' or 'preview'.

This is a formatting spec, so

    --outputLeafDir 'preview-%%s'

where %%s is the original leaf directory node, will prefix each
final directory containing output with the text 'preview-' which
can be useful in describing some features of the output set.

[--maxdepth <dirDepth>]
The maximum depth to descend relative to the <inputDir>. Note, that
this counts from zero! Default of '-1' implies transverse the entire
directory tree.

[--syslog]
If specified, prepend output 'log' messages in syslog style.

[--threads <numThreads>]
If specified, break the innermost analysis loop into <numThreads>
threads.

[--noJobLogging]
If specified, then suppress the logging of per-job output. Usually
each job that is run will have, in the output directory, three
additional files:

        %inputWorkingFile-returncode
        %inputWorkingFile-stderr
        %inputWorkingFile-stdout

By specifying this option, the above files are not recorded.

[-h] [--help]
If specified, show help message and exit.

[--json]
If specified, show json representation of app and exit.

[--man]
If specified, print (this) man page and exit.

[--meta]
If specified, print plugin meta data and exit.

[--savejson <DIR>]
If specified, save json representation file to DIR and exit.

[--verbose <level>]
Verbosity level for app.

[--version]
If specified, print version number and exit.
docker run --rm fnndsc/pl-pfdorun pfdorun --man

Any text in the CLI prefixed with a percent char % is interpreted in one of two ways.

First, any CLI to the pfdo_run itself can be accessed via %. Thus, for example a %outputDir in the --exec string will be expanded to the outputDir of the pfdo_run.

Secondly, three internal '%' variables are available:

  • %inputWorkingDir - the current input tree working directory
  • %outputWorkingDir - the current output tree working directory
  • %inputWorkingFile - the current file being processed

These internal variables allow for contextual specification of values. For example, a simple CLI touch command could be specified as

--exec "touch %outputWorkingDir/%inputWorkingFile"

or a command to convert an input png to an output jpg using the ImageMagick convert utility

--exec "convert %inputWorkingDir/%inputWorkingFile
                %outputWorkingDir/%inputWorkingFile.jpg"

Furthermore, pfdo_run offers the ability to apply some interal functions to a tag. The template for specifying a function to apply is:

%_<functionName>[|arg1|arg2|...]_<tag>

thus, a function is identified by a function name that is prefixed and suffixed by an underscore and appears in front of the tag to process.

Possible args to the <functionName> are separated by pipe "|" characters. For example a string snippet that contains

%_strrepl|.|-_inputWorkingFile.txt

will replace all occurences of . in the %inputWorkingFile with -. Also of interest, the trailing .txt is preserved in the final pattern for the result.

The following functions are available:

%_md5[|<len>]_<tagName>

Apply an ``md5`` hash to the value referenced by <tagName> and optionally
return only the first <len> characters.
%_strmsk|<mask>_<tagName>

Apply a simple mask pattern to the value referenced by ``<tagName>``.
Chars that are ``*`` in the mask are passed through unchanged. The mask
and its target should be the same length.
%_strrepl|<target>|<replace>_<tagName>

Replace the string <target> with <replace> in the value referenced
by <tagName>.
%_rmext_<tagName>

Remove the "extension" of the value referenced by <tagName>. This of course
only makes sense if the <tagName> denotes something with an extension!
%_name_<tag>

Replace the value referenced by <tag> with a name generated by the faker
module.

Functions cannot currently be nested.

You need you need to specify input and output directories using the -v flag to docker run.

docker run --rm -u $(id -u) -ti                                         \
  -v $(pwd)/in:/in -v $(pwd)/out:/out                                   \
  -v $(pwd)/pfdorun:/usr/local/lib/python3.8/dist-packages/pfdorun:     \
  fnndsc/pl-pfdorun pfdorun                                             \
  /in /out

Build the Docker container:

docker build -t local/pl-pfdorun .

Python dependencies can be added to setup.py. After a successful build, track which dependencies you have installed by generating the requirements.txt file.

docker run --rm local/pl-pfdorun -m pip freeze > requirements.txt

For the sake of reproducible builds, be sure that requirements.txt is up to date before you publish your code.

git add requirements.txt && git commit -m "Bump requirements.txt" && git push
docker run --rm -u $(id -u)                                 \
    -v $(pwd)/in:/incoming -v $(pwd)/out:/outgoing          \
    fnndsc/pl-pfdorun pfdorun                               \
    --exec "cp %inputWorkingDir/%inputWorkingFile
               %outputWorkingDir/%inputWorkingFile"         \
    --threads 0 --printElapsedTime                          \
    --verbose 5                                             \
    /incoming /outgoing

Assume the inputDir has a file, input.json. We use that file as a tag to search in order to process the whole directory tree:

docker run -ti --rm -u $(id -u)                                         \
    -v /home/rudolphpienaar/data/convert_test:/incoming                 \
    -v $(pwd)/out:/outgoing                                             \
    fnndsc/pl-pfdorun                                                   \
    pfdorun --inputFile input.json                                      \
            --exec "tar cvfz %outputDir/out.tgz %inputDir"              \
            --threads 0                                                 \
            --printElapsedTime                                          \
            --verbose 5                                                 \
            /incoming /outgoing

Assume the inputDir has a file ending in tgz somewhere in the tree we wish to unpack:

docker run -ti --rm -u $(id -u)                                         \
    -v /home/rudolphpienaar/data/convert_test:/incoming                 \
    -v $(pwd)/out:/outgoing                                             \
    fnndsc/pl-pfdorun                                                   \
    pfdorun --filterExpression tgz                                      \
            --exec "tar xvfz %inputWorkingDir/%inputWorkingFile -C %outputDir"  \
            --threads 0                                                 \
            --printElapsedTime                                          \
            --verbose 5                                                 \
            /incoming /outgoing

Assume that the inputDir has many nested directories. One of them, 100307 contains a single file, brain.mgz. We wish to only copy this single file to the outputDir:

docker run -ti --rm -u $(id -u)                                         \
    -v $(pwd)/in:/incoming                                              \
    -v $(pwd)/out:/outgoing                                             \
    fnndsc/pl-pfdorun                                                   \
    pfdorun --fileFilter " " --dirFilter 100307                         \
            --exec "cp %inputWorkingDir/brain.mgz
            %outputWorkingDir/brain.mgz"                                \
            --noJobLogging                                              \
            --threads 0                                                 \
            --printElapsedTime                                          \
            --verbose 5                                                 \
            /incoming /outgoing

To debug the containerized version of this plugin, simply volume map the source directories of the repo into the relevant locations of the container image:

docker run -ti --rm -v $PWD/in:/incoming:ro -v $PWD/out:/outgoing:rw    \
    -v $PWD/pfdorun:/usr/local/lib/python3.9/site-packages/pfdorun:ro   \
    fnndsc/pl-pfdorun pfdorun /incoming /outgoing

To enter the container:

docker run -ti --rm -v $PWD/in:/incoming:ro -v $PWD/out:/outgoing:rw    \
    -v $PWD/pfdorun:/usr/local/lib/python3.9/site-packages/pfdorun:ro   \
    --entrypoint /bin/bash fnndsc/pl-pfdorun

Remember to use the -ti flag for interactivity!

30

https://raw.githubusercontent.com/FNNDSC/cookiecutter-chrisapp/master/doc/assets/badge/light.png

About

A ChRIS DS plugin that can perform somewhat arbitrary (housekeeping) operations on patterns of input directories and files.

Topics

Resources

License

Stars

Watchers

Forks