a tool to archive and co-locate NGS data with project-level, sample-level, and analysis-level metadata.
- Overview
- Getting Started
2.1 Dependencies
2.2 Installation - Run pyrkit
3.1 Usage
3.2 Required Arguments
3.3 OPTIONS
3.4 Example
pyrkit, pronouced park-it
, automates the process of moving data from the cluster into object storage in HPC DME. It instantiates a collection heirarchy to archive raw data and results. pyrkit parses a project request template, a pipeline's output directory, and a MultiQC directory to capture project, analysis, quality-control metadata. pyrkit was created to enable FAIR scientific data management and stewardship.
Please Note: Some of the metadata listed in the example above is pipeline-specific (i.e. only for the RNA-seq pipeline).
pykrit has a few required dependencies. It requires the installation of the following programs:
Please note that if you running pyrkit on Biowulf, the only dependency you will need to install in the HPC DME toolkit
. pyrkit will attempt to module load jq and python/3.5 (which meets any python requirements), if they are not in your $PATH.
Installation of pyrkit is easy! Please clone the repository from Github, create a virtual enviroment, and install any dendencies. Again, if you are on Biowulf, all you will need to do is clone the repository.
# Clone the Repository
git clone https://github.com/skchronicles/pyrkit.git
# Steps below are optional for biowulf users
# Create a virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
# Update pip
pip install --upgrade pip
# Download Dependencies
pip install -r requirements.txt
usage: pyrkit -i INPUT_DIRECTORY -o OUTPUT_VAULT -r REQUEST_TEMPLATE
-m MULTIQC_DIRECTORY -d DME_REPO [-p PROJECT_ID] [-n]
[-l] [-v] [-h] [--version]
Argument | Type | Description | Example |
---|---|---|---|
-i, --input-directory | Path | Pipeline output directory | /scratch/RNA_hg38/ |
-o, --output-vault | String | HPC DME vault to upload data | /CCBR_Archive |
-r, --request-template | File | Project Request Template | experiment_metadata.xlsx |
-m, --multiqc-directory | Path | MultiQC Output Directory | /scratch/RNA_hg38/multiqc_data/ |
-d, --dme-repo | Path | Path to a HPC DME toolkit install | ~/DME/HPC_DME_APIs/ |
Argument | Type | Description | Example |
---|---|---|---|
-p, --project-id | String | Project ID | ccbr-123 |
-n, --dry-run | Flag | Dry-run the entire pyrkit workflow | -n |
-n, --local-run | Flag | Upload to DME without job submission | -l |
-v, --validate | Flag | Validate entries before submission | -v |
-h, --help | Flag | Display help message and exit | -h |
--version | Flag | Display version information and exit | --version |
# Grab an interactive node or submit pyrkit command to cluster
# Do not run this on the head node!
sinteractive --mem=8g --cpus-per-task=2
# Dry runs pyrkit and submits job to upload data to cluster
./pyrkit -i /scratch/ccbr123/RNA_hg38/ \
-o /CCBR_Archive \
-r experiment_metadata.xlsx \
-m /scratch/ccbr123/RNA_hg38/multiqc_data/ \
-d ~/DME/HPC_DME_APIs/ \
-p ccbr-123