pret
is a programmable ETL pipeline for the CANDEL database. The input to pret
is a collection of input data in flat tsv files, and a specification (in EDN format) of how this data maps to the CANDEL schema. Using this input, pret
will:
- parse the data according to the specification
- validate the data against a set of rules
- transact the data into the CANDEL database
Therefore pret
is programmed through data (the EDN specification) and allows you to import new dataset into CANDEL without having to interact with the database directly. Your job as a user is to provide the EDN specification that describes how your dataset maps to the CANDEL schema, and pret
will do the rest.
pret
will use environment variables / Java system properties for environment configuration:
Envar | systemProperty | Description | Default |
---|---|---|---|
CANDEL_BASE_URI |
n/a | Base datomic URI for CANDEL databases | nil |
CANDEL_AWS_REGION |
candel.awsRegion |
AWS Region | "us-east-1" |
CANDEL_DDB_TABLE |
candel.ddbTable |
DDB Table containing CANDEL Datomic DB | "candel-prod" |
CANDEL_REFERENCE_DATA_BUCKET |
candel.referenceDataBucket |
S3 bucket for reference data | "pret-processed-reference-data-prod" |
CANDEL_MATRIX_BUCKET |
candel.matrixBucket |
S3 bucket for matrix files | "candel-matrix" |
CANDEL_MATRIX_DIR |
candel.matrixDir |
Local dir for matrix files when using "file" | "matrix-store" |
CANDEL_MATRIX_BACKEND |
candel.matrixBackend |
Storage medium for matrix files (file or s3) | "s3" |
see the org.candelbio.pret.db.config
namespace
AWS permissions should be configured at the environment level for use of pret
pret
requires the following installed:
- Java version 1.8 or 1.9. See OpenJDK for installation instructions.
See the DEVELOPMENT.md
file for details on building the pret jar.
Invoke pret
in the directory you downloaded the pret.jar
file with:
./pret
in linux or macOS
pretw.bat
in Windows
This will echo the command line usage options
The following example would provision a test database as specified in ~/repos/pret/example-data/datomic.conf.edn
, prepare data as specified in ~/repos/pret-datasets/tcga/config.edn
to the working directory ~/data/tcga-import/tmp-working
, and then transact the data into the database provisioned in the provision
task.
./pret provision --datomic-config ~/repos/pret/example-data/datomic.conf.edn
./pret prepare --import-config ~/repos/pret-datasets/tcga/config.edn --working-directory ~/data/tcga-import/tmp-working
./pret transact --datomic-config ~/repos/pret/example-data/datomic.conf.edn --working-directory ~/data/tcga-import/tmp-working
For developer notes see DEVELOPMENT.md.
Copyright Parker Institute for Cancer Immunotherapy, 2022
Licensing : This software is released under the Apache 2.0 License