Pipeline of Bgee release 14.0
Bgee pipeline documentation: content
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).
Bgee is based exclusively on curated "normal", healthy, expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression.
Bgee produces ranked calls of presence/absence of expression, and of differential over-/under-expression, integrated along with information of gene orthology, and of homology between organs. This allows comparisons of expression patterns between species.
pipeline/: contains the Makefiles and scripts used to generate data. The Bgee pipeline is a mixture of Perl scripts, R scripts, Java code. They are all called through the use of makefiles. This directory is organized by sub-directories, corresponding to steps in the pipeline, that each contains a Makefile to be run. Common parameters are defined in the files pipeline/Makefile.common (for non-sensitive variables) and pipeline/Makefile.Config (for sensitive variables, e.g., passwords).
source_files/: contains the source data used for the pipeline, for instance, files from annotators in TSV format, downloaded ontologies to use, downloaded files from MOD, etc.
generated_files/: contains the files and outputs generated by the pipeline.
java/: contains files related to the Bgee Java project. Some Makefiles use them through command line arguments.
Documenting the pipeline
The pipeline is documented directly in the relevant directories of the pipeline,
README.md files. See pipeline/ directory as a starting point.
- Aim of the step, and previous steps required.
Details: explanations about the step
Data generation: how to run the step
Data verification: how to verify that the step was correctly run
Error handling: what to do when confronted to typical errors
Other notable Makefile targets: individual Makefile targets that could be useful
Mandatory variables and import
Two variables that are used in pipeline/Makefile.common must be defined by each Makefile:
PIPELINEROOT: defines the path to the root folder of the pipeline, relative to the Makefile (i.e., to target the directory pipeline/).
DIR_NAME: defines the name of the directory where lives the Makefile (e.g.,
species/). This allows to use folders of same name in source_files/ and generated_files/.
Example start of a Bgee Makefile in pipeline/species/:
PIPELINEROOT := ../ DIR_NAME := species/ include $(PIPELINEROOT)Makefile.common
Step verification file
all target of the Makefile must always be to generate a step verification file,
whose path is accessible through the variable
should either be the first target defined in the Makefile, or be assigned as the default goal
.DEFAULT_GOAL := all).
This verification file must contain information allowing to easily assess whether all the targets of the Makefile were successfully run. For instance, to generate this file, a Makefile could launch a query to the database to verify correct insertion of data.
Example Makefile targets:
all: $(VERIFICATIONFILE) ... $(VERIFICATIONFILE): dependency1 dependency2 ... @$(MYSQL) -e "SELECT * FROM taxon where bgeeSpeciesLCA = TRUE order by taxonLeftBound" > $@.temp @$(MV) $@.temp $@
In order to more efficiently save input files and files generated by a pipeline run, they are kept in specific folders, not mixed with pipeline scripts, in source_files/ and generated_files/. The directory structure in these folders should be the same as in the pipeline/ directory.
No Makefile should read directly from or write directly into the pipeline/ folder.
When a file name, or URL, etc, is used in several Makefiles, it should be assigned to a variable in pipeline/Makefile.common. Notable variables:
SOURCE_FILES_DIR: corresponds to
INPUT_DIR: corresponds to
GENERATED_FILES_DIR: corresponds to
OUTPUT_DIR: corresponds to
VERIFICATIONFILE: corresponds to
- useful commands:
$(CURL): and see
$(APPEND_CURL_COMMAND); allows to download a file only if the remote file is newer than the local file, and to a save it to a tmp file until the transfer is successfully completed.
If a variable contains sensitive information (e.g., a password), it should be defined in pipeline/Makefile.Config. The actual values of these variables should not be versioned! (simpler than to encrypt the file)
Versioning input/output files
Source and generated files should be versioned using git, if not too large. This versioning is not performed automatically by the Makefiles. It is the responsibility of the person in charge to version the relevant files when a step is completed.