Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Adapting CellProfiler to a LIMS environment
At the Broad Institute, we split a CellProfiler analysis into a number of small jobs which are run on separate cores in a headless mode. CellProfiler is optimized to run analyses headless on a single thread in order to get predictable concurrency with one CellProfiler instance per blade core. There are command-line switches that let you execute a partial analysis, there are switches that let you specify the inputs and outputs and there are modules whose primary target environment is a lab information system. This page describes some best practices for integrating CellProfiler into your LIMS workflow.
CellProfiler can be also run in the cloud for distributing jobs among many machines. We have scripts and configuration files for running CellProfiler in distributed mode using the Amazon Web Services platform. If you are interested in this setup, visit the Distributed-CellProfiler project page to get more details.
You might wish to take a look at this paper led by Novartis describing Jenkins-CI, "an Open-Source Continuous Integration System, as a Scientific Data and Image-Processing Platform".
Table of Contents
- Input modules and --file-list (new in CP 2.1.2)
- Writing your own image loading module
- Measurement output formats
- Headless operation
CellProfiler has two modules which have been designed to work in a server-farm environment: LoadData and CreateBatchFiles. They can be used in conjunction or used separately with the choice being largely determined by your analysis workflow.
LoadData takes a .csv file as input - each row of the .csv file supplies the input data for one execution cycle of the researcher's pipeline and each column supplies either image file location information or metadata such as physical location (plate, well and site) or sample treatment. The .csv for LoadData can be generated using a SQL query on a LIMS database. For example, a researcher could submit a request for robotically-prepared plates to be analyzed by CellProfiler and a pipeline or pipelines to be run. The LIMS would create the LoadData .csv using a query whose fields specified the per-channel image file locations, the plate, well and site and the perturbant for the well and the pipeline and .csv could be farmed out to a number of execution nodes. LoadData can reference images by file name or by URL, with "file:", "http:", "ftp:" and the special-case "omero:" URL schemes being supported. The "omero:" scheme allows CellProfiler to download images from an OMERO server. CellProfiler (as of c2cb201) can take the name of the LoadData .csv file from the command-line if you supply it using the --data-file switch. This means that a researcher can create a pipeline using a short list of representative images and a short LoadData file, then submit the pipeline to the LIMS. The LIMS can then generate LoadData .csv files for the researcher and replace the researcher's example .csv with the production .csv without modification to the pipeline, the researcher's .csv or to CellProfiler.
CellProfiler desktop users may be more comfortable using the input modules to build image sets from a list of files. These pipelines can be used in a headless environment if a file list is supplied on the command-line using the --file-list switch. This file list takes the place of the one that the user assembles in the Images module. The best practice is to use one file list per job, without the -f and -l switches to minimize the time spent assembling each job's image sets. One of the advantages of this approach is the simplicity of the deployment: the file list can be constructed from a simple directory listing and the logic that builds the image set can be the responsibility of the researcher rather than the LIMS system.
The CreateBatchFiles module is placed at the end of a pipeline. The pipeline is then executed and the result is a Batch_data.h5 (HDF5) file which contains the image set list and pipeline. This file can then be submitted to CellProfiler on the command-line and CellProfiler will run in a batch mode, without its user interface to process the pipeline. Typically, a user or LIMS will break a long image set list into pieces and execute each of these pieces using the command line switches, -f and -l to specify the first and last image sets in each job. The advantage of CreateBatchFiles from the researcher's perspective is that the Batch_data.h5 file generated by the module captures all of the data needed to run the analysis. The researcher is both responsible for and in control of the choice and layout of files and in many situations, all that is needed is a mechanism for farming out the jobs to cluster nodes (or alternately, the researcher can submit the jobs manually to their cluster). A researcher may include a LoadData module and a CreateBatchFiles module in a pipeline; in this case, it's usually the researcher's choice to use LoadData and some scripting to organize the analysis' image files.
Running from Batch_data.h5 disables database table creation and the initial population of some tables during batch job initialization. If you use MySQL and multiple batches and do not use CreateBatchFiles, each job will attempt to overwrite the schema during startup. In CellProfiler version 2.1.2 and later, you can prevent this by running one job to initialize the schema and then use the --do-not-write-schema command-line switch to prevent later jobs from overwriting the database schema.
We have requests from time to time from LIMS integrators who want to write their own image loading module. Often the idea is to have long-running copies of CellProfiler poll for images deposited in a target directory by a microscope. Other use cases are images fetched from a database or transmitted via some network protocol. Image loading modules are complex and interact with interfaces in CellProfiler that may be subject to change. Many of these interactions are not obvious at the outset. LIMS integrators should consider writing an application that invokes CellProfiler as a command-line tool instead, using the command-line switches as a stable, documented interface. An independent code base can then be used to marshall the image inputs and harvest CellProfiler's outputs in a manner best suited for the LIMS environment.
CellProfiler saves derivative images and measurements as output. The images are usually illustrative and are used to assess proper operation and to guide machine-learning classification in CellProfiler Analyst and similar programs. CellProfiler's measurements are its most useful output in a high-throughput context. Researchers will run CellProfiler, then use statistical analysis and machine learning to score samples as hit candidates based on the sample image and cell measurements or to answer other scientific questions. CellProfiler can generate thousands of measurements per cell and identify hundreds of cells per image resulting in databases containing on the order of a trillion feature values. There are four categories of features:
- Per-experiment features: these are single values that apply to the whole analysis. An example is Pipeline_Pipeline which is a text copy of the pipeline used to perform the analysis.
- Per-image features: these are features organized into one row per image set and one column per feature. Some features are text, such as image file locations and metadata and others are numeric, such as whole image average intensity or the thresholding value used during segmentation.
- Per-object features: A CellProfiler pipeline can perform one or more segmentations which partition the image into many objects. CellProfiler can then calculate measurements on each object, producing many measurements of the same feature per image. Most pipelines identify cells and produce segmentations of the cell nucleus, cytoplasm and entire cell and in these cases, there is a one to one mapping between nucleus, cytoplasm and cell and an easy way to combine all per-object features into a single table. Other pipelines may segment and measure sub-compartments of a cell such as the mitochondria and in this case, there may be a many to one relation requiring separate tables per-object.
- Object relationships: CellProfiler produces a table of relational links between objects. The relationships can be between cell constituents and the segmentation of the whole cell, between neighbors or between segmentations of the same cell in images of the same site, taken at different times. The object relationships are intended to give researchers unambiguous linkages that can be mined using a relational database.
CellProfiler has two modules dedicated to feature output, ExportToSpreadsheet and ExportToDatabase. In addition, CellProfiler outputs an HDF5 file containing the measurement values. ExportToDatabase operates in one of two modes: an offline mode which stores table definitions and .csv files of measurements in files for batch upload and an online mode which constructs the database schema at the start of analysis and populates it during the analysis. Each mechanism has strengths and weaknesses; to our knowledge, high-throughput screening centers have successfully used both modes of ExportToDatabase and have used ExportToSpreadsheet as part of their data collection strategy. We discourage the use of the HDF5 file for storage because its format might change in the future and because the .csv output is simple and unambiguous.
ExportToSpreadsheet, despite its name, adapts well to databases. The output is a series of columnar .csv files with a header as the first line in the file giving the column's feature names. The image .csv file has an image index column and the object .csv files have image and object index columns which are suitable as primary keys if imported into a database. The object relationships .csv has two sets of image and object index columns which can be used to join two object tables. LIMS systems choose ExportToSpreadsheet because its output is not tied to a particular SQL variant syntax and because it makes no assumptions about the LIMS database schema.
ExportToDatabase can produce a MySQL script to populate a database schema and .csv files that are imported to populate the database tables. The SQL data definition commands we use are somewhat generic, but have never been tested with other databases and are only supported for MySQL. There are several advantages to using ExportToDatabase in offline mode. Database tables are in a format that researchers expect and researchers need less coordination with the IT staff regarding their experiment's schema. CellProfiler only needs to interact with the file system and requires no database passwords.
ExportToDatabase can connect directly to a MySQL database and store image and object measurements from multiple analysis jobs. Measurements are written to the database once per image set. ExportToDatabase in online mode requires the least amount of coordination between database maintainers and researchers because it populates the database schema and tables directly. It does require that the database connection parameters, including password, be stored in the pipeline in cleartext - this may be a security concern in some situations.
CellProfiler writes its measurements to an HDF5 file during the course of operation to offload measurement data from memory. When run on the command-line, CellProfiler interprets the first non-switch command-line argument as the name of the HDF5 file to use for measurement output; it uses a temporary file for output if no name is supplied. The HDF5 file format is documented here. There may be advantages to harvesting the measurements from the HDF5 file such as smaller file size and faster access, but it's likely that a future version of CellProfiler will replace the current format with one that conforms to a community standard (Dougherty, Unifying Biological Image Formats with HDF5, [Sommer, CellH5: a format for data exchange in high-content screening] (http://bioinformatics.oxfordjournals.org/content/early/2013/05/08/bioinformatics.btt175.full)).
As stated above, users can specify the LoadData .csv file on CellProfiler's command-line to run different image set lists on the same pipeline. Each analysis will produce the same output files and will reuse image indexes starting at 1. If the pipeline uses ExportToSpreadsheet or ExportToDatabase in offline mode, the resulting files will have overlapping image indexes and, if stored to an absolute disk location, will overwrite each other. If ExportToDatabase is run in online mode, subsequent analyses will delete the schema produced by previous analyses (breaking a single analysis into separate jobs works). There are a couple of strategies that can alleviate these problems. Perhaps the easiest is to use ExportToSpreadsheet or ExportToDatabase in offline mode and to specify file locations relative to CellProfiler's default output directory. Each job can use a different default output directory and this directory can be set on the command-line using the "-o" switch. A second strategy is to use a piece of metadata such as a plate name as part of the output file name or directory in ExportToSpreadsheet. CellProfiler will write the data from different plates into different files.
CellProfiler is built to run headless. It makes no UI calls in headless mode and you can run it Linux without an X server. However ImageJ 1.x assumes it's being run with a UI. We've implemented a headless-but-not-headless mode that's enabled by setting the environment variable, "CELLPROFILER_USE_XVFB" to one. On our compute cluster, we use the X application Xvfb to set up an X server and virtual frame buffer that provides enough of a UI for ImageJ to run. We also start up Java without the command-line switch, "-Djava.awt.headless=true" and this lets ImageJ 1.x operate correctly. Our script for launching Python in headless-but-not-headless mode is https://github.com/CellProfiler/CellProfiler/blob/60f697e8dfc58579550e80309405381d67ba7e5c/cellprofiler/utilities/cpjvm.py
One of our primary focuses for CellProfiler is interoperability with other software packages developed by our community. We have strong working relationships with OMERO, the ImageJ 2.0 effort and BISQUE and we will include a tight integration with OMERO in an upcoming release. We will be contributing to and probably will adopt file format standards for image analysis as they become mature. We would encourage LIMS developers to consider using these community tools and to use CellProfiler's interoperability with them as part of your CellProfiler integration.
The following is a list of command-line switches that may be useful in a cluster-computing or LIMS environment:
- -p Pipeline to run. This is the file location of the pipeline that controls the analysis. This can also be a measurements file containing a pipeline or a Batch_data.h5 file as output by the CreateBatchFiles module.
- -c Run headless. This directs CellProfiler to run without a user interface.
- -r Run in batch mode. This directs CellProfiler to run an analysis using the supplied pipeline
- -i Set the default input directory. This is a file path to the root directory for image files and other configuration files needed by the pipeline. The default input directory is being deprecated and we anticipate that pipelines created by future versions of CellProfiler will generally not use it.
- -o Set the default output directory. This is the file path to the directory used to store images and measurement data.
- --data-file Provide the location to an alternate .csv file to LoadData. If this switch is present, this file is used instead of the one specified in the LoadData module.
- -f Set the first image set for the job (starting at 1 for the first).
- -l Set the last (inclusive) image set for the job.
- -g If using the grouping feature, -g specifies the metadata keys and values of one group of image sets to be processed by the job. As an example, a pipeline might operate on a time series where each well was imaged multiple times. If the pipeline is grouped by metadata named "Plate" and "Well", a possible -g switch might be "-g Plate=A945G6,Well=A01". See CellProfiler's documentation on grouping for more details.
- --print-groups If using the grouping feature, --print-groups will print a JSON string that documents the grouping metadata and image numbers of each group in the analysis. The top level of the data structure is a list, with one group per list element. Each group element contains a dictionary and a list of image set numbers for the group. The dictionary keys are the metadata keys for the "-g" switch and the values are the corresponding "-g" metadata values.
- --plugins-directory CellProfiler users can write their own plugin modules. In batch mode, CellProfiler loads any plugins it finds in the directory specified by this switch.
- --ij-plugins-directory CellProfiler can use its RunImageJ module to run ImageJ plugins. In batch mode, CellProfiler looks for these plugins in the --ij-plugins-directory.
- -t Set the temporary directory. CellProfiler offloads data in memory to temporary files. It stores those files in the temporary directory.
- --jvm-heap-size This switch sets the amount of memory reserved for CellProfiler's Java Virtual Machine. Values are generally specified in megabytes, e.g. "--jvm-heap-size 1024m"
- -b This switch prevents the source-code version of CellProfiler from recompiling its sources. CellProfiler is generally downloaded once in a cluster environment and run once to compile it using the --build-and-exit switch and afterwards, the -b switch is used to prevent compilation.
- --do-not-fetch CellProfiler will download some binaries if run from source code. This switch prevents the check for these binaries.
- --do-not-write-schema (CP 2.1.2 and later) This switch prevents jobs from writing the database schema if MySQL is used. In a cluster environment, this switch can be used to prevent jobs from racing to create tables.
- --measurements In conjunction with the "-p" switch, this switch will print the measurements made by a pipeline. This printout can be used to populate a database schema with the anticipated output columns or can be used with data flow tools such as Knime to discover the pipeline's outputs prior to analysis. The format is similar to .csv and consists of a line acting as a column header, followed by rows consisting of object name (e.g. Image or Nuclei), measurement name (e.g. AreaShape_Area), and the SQL column type.
- --get-batch-commands This switch is meant to be followed by the Batch_data.h5 file output by the CreateBatchFiles module. When specified, CellProfiler outputs one line per job to be run. This output should be further processed to generate a script that can invoke the jobs in a cluster-computing context.
- --file-list (CP 2.1.2 and later) This switch lets you specify a file list for a pipeline based on the input modules (Images, Metadata, NamesAndTypes, and Groups). The file list can either be a list of files or URLs, one file per line. These will be used in place of the list that appears in the Images UI.