Validate the metadata complies with a set of rules
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib
scripts
t
web
.gitignore
.travis.yml
.travis.yml~
Build.PL
LICENSE
MANIFEST
MANIFEST.SKIP.bak
README.md
cpanfile
cpanfile.snapshot

README.md

Validate metadata Build Status Coverage Status

This is a library, web application and set of scripts to test whether or not metadata conforms to a set of rules. It has been developed to support the FAANG project.

FAANG validation tools are available at https://www.ebi.ac.uk/vg/faang.

The expected use of this software is to

  1. Read some metadata with a parser.
  2. Convert them to a common format, our metadata model.
  3. Evaluate their compliance with a set of rules.
  4. Report on their compliance.

We also provide a converter to simplify the preparation of BioSamples submissions for FAANG.

Parsers

We have parsers for the following formats:

  1. SRA (Sequence Read Archive) experiment XML (Bio::Metadata::Loader::XMLExperimentLoader)
  2. SRA sample XML (Bio::Metadata::Loader::XMLSampleLoader)
  3. Simple spreadsheets (.tsv) (Bio::Metadata::Loader::TSVLoader)
  4. BioSamples records (via BioSamples REST API, using the BioSD library) (Bio::Metadata::Validate::Support::BioSDLookup)
  5. JSON serialisation of Bio::Metadata::Entity (Bio::Metadata::Loader::JSONEntityLoader)
  6. FAANG BioSample spreadsheets (Bio::Metadata::Loader::XLSXBioSampleLoader)

All parsers produce Bio::Metadata::Entity objects, the basis of our metadata model.

Metadata model

The central class for the metadata model is Bio::Metadata::Entity. Each entity has an ID and a set of attributes. Each attribute has a name and a value, and may optionally have units, an ID or a URI. An entity can have several attributes with the same name. This model can adequately represent many biological metadata objects, such as BioSamples records, SRA samples and experiments.

Rules

The central class for defined rule set is Bio::Metadata::Rules::RuleSet. The web app produces pages describing the rules in a rule set; this may prove useful in understanding how the rules are structured - see the FAANG sample metadata rule set as an example.

A rule set is comprised of one or more rule groups. Each rule group contains a list of rules. A rule group can have conditions that control which entities the group is applied to. This allows you to apply different rules for differet types of data. For example, the FAANG sample metadata rule set contains different rules for tissue and cell line samples applied depending on the value of the 'material' attribute.

Each rule has a name. It will be applied to each attribute with a matching name (not sensitive to case). Rules can be mandatory, recommended or optional, and can permit multiple attributes of the same name or just one. are permitted. They can specify a set of valid units. Each rule has a type, which defines how the attribute values are to be validated.

At present, these data types are supported:

Further types can be supported by creating a validator module that fulfils the AttributeValidatorRole, adding a name for the type to the Bio::Metadata::Rules::Rule::TypeEnum in Bio::Metadata::Types, and updating the type_validator mapping in Bio::Metadata::Validate::EntityValidator

In addition to per-attribute validation, it is sometimes necessary to make consistency checks across multiple attributes. We have these FAANG-specific checks:

Rule sets should be written in JSON. JSON files can be loaded as rules sets to test validity using scripts/load_rule_set.pl.

Validating entities with a rule set produces a set of Validation Outcomes. Outcomes have a status (pass, warning or error in the order of from best to worst) and a message explaining the problem if that status is not pass. The overall outcome for an entity is the worst outcome produced for its attributes.

Reporting

Validation outcomes for a set of entities can be reported in text, spreadsheet or web page.

Conversion

We include a conversion tool to produce SampleTab files for submission to BioSamples based on a template spreadsheet. This is intended to simplify submission of sample metadata for FAANG. This conversion is available through the web application.

Installation

If you want to use the scripts or code against the libraries, the simplest thing to do is to use cpanm:

cpanm git@github.com:EMBL-EBI-GCA/BioSD.git
cpanm git@github.com:FAANG/validate-metadata.git

This will install the library and its dependencies.

If you wish to manage your install manually, please install BioSD and its dependencies (see the cpanfile for a list), then the dependencies listed in validate-metadata's cpanfile.

Web application

The web application uses Mojolicious and should be compatible with any of the deployment types supported by the framework.

Installing the dependencies and running web/dev.sh should be enough to give you a web server to test with.

In production, we use an Apache2 server and Plack, with carton to manage dependencies. The apache server config looks a lot like this:

 PerlSwitches -I/path/to/validate-metadata/local/lib/perl5
 PerlSwitches -I/path/to/BioSD/lib
 PerlSwitches -I/path/to/validate-metadata/lib
 
 <VirtualHost *:80>
   ServerName placeholder.ebi.ac.uk
   ServerAlias placeholder.ebi.ac.uk
 
   <Perl>
      $ENV{PLACK_ENV} = 'production';
      $ENV{MOJO_HOME} = '/path/to/validate-metadata/web';
      $ENV{MOJO_MODE} = 'production';
      $ENV{MOJO_CONFIG} = '/path/to/conf_files/validate_metadata.mojo_conf';
    </Perl>

    <Location /vg/faang>
      Order allow,deny
      Allow from all
      SetHandler perl-script
      PerlResponseHandler Plack::Handler::Apache2
      PerlSetVar psgi_app /path/to/validate-metadata/web/validate_metadata.pl
    </Location>
  
  </VirtualHost>

Application configuration is via a mojolicious config file. The example of expected content of the config file is available here. This controls application branding and which rule sets are available.