Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema v1.1.0 - bug fixes and cli #32

Merged
merged 39 commits into from
Sep 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
c982050
Ensure either high and low are both specified
alisonrclarke Jul 1, 2021
293a656
Add maxLength restriction to keyword values
alisonrclarke Jul 2, 2021
33e726b
Ensure both header and values are required for dependent variables
alisonrclarke Jul 2, 2021
615f28b
Make additional_resources consistent across schemas
alisonrclarke Jul 7, 2021
02bc45e
Merge origin/master into schema-v1.0.2
alisonrclarke Jul 7, 2021
77eccdf
Added checks for duplicate table names and data_files
alisonrclarke Jul 8, 2021
32b8abe
Added cli script to validate directories, zips or single files
alisonrclarke Jul 8, 2021
e063531
Make it easier to call full submission validator from python
alisonrclarke Jul 12, 2021
676aaf6
Convert full_submission_validator to a Validator subclass
alisonrclarke Jul 14, 2021
344bed1
Extract method to check docs to slightly reduce complexity of validate
alisonrclarke Jul 14, 2021
ec7e19c
Ensure self.directory is updated at each call to validate
alisonrclarke Jul 14, 2021
dc165e6
Put additional_resources in separate schema
alisonrclarke Jul 14, 2021
a206ac9
Updated symlinks to point to v1.0.2 files
alisonrclarke Jul 14, 2021
0563761
Use v1.1.0 instead of v1.0.2
alisonrclarke Jul 14, 2021
ff8758e
Use relative not absolute paths in $ref
alisonrclarke Jul 15, 2021
395d2ed
Full submission validator now works for remote schemas
alisonrclarke Jul 27, 2021
bec110e
Updated docs
alisonrclarke Jul 27, 2021
e9e6e4e
Allow low, high, value in independent variables
alisonrclarke Aug 5, 2021
0b489f5
Add checks for unreferenced files in the directory
alisonrclarke Aug 5, 2021
40ed860
Bring error messages in line with those from main hepdata repo
alisonrclarke Aug 11, 2021
df0c35a
Allow access to submission docs; fix schemas; improve documentation
alisonrclarke Aug 11, 2021
59bd10f
Add check for empty file (which previously caused an exception)
alisonrclarke Aug 11, 2021
82866ce
Allow old-style resources for v0 schema
alisonrclarke Aug 11, 2021
bca5a79
Add methods to clear data
alisonrclarke Aug 11, 2021
e1f7b24
Update docs
alisonrclarke Aug 12, 2021
0688a68
Add option to disallow automatic remote schema loading
alisonrclarke Aug 17, 2021
bf854b0
Avoid ranges in string value in independent variables
alisonrclarke Sep 2, 2021
11c974a
Return multiple errors in data files and update tests
alisonrclarke Sep 2, 2021
9390bf7
Display all error messages when validating, not just first/best.
alisonrclarke Sep 2, 2021
dd5d463
Move check for ranges to python to give more informative error messages
alisonrclarke Sep 2, 2021
18d0bc3
Fix dodgy logic and regex
alisonrclarke Sep 6, 2021
cf255f0
Changes following review
alisonrclarke Sep 7, 2021
079e6d8
Ensure temp directories are not included in messages
alisonrclarke Sep 8, 2021
cd4d5e6
Added extra checks for keys in case data is not valid
alisonrclarke Sep 14, 2021
544601c
Add check that a submission has at least 1 doc that validates against…
alisonrclarke Sep 15, 2021
364c1a8
Only restrict to docs with at least 1 submission in v1 schema
alisonrclarke Sep 16, 2021
73d22d8
Only restrict to docs with at least 1 submission in v1.1 schema
alisonrclarke Sep 16, 2021
8ecaed0
Add extra info to error when schema not found
alisonrclarke Sep 16, 2021
0d67d29
docs: minor tweaks to CLI help and README.rst file
GraemeWatt Sep 23, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 125 additions & 17 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Installation
------------

If you can, install `LibYAML <https://pyyaml.org/wiki/LibYAML>`_ (a C library for parsing and emitting YAML) on your machine.
This will allow for the use of CLoader for faster loading of YAML files.
This will allow for the use of ``CSafeLoader`` (instead of Python ``SafeLoader``) for faster loading of YAML files.
Not a big deal for small files, but performs markedly better on larger documents.

Via pip:
Expand All @@ -61,18 +61,130 @@ Via GitHub (for developers):
Usage
-----

The ``hepdata-validator`` package allows you to validate (via the command line or Python):

* A full directory of submission and data files
* An archive file (.zip, .tar, .tar.gz, .tgz) containing all of the files (`full details <https://hepdata-submission.readthedocs.io/en/latest/introduction.html>`_)
* A `single .yaml or .yaml.gz file <https://hepdata-submission.readthedocs.io/en/latest/single_yaml.html>`_ (but *not* ``submission.yaml`` or a YAML data file)
* A ``submission.yaml`` file or individual YAML data file (via Python only, not via the command line)

The same package is used for validating uploads made to `hepdata.net <https://www.hepdata.net>`_, therefore
first validating offline can be more efficient in checking your submission is valid before uploading.


Command line
============

Installing the ``hepdata-validator`` package adds the command ``hepdata-validate`` to your path, which allows you to validate a
`HEPData submission <https://hepdata-submission.readthedocs.io/en/latest/introduction.html>`_ offline.

Examples
^^^^^^^^

To validate a submission comprising of multiple files in the current directory:

.. code:: bash

$ hepdata-validate

To validate a submission comprising of multiple files in another directory:

.. code:: bash

$ hepdata-validate -d ../TestHEPSubmission

To validate an archive file (.zip, .tar, .tar.gz, .tgz) in the current directory:

.. code:: bash

$ hepdata-validate -a TestHEPSubmission.zip

To validate a single YAML file in the current directory:

.. code:: bash

$ hepdata-validate -f single_yaml_file.yaml

Usage options
^^^^^^^^^^^^^

.. code:: bash

$ hepdata-validate --help
Usage: hepdata-validate [OPTIONS]

Offline validation of submission.yaml and YAML data files. Can check either
a directory, an archive file, or the single YAML file format.

Options:
-d, --directory TEXT Directory to check (defaults to current working
directory)
-f, --file TEXT Single .yaml or .yaml.gz file (but not submission.yaml
or a YAML data file) to check - see https://hepdata-
submission.readthedocs.io/en/latest/single_yaml.html.
(Overrides directory)
-a, --archive TEXT Archive file (.zip, .tar, .tar.gz, .tgz) to check.
(Overrides directory and file)
--help Show this message and exit.


Python
======

Validating a full submission
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To validate a full submission, instantiate a ``FullSubmissionValidator`` object:

.. code:: python

from hepdata_validator.full_submission_validator import FullSubmissionValidator, SchemaType
full_submission_validator = FullSubmissionValidator()

# validate a directory
is_dir_valid = full_submission_validator.validate(directory='TestHEPSubmission')

# or uncomment to validate an archive file
# is_archive_valid = full_submission_validator.validate(archive='TestHEPSubmission.zip')

# or uncomment to validate a single file
# is_file_valid = full_submission_validator.validate(file='single_yaml_file.yaml')

# if there are any error messages, they are retrievable through this call
full_submission_validator.get_messages()

# the error messages can be printed for each file
full_submission_validator.print_errors('submission.yaml')

# the list of valid files can be retrieved via the valid_files property, which is a
# dict mapping SchemaType (e.g. SUBMISSION, DATA, SINGLE_YAML, REMOTE) to lists of
# valid files
full_submission_validator.valid_files[SchemaType.SUBMISSION]
full_submission_validator.valid_files[SchemaType.DATA]
# full_submission_validator.valid_files[SchemaType.SINGLE_YAML]

# if a remote schema is used, valid_files is a list of tuples (schema, file)
# full_submission_validator.valid_files[SchemaType.REMOTE]

# the list of valid files can be printed
full_submission_validator.print_valid_files()


Validating individual files
^^^^^^^^^^^^^^^^^^^^^^^^^^^

To validate submission files, instantiate a ``SubmissionFileValidator`` object:

.. code:: python

from hepdata_validator.submission_file_validator import SubmissionFileValidator

submission_file_validator = SubmissionFileValidator()
submission_file_path = 'submission.yaml'

# the validate method takes a string representing the file path
is_valid_submission_file = submission_file_validator.validate(file_path=submission_file_path)

# if there are any error messages, they are retrievable through this call
submission_file_validator.get_messages()

Expand All @@ -83,14 +195,14 @@ To validate submission files, instantiate a ``SubmissionFileValidator`` object:
To validate data files, instantiate a ``DataFileValidator`` object:

.. code:: python

from hepdata_validator.data_file_validator import DataFileValidator

data_file_validator = DataFileValidator()

# the validate method takes a string representing the file path
data_file_validator.validate(file_path='data.yaml')

# if there are any error messages, they are retrievable through this call
data_file_validator.get_messages()

Expand All @@ -106,12 +218,12 @@ for the error message lookup map.

from hepdata_validator.data_file_validator import DataFileValidator
import yaml

file_contents = yaml.safe_load(open('data.yaml', 'r'))
data_file_validator = DataFileValidator()

data_file_validator.validate(file_path='data.yaml', data=file_contents)

data_file_validator.get_messages('data.yaml')

data_file_validator.print_errors('data.yaml')
Expand All @@ -131,10 +243,6 @@ For the analogous case of the ``SubmissionFileValidator``:
is_valid_submission_file = submission_file_validator.validate(file_path=submission_file_path, data=docs)
submission_file_validator.print_errors(submission_file_path)

An example `offline validation script <https://github.com/HEPData/hepdata-submission/blob/master/scripts/check.py>`_
uses the ``hepdata_validator`` package to validate the ``submission.yaml`` file and all YAML data files of a
HEPData submission.


Schema Versions
---------------
Expand Down Expand Up @@ -196,7 +304,7 @@ download them. However, in principle, for testing purposes, note that the same m

.. code:: python

schema_path = 'https://hepdata.net/submission/schemas/1.0.1/'
schema_path = 'https://hepdata.net/submission/schemas/1.1.0/'
schema_name = 'data_schema.json'

and passing a HEPData YAML data file as the ``file_path`` argument of the ``validate`` method.
and passing a HEPData YAML data file as the ``file_path`` argument of the ``validate`` method.
50 changes: 38 additions & 12 deletions hepdata_validator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,16 @@
import abc
import os

from jsonschema import validate as json_validate, ValidationError
from jsonschema.validators import validator_for
from jsonschema.exceptions import by_relevance
from packaging import version as packaging_version

from .version import __version__

__all__ = ('__version__', )

VALID_SCHEMA_VERSIONS = ['1.0.1', '1.0.0', '0.1.0']
VALID_SCHEMA_VERSIONS = ['1.1.0', '1.0.1', '1.0.0', '0.1.0']
LATEST_SCHEMA_VERSION = VALID_SCHEMA_VERSIONS[0]

RAW_SCHEMAS_URL = 'https://raw.githubusercontent.com/HEPData/hepdata-validator/' \
Expand All @@ -48,22 +53,16 @@ def __init__(self, *args, **kwargs):
self.default_schema_file = ''
self.schemas = kwargs.get('schemas', {})
self.schema_folder = kwargs.get('schema_folder', 'schemas')
self.schema_version = kwargs.get('schema_version', LATEST_SCHEMA_VERSION)
if self.schema_version not in VALID_SCHEMA_VERSIONS:
raise ValueError('Invalid schema version ' + self.schema_version)
self.schema_version_string = kwargs.get('schema_version', LATEST_SCHEMA_VERSION)
if self.schema_version_string not in VALID_SCHEMA_VERSIONS:
raise ValueError('Invalid schema version ' + self.schema_version_string)
self.schema_version = packaging_version.parse(self.schema_version_string)

def _get_major_version(self):
"""
Parses the major version of the validator.

:return: integer corresponding to the validator major version
"""
return int(self.schema_version.split('.')[0])

def _get_schema_filepath(self, schema_filename):
full_filepath = os.path.join(self.base_path,
self.schema_folder,
self.schema_version,
self.schema_version_string,
schema_filename)

if not os.path.isfile(full_filepath):
Expand All @@ -81,6 +80,33 @@ def validate(self, **kwargs):
:return: true if valid, false otherwise
"""

def _validate_json_against_schema(self, file_path, data, schema, sort_fn=None, **kwargs):
"""
Validates json_data against the given schema.
Roughly follows the pattern of jsonschema.validate but adds errors to
self.messages, and will add multiple errors if they exist.

:param type file_path: path to file being checked
:param type data: JSON/YAML data to validate
:param type schema: schema to validate data against
:param type sort_fn: Function to sort error messages to get most
relevant (see docs for `jsonschema.exceptions.by_relevance`).
:param type **kwargs: Other kwargs to use when creating the
`jsonschema.IValidator` instance.
"""
# Create validator ourselves so we can tweak the errors
cls = validator_for(schema)
cls.check_schema(schema)
v = cls(schema, **kwargs)

if not sort_fn:
sort_fn = by_relevance()

# Show all errors found, using best error in context for each
for error in v.iter_errors(data):
best = sorted([error] + error.context, key=sort_fn)[0]
self.add_validation_error(file_path, best)

def has_errors(self, file_name):
"""
Returns true if the provided file name has error messages
Expand Down
30 changes: 30 additions & 0 deletions hepdata_validator/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import sys

import click

from .full_submission_validator import FullSubmissionValidator


@click.command()
@click.option('--directory', '-d', default='.', help='Directory to check (defaults to current working directory)')
@click.option('--file', '-f', default=None, help='Single .yaml or .yaml.gz file (but not submission.yaml or a YAML data file) to check - see https://hepdata-submission.readthedocs.io/en/latest/single_yaml.html. (Overrides directory)')
@click.option('--archive', '-a', default=None, help='Archive file (.zip, .tar, .tar.gz, .tgz) to check. (Overrides directory and file)')
def validate(directory, file, archive): # pragma: no cover
"""
Offline validation of submission.yaml and YAML data files.
Can check either a directory, an archive file, or the single YAML file format.
"""
file_or_dir_checked = archive if archive else (file if file else directory)
validator = FullSubmissionValidator()
is_valid = validator.validate(directory, file, archive)
if is_valid:
click.echo(f"{file_or_dir_checked} is valid.")
else:
click.echo(f"ERROR: {file_or_dir_checked} is invalid.")

validator.print_valid_files()
for f in validator.messages.keys():
validator.print_errors(f)

if not is_valid:
sys.exit(1)
Loading