Skip to content

Commit

Permalink
Merge pull request #17 from Sinclert/pyhf-integration
Browse files Browse the repository at this point in the history
* Download a JSON schema (resolving possible $ref) from a remote location.
* Allow `data_schema` key specifying remote location in submission.yaml file.
* Bump native HEPData JSON schema version to 1.0.1.

Co-authored-by: Graeme Watt <Graeme.Watt@durham.ac.uk>
  • Loading branch information
GraemeWatt committed Apr 22, 2020
2 parents 5db8fb7 + c1b502c commit be6b82a
Show file tree
Hide file tree
Showing 22 changed files with 20,997 additions and 58 deletions.
7 changes: 6 additions & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
[run]
source = hepdata_validator
source = hepdata_validator

[report]
exclude_lines =
pragma: no cover
@abstract
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,7 @@ docs/_build/
target/

# PyCharm
.idea/
.idea/

# Downloaded schemas
hepdata_validator/schemas_remote/
77 changes: 64 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Via GitHub (for developers):
Usage
-----

To validate files, you need to instantiate a validator (I love OO).
To validate submission files, instantiate a ``SubmissionFileValidator`` object:

.. code:: python
Expand All @@ -70,7 +70,7 @@ To validate files, you need to instantiate a validator (I love OO).
submission_file_validator = SubmissionFileValidator()
submission_file_path = 'submission.yaml'
# the validate method takes a string representing the file path.
# the validate method takes a string representing the file path
is_valid_submission_file = submission_file_validator.validate(file_path=submission_file_path)
# if there are any error messages, they are retrievable through this call
Expand All @@ -80,15 +80,15 @@ To validate files, you need to instantiate a validator (I love OO).
submission_file_validator.print_errors(submission_file_path)
Data file validation is exactly the same.
To validate data files, instantiate a ``DataFileValidator`` object:

.. code:: python
from hepdata_validator.data_file_validator import DataFileValidator
data_file_validator = DataFileValidator()
# the validate method takes a string representing the file path.
# the validate method takes a string representing the file path
data_file_validator.validate(file_path='data.yaml')
# if there are any error messages, they are retrievable through this call
Expand All @@ -99,15 +99,15 @@ Data file validation is exactly the same.
Optionally, if you have already loaded the YAML object, then you can pass it through
as a data object. You must also pass through the ``file_path`` since this is used as a key
as a ``data`` object. You must also pass through the ``file_path`` since this is used as a key
for the error message lookup map.

.. code:: python
from hepdata_validator.data_file_validator import DataFileValidator
import yaml
file_contents = yaml.load(open('data.yaml', 'r'))
file_contents = yaml.safe_load(open('data.yaml', 'r'))
data_file_validator = DataFileValidator()
data_file_validator.validate(file_path='data.yaml', data=file_contents)
Expand All @@ -122,16 +122,67 @@ uses the ``hepdata_validator`` package to validate the ``submission.yaml`` file
HEPData submission.


Schemas
-------
Schema Versions
---------------

There are currently 2 versions of the JSON schemas, `0.1.0
<https://github.com/HEPData/hepdata-validator/tree/master/hepdata_validator/schemas/0.1.0>`_ and `1.0.0
<https://github.com/HEPData/hepdata-validator/tree/master/hepdata_validator/schemas/1.0.0>`_. In most cases you should use
**1.0.0** (the default). If you need to use a different version, you can pass a keyword argument ``schema_version``
when initialising the validator:
When considering **native HEPData JSON schemas**, there are multiple `versions
<https://github.com/HEPData/hepdata-validator/tree/master/hepdata_validator/schemas>`_.
In most cases you should use the **latest** version (the default). If you need to use a different version,
you can pass a keyword argument ``schema_version`` when initialising the validator:

.. code:: python
submission_file_validator = SubmissionFileValidator(schema_version='0.1.0')
data_file_validator = DataFileValidator(schema_version='0.1.0')
Remote Schemas
--------------

When using **remotely defined schemas**, versions depend on the organization providing those schemas,
and it is their responsibility to offer a way of keeping track of different schema versions.

The ``JsonSchemaResolver`` object resolves ``$ref`` in the JSON schema. The ``HTTPSchemaDownloader`` object retrieves
schemas from a remote location, and optionally saves them in the local file system, following the structure:
``schemas_remote/<org>/<project>/<version>/<schema_name>``. An example may be:

.. code:: python
from hepdata_validator.data_file_validator import DataFileValidator
data_validator = DataFileValidator()
# Split remote schema path and schema name
schema_path = 'https://scikit-hep.org/pyhf/schemas/1.0.0/'
schema_name = 'workspace.json'
# Create JsonSchemaResolver object to resolve $ref in JSON schema
from hepdata_validator.schema_resolver import JsonSchemaResolver
pyhf_resolver = JsonSchemaResolver(schema_path)
# Create HTTPSchemaDownloader object to validate against remote schema
from hepdata_validator.schema_downloader import HTTPSchemaDownloader
pyhf_downloader = HTTPSchemaDownloader(pyhf_resolver, schema_path)
# Retrieve and save the remote schema in the local path
pyhf_type = pyhf_downloader.get_schema_type(schema_name)
pyhf_spec = pyhf_downloader.get_schema_spec(schema_name)
pyhf_downloader.save_locally(schema_name, pyhf_spec)
# Load the custom schema as a custom type
import os
pyhf_path = os.path.join(pyhf_downloader.schemas_path, schema_name)
data_validator.load_custom_schema(pyhf_type, pyhf_path)
# Validate a specific schema instance
data_validator.validate(file_path='pyhf_workspace.json', file_type=pyhf_type)
The native HEPData JSON schema are provided as part of the ``hepdata-validator`` package and it is not necessary to
download them. However, in principle, for testing purposes, note that the same mechanism above could be used with:

.. code:: python
schema_path = 'https://hepdata.net/submission/schemas/1.0.1/'
schema_name = 'data_schema.json'
and passing a HEPData YAML data file as the ``file_path`` argument of the ``validate`` method.
22 changes: 14 additions & 8 deletions hepdata_validator/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of HEPData.
# Copyright (C) 2016 CERN.
# Copyright (C) 2020 CERN.
#
# HEPData is free software; you can redistribute it
# and/or modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -29,31 +29,32 @@

__all__ = ('__version__', )

VALID_SCHEMA_VERSIONS = ['1.0.0', '0.1.0']
VALID_SCHEMA_VERSIONS = ['1.0.1', '1.0.0', '0.1.0']
LATEST_SCHEMA_VERSION = VALID_SCHEMA_VERSIONS[0]

RAW_SCHEMAS_URL = 'https://raw.githubusercontent.com/HEPData/hepdata-validator/' \
+ __version__ + '/hepdata_validator/schemas'

class Validator(object):
"""
Provides a general 'interface' for Validator in HEPdata
Provides a general 'interface' for Validator in HEPData
which validates schema files created with the
JSONschema syntax http://json-schema.org/
JSON Schema syntax http://json-schema.org/
"""
__metaclass__ = abc.ABCMeta

def __init__(self, *args, **kwargs):
self.messages = {}
self.default_schema_file = ''
self.schemas = kwargs.get('schemas', {})
self.schema_folder = kwargs.get('schema_folder', 'schemas')
self.schema_version = kwargs.get('schema_version', LATEST_SCHEMA_VERSION)
if self.schema_version not in VALID_SCHEMA_VERSIONS:
raise ValueError('Invalid schema version ' + self.schema_version)

def _get_schema_filepath(self, schema_filename):
full_filepath = os.path.join(self.base_path,
'schemas',
self.schema_folder,
self.schema_version,
schema_filename)

Expand All @@ -66,6 +67,7 @@ def _get_schema_filepath(self, schema_filename):
def validate(self, **kwargs):
"""
Validates a file.
:param file_path: path to file to be loaded.
:param data: pre loaded YAML object (optional).
:return: true if valid, false otherwise
Expand All @@ -75,6 +77,7 @@ def has_errors(self, file_name):
"""
Returns true if the provided file name has error messages
associated with it, false otherwise.
:param file_name:
:return: boolean
"""
Expand All @@ -84,6 +87,7 @@ def get_messages(self, file_name=None):
"""
Return messages for a file (if file_name provided).
If file_name is none, returns all messages as a dict.
:param file_name:
:return: array if file_name is provided, dict otherwise.
"""
Expand All @@ -98,14 +102,16 @@ def get_messages(self, file_name=None):

def clear_messages(self):
"""
Removes all error messages
Removes all error messages.
:return:
"""
self.messages = {}

def add_validation_message(self, message):
"""
Adds a message to the messages dict
Adds a message to the messages dict.
:param message:
"""
if message.file not in self.messages:
Expand All @@ -115,7 +121,7 @@ def add_validation_message(self, message):

def print_errors(self, file_name):
"""
Prints the errors observed for a file
Prints the errors observed for a file.
"""
for error in self.get_messages(file_name):
print('\t', error.__unicode__())
Expand Down
60 changes: 34 additions & 26 deletions hepdata_validator/data_file_validator.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of HEPData.
# Copyright (C) 2016 CERN.
# Copyright (C) 2020 CERN.
#
# HEPData is free software; you can redistribute it
# and/or modify it under the terms of the GNU General Public License as
Expand All @@ -23,7 +23,6 @@
# as an Intergovernmental Organization or submit itself to any jurisdiction.

import json

import os
import yaml

Expand All @@ -41,20 +40,28 @@

class DataFileValidator(Validator):
"""
Validates the Data file YAML/JSON file
Validates the YAML/JSON data file.
"""
base_path = os.path.dirname(__file__)
schema_name = 'data_schema.json'

custom_data_schemas = {}

def __init__(self, *args, **kwargs):
super(DataFileValidator, self).__init__(*args, **kwargs)
self.default_schema_file = self._get_schema_filepath(self.schema_name)

def _get_major_version(self):
"""
Parses the major version of the validator.
:return: integer corresponding to the validator major version
"""
return int(self.schema_version.split('.')[0])

def load_custom_schema(self, type, schema_file_path=None):
"""
Loads a custom schema, or will used a stored version for the given type if available
Loads a custom schema, or will use a stored version for the given type if available.
:param type: e.g. histfactory
:return:
"""
Expand All @@ -66,7 +73,7 @@ def load_custom_schema(self, type, schema_file_path=None):
_schema_file = schema_file_path
else:
_schema_file = os.path.join(self.base_path,
'schemas',
self.schema_folder,
self.schema_version,
"{0}_schema.json".format(type))

Expand All @@ -81,53 +88,54 @@ def load_custom_schema(self, type, schema_file_path=None):

def validate(self, **kwargs):
"""
Validates a data file
Validates a data file.
:param file_path: path to file to be loaded.
:param file_type: file data type (optional).
:param data: pre loaded YAML object (optional).
:return: Bool to indicate the validity of the file.
"""

default_data_schema = None

with open(self.default_schema_file, 'r') as f:
default_data_schema = json.load(f)

# even though we are using the yaml package to load,
# it supports JSON and YAML
data = kwargs.pop("data", None)
file_path = kwargs.pop("file_path", None)
file_type = kwargs.pop("file_type", None)
data = kwargs.pop("data", None)

if file_path is None:
raise LookupError("file_path argument must be supplied")

if data is None:

try:
# The yaml package support both JSON and YAML
with open(file_path, 'r') as df:
data = yaml.load(df, Loader=Loader)
except Exception as e:
self.add_validation_message(ValidationMessage(file=file_path, message=
'There was a problem parsing the file.\n' + e.__str__()))
self.add_validation_message(ValidationMessage(
file=file_path,
message='There was a problem parsing the file.\n' + e.__str__(),
))
return False

try:

if 'type' in data:
if file_type:
custom_schema = self.load_custom_schema(file_type)
json_validate(data, custom_schema)
elif 'type' in data:
custom_schema = self.load_custom_schema(data['type'])
json_validate(data, custom_schema)
else:
json_validate(data, default_data_schema)
major_schema_version = int(self.schema_version.split('.')[0])
if major_schema_version > 0:
with open(self.default_schema_file, 'r') as f:
default_data_schema = json.load(f)
json_validate(data, default_data_schema)
if self._get_major_version() > 0:
check_for_zero_uncertainty(data)
check_length_values(data)

except ValidationError as ve:

self.add_validation_message(
ValidationMessage(file=file_path,
message=ve.message + ' in ' + str(ve.instance)))
self.add_validation_message(ValidationMessage(
file=file_path,
message=ve.message + ' in ' + str(ve.instance),
))

if self.has_errors(file_path):
return False
Expand Down
Loading

0 comments on commit be6b82a

Please sign in to comment.