Merge pull request #17 from Sinclert/pyhf-integration

* Download a JSON schema (resolving possible $ref) from a remote location. * Allow `data_schema` key specifying remote location in submission.yaml file. * Bump native HEPData JSON schema version to 1.0.1. Co-authored-by: Graeme Watt <Graeme.Watt@durham.ac.uk>
HEPData · Apr 22, 2020 · be6b82a · be6b82a
2 parents 5db8fb7 + c1b502c
commit be6b82a
Show file tree

Hide file tree

Showing 22 changed files with 20,997 additions and 58 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -1,2 +1,7 @@
 [run]
-source = hepdata_validator
+source = hepdata_validator
+
+[report]
+exclude_lines =
+    pragma: no cover
+    @abstract
diff --git a/.gitignore b/.gitignore
@@ -56,4 +56,7 @@ docs/_build/
 target/
 
 # PyCharm
-.idea/
+.idea/
+
+# Downloaded schemas
+hepdata_validator/schemas_remote/
diff --git a/README.rst b/README.rst
@@ -61,7 +61,7 @@ Via GitHub (for developers):
 Usage
 -----
 
-To validate files, you need to instantiate a validator (I love OO).
+To validate submission files, instantiate a ``SubmissionFileValidator`` object:
 
 .. code:: python
 
@@ -70,7 +70,7 @@ To validate files, you need to instantiate a validator (I love OO).
     submission_file_validator = SubmissionFileValidator()
     submission_file_path = 'submission.yaml'
     
-    # the validate method takes a string representing the file path. 
+    # the validate method takes a string representing the file path
     is_valid_submission_file = submission_file_validator.validate(file_path=submission_file_path)
     
     # if there are any error messages, they are retrievable through this call
@@ -80,15 +80,15 @@ To validate files, you need to instantiate a validator (I love OO).
     submission_file_validator.print_errors(submission_file_path)
 
 
-Data file validation is exactly the same.
+To validate data files, instantiate a ``DataFileValidator`` object:
 
 .. code:: python
     
     from hepdata_validator.data_file_validator import DataFileValidator
     
     data_file_validator = DataFileValidator()
     
-    # the validate method takes a string representing the file path.
+    # the validate method takes a string representing the file path
     data_file_validator.validate(file_path='data.yaml')
     
     # if there are any error messages, they are retrievable through this call
@@ -99,15 +99,15 @@ Data file validation is exactly the same.
 
 
 Optionally, if you have already loaded the YAML object, then you can pass it through
-as a data object. You must also pass through the ``file_path`` since this is used as a key
+as a ``data`` object. You must also pass through the ``file_path`` since this is used as a key
 for the error message lookup map.
 
 .. code:: python
 
     from hepdata_validator.data_file_validator import DataFileValidator
     import yaml
     
-    file_contents = yaml.load(open('data.yaml', 'r'))
+    file_contents = yaml.safe_load(open('data.yaml', 'r'))
     data_file_validator = DataFileValidator()
     
     data_file_validator.validate(file_path='data.yaml', data=file_contents)
@@ -122,16 +122,67 @@ uses the ``hepdata_validator`` package to validate the ``submission.yaml`` file
 HEPData submission.
 
 
-Schemas
--------
+Schema Versions
+---------------
 
-There are currently 2 versions of the JSON schemas, `0.1.0
-<https://github.com/HEPData/hepdata-validator/tree/master/hepdata_validator/schemas/0.1.0>`_ and `1.0.0
-<https://github.com/HEPData/hepdata-validator/tree/master/hepdata_validator/schemas/1.0.0>`_. In most cases you should use
-**1.0.0** (the default). If you need to use a different version, you can pass a keyword argument ``schema_version``
-when initialising the validator:
+When considering **native HEPData JSON schemas**, there are multiple `versions
+<https://github.com/HEPData/hepdata-validator/tree/master/hepdata_validator/schemas>`_.
+In most cases you should use the **latest** version (the default). If you need to use a different version,
+you can pass a keyword argument ``schema_version`` when initialising the validator:
 
 .. code:: python
 
     submission_file_validator = SubmissionFileValidator(schema_version='0.1.0')
     data_file_validator = DataFileValidator(schema_version='0.1.0')
+
+
+Remote Schemas
+--------------
+
+When using **remotely defined schemas**, versions depend on the organization providing those schemas,
+and it is their responsibility to offer a way of keeping track of different schema versions.
+
+The ``JsonSchemaResolver`` object resolves ``$ref`` in the JSON schema. The ``HTTPSchemaDownloader`` object retrieves
+schemas from a remote location, and optionally saves them in the local file system, following the structure:
+``schemas_remote/<org>/<project>/<version>/<schema_name>``. An example may be:
+
+.. code:: python
+
+    from hepdata_validator.data_file_validator import DataFileValidator
+    data_validator = DataFileValidator()
+
+    # Split remote schema path and schema name
+    schema_path = 'https://scikit-hep.org/pyhf/schemas/1.0.0/'
+    schema_name = 'workspace.json'
+
+    # Create JsonSchemaResolver object to resolve $ref in JSON schema
+    from hepdata_validator.schema_resolver import JsonSchemaResolver
+    pyhf_resolver = JsonSchemaResolver(schema_path)
+
+    # Create HTTPSchemaDownloader object to validate against remote schema
+    from hepdata_validator.schema_downloader import HTTPSchemaDownloader
+    pyhf_downloader = HTTPSchemaDownloader(pyhf_resolver, schema_path)
+
+    # Retrieve and save the remote schema in the local path
+    pyhf_type = pyhf_downloader.get_schema_type(schema_name)
+    pyhf_spec = pyhf_downloader.get_schema_spec(schema_name)
+    pyhf_downloader.save_locally(schema_name, pyhf_spec)
+
+    # Load the custom schema as a custom type
+    import os
+    pyhf_path = os.path.join(pyhf_downloader.schemas_path, schema_name)
+    data_validator.load_custom_schema(pyhf_type, pyhf_path)
+
+    # Validate a specific schema instance
+    data_validator.validate(file_path='pyhf_workspace.json', file_type=pyhf_type)
+
+
+The native HEPData JSON schema are provided as part of the ``hepdata-validator`` package and it is not necessary to
+download them. However, in principle, for testing purposes, note that the same mechanism above could be used with:
+
+.. code:: python
+
+    schema_path = 'https://hepdata.net/submission/schemas/1.0.1/'
+    schema_name = 'data_schema.json'
+
+and passing a HEPData YAML data file as the ``file_path`` argument of the ``validate`` method.
diff --git a/hepdata_validator/__init__.py b/hepdata_validator/__init__.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 #
 # This file is part of HEPData.
-# Copyright (C) 2016 CERN.
+# Copyright (C) 2020 CERN.
 #
 # HEPData is free software; you can redistribute it
 # and/or modify it under the terms of the GNU General Public License as
@@ -29,31 +29,32 @@
 
 __all__ = ('__version__', )
 
-VALID_SCHEMA_VERSIONS = ['1.0.0', '0.1.0']
+VALID_SCHEMA_VERSIONS = ['1.0.1', '1.0.0', '0.1.0']
 LATEST_SCHEMA_VERSION = VALID_SCHEMA_VERSIONS[0]
 
 RAW_SCHEMAS_URL = 'https://raw.githubusercontent.com/HEPData/hepdata-validator/' \
     + __version__ + '/hepdata_validator/schemas'
 
 class Validator(object):
     """
-    Provides a general 'interface' for Validator in HEPdata
+    Provides a general 'interface' for Validator in HEPData
     which validates schema files created with the
-    JSONschema syntax http://json-schema.org/
+    JSON Schema syntax http://json-schema.org/
     """
     __metaclass__ = abc.ABCMeta
 
     def __init__(self, *args, **kwargs):
         self.messages = {}
         self.default_schema_file = ''
         self.schemas = kwargs.get('schemas', {})
+        self.schema_folder = kwargs.get('schema_folder', 'schemas')
         self.schema_version = kwargs.get('schema_version', LATEST_SCHEMA_VERSION)
         if self.schema_version not in VALID_SCHEMA_VERSIONS:
             raise ValueError('Invalid schema version ' + self.schema_version)
 
     def _get_schema_filepath(self, schema_filename):
         full_filepath = os.path.join(self.base_path,
-                                     'schemas',
+                                     self.schema_folder,
                                      self.schema_version,
                                      schema_filename)
 
@@ -66,6 +67,7 @@ def _get_schema_filepath(self, schema_filename):
     def validate(self, **kwargs):
         """
         Validates a file.
+
         :param file_path: path to file to be loaded.
         :param data: pre loaded YAML object (optional).
         :return: true if valid, false otherwise
@@ -75,6 +77,7 @@ def has_errors(self, file_name):
         """
         Returns true if the provided file name has error messages
         associated with it, false otherwise.
+
         :param file_name:
         :return: boolean
         """
@@ -84,6 +87,7 @@ def get_messages(self, file_name=None):
         """
         Return messages for a file (if file_name provided).
         If file_name is none, returns all messages as a dict.
+
         :param file_name:
         :return: array if file_name is provided, dict otherwise.
         """
@@ -98,14 +102,16 @@ def get_messages(self, file_name=None):
 
     def clear_messages(self):
         """
-        Removes all error messages
+        Removes all error messages.
+
         :return:
         """
         self.messages = {}
 
     def add_validation_message(self, message):
         """
-        Adds a message to the messages dict
+        Adds a message to the messages dict.
+
         :param message:
         """
         if message.file not in self.messages:
@@ -115,7 +121,7 @@ def add_validation_message(self, message):
 
     def print_errors(self, file_name):
         """
-        Prints the errors observed for a file
+        Prints the errors observed for a file.
         """
         for error in self.get_messages(file_name):
             print('\t', error.__unicode__())

diff --git a/hepdata_validator/data_file_validator.py b/hepdata_validator/data_file_validator.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 #
 # This file is part of HEPData.
-# Copyright (C) 2016 CERN.
+# Copyright (C) 2020 CERN.
 #
 # HEPData is free software; you can redistribute it
 # and/or modify it under the terms of the GNU General Public License as
@@ -23,7 +23,6 @@
 # as an Intergovernmental Organization or submit itself to any jurisdiction.
 
 import json
-
 import os
 import yaml
 
@@ -41,20 +40,28 @@
 
 class DataFileValidator(Validator):
     """
-    Validates the Data file YAML/JSON file
+    Validates the YAML/JSON data file.
     """
     base_path = os.path.dirname(__file__)
     schema_name = 'data_schema.json'
-
     custom_data_schemas = {}
 
     def __init__(self, *args, **kwargs):
         super(DataFileValidator, self).__init__(*args, **kwargs)
         self.default_schema_file = self._get_schema_filepath(self.schema_name)
 
+    def _get_major_version(self):
+        """
+        Parses the major version of the validator.
+
+        :return: integer corresponding to the validator major version
+        """
+        return int(self.schema_version.split('.')[0])
+
     def load_custom_schema(self, type, schema_file_path=None):
         """
-        Loads a custom schema, or will used a stored version for the given type if available
+        Loads a custom schema, or will use a stored version for the given type if available.
+
         :param type: e.g. histfactory
         :return:
         """
@@ -66,7 +73,7 @@ def load_custom_schema(self, type, schema_file_path=None):
                 _schema_file = schema_file_path
             else:
                 _schema_file = os.path.join(self.base_path,
-                                            'schemas',
+                                            self.schema_folder,
                                             self.schema_version,
                                             "{0}_schema.json".format(type))
 
@@ -81,53 +88,54 @@ def load_custom_schema(self, type, schema_file_path=None):
 
     def validate(self, **kwargs):
         """
-        Validates a data file
+        Validates a data file.
 
         :param file_path: path to file to be loaded.
+        :param file_type: file data type (optional).
         :param data: pre loaded YAML object (optional).
         :return: Bool to indicate the validity of the file.
         """
 
-        default_data_schema = None
-
-        with open(self.default_schema_file, 'r') as f:
-            default_data_schema = json.load(f)
-
-        # even though we are using the yaml package to load,
-        # it supports JSON and YAML
-        data = kwargs.pop("data", None)
         file_path = kwargs.pop("file_path", None)
+        file_type = kwargs.pop("file_type", None)
+        data = kwargs.pop("data", None)
 
         if file_path is None:
             raise LookupError("file_path argument must be supplied")
 
         if data is None:
 
             try:
+                # The yaml package support both JSON and YAML
                 with open(file_path, 'r') as df:
                     data = yaml.load(df, Loader=Loader)
             except Exception as e:
-                self.add_validation_message(ValidationMessage(file=file_path, message=
-                'There was a problem parsing the file.\n' + e.__str__()))
+                self.add_validation_message(ValidationMessage(
+                    file=file_path,
+                    message='There was a problem parsing the file.\n' + e.__str__(),
+                ))
                 return False
 
         try:
-
-            if 'type' in data:
+            if file_type:
+                custom_schema = self.load_custom_schema(file_type)
+                json_validate(data, custom_schema)
+            elif 'type' in data:
                 custom_schema = self.load_custom_schema(data['type'])
                 json_validate(data, custom_schema)
             else:
-                json_validate(data, default_data_schema)
-                major_schema_version = int(self.schema_version.split('.')[0])
-                if major_schema_version > 0:
+                with open(self.default_schema_file, 'r') as f:
+                    default_data_schema = json.load(f)
+                    json_validate(data, default_data_schema)
+                if self._get_major_version() > 0:
                     check_for_zero_uncertainty(data)
                     check_length_values(data)
 
         except ValidationError as ve:
-
-            self.add_validation_message(
-                ValidationMessage(file=file_path,
-                                    message=ve.message + ' in ' + str(ve.instance)))
+            self.add_validation_message(ValidationMessage(
+                file=file_path,
+                message=ve.message + ' in ' + str(ve.instance),
+            ))
 
         if self.has_errors(file_path):
             return False