Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOV crawler & Vaadin update #93

Merged
merged 16 commits into from
Dec 5, 2018
Merged

LOV crawler & Vaadin update #93

merged 16 commits into from
Dec 5, 2018

Conversation

chile12
Copy link

@chile12 chile12 commented Dec 1, 2018

  • Updated Vaadin, since it caused multiple issues especially under windows.
  • Included an 'Ontology' option in the constraint selection, which will not skip owl, rdf, and rdfs constraints tests.
  • Made the excluded ontologies configurable, (by default it still excludes owl, rdf, rdfs), introducing an 'x' option in the CLI.
  • Revised the LOV endpoint class, and created basically a crawler, which searches actively for possible resource locations, I also updated the default schemaDecl.csv
  • I introduced a directoary containing standard vocabularies which are always loaded (based on vann properties (containing rut, shacl, and xsd as ontology with regex restrictions)
  • SchemaSource now collects all imports transitively, a special function now can also load the source with all its imports.
  • other minor stuff

How Has This Been Tested?

  • added a test for the LOV crawler (deactivated by default)
  • added xsd type tests
  • added a test for imported ontologies

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Markus Freudenberg added 13 commits November 5, 2018 13:44
7.7.1 causes this issue under Windows: https://dev.vaadin.com/ticket/20285
# Conflicts:
#	pom.xml
#	rdfunit-commons/pom.xml
#	rdfunit-core/pom.xml
#	rdfunit-core/src/test/resources/org/aksw/rdfunit/validate/data/owl/ontology.ttl
#	rdfunit-examples/pom.xml
#	rdfunit-io/pom.xml
#	rdfunit-junit/pom.xml
#	rdfunit-manual-tests/pom.xml
#	rdfunit-model/pom.xml
#	rdfunit-validate/pom.xml
#	rdfunit-w3c-dqv/pom.xml
@coveralls
Copy link

coveralls commented Dec 1, 2018

Coverage Status

Coverage decreased (-2.2%) to 55.936% when pulling 9281c12 on fix/updateVaadin into d96c328 on master.

?C1 rdf:type/rdfs:subClassOf* ?C2 .
FILTER ( ?C2 IN (rdfs:Literal, rdf:langString, rdfs:Datatype, owl:DatatypeProperty))
?D1 rdf:type/rdfs:subClassOf* ?C3 .
FILTER ( ?C3 IN (rdfs:Literal, rdf:langString, rdfs:Datatype, owl:DatatypeProperty))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍

Copy link
Member

@jimkont jimkont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this is a great contribution, thanks a lot @chile12 !!!

This PR covers a lot of features that almost all are good to go except from two remarks:

  1. the test results need to be reverted back to LIst (from Set) for conformance to shacl and
  2. the automatic following of owl:imports probably needs to be moved one lever higher to avoid possible additional test geneartion and execution time

* @param redirects - list of established redirects which led to this point
* @return - the optional content location
*/
private Optional<String> getContentLocation(String urlStr, SerializationFormat format, ArrayList<String> redirects){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks very useful, would worth considering reusing it in the rdfunit-io module / RdfReader and we re-use it all sort of dereferencing we do. I'll make an issue for tracking this after this get's merged

/**
* @author Dimitris Kontokostas
* @since 9/16/13 1:51 PM
*/
@ToString
@EqualsAndHashCode(exclude={"model", "schemaReader"})
@EqualsAndHashCode(exclude={"model", "schemaReader", "imports"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change probably needs some more discussion. Initial versions of RDFUnit had transitive owl:imports closure computation through the Jena ontology model but the effect was kind of similar (here there is more control on what gets loaded though).

The issues I had and decided to revert this was that multiple sources could be re-importing the same imports and since the test generation is done on a SchmaSource level, we generate multiple copies of the same tests, which adds big delay in the test generation phase as well as on the resulting execution.

The other problem I had was big delays in resolving non-resolvable IRIs, but this is not a problem in this case since we reconcile with the predefined schemas

To tackle the first issue I would propose to, instead of merging the imported models here, we move this functionality outside of the class and generate additional SchemaSources that we can easier deduplicate. wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good. I will create the subclass.

Copy link
Member

@jimkont jimkont Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one approach would be to get all the schema sources that are defined explicitly (or the ones we got automatically).
We iterate over all of them and we get all transitive owl:imports from each schema source instance.
We gather all import IRIs into a set (for deduplication) and then reconcile them with the schema service.

at the end, we return the original schema sources along with the ones we computed as a new collection for input to RDFnit

Copy link
Author

@chile12 chile12 Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically already implemented, except that I load each unique import into one model.
So you would like to change the interface like this?

public Model getTransitiveModel() -> public Collection[SchemaSource] getTransitiveSchemaSources()

Where would use it then? Atm I only use it in TagRdfUnitTestGenerator...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to use this a bit differently so that people use the owl:import computation on demand.

instead of creating a separate class that is fixed in the code, I was thinking of a (static) function that takes the original
SchemaSource list and returns an augmented list back and this function would be called in the RDFUnitConfiguration.getAllSchemata().

This way we could even have a parameter specifying if we want to compute the owl:imports or not in the future, or maybe allow to get sources even not defined in the SchemaService.

So, that function would look something like the following (not tested):

    public static List<SchemaSource> augmentWithOwlImports(List<SchemaSource> originalSources) {

        ImmutableList.Builder<SchemaSource> augmentedSources = ImmutableList.builder();
        augmentedSources.addAll(originalSources);

        Set<String> schemaIris = originalSources.stream().map(SchemaSource::getSchema).collect(Collectors.toSet());
        Set<SchemaSource> currentSources = new HashSet<>(originalSources);

        while (!currentSources.isEmpty()) {

            Set<SchemaSource> computedSources = currentSources.stream()
                    .map(SchemaSource::getModel)
                    .flatMap(m -> m.listObjectsOfProperty(OWL.imports).toList().stream())
                    .filter(RDFNode::isResource)
                    .map(RDFNode::asResource)
                    .map(Resource::getURI)
                    .filter(uri -> !schemaIris.contains(uri))
                    .map(SchemaService::getSourceFromUri)
                    .filter(Optional::isPresent)
                    .map(Optional::get)
                    .collect(Collectors.toSet());

            Set<String> newIris = computedSources.stream().map(SchemaSource::getSchema).collect(Collectors.toSet());
            schemaIris.addAll(newIris);

            augmentedSources.addAll(computedSources);
            
            currentSources = computedSources;

        }

        return augmentedSources.build();
    }

wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as stated (worked without changing a line 👍 ), introduced an -i option in the cli

@jimkont jimkont merged commit 05c1c8e into master Dec 5, 2018
@jimkont
Copy link
Member

jimkont commented Dec 5, 2018

💯 thanks a lot @chile12

@jimkont jimkont deleted the fix/updateVaadin branch June 8, 2019 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants