Skip to content

SDM-TIB/PALADIN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: MIT

logo of PALADIN

PALADIN is a process-based data validator, i.e., it is capable of validating an entity at different stages of the process it undergoes. PALADIN is source-agnostic, which means that data sources of different formats can be validated.

Note

This repository contains the prototype implementation of PALADIN. The prototype is limited to RDF-based knowledge graphs accessible via SPARQL endpoints and relational databases in MySQL.

PALADIN Architecture

PALADIN Architecture

Fig. 1: PALADIN Architecture

The PALADIN architecture (see Fig. 1) consists of two components, namely the Schema Traversal Planner and the Schema Traversal Engine. The input is a PALADIN schema (the constraints) and the population (entities from a datasource) which is validated. The population can be sourced from different data formats, e.g., RDF-based knowledge graphs, relational databases, or NoSQL databases. PALADIN returns for each shape in the PALADIN schema the set of valid and invalid entities.

Shape Traversal Planner. This component generates an evaluation plan for the shapes in the PALADIN schema. The process is guided by several features of the shapes including statistics related to the number of the shapes, the topology of the schema, the population size, and selectivity of the shapes. If the PALADIN schema corresponds to a balanced tree, the schema will be evaluated following a breadth-first strategy. However, when the schema induces an imbalanced tree where the majority of the shapes are part of the left side of the tree, PALADIN follows a depth-first strategy. The experiments show that the selection of the traversal strategy impacts the continuous behavior of the engine.

Shape Traversal Engine. This component is responsible for the evaluation of the PALADIN schema following the traversal strategy chosen by the shape traversal planner. The engine follows a recursive evaluation model starting from the root shape of the schema. The evaluation involves two steps:

  1. Target Population Generation. The target population is the intersection of the ancestor's population (either valid or invalid entities) and the shape's target query.
  2. Population Validation. The target population is split into the sets for valid and invalid entities based on the constraint query.

The output of the shape traversal engine are the sets of valid and invalid entities for each shape in the PALADIN schema.

Experimental Study

Note

This is a brief summary of the experimental study presented in the paper. Please, read the paper for a more elaborate result discussion. The data and scripts can be found in [1]. The data is generated using [2].

The experimental study aims at answering the following research questions:

  1. What is the impact of evaluating PALADIN schema following different tree traversal strategies?
  2. Which parameters impact the performance of PALADIN?
  3. What is the impact of the size and heterogeneity of the data sources on the performance of PALADIN?
  4. How does the size of a PALADIN schema affect the validation time?

Experimental Setup

Data Sets. PALADIN is evaluated over 20 data sets. Two of these data sets contain real-world data about breast cancer treatments; one in a relational database and the other as an RDF-based knowledge graph. The remaining 18 data sets contain synthetic data based on the real-world breast cancer data. The synthetic data sets are published as [1] and are generated using [2]. Again, half of the data sets are stored in a relational database while the other half is a corresponding RDF-based knowledge graph. The resulting nine different data sets differ in the size, i.e., number of patients modeled, and the truthfulness of the data. Synthetic data sets modeling 1000 (small), 10000 (medium), and 100000 (big) patients are used. The data sets varying in the data quality are called clean, mid, and dirty.

PALADIN Schemas. The performance of PALADIN is evaluated using a PALADIN schema representing the treatment guideline for breast cancer patients with an amplified HER2 gene. The schema contains 15 shapes and is imbalanced. For the scalability study, seven different PALADIN schemas are evaluated. They all divide the population based on ranges over the ID and differ in the number of nodes in the schema. The evaluated schemas contain 16, 32, 64, 128, 256, 512, and 1024 nodes, respectively.

Configurations. Two different data formats are considered in the experimental study: (i) relational database (RDB), and (ii) RDF-based knowledge graph (KG). Additionally, the traversal strategies BFS and DFS are evaluated. This results in four configurations:

  1. RDB+BFS
  2. RDB+DFS
  3. KG+BFS
  4. KG+DFS

In the case of the real-world data, PALADIN is compared to Trav-SHACL [3], a state-of-the-art SHACL validator, as well as Shaclex [4] and PyShEx [5], two ShEx validators. For the comparison with SHACL and ShEx, 15 artificial classes representing the entities in all the shapes are added to the real-world KG. For constraints that compare the value of two different properties, the constraint satisfaction is materialized. This is due to the fact that such constraints cannot be expressed in ShEx. In the case of SHACL, a SPARQL-based constraint would be required.

Metrics. The following metrics are reported:

  1. Average Validation Time: Average time in seconds that are needed to validate the PALADIN schema.
  2. Standard Deviation of the Validation Time: Standard deviation of the time required to validate the PALADIN schema.
  3. dief@t: Continuous efficiency in the first t time units [6].
  4. dief@k Continuous efficiency while producing the first k results [6].

Experimental Setup. The main PALADIN schema is evaluated with all four configurations over all 20 data sets. The three scalability schemas are evaluated with all PALADIN configurations over the data set Mid Large. Together with the one case of using Trav-SHACL, this results in a total of 53 testbeds. Each testbed is executed 10 times. MySQL 8.1.0 and Virtuoso 7.20.3237 are used for hosting the RDB and KG data. The entire experimental environment is dockerized and available in [1].

Results

Real Data

Execution Time over Real Data

Fig. 2: Execution Time over Real Data

Answer Traces for Real Data

Fig. 3: Answer Traces for Real Data

dief@t for Real Data

Fig. 4: dief@t for Real Data

Comparison of PALADIN with SHACL and ShEx

Fig. 5: EComparison of PALADIN with SHACL and ShEx over Real Data

Figs. 2-5 present the results over the real-world data. The KG configurations outperform the RDB configurations. There is no real difference in the execution time for the two traversal strategies. The continuous behavior shows that the RDB configurations generate the first result faster but are slower overall. PALADIN is significantly faster than the other approaches. Trav-SHACL is the second best engine. PyShEx and Shaclex (ShEx) have a similar performance. However, validating SHACL with Shaclex, i.e., Shaclex (SHACL), has the worst performance.

Synthetic Data

Execution Time over Synthetic Data

Fig. 6: Execution Time over Synthetic Data

dief@t for Synthetic Data

Fig. 7: dief@t for Synthetic Data

dief@k for Synthetic Data

Fig. 8: dief@k for Synthetic Data

Figs. 6-8 show the results of the evaluation of the synthetic data. The RDB configurations are faster for small data sets but starting from medium-sized data sets, the KG configurations perform better. Again, no significant difference in the overall execution time between the two traversal strategies can be observed. For the continuous behavior only the KG configurations are reported. When considering dief@t (higher is better), no difference can be seen. However, when studying dief@k (lower is better), it is obvious that KG+DFS is producing the first half of answers faster than the KG+BFS configuration. This is due to the imbalanced nature of the PALADIN schema as discussed above.

Scalability

Execution Time Scalability Study

Fig. 9: Execution Time Scalability Study

Fig. 9 shows that the execution time increases linear with the number of nodes in the PALADIN schema, i.e., when the complexity of the queries is fixed. Due to the nature of the constraint queries (range query over primary key), the RDB configurations outperform the KG configurations.

Note

This is a brief summary of the experimental study presented in the paper. Please, read the paper for a more elaborate result discussion. The data and scripts can be found in [1]. The data is generated using [2].

References

[1] Philipp D. Rohde, Antonio Jesus Diaz-Honrubia, Emetis Niazmand, Maria-Esther Vidal. PALADIN: Benchmarks, Experimental Settings, and Evaluation. DOI: 10.57702/kf5tc88r

[2] Antonio Jesus Diaz-Honrubia, Philipp D. Rohde. Synthetic Data Generator. GitHub: SDM-TIB/Synthetic-Data-Generator

[3] Mónica Figuera, Philipp D. Rohde, Maria-Esther Vidal. Trav-SHACL: Efficiently Validating Networks of SHACL Constraints. In Proceedings of the Web Conference 2021 (WWW '21), April 19-23, 2021, Ljubljana, Slovenia. DOI: 10.1145/3442381.3449877

[4] Jose Emilio Labra Gayo, Eric Prud'hommeaux, Bogdan Roman, Toni Cebrían, Andrew Berezovskyi. Shaclex v0.2.2. GitHub: weso/shaclex

[5] Harold Solbrig, Alejandro González Hevia, Vincent Emonet, Egon Willighagen, Josh Moore, Andra Waagmeester, Ben McAlister. PyShEx v0.8.2. GitHub: hsolbrig/PyShEx

[6] Maribel Acosta, Maria-Esther Vidal, York Sure-Vetter. Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing Approaches. In Proceedings of the International Semantic Web Conference, 2017. DOI: 10.1007/978-3-319-68204-4_1