Skip to content

4. FedShop Data Generator

Minh-Hoang DANG edited this page Sep 24, 2023 · 4 revisions

4.1 Introduction

The FedShop Data Generator consists of three WatDiv template models located in experiments/bsbm/model, which adhere closely to the BSBM specification. The use of WatDiv models makes it easy to modify the schema through the configuration file experiments/bsbm/config.yaml.

The majority of FedShop's parameters are defined in experiments/bsbm/config.yaml, including the number of products to generate, as well as the number of vendors and rating sites to be included.

4.2 General workflow

  • Create the catalogue of products (200000 by default)
  • Batch(0)= Create 10 autonomous vendors and 10 autonomous rating sites sharing products from the catalogue (products are replicated with local URL per vendors and rating sites). The distribution law can be controlled with parameters declared in experiments/bsbm/config.yaml
  • Workload = Instantiate the 12 template queries with 10 different random place-holders, such that each query return results.
  • Compute the minimal source assignment of each of the 120 queries of the Workload on Batch(0)
  • For i from 1 to 9
    • Batch(i) = Batch(i-1) + 10 new vendors + 10 rating sites
    • Compute the minimal source assignment for each query of the Workload over Batch(I)

4.3 Generate Data

In this section, we'll provide you with a step-by-step guide to generating data for the benchmark. By following these instructions, you'll be able to generate the necessary data and obtain the results in experiments/bsbm/model/dataset.

4.3.1 Create the Virtual Catalog

Please note that in the examples below, we use {%component_name} to expose components that can be modified downstream by the generation script. All the templates for generating product, vendor, and rating sites are available under experiments/bsbm/model/watdiv.

#namespace <prefix>=<URI> means that each prefix will be linked to a specific URI, like the PREFIX clause in SPARQL's queries.

For example, we can have the following prefix

#namespace  bsbm=http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/

#namespace __provenance={%provenance} means that each entity that uses provenance uses it as the base URI.
#namespace __output_org=fragmented means that each entity will be generated in a separate file. This is particularly helpful to replicate each product to a shop/rating site downstream.
#namespace __output_dir={%export_output_dir} means that each entity's files will be generated in the export_output_dir

#namespace	<prefix>=<URI>
#namespace  __provenance={%provenance}
#namespace  __output_org=fragmented
#namespace  __output_dir={%export_output_dir}

// ===== ENTITIES & LITERAL PROPERTIES ===== //

// ----- <linked_subject> ----- //
<type> <linked_subject_class> {%<linked_subject>_n}

<pgroup> <predicate_probability>
#predicate <predicate> <object>
</pgroup>

</type>

// ----- <main_subject> ----- //

<type> <main_subject_class> {%<main_subject>_n}

<pgroup> <predicate_probability>
#predicate <predicate> <object>
</pgroup>

</type>

#association <main_subject_class> <predicate> <main_subject_type> 2 <number_of_main_subject_type> <association_probability> NORMAL // (Many-To-Many) All <main_subject_class> have <number_of_main_subject_type> <main_subject_type>, following a normal distribution, with the probability of <association_probability>.

#association <main_subject_class> <predicate> <linked_subject_class> 2 1 <association_probability> NORMAL // (Many-To-One) All <main_subject_class> have only 1 <linked_subject_class>, following Normal distribution, with the probability of <association_probability>.

Using the template provided above, we can create a file for the Product entity, such as the one shown below.

#namespace	bsbm=http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/
#namespace	rdfs=http://www.w3.org/2000/01/rdf-schema#
#namespace	rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#
#namespace	dc=http://purl.org/dc/elements/1.1/
#namespace  __provenance={%provenance}
#namespace  __output_org=fragmented
#namespace  __output_dir={%export_output_dir}
#namespace  __replicated=true

// ===== ENTITIES & LITERAL PROPERTIES ===== //

// ----- Producer ----- //
<type> bsbm:Producer {%producer_n}

<pgroup> 1.0
#predicate rdfs:label string{%label_wc}
</pgroup>

<pgroup> 1.0
#predicate rdfs:comment string{%producer_comment_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:country country
</pgroup>

<pgroup> 1.0
#predicate bsbm:publishDate date 2000-07-20 2005-06-23
</pgroup>

</type>

// ----- Product ----- //

<type> bsbm:Product {%product_n}

<pgroup> 1.0
#predicate rdfs:label string{%label_wc}
</pgroup>

<pgroup> 1.0
#predicate rdfs:comment string{%comment_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:productPropertyTextual1 string{%textual_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:productPropertyTextual2 string{%textual_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:productPropertyTextual3 string{%textual_wc}
</pgroup>

<pgroup> {%productPropertyTextual4_p}
#predicate bsbm:productPropertyTextual4 string{%textual_wc}
</pgroup>

<pgroup> {%productPropertyTextual5_p}
#predicate bsbm:productPropertyTextual5 string{%textual_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:productPropertyNumeric1 integer 1 2000 normal
</pgroup>

<pgroup> 1.0
#predicate bsbm:productPropertyNumeric2 integer 1 2000 normal
</pgroup>
</type>

<pgroup> 1.0
#predicate bsbm:productPropertyNumeric3 integer 1 2000 normal
</pgroup>
</type>

<pgroup> {%productPropertyNumeric4_p}
#predicate bsbm:productPropertyNumeric4 integer 1 2000 normal
</pgroup>

<pgroup> {%productPropertyNumeric5_p}
#predicate bsbm:productPropertyNumeric5 integer 1 2000 normal
</pgroup>

<pgroup> 1.0
#predicate bsbm:publishDate date 2000-09-20 2006-12-23
</pgroup>

</type>

// ----- ProductFeature ----- //
<type> bsbm:ProductFeature {%feature_n}

<pgroup> 1.0
#predicate rdfs:label string{%label_wc}
</pgroup>

<pgroup> 1.0
#predicate rdfs:comment string{%feature_comment_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:publishDate date 2000-05-20 2000-06-23
</pgroup>

</type>

// ----- ProductType ----- //
<type> bsbm:ProductType {%type_n}

<pgroup> 1.0
#predicate rdfs:label string{%label_wc}
</pgroup>

<pgroup> 1.0
#predicate rdfs:comment string{%type_comment_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:publishDate date 2000-05-20 2000-06-23
</pgroup>

</type>

// Every products have serveral product type than others
#association bsbm:Product rdf:type bsbm:ProductType 2 {%type_c} 1.0 NORMAL

// Every products have serveral product features
#association bsbm:Product bsbm:productFeature bsbm:ProductFeature 2 {%feature_c} 1.0 NORMAL

// Every product has a producer
#association bsbm:Product bsbm:producer bsbm:Producer 2 1 1.0 NORMAL


The main subject of this template file is Product.
The template also includes Producer, ProductFeature, and ProductType as linked subjects.

Linked subjects are entities that are connected to the main subject of the file, which in this case is Product.

4.3.2 Template for generating federation members:

The general template for generating federation members and virtual catalogs is similar, but there are some differences which are highlighted above.

#namespace	<prefix>=<URI>
#namespace  __provenance={%provenance}
#namespace  __output_org=monolithic
#namespace  __output_dir={%export_output_dir}
#namespace  __output_file={%ratingsite_id}
#namespace  __output_dep={%export_dep_output_dir}
#namespace  __output_dep_org=fragmented
#namespace  __output_dep_rename_exception_predicates=<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/country>;

// ===== ENTITIES & LITERAL PROPERTIES ===== //

// ----- <global_subject> ----- //

<type> <global_subject_class> {%<global_subject>_n}
</type>

// ----- <linked_subject> ----- //
<type> <linked_subject_class> {%<linked_subject>_n}

<pgroup> <predicate_probability>
#predicate <predicate> <object>
</pgroup>

</type>

// ----- <main_subject> ----- //

<type> <main_subject_class> {%<main_subject>_n}

<pgroup> <predicate_probability>
#predicate <predicate> <object>
</pgroup>

</type>

// Every <linked_subject> is related to [1] <main_subject> (drawn with a ZIPFIAN) with the probability 1.0 
#association <linked_subject> <predicate> <main_subject> 2 1 NORMAL NORMAL

// Every generated <linked_subject> is related to [1] <global_subject> (drawn with a ZIPFIAN) with the probability 1.0 
#association1 <linked_subject> <predicate> <global_subject> 2 1 NORMAL NORMAL

// Every generated existing <linked_subject> are related to [Many] <other_linked_subject> (drawn with a ZIPFIAN) with the probability 1.0 
#association1 <linked_subject> <predicate> <other_linked_subject> 2 1 NORMAL NORMAL

We'll now provide a template for RatingSite that replicates products from the Virtual Catalog, building upon the previous template we've discussed. It's worth noting that the same principles apply to Vendor, as well.

#namespace __output_org=monolithic means that all quads generated for all rating-site entities will be contained in one file. This is to facilitate distribution and maintenance of endpoints downstream.
#namespace __output_dep={%export_dep_output_dir} indicates where to look for dependencies, e.g, Product.
#namespace __output_dep_org=fragmented means the the dependency, i.e, Product, are in separate files.
#namespace __output_dep_rename_exception_predicates=<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/country>; indicate a semicolon-separated list of URI that will not be localized when replicated

#namespace	bsbm=http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/
#namespace	rdfs=http://www.w3.org/2000/01/rdf-schema#
#namespace	rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#
#namespace	dc=http://purl.org/dc/elements/1.1/
#namespace  rev=http://purl.org/stuff/rev#
#namespace  foaf=http://xmlns.com/foaf/0.1/
#namespace  __provenance={%provenance}
#namespace  __output_org=monolithic
#namespace  __output_dir={%export_output_dir}
#namespace  __output_file={%ratingsite_id}
#namespace  __output_dep={%export_dep_output_dir}
#namespace  __output_dep_org=fragmented
#namespace  __output_dep_rename_exception_predicates=<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/country>;

// ===== ENTITIES & LITERAL PROPERTIES ===== //

// ----- Product ----- //

<type> bsbm:Product {%product_n}
</type>

// ----- RatingSite ----- //

<type> bsbm:RatingSite 1

<pgroup> 1.0
#predicate rdfs:label string{%label_wc}
</pgroup>

<pgroup> 1.0
#predicate bsbm:country country
</pgroup>

</type>

// ----- Person ----- //

<type> bsbm:Person {%person_n}

<pgroup> 1.0
#predicate foaf:name name{%person_name_wc}
</pgroup>

<pgroup> 1.0
#predicate foaf:mbox_sha1sum integer
</pgroup>

<pgroup> 1.0
#predicate bsbm:country country
</pgroup>

<pgroup> 1.0
#predicate bsbm:publishDate date 2008-5-20 2008-8-23
</pgroup>

</type>

// ----- Review ----- //
<type> bsbm:Review {%review_n}

<pgroup> 1.0
#predicate dc:title string{%title_wc}
</pgroup>

<pgroup> 1.0
#predicate rev:text string{%text_wc}
</pgroup>

<pgroup> {%rating1_p}
#predicate bsbm:rating1 integer 1 10 normal
</pgroup>

<pgroup> {%rating2_p}
#predicate bsbm:rating2 integer 1 10 normal
</pgroup>

<pgroup> {%rating3_p}
#predicate bsbm:rating3 integer 1 10 normal
</pgroup>

<pgroup> {%rating4_p}
#predicate bsbm:rating4 integer 1 10 normal
</pgroup>

<pgroup> 1.0
#predicate bsbm:publishDate date
</pgroup>

<pgroup> 1.0
#predicate bsbm:reviewDate date 2007-01-01 2007-12-31
</pgroup>

</type>

// Every bsbm:Review is related to [1] bsbm:RatingSite (drawn with a ZIPFIAN) with the probability 1.0 
#association bsbm:Review dc:publisher bsbm:RatingSite 2 1 NORMAL NORMAL

// Every generated bsbm:Review is related to [1] bsbm:Product (drawn with a ZIPFIAN) with the probability 1.0 
#association1 bsbm:Review bsbm:reviewFor bsbm:Product 2 1 NORMAL NORMAL

// Every generated existing bsbm:Review are related to [Many] bsbm:Person (drawn with a ZIPFIAN) with the probability 1.0 
#association1 bsbm:Review rev:reviewer bsbm:Person 2 1 NORMAL NORMAL

4.3.2 Configuration

Below is an explanation of our configuration file config.yaml. The aim is to customize the exposed components from the WatDiv configuration template mentioned in the previous section. We use a YAML-based hierarchical configuration system called OmegaConf to achieve this. Custom resolvers such as normal_truncated and get_docker_endpoint are defined in rsfb/utils.py.

  • generation
    • workdir: Where you want to generate all your data and queries
    • n_batch: Number of steps of generation of data and queries
    • n_query_instances: Number of different versions of one query
    • n_federation_members: Number of different sources you want(e.g. "${sum: ${generation.schema.vendor.params.vendor_n}, ${generation.schema.ratingsite.params.ratingsite_n}}" correspond to the sum of the number of vendors with the number of rating sites)
    • verbose: If you want the workflow log information while it running or not
    • stats
      • confidence_level: Accuracy of the resulting data
    • generator
      • dir: Where WatDiv is located
      • exec: Where the executable is located
    • virtuoso
      • compose_file: Where the docker compose file is located(e.g. we have one virtuoso endpoint per batch, and one docker per batch)
      • service_name: Generic name of docker
      • endpoints: List of all virtuoso endpoints corresponding to their running docker
      • container_names: List of all running docker
    • schema
      • <subject>
        • is_source: If you want the subject to be a federation member or not
        • provenance: Template of the URI(in case of global subject, we just put a base URI, but in case of federation members subject, we put a template URI look like this: http://www.{%<subject>_id}.fr/)
        • template: Where the template file is located
        • scale_factor: Percentage of for a new batch (1 corresponds to 100%)
        • export_output_dir: Where all the nq file will be generated
        • params: Corresponds to the number of objects (more precisely subjects who're linked with the ), corresponds to the probabilities and law which every predicate and some object follows(in the case of objects, is to generate value. But in the case of predicates is just to determine if we have the predicate or not)

Following the explanation provided above, the configuration file config.yaml is shown below.

generation:
  workdir: "experiments/bsbm"
  n_batch: 10
  n_query_instances: 10
  n_federation_members: "${sum: ${generation.schema.vendor.params.vendor_n}, ${generation.schema.ratingsite.params.ratingsite_n}}"
  verbose: true
  stats:
    confidence_level: 0.95
  generator: 
    dir: "generators/watdiv"
    exec: "${generation.generator.dir}/bin/Release/watdiv"
  virtuoso:
    compose_file: "${generation.workdir}/docker/virtuoso.yml"
    service_name: "bsbm-virtuoso"
    endpoints: "${get_docker_endpoints: ${generation.virtuoso.compose_file}, ${generation.virtuoso.service_name}}" 
    container_names: "${get_virtuoso_containers: ${generation.virtuoso.compose_file}, ${generation.virtuoso.service_name}}" 
  schema:
    # Configuration for ONE batch
    product:
      is_source: false 
      provenance: http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/ # Prefix URI
      # Products are generated once, independant from vendor and person
      template: "${generation.workdir}/model/watdiv/bsbm-product.template"
      scale_factor: 1
      export_output_dir: "${generation.workdir}/model/tmp/product"
      params:
        # type
        product_n: 200000 # Number of distinct products generate
        producer_n: "${get_product_producer_n: ${generation.schema.product.params.product_n}}" # One producer per product
        feature_n: "${get_product_feature_n: ${generation.schema.product.params.product_n}}" # One feature per product
        feature_c: 9
        type_n: "${get_product_type_n: ${generation.schema.product.params.product_n}}" # One type per product
        type_c: 9
        # pgroup
        productPropertyTextual4_p: 0.7 # Probability to have the predicate productPropertyTextual4
        productPropertyTextual5_p: 0.8 # Probability to have the predicate productPropertyTextual5
        productPropertyNumeric4_p: 0.7 # Probability to have the predicate productPropertyNumeric4
        productPropertyNumeric5_p: 0.8 # Probability to have the predicate productPropertyNumeric5
        textual_wc: "${normal_truncated: 9, 3, 3, 15}"
        label_wc: "${normal_truncated: 2, 1, 1, 3}"
        comment_wc: "${normal_truncated: 100, 20, 50, 150}"
        type_comment_wc: "${normal_truncated: 35, 10, 20, 50}"
        feature_comment_wc: "${normal_truncated: 35, 10, 20, 50}"
        producer_comment_wc: "${normal_truncated: 35, 10, 20, 50}"

    vendor:
      is_source: true 
      provenance: http://www.{%vendor_id}.fr/ # Template URI
      template: "${generation.workdir}/model/watdiv/bsbm-vendor.template" 
      export_output_dir: "${generation.workdir}/model/dataset"
      export_dep_output_dir: "${generation.schema.product.export_output_dir}"
      scale_factor: 1
      params:
        vendor_n: "${multiply: 10, ${generation.n_batch}}" # Increase the number of vendors with a step of 10 per batch
        offer_n: "${normal_dist: 3, 1, 2000}" # Specs: 100 productsVendorsRatio * 20 avgOffersPerProduct, ref: bsbmtools
        product_n: "${generation.schema.product.params.product_n}" # All generated products will sell by a vendor
        label_wc: "${normal_truncated: 2, 1, 1, 3}"
        comment_wc: "${normal_truncated: 35, 10, 20, 50}"

    ratingsite:
      is_source: true
      provenance: http://www.{%ratingsite_id}.fr/
      template: "${generation.workdir}/model/watdiv/bsbm-ratingsite.template" 
      export_output_dir: "${generation.workdir}/model/dataset"
      export_dep_output_dir: "${generation.schema.product.export_output_dir}"
      scale_factor: 1
      params:
        #type
        ratingsite_n: "${multiply: 10, ${generation.n_batch}}" # Increase the number of vendors with a step of 10 per batch
        product_n: "${generation.schema.product.params.product_n}" # All generated products will have a rating
        review_n: "${normal_dist: 3, 1, 10000}" # Specs: 10000
        person_n: "${divide: ${generation.schema.ratingsite.params.review_n}, 20}" # Number of people who rate a product
        person_name_wc: "${normal_truncated: 3, 1, 2, 4}"
        label_wc: "${normal_truncated: 2, 1, 1, 3}"
        text_wc: "${normal_truncated: 125, 20, 50, 200}"
        title_wc: "${normal_truncated: 9, 3, 4, 15}"
        #pgroup
        rating1_p: 0.7 # Probability to have the predicate rating1
        rating2_p: 0.7 # Probability to have the predicate rating2
        rating3_p: 0.7 # Probability to have the predicate rating3
        rating4_p: 0.7 # Probability to have the predicate rating4

Once config.yaml properly set, you can launch the generation of the FedShop benchmark with the following command:

4.3.3 Run the generator

python rsfb/benchmark.py generate data experiments/bsbm/config.yaml  [OPTIONS]

OPTIONS:
--clean [benchmark|metrics|instances][+db]: clean the benchmark|metrics|instances then (optional) destroy all database containers
--touch : mark a phase as "terminated" so snakemake would not rerun it.

4.3.4 Expected outcome

Generating data for the benchmark is a complex and lengthy process, resulting in numerous artefacts created under the experiment/bsbm directory. The datasets are generated under experiments/bsbm/model/dataset.

4.4 Generate queries

To generate queries for the benchmark, we adapted the queries from the BSBM Explore Use Case. An example with q04 is provided in this section, while all the generated queries are available in experiments/bsbm/queries. The instantiated queries along with their reference source selection/results should be obtained in experiments/bsbm/evaluation/benchmark/generation/ at the end of the process.

Below is the original q04 from BSBM:

PREFIX bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/>
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?product ?label ?propertyTextual
WHERE {
 { 
 ?product rdfs:label ?label .
 ?product rdf:type %ProductType% .
 ?product bsbm:productFeature %ProductFeature1% .
	?product bsbm:productFeature %ProductFeature2% .
 ?product bsbm:productPropertyTextual1 ?propertyTextual .
	?product bsbm:productPropertyNumeric1 ?p1 .
	FILTER ( ?p1 > %x% )
 } UNION {
 ?product rdfs:label ?label .
 ?product rdf:type %ProductType% .
 ?product bsbm:productFeature %ProductFeature1% .
	?product bsbm:productFeature %ProductFeature3% .
 ?product bsbm:productPropertyTextual1 ?propertyTextual .
	?product bsbm:productPropertyNumeric2 ?p2 .
	FILTER ( ?p2> %y% ) 
 } 
}
ORDER BY ?label
OFFSET 5
LIMIT 10

with the following placeholders:

Parameter Description
%ProductType% A randomly selected Class URI from the class hierarchy (leaf level).
%ProductFeature1% %ProductFeature2% %ProductFeature3% Three different, randomly selected product feature URI that correspond to the chosen product type.
%x% %y% Two random numbers between 1 and 500

4.4.1 Devise a placeholder replacement strategy:

The injection engine works as follows:

1. First iteration, build the `value_selection` query. It's done by projecting all placeholders and disabling any `FILTER`, `ORDER BY`, `LIMIT`, etc.
2. For the next iterations, try in order 3 options then inject. Only try the next option if the current option doesn't work:
    2.1 Option 1: Exclude the partially injected query to refill
    2.2 Option 2: extract the needed value for placeholders from value_selection.csv
    2.3 Option 3: Relax the query knowing there is NO solution mapping for a given combination of placeholders 
    2.4 Inject placeholder values: 
        - For every uninjected constant, ordered by priority: 
            - If this is the first injection, or the operator is unary, inject with the `instance_id `-th row of `value_selection`
            - Else, each operator has its own rule to inject missing constants:
                - Comparison op, e.g, ?a > ?b: first try to select a random value constrained by the operator
                - Dependant-difference ($!) or independant-different (!=): choose randomly a value that is different to injected constant 
                - Containment ("in"): choose randomly one out of 10 least common words
3. Restore original `SELECT`, `FILTER`, `ORDER BY`, `LIMIT`, etc.
  • With this in mind, the right strategy for q04 should be, in plain English:
1. Inject all placeholders tied to `sameAs` exclusively first.
2. Inject constrained and filter placeholder of the BGP left of UNION 
3. Inject constrained and filter placeholder of the BGP right of UNION 

4.4.2 Annotate the strategy inside the query

  • We will annotate this query for the benchmark. First all placeholders %placeholder% will be replaced with a variable ?placeholder and then annotated with const directive in the comment just above the triple pattern. The syntax is as follows:
const[EXCLUSIVE][PRIORITY] VARIABLE
EXCLUSIVE = "!": evaluate the template query with marked triple patterns exclusively first
PRIORITY = "*": triple patterns with the more "*" will be evaluated first 

For example:
const ?placeholder # Replace placeholder 
  • The result is as follows:
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT ?product ?label ?propertyTextual
WHERE {
    { 
        ?product rdfs:label ?label .
        # const!* ?ProductType
        ?product rdf:type ?localProductType .
        ?localProductType owl:sameAs ?ProductType .
        # const!* ?ProductFeature1
        ?product bsbm:productFeature ?localProductFeature1 .
        ?localProductFeature1 owl:sameAs ?ProductFeature1.
        # const** ?ProductFeature2 != ?ProductFeature1
        ?product bsbm:productFeature ?localProductFeature2 .
        ?localProductFeature2 owl:sameAs ?ProductFeature2.
        ?product bsbm:productPropertyTextual1 ?propertyTextual .
        ?product bsbm:productPropertyNumeric1 ?p1 .
        # const** ?x < ?p1
        FILTER ( ?p1 > ?x )
    } UNION {
        ?product rdfs:label ?label .
        # const!* ?ProductType
        ?product rdf:type ?localProductType .
        ?localProductType owl:sameAs ?ProductType .
        # const!* ?ProductFeature1
        ?product bsbm:productFeature ?localProductFeature1 .
        ?localProductFeature1 owl:sameAs ?ProductFeature1 .
        # const* ?ProductFeature3 != ?ProductFeature2, ?ProductFeature1
        ?product bsbm:productFeature ?localProductFeature3 .
        ?localProductFeature3 owl:sameAs ?ProductFeature3 .
        ?product bsbm:productPropertyTextual1 ?propertyTextual .
        ?product bsbm:productPropertyNumeric2 ?p2 .
        # const ?y < ?p2
        FILTER ( ?p2 > ?y ) 
    } 
}
ORDER BY ?label
OFFSET 5
LIMIT 10

Note that the triple pattern marked with @skip will be deactivated (commented) by the execution engine.

4.4.3 Run the generator

python rsfb/benchmark.py generate queries experiments/bsbm/config.yaml  [OPTIONS]

OPTIONS:
--clean [benchmark|metrics|instances][+db]: clean the benchmark|metrics|instances then (optional) destroy all database containers
--touch : mark a phase as "terminated" so snakemake would not rerun it.

4.4.4 Expected outcome

The queries are generated under experiments/bsbm/benchmark/generation, as illustrated below: image

  • injected.sparql: given the template query in experiments/bsbm/queries, this is the instantiated query with all placeholders replaced with real values.
  • provenance[.opt].sparql: source selection query built by wrapping each triple pattern in injected.sparql with GRAPH. opt is an optimized version with better decomposition.
  • provenance[.opt].csv: the results obtained by executing provenance[.opt].sparql.
  • results.csv: the results obtained by executing injected.sparql.