Skip to content
IS4 edited this page Jul 19, 2023 · 2 revisions

Using the --sparql-query option, it is possible to execute custom SPARQL queries during the process, and match various entities handled by the analyzers.

Each query is executed at various points during the analysis. The data available to the query differs based on the presence of the --buffered option: if the option is present, the query operates on the whole graph, while if the option is not present, only a small section of the data, usually enough to describe a single entity, is used.

Search

In the search mode, the --sparql-query option should point to a SELECT or ASK query. When a query is evaluated, its results are added to an internal storage, which is serialized to the output file when the process stops.

The evaluation of a query in this mode may also stop the process prematurely if one of these conditions succeeds:

  • The query uses ASK, and its result is determined to be true.
  • The query uses LIMIT, and the number of results exceeds the limit. The process will be stopped in this case only if there are no other queries that may yet produce results, such as queries without LIMIT.

Examples

Presence of a PNG image

PREFIX schema: <http://schema.org/>

ASK WHERE {
  [] schema:encodingFormat <https://w3id.org/uri4uri/mime/image/png> .
}

Names of image files and their dimensions

PREFIX nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
PREFIX nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?name ?w ?h
WHERE {
  [
    nfo:fileName ?name ;
    nie:interpretedAs/dcterms:hasFormat [
      a schema:ImageObject ;
      nfo:width ?w ;
      nfo:height ?h
    ]
  ] .
}

File extraction

In the describe mode, the --sparql-query option should point to a SELECT query, which will be used to mark entities that should be matched and extracted if they are backed by binary data. The query should have a variable ?node, which is compared against the node representing the currently analyzed entity, extracting it as a file if the nodes are equal.

The name of the file can be determined by assigning the ?path_format variable in the query, which has the default value "${name${extension}"}. Other properties related to the file may be substituted in ?path_format, including ${media_type} or ${size}.

Examples

Match all entities and extract them according to their name and extension

SELECT ?node ?path_format
WHERE {
  ?node ?p ?o .
  BIND("extracted/${name}${extension}" AS ?path_format)
}

Matches all 256x256 images

PREFIX nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?node
WHERE {
  ?node dcterms:hasFormat [
    a schema:ImageObject ;
    nfo:width 256 ;
    nfo:height 256
  ]
}