Statistical Criteria

Michael Martin edited this page May 22, 2013 · 21 revisions

test

Clone this wiki locally

LODStats is analyzing RDF datasets from different CKANs according a set of configured criteria using a stream-based approach. Datasets from CKAN are available either serialised as a le (in RDF/XML, N-Triples and other formats) or via SPARQL endpoints. Serialised datasets containing more than a few million triples (i.e. data items) tend to be too large for most existing analysis approaches as the size of the dataset or its representation as a graph exceeds the available main memory, where the complete dataset is commonly stored for statistical processing. LODStats' main advantage when compared to existing approaches is its superior performance, especially for large datasets with many millions of triples, while keeping its extensibility with novel analytical criteria straightforward.

While describing existing criteria more in detail in the following we will re-use a few things:

//A triple is given by
s p o //wherin "s" is a subject, "p" a predicate and "o" an object

//A triple pattern is given by
?s ?p ?o //wherin ?s is a variable for subject, ?p for predicate and ?o for object

G = directed graph, 
M = map, 
S = set, 
i, len = integer. 
+= and ++ denote standard additions on those structures, 
   i.e. adding edges to a graph, increasing the value of the key of a map, 
   adding elements to a set and incrementing an integer value.

1. Used Classes

This criterion is used to create a list of classes that are in use by instances of the analyzed dataset. As an example of such a triple that will be accepted by the filter is aksw:Ivan rdf:type lodstats:Developer. If such an triples is accepted the IRI will be added to the set of classes (or better the respective IRI)

Filter rule

?p=rdf:type && isIRI(?o)

Action

S += ?o

2. Class usage count

To count the usage of respective classes of a dataset, the filter rule that is used to analyze a triple is the same as in the first criterion. As an action a map is being created having class IRIs as identifier and its respective usage count as value. If a triple is conform to the filter rule the respective value will be increased by one. To return the top 100 classes used in the dataset the respective postprocessing step will be executed.

Filter rule

?p=rdf:type && isIRI(?o)

Action

M[?o]++ 

Postprocessing

top(M,100)

3. Classes defined

To get a set of classes that are defined within a dataset this criterion is being used. Usually in RDF/S and OWL a class can be defined by a triple using the predicate rdf:type and either rdfs:Class or owl:Class as object. The filter rule illustrates the condition used to analyze the triple. If the triple is accepted by the rule, the IRI used as subject is added to the set of classes.

Filter rule

?p=rdf:type && isIRI(?s) &&(?o=rdfs:Class||?o=owl:Class)

Action

S += ?s

4. Class hierarchy depth

Description is coming soon

Filter rule

?p = rdfs:subClassOf &&  isIRI(?s) && isIRI(?o) 

Action

G += (?s,?o)

Postprocessing

hasCycles(G) ? inf. : depth(G)

5. Property usage

This criterion is used to count the usage of properties within triples. Therefor a set will be created containing all property IRI's as identifier. While analyzing a respective triple its predicate will be added to the set (if its not added already) and the corresponding value (usage count) will be increased by one. To create a list of the top 100 predicate usages of a dataset the illustrated postprocessing step will be executed.

Action

M[?p]++

Postprocessing

top(M,100)

6. Property usage distinct per subject

Description is coming soon

Action

M[?s] += ?p

Postprocessing

sum(M)

7. Property usage distinct per object

Description is coming soon

Action

M[?o] += ?p

Postprocessing

sum(M)

8. Properties distinct per subject

Description is coming soon

Action

M[?s] += ?p

Postprocessing

sum(M)/size(M)

9. Properties distinct per object

Description is coming soon Action

M[?o] += ?p

Postprocessing

sum(M)/size(M)

10. outdegree

Description is coming soon

Action

M[?s]++

Postprocessing

sum(M)/size(M)

11. indegree

Description is coming soon

Action

M[?o]++

Postprocessing

sum(M)/size(M)

12. Property hierarchy depth

Description is coming soon

Filter rule

?p=rdfs:subPropertyOf && isIRI(?s) && isIRI(?o)

Action

G += (?s,?o)

Postprocessing

hasCycles(G) ? inf. : depth(G)

13. Subclass usage

Description is coming soon

Filter rule

?p = rdfs:subClassOf

Action

i++

14. Triples

This criterion is used to measure the amount of triples of a dataset. So, if a triple is analyzed a respective counter will be increased by one.

Action

i++

15. Entities mentioned

To get a count of entities (resources / IRIs) that are mentioned within a dataset, this criterion is used. The action that will be processed is extracting all IRIs from the analyzed triple (iris({?s,?p,?o})) and increase a respective counter by the amount of extracted IRIs.

Action

i+=size(iris({?s,?p,?o}))

16. Distinct entities

To get a set/list of distinct entities of a dataset all IRIs are extracted from the respective triple and added to the set of entities. If an IRI (entity) is already in the set of entities it will be overwritten to prevent multiple occurrences of entities.

Action

S+=iris({?s,?p,?o})

17. Literals

To get the amount of triples that are referencing literals to subjects the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a literal a respective counter is being increased by one.

Filter rule

isLiteral(?o)

Action

i++

18. Blanknodes as subject

To get the amount of blanknodes used as subjects the illustrated filter rule is used to analyze the respective triple. If the subject of a triple is recognized as a blanknode a respective counter is being increased by one.

Filter rule

isBlank(?s)

Action

i++

19. Blanknodes as object

To get the amount of blanknodes used as objects the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a blanknode a respective counter is being increased by one.

Filter rule

isBlank(?o)

Action

i++

20. Datatypes

Usually in RDF/S and OWL literals used as objects of triples can be specified narrower. On the one hand its possible to define its datatype by using ^^ and a corresponding datatype etiquette as exemplary illustrated as follows: aksw:Ivan foaf:name "Ivan"^^xsd:string. On the other hand it is possible to define the language of the literal using an @ as exemplary illustrated as follows: aksw:AKSW foaf:name "Agile Knowledge Engineering and Semantic Web research Group"@en. If a language etiquette exists the datatype can be concluded automatically as xsd:string. To get the datatypes used in a dataset and its respective counts the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a literal the datatype of the triple is being extracted as you can see in the Action step. The extracted datatype is being added to a respective set (if its not added already) and the respective usage counter will be increased by one. If the literal object didn't contain any datatype definition nothing happens and the next triple will be analyzed.

Filter rule

isLiteral(?o)

Action

M[type(?o)]++

21. Languages

Usually in RDF/S and OWL literals used as objects of triples can be specified narrower. On the one hand its possible to define its datatype by using ^^ and a corresponding datatype etiquette as exemplary illustrated as follows: aksw:Ivan foaf:name "Ivan"^^xsd:string. On the other hand it is possible to define the language of the literal using an @ as exemplary illustrated as follows: aksw:AKSW foaf:name "Agile Knowledge Engineering and Semantic Web research Group"@en. If a language etiquette exists the datatype can be concluded automatically as xsd:string. To get the language definitions used in a dataset and its respective counts the illustrated filter rule is used to analyze the respective triple. If the object of a triple is recognized as a literal the language definition of the literal is being extracted as you can see in the Action step. The extracted language definition is being added to a respective set (if its not added already) and the respective usage counter will be increased by one. If the literal object didn't contain any language definition nothing happens and the next triple will be analyzed.

Filter rule

isLiteral(?o)

Action

H[language(?o)]++

22. Average typed string length

Description is coming soon

Filter rule

isLiteral(?o) && datatype(?o)=xsd:string

Action

i++;
len+=len(?o)

Postprocessing

len/i

23. Average untyped string length

Description is coming soon

Filter rule

isLiteral(?o) && datatype(?o) = NULL

Action

i++;
len+=len(?o)

Postprocessing

len/i

24. Typed subjects

Description is coming soon

Filter rule

?p = rdf:type

Action

i++

25. Labeled subjects

Description is coming soon

Filter rule

?p = rdfs:label

Action

i++

26. Usage of owl:sameAs

Description is coming soon

Filter rule

?p = owl:sameAs

Action

i++

27. Links

Description is coming soon

Filter rule

ns(?s) != ns(?o)

Action

M[ns(?s)+ns(?o)]++

28. Maximum per property {int,float,time}

Description is coming soon

Filter rule

datatype(?o)={xsd:int|xsd:float|xsd:datetime}

Action

M[?p]=max(M[?p],?o)

29. Average per property {int,float,time}

Description is coming soon

Filter rule

datatype(?o)={xsd:int|xsd:float|xsd:datetime}

Action

M[?p]+=?o;
M2(?p)++

Postprocessing

M[?p]/M2[?p]

30. Subject vocabularies

Description is coming soon

Action

M[ns(?s)]++

31. Predicate vocabularies

Description is coming soon Action

M[ns(?p)]++

32. Object vocabularies

Description is coming soon

Action

M[ns(?o)]++