Skip to content

Data Quality

Tim L edited this page Jul 1, 2013 · 40 revisions

Quality data...

  • ... is structured similarly to dataset X using uniform vocabulary.
  • ... is structured similarly to dataset X.
  • ... I [dis]agree with.
  • ... I understand.
  • ... is complete.
  • ... explicitly connects to the data currently portrayed in visual artifact X. (e.g. A book's two pages, currently visible)
  • ... explicitly connects to the data portrayed in visual artifact X. (e.g. An entire book, yet to be opened)
  • ... explicitly connects to dataset X.
  • ... I find interesting.
  • ... explicitly connects to other datasets. (i.e. TBL-5)
  • ... is in RDF that I can retrieve as a dump.
  • ... is in RDF that I can retrieve via SPARQL query.
  • ... is in RDF that I can retrieve with dereferencable URIs.
  • ... is in RDF that I can retrieve.
  • ... is in RDF. (i.e. TBL-4)
  • ... I can retrieve and is machine processable using my own (or open) tools. (i.e. TBL-3)
  • ... I can retrieve and is machine processable. (i.e. TBL-2)
  • ... I may and can retrieve. (i.e. TBL-1)
  • ... I may retrieve. (i.e. open)

situate:

Related work

Hausenblas 2008

  • Number of triples for each dataset
  • Number of datasets
  • Number of interlinks [from each dataset [to each other dataset]]
  • Accessible via data dump, accessible via SPARQL query, accessible via crawling
  • For each property in a dataset, the number of extra-namespace URI values that are dereferencable
  • Density: Number of extra-namespace URI values in a dataset / The size of the dataset.
  • Density: Number of extra-namespace URI values in a dataset / Number of instances of a given class in the dataset.

Tummarello 2007

  • Distribution of URIs over documents

Ding 2005

  • Distribution of URIs over documents
  • Interlinking

Wang 2006

  • schema level gauges

Stárka 2012

  • Lengths of terms (URI, bnode, and literal)
  • Term prefix and suffix
  • Frequency occurrence of a term in position S, P, or O.
    • covers "number of triples per predicate" (270-1,000s)
    • 2-20 triples per subject.
  • "Simple" and "Double" Projections
  • Star signatures (set of predicates in/out) of every graph node -- causes "star classes". Number and sizes of star classes can reflect uniformity in a graph.
  • Path patterns from a source and target node.

Langegger 2009

  • instances per class

Helena's survey

Hogan 2012

  • Number of crawl seed URIs (42,500 from a 2009 crawl)
  • Number of [unique] {quads,triples} crawled from number of documents ([1.106 BQ] 1.118 BQ, [947 MT])
  • Time period[s] crawled (May 2010 and 9 monthly March-November 2010)
  • Number of data providers, according to Pay-Level-Domains (778)
Clone this wiki locally