Data Quality
Tim L edited this page Jul 1, 2013
·
40 revisions
Quality data...
- ... is structured similarly to dataset X using uniform vocabulary.
- ... is structured similarly to dataset X.
- ... I [dis]agree with.
- ... I understand.
- ... is complete.
- ... explicitly connects to the data currently portrayed in visual artifact X. (e.g. A book's two pages, currently visible)
- ... explicitly connects to the data portrayed in visual artifact X. (e.g. An entire book, yet to be opened)
- ... explicitly connects to dataset X.
- ... I find interesting.
- ... explicitly connects to other datasets. (i.e. TBL-5)
- ... is in RDF that I can retrieve as a dump.
- ... is in RDF that I can retrieve via SPARQL query.
- ... is in RDF that I can retrieve with dereferencable URIs.
- ... is in RDF that I can retrieve.
- ... is in RDF. (i.e. TBL-4)
- ... I can retrieve and is machine processable using my own (or open) tools. (i.e. TBL-3)
- ... I can retrieve and is machine processable. (i.e. TBL-2)
- ... I may and can retrieve. (i.e. TBL-1)
- ... I may retrieve. (i.e. open)
situate:
- uses vocabulary annotated with vocabulary annotations
- Number of triples for each dataset
- Number of datasets
- Number of interlinks [from each dataset [to each other dataset]]
- Accessible via data dump, accessible via SPARQL query, accessible via crawling
- For each property in a dataset, the number of extra-namespace URI values that are dereferencable
- Density: Number of extra-namespace URI values in a dataset / The size of the dataset.
- Density: Number of extra-namespace URI values in a dataset / Number of instances of a given class in the dataset.
Tummarello 2007
- Distribution of URIs over documents
Ding 2005
- Distribution of URIs over documents
- Interlinking
Wang 2006
- schema level gauges
- Lengths of terms (URI, bnode, and literal)
- Term prefix and suffix
- Frequency occurrence of a term in position S, P, or O.
- covers "number of triples per predicate" (270-1,000s)
- 2-20 triples per subject.
- "Simple" and "Double" Projections
- Star signatures (set of predicates in/out) of every graph node -- causes "star classes". Number and sizes of star classes can reflect uniformity in a graph.
- Path patterns from a source and target node.
- instances per class
- Number of crawl seed URIs (42,500 from a 2009 crawl)
- Number of [unique] {quads,triples} crawled from number of documents ([1.106 BQ] 1.118 BQ, [947 MT])
- Time period[s] crawled (May 2010 and 9 monthly March-November 2010)
- Number of data providers, according to Pay-Level-Domains (778)