Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI vs Slug vs String #18

Closed
Robsteranium opened this issue Feb 28, 2018 · 1 comment
Closed

URI vs Slug vs String #18

Robsteranium opened this issue Feb 28, 2018 · 1 comment

Comments

@Robsteranium
Copy link
Contributor

Robsteranium commented Feb 28, 2018

The new loading architecture suggests to me a possible solution to the stringy-identifiers problem:

  • if the column configuration identifies the datatype as string:
    • if the column configuration includes a value-template, then cell value should be treated as ready for that template (i.e. already slugged or a code without spaces)
    • else if the column configuration points at a component property that specifies a qb:codeList then use the value in that cell to lookup a code by it's label
    • else treat the value as a string literal (e.g. observation label)
  • else if the datatype is number:
    • parse the string as a number (as per csvw)
  • else if the datatype is xsd:anyURI
    • parse the string as a URI (as per csvw), I think this would also allow curies if the csv2rdf process already recognises the prefix

Given that multiple column names could theoretically be used to populate the same component, this gives us quite a bit of flexibility. For example, for specifying a reference period you might provide three columns all having sdmx-dim:refPeriod in their property_template configuration, but each with a different value_template:

  • "Year": "http://reference.data.gov.uk/id/year/{year}" accepting values like "2018"
  • "Government Year":"http://reference.data.gov.uk/id/government-year/{government-year}" accepting values like "2017-2018"
  • "Reference Period": "http://reference.data.gov.uk/id/{reference_period}" (a generic fallback) accepting values like "government-quarter/2009-2010/Q1"

Thus the uploader would indicate which kind of date they'd provided by using the appropriate column heading. They could then provide more than one kind of interval by either splitting the upload or by providing multiple (non-overlapping) columns in the table (i.e. some rows having Years, others Government Years but none both).

One implication of this is that we wouldn't be slugging anything in the observations input csv! We could still need to do so as part of the code-list pipeline (i.e. to create an skos:notation where none was provided).

@Robsteranium
Copy link
Contributor Author

One possible problem with (blindly) passing values into templates is that we don't have context-sensitive validation of the inputs - i.e. it would be possible to provide invalid values like "20180" to the "Year" column (technically http://reference.data.gov.uk/id/year/20180 is valid and does resolve, but we know from the context that it's likely a mistake) or "2015-2018" to "Government Year".

We may be able to solve this using datatypes. Although, if we need to coin our own (xsd:gYear wouldn't catch either example), this may make the serialisation less portable.

Alternatively we could ignore the problem at this level and instead pick it up with a later validation. We will in any case validate that dimension-values are recognised (i.e. ASK that ?obs ?dim/?p ?o) which would highlight these cases. We haven't actually designed a table2qb version of the intervals pipeline (it wouldn't fit the codelists pipeline as we want to build start/end instants etc) - perhaps this will just need to be bespoke and include it's own validation.

Note that cases with qb:codeLists could raise an validation error if the code couldn't be found (bearing a skos:prefLabel or rdfs:label with) the string provided. We wouldn't need to wait until a later (rdf-based) validation and could reject the uploaded csv immediately.

Robsteranium added a commit that referenced this issue Mar 20, 2018
Adds HMRC Overseas Trade example and starts removing hard-coded config
specific to the regional-trade example as per #22.

The `is-dimension?` and `is-attribute?` sets are now drawn from the
conventions - i.e. those columns with the respective component
attachment property. The `values` seq now uses an equivalent `is-value?`
set which includes those columns for which the conventions don't specify
a component attachment (i.e. if the column is not a dimension, measure,
or attribute then it must be a value).

`standardise-measure` is now just `title->name`

`slugize-columns` is replaced by `transform-columns` which is configured
by a new convention: `value_transformation` which also allows us to
specify e.g. a `replace-symbols` transform (#18 may later allow us to
derive these instead of setting them explicitly).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants