Encoding problems with NUTS concept labels #8

Robsteranium · 2021-04-12T12:21:10Z

There's some mistakes upstream.

The source sometimes has two skos:prefLabels for a resource - presumably these were intended to differ by language tag (as permitted by the vocab) but no language tags are provided.

Presumably this is intended to provide a native name and a transliteration - e.g. code PL22 has "Śląskie" and "Slaskie". In most cases these labels are fine but the Cyrillic and Greek ones appear to have encoding problems - e.g. BG has "БЪЛГАРИЯ" and "????????" (hd inspection of the download shows these are literally U+003F - question marks).

Requesting with a 'Accept: application/rdf+xml;charset=UTF-8' header makes no difference as it's a statically-hosted file.

The RDF files for individual NUTS codes - e.g. BG - don't have this problem (so presumably they do have language tags somewhere upstream). We could ask for them to fix it.

We could also crawl all of those pages.

Alternatively we could try to filter the bad labels out as part of the step which de-dupes multiple labels.

Absent a language tag, we need some way to distinguish the good vs bad label.

The order isn't consistent in the file sadly (even if this were to be maintained deterministically by the sparql cli tool).

Instead we might be able to do something like FILTER NOT (regex(str(?label), "\\?")).

Indeed we might to have the dedupe step filter instead of concatenating the labels (e.g. taking the first of those without encoding problems).

The text was updated successfully, but these errors were encountered:

robons · 2021-04-12T14:02:23Z

I think it’s all happening because we’re piping triples via bash commands.

I don't believe the problem is affecting any labels for any UK based locations, but I look forward to being proven wrong.

We should probably move this pipeline towards using gss-jvm-build-tools which avoids piping text around the terminal.

canwaf · 2021-04-12T15:00:44Z

Sounds like yet another instance of us needing to seriously reconsider our stack.

Robsteranium · 2021-04-27T07:59:38Z

I've submitted feedback to data.europa.eu.

H-a-g-L · 2021-05-12T18:02:54Z

@Robsteranium Thanks for raising the issue with data.europa.eu. You might want to have a look at the NUTS RDF model published by the EU Publications Office.
It can be queries from the Publications Office SPARQL EP
specifying the name of NUTS graph. For Example:
select distinct *
FROM http://data.europa.eu/nuts
where {http://data.europa.eu/nuts ?p ?o}

Robsteranium · 2021-05-14T09:52:41Z

Great thanks @ODP-hil!

Switching to either of these sources (the nuts-skos-ap-eu.rdf file or the sparql endpoint) would resolve the duplicate skos:prefLabel and encoding problems.

I gather that translations are being worked-upon at the moment - will these be available in these same locations when they're ready?

@robons if we switch to the other rdf file we can remove the post processing too.

robons · 2021-05-17T13:55:56Z

I've updated the import so that we're using the improved source of the NUTS skos:ConceptScheme. Thanks @Robsteranium.

https://staging.gss-data.org.uk/concept-scheme/concepts?uri=http%3A%2F%2Fgss-data.org.uk%2Fdef%2Fconcept-scheme%2Fnuts-2016%2Fdataset&concept-scheme-uri=http%3A%2F%2Fdata.europa.eu%2Fnuts%2Fscheme%2F2016&concept-uri=http%3A%2F%2Fdata.europa.eu%2Fnuts%2Fcode%2FUKG39&standalone=true&level-uri=http%3A%2F%2Fgss-data.org.uk%2Fdef%2Fgeography%2Flevel%2FNUTS1

Robsteranium mentioned this issue Apr 12, 2021

Resolve encoding issues Swirrl/ook#55

Open

robons pushed a commit that referenced this issue May 17, 2021

Issue #8. Using NUTS concept scheme from alternative EU source.

60c7440

robons closed this as completed May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding problems with NUTS concept labels #8

Encoding problems with NUTS concept labels #8

Robsteranium commented Apr 12, 2021

robons commented Apr 12, 2021

canwaf commented Apr 12, 2021

Robsteranium commented Apr 27, 2021

H-a-g-L commented May 12, 2021

Robsteranium commented May 14, 2021

robons commented May 17, 2021 •

edited

Loading

Encoding problems with NUTS concept labels #8

Encoding problems with NUTS concept labels #8

Comments

Robsteranium commented Apr 12, 2021

robons commented Apr 12, 2021

canwaf commented Apr 12, 2021

Robsteranium commented Apr 27, 2021

H-a-g-L commented May 12, 2021

Robsteranium commented May 14, 2021

robons commented May 17, 2021 • edited Loading

robons commented May 17, 2021 •

edited

Loading