Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems with NUTS concept labels #8

Closed
Robsteranium opened this issue Apr 12, 2021 · 6 comments
Closed

Encoding problems with NUTS concept labels #8

Robsteranium opened this issue Apr 12, 2021 · 6 comments

Comments

@Robsteranium
Copy link
Collaborator

See Thessaloniki for example.

There's some mistakes upstream.

The source sometimes has two skos:prefLabels for a resource - presumably these were intended to differ by language tag (as permitted by the vocab) but no language tags are provided.

Presumably this is intended to provide a native name and a transliteration - e.g. code PL22 has "Śląskie" and "Slaskie". In most cases these labels are fine but the Cyrillic and Greek ones appear to have encoding problems - e.g. BG has "БЪЛГАРИЯ" and "????????" (hd inspection of the download shows these are literally U+003F - question marks).

Requesting with a 'Accept: application/rdf+xml;charset=UTF-8' header makes no difference as it's a statically-hosted file.

The RDF files for individual NUTS codes - e.g. BG - don't have this problem (so presumably they do have language tags somewhere upstream). We could ask for them to fix it.

We could also crawl all of those pages.

Alternatively we could try to filter the bad labels out as part of the step which de-dupes multiple labels.

Absent a language tag, we need some way to distinguish the good vs bad label.

The order isn't consistent in the file sadly (even if this were to be maintained deterministically by the sparql cli tool).

Instead we might be able to do something like FILTER NOT (regex(str(?label), "\\?")).

Indeed we might to have the dedupe step filter instead of concatenating the labels (e.g. taking the first of those without encoding problems).

@robons
Copy link
Contributor

robons commented Apr 12, 2021

I think it’s all happening because we’re piping triples via bash commands.

I don't believe the problem is affecting any labels for any UK based locations, but I look forward to being proven wrong.

We should probably move this pipeline towards using gss-jvm-build-tools which avoids piping text around the terminal.

@canwaf
Copy link

canwaf commented Apr 12, 2021

Sounds like yet another instance of us needing to seriously reconsider our stack.

@Robsteranium
Copy link
Collaborator Author

I've submitted feedback to data.europa.eu.

@H-a-g-L
Copy link

H-a-g-L commented May 12, 2021

@Robsteranium Thanks for raising the issue with data.europa.eu. You might want to have a look at the NUTS RDF model published by the EU Publications Office.
It can be queries from the Publications Office SPARQL EP
specifying the name of NUTS graph. For Example:
select distinct *
FROM http://data.europa.eu/nuts
where {http://data.europa.eu/nuts ?p ?o}

@Robsteranium
Copy link
Collaborator Author

Great thanks @ODP-hil!

Switching to either of these sources (the nuts-skos-ap-eu.rdf file or the sparql endpoint) would resolve the duplicate skos:prefLabel and encoding problems.

I gather that translations are being worked-upon at the moment - will these be available in these same locations when they're ready?

@robons if we switch to the other rdf file we can remove the post processing too.

@robons robons closed this as completed May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants