-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding problems with NUTS concept labels #8
Comments
I think it’s all happening because we’re piping triples via bash commands. I don't believe the problem is affecting any labels for any UK based locations, but I look forward to being proven wrong. We should probably move this pipeline towards using gss-jvm-build-tools which avoids piping text around the terminal. |
Sounds like yet another instance of us needing to seriously reconsider our stack. |
I've submitted feedback to data.europa.eu. |
@Robsteranium Thanks for raising the issue with data.europa.eu. You might want to have a look at the NUTS RDF model published by the EU Publications Office. |
Great thanks @ODP-hil! Switching to either of these sources (the nuts-skos-ap-eu.rdf file or the sparql endpoint) would resolve the duplicate I gather that translations are being worked-upon at the moment - will these be available in these same locations when they're ready? @robons if we switch to the other rdf file we can remove the post processing too. |
I've updated the import so that we're using the improved source of the NUTS skos:ConceptScheme. Thanks @Robsteranium. |
See Thessaloniki for example.
There's some mistakes upstream.
The source sometimes has two
skos:prefLabel
s for a resource - presumably these were intended to differ by language tag (as permitted by the vocab) but no language tags are provided.Presumably this is intended to provide a native name and a transliteration - e.g. code PL22 has "Śląskie" and "Slaskie". In most cases these labels are fine but the Cyrillic and Greek ones appear to have encoding problems - e.g. BG has "БЪЛГАРИЯ" and "????????" (
hd
inspection of the download shows these are literallyU+003F
- question marks).Requesting with a
'Accept: application/rdf+xml;charset=UTF-8'
header makes no difference as it's a statically-hosted file.The RDF files for individual NUTS codes - e.g. BG - don't have this problem (so presumably they do have language tags somewhere upstream). We could ask for them to fix it.
We could also crawl all of those pages.
Alternatively we could try to filter the bad labels out as part of the step which de-dupes multiple labels.
Absent a language tag, we need some way to distinguish the good vs bad label.
The order isn't consistent in the file sadly (even if this were to be maintained deterministically by the
sparql
cli tool).Instead we might be able to do something like
FILTER NOT (regex(str(?label), "\\?"))
.Indeed we might to have the dedupe step filter instead of concatenating the labels (e.g. taking the first of those without encoding problems).
The text was updated successfully, but these errors were encountered: