New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ntriple reader gave error reading nt files including these characters. #24

Closed
afshinsadeghi opened this Issue Jan 10, 2018 · 11 comments

Comments

Projects
None yet
3 participants
@afshinsadeghi
Copy link

afshinsadeghi commented Jan 10, 2018

The triple reader gives error when the .nt file include one of these characters: " or \ or # or '

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented Jan 10, 2018

Hi @afshinsadeghi ,
many thanks for opening this issue. Could you please be more specific (i.e which of the reader is throwing this error: this one NTripleReader.load right?).

And if possible, could you share the simple dataset which throws that particular error?

@GezimSejdiu GezimSejdiu self-assigned this Jan 10, 2018

@GezimSejdiu GezimSejdiu added this to the 0.4 milestone Jan 10, 2018

@afshinsadeghi

This comment has been minimized.

Copy link
Author

afshinsadeghi commented Jan 10, 2018

Hi @GezimSejdiu yes, the NTripleReader.scala#L25

@afshinsadeghi

This comment has been minimized.

Copy link
Author

afshinsadeghi commented Jan 10, 2018

I tested reading these files:
mappingbased_objects_en.ttl.bz2 which is in
http://downloads.dbpedia.org/current/core-i18n/en/mappingbased_objects_en.ttl.bz2

and
mappingbased_literals_en.ttl.bz2 which is in
http://downloads.dbpedia.org/current/core-i18n/en/mappingbased_literals_en.ttl.bz2

@GezimSejdiu

This comment has been minimized.

Copy link
Member

GezimSejdiu commented Jan 11, 2018

Hi @afshinsadeghi ,
as the name says, it is NTripleReader, which mean can load only nt files, could you please, share the files (in case you have converted them from ttl to nt in order to reproduce it).

@afshinsadeghi

This comment has been minimized.

Copy link
Author

afshinsadeghi commented Jan 12, 2018

I have not the unprocessed nt files anymore, but I will reproduce them and upload them. I will write here again when it is done.

@LorenzBuehmann

This comment has been minimized.

Copy link
Member

LorenzBuehmann commented Jan 15, 2018

@GezimSejdiu The file extension ttl as a bit misleading. The file format is in fact N-Triples, don't know why the DBpedia guys used ttl instead of nt especially as some tools guess the format by analyzing the file extension.
@afshinsadeghi I cannot reproduce the problem. I loaded mappingbased_objects_en.ttl.bz2 successfully (the extracted version indeed). Can you share the whole error stacktrace please? Actually, the parsing code should be pretty robust for N-Triples format as we're just splitting the input file(s) linewise and reuse the RIOT parser of the Apache Jena project. In addition, comment lines (starting with #) as well as empty lines will indeed be omitted.

Update

I can see from your project POM file that you're still using version 0.2.0. Omitting those empty lines and comment lines has been added to the parser in release 0.3.0, thus, you should update to latest version and the error should be gone.
Cheers

@afshinsadeghi

This comment has been minimized.

Copy link
Author

afshinsadeghi commented Jan 15, 2018

@LorenzBuehmann Two more dataset that I tested were
http://resources.mpi-inf.mpg.de/yago-naga/yago3.1/yagoFacts.ttl.7z
http://resources.mpi-inf.mpg.de/yago-naga/yago3.1/yagoLabels.ttl.7z
These are real ttl that I need to upload nt versions. They had invalid uni-code characters(not including number sign) in the list I wrote above. I am not sure, what will be the best policy for a reader, to say that the file has an error or to ignore those characters.

@LorenzBuehmann

This comment has been minimized.

Copy link
Member

LorenzBuehmann commented Jan 16, 2018

I can confirm that the YAGO dataset has some issues:

mkdir /tmp/sansa-test
cd /tmp/sansa-test
wget http://resources.mpi-inf.mpg.de/yago-naga/yago3.1/yagoFacts.ttl.7z
7za x yagoFacts.ttl.7z
$JENA_HOME/bin/riot --time --check --out=NTRIPLES yagoFacts.ttl > yagoFacts.nt

a parse error occurs with

08:04:29 ERROR riot                 :: [line: 9597504, col: 8 ] Illegal unicode escape sequence value: \\ (0x5C)

I am not sure, what will be the best policy for a reader, to say that the file has an error or to ignore those characters.
Ignoring a character is for sure no option, as this would invalidate the location target if the URI/IRI. The most fine-granular entity in RDF is an RDF resource, but the smallest statement is an RDF triple. Thus, the only option would be to ignore the whole RDF triple. Right now this isn't possible in SANSA.
I opened a ticket #27

@afshinsadeghi

This comment has been minimized.

Copy link
Author

afshinsadeghi commented Jan 16, 2018

Seems to be a known issue
https://stackoverflow.com/questions/39664819/sanitize-yago-files-before-loading-into-apache-jena-tdb-triplestore/43515692

As it is a cleaning issue of files and not reading them I assume my issue announced here closed and will follow up the #27

@LorenzBuehmann

This comment has been minimized.

Copy link
Member

LorenzBuehmann commented Jan 16, 2018

Summary:

  • the first issue has been solved with release 0.3.0
  • the second issue is a general problem with poor data quality and needs some more rework and extension of the parser, refer to #27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment