-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML declaration missing from OAI when using oai_ddi #10329
Comments
I may be missing something obvious, but why do you think that the problem above is due to a missing xml declaration? Oh, I see what you mean now [edit: no, that's not what the OP meant] - the xml declaration that is present in the export, but absent in the OAI output. This is in fact "a feature, not a bug": these declarations are stripped on purposes when generating the OAI output. That xml header would in fact make the OAI xml invalid if left in place. To me it looks like the "XML Parsing Error" in your example is due to the invalid UTF8 characters in the output (after the word "filtering"). I also suspect that it's a result of this bug: But please note that this is just a guess, we would need to confirm this.
Any chance you could point me to an OAI record that is still similarly broken? |
OAI output requires the declaration as cited above in 3.2.1 of the spec.
For example, this is missing the declaration: It is also the record that caused the chaos: I can't point you to a similar record that's broken, but you should be able to reproduce it by copying over the metadata from version 1 to wherever you test things: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0 I'm not sure how you get "That xml header would in fact make the OAI xml invalid if left in place." Using the GetRecord verb and having the declaration in place should not invalidate the XML, unless I'm missing something. "An example of a successful reply to the GetRecord request shown above is of the form:"
The GetRecord response from Dataverse does not conform to this model. I didn't write whatever parsed the XML output from Borealisdata.ca. But I did have to find out what was causing the problem in the record, which I then traced to the offending UTF-8 character. I realize that XML should explicitly be assumed to be UTF-8, but:
As far as I can tell, inserting one line whenever the |
The sentence quoted above refers to the xml declaration at the top of the full GetRecord XML output itself... So, we were talking about different things (I was referring to our code going to some trouble stripping these headers somewhere else). But, do note that I opened with an acknowledgment that it was possible I was missing something. However, having taken another look, I can tell you 100% for sure that the xml error in your original example is most definitely the result of the bug I mentioned (#9910). I can send you more info about that bug; and I will otherwise look into this some more tomorrow. My apologies for having missed this issue when you opened it last months. |
Sorry for adding unnecessary confusion the other day. I can point you to the specific place where that xml declaration is in fact stripped from the output (inside the There are two separate things going on:
That was the nature of the weird bug, it manifested itself when a multi-byte UTF8 sequence happened to straddle the 1024 bytes offset in the cached metadata record (and only that offset, not the multiples of). All the best, |
I opened an issue in the xoai repo (gdcc/xoai#225) for the missing declaration. It seems somewhat redundant, since the server already sends the Otherwise I'm going to close this issue. Once again, I am really sorry we didn't get back to you sooner on this. We communicated directly with a couple of other Dataverse instances who reported the bug last fall and helped them patch their installations. But then once it was fixed in 6.1 we just moved on, assuming that everybody would just upgrade - I'm realizing now that was a mistaken assumption. |
I don't think it's the libraries job to take care of the prolog. IMHO we'd need to make this change in Dataverse code. See also my reply at gdcc/xoai#225 (comment) |
XML declarations missing from
metadataPrefix=oai_ddi
records.What steps does it take to reproduce the issue?
An OAI harvest on a record using
oai_ddi
. Generic example: https://[DV_URL]/oai?verb=GetRecord&metadataPrefix=oai_ddi&identifier=doi:10.5683/SUM/IDENTOn OAI harvest as above.
On every occurence.
Record is missing mandatory xml declaration as in the OAI spec section 3.2.1 as per https://www.openarchives.org/OAI/openarchivesprotocol.html
Because of this, records may cause an error
XML Parsing Error: not well-formed
when encountering non-ASCII characters, causing problems with OAI harvest.This would (presumably) affect all records which contain characters outside of ISO-8859-1
XML was expected to be generated without error (notably the DDI export found in the API and Dataverse GUI contains an XML declaration).
Which version of Dataverse are you using?
v5.13 (at https://borealisdata.ca)
As an example of this, here is the output of
https://borealisdata.ca/oai?verb=GetRecord&identifier=doi:10.5683/SP2/NEPRTA&metadataPrefix=oai_ddi (2024-02-16, probably repaired by the time you see it), original record at
https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/NEPRTA&version=1.0
The character which causes the failure is the single typographic quote in the title: https://www.codetable.net/decimal/8217
Note that the content of the page is as follows, and is missing the XML declaration:
The text was updated successfully, but these errors were encountered: