Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #2186

Closed
mmalmeida opened this issue Jan 2, 2023 · 10 comments
Closed

[BUG] #2186

mmalmeida opened this issue Jan 2, 2023 · 10 comments
Assignees
Labels

Comments

@mmalmeida
Copy link

mmalmeida commented Jan 2, 2023

There is a discrepancy regarding your repository response and the expected response according to OAI Protocol.

According to guidelines in https://www.openarchives.org/OAI/openarchivesprotocol.html#OAIPMHschema, the response XML sent by the repository should include the following attribute in the metadata part of each record:
xmlns:xsi="[http://www.w3.org/2001/XMLSchema-instance"](http://www.w3.org/2001/XMLSchema-instance%22)

the metadata part. This consists of a single root tag - in the example the tag oai_dc:dc - with the nested tags belonging to the corresponding metadata format - in the example, Dublin Core elements such as dc:title. Note that the root tag within the metadata part includes a number of attributes that are common to all XML documents that use namespaces and schema validity:
namespace declarations -- the declarations of the namespaces used within the metadata part, each of which is prefixed with xmlns.

  • Namespace declarations within the metadata part fall into two categories:

metadata format specific namespace(s) - every metadata part must include one or more xmlns prefixed attributes that define the correspondence between a metadata format prefix -- e.g. dc -- and the namespace URI (as defined by the XML namespace specification ) of the respective metadata format. Some metadata formats employ tags from multiple namespaces, requiring multiple xmlns prefixed attributes -- in the example, there are declarations for both oai_dc and dc.
xml schema namespace - every metadata part must include the attribute xmlns:xsi, the value of which must always be the URI shown in the example, which is the namespace URI for XML schema.

  • xsi:schemaLocation -- the value of which is a URI, URL pair; the first is the namespace URI (as defined by the XML namespace specification) of the metadata that follows in this part, and the second is the URL of the XML schema for validation of the metadata that follows

We've confirmed this occurs in most repositories' responses (we'll use Scielo Spain below as example).

For the request: https://scielo.isciii.es/oai/scielo-oai.php?verb=ListRecords&set=0213-1285&from=2022-11-30&metadataPrefix=oai_dc

The following response (XMLSchema-instance included in metadata element, in xmlns:xsi):

<ListRecords>
  <record>
    <header>
      <identifier>oai:scielo:S0213-12852022000300001</identifier>
      <datestamp>2022-11-30</datestamp>
      <setSpec>0213-1285</setSpec>
    </header>
    <metadata>
      <oai-dc:dc xmlns:oai-dc="[http://www.openarchives.org/OAI/2.0/oai_dc/"](http://www.openarchives.org/OAI/2.0/oai_dc/%22) xmlns:dc="[http://purl.org/dc/elements/1.1/"](http://purl.org/dc/elements/1.1/%22) xmlns:xsi="[http://www.w3.org/2001/XMLSchema-instance"](http://www.w3.org/2001/XMLSchema-instance%22) xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ [http://www.openarchives.org/OAI/2.0/oai_dc.xsd">](http://www.openarchives.org/OAI/2.0/oai_dc.xsd%22%3E)

However for: https://doaj.org/oai.article?verb=ListRecords&set=TENDOkRlcm1hdG9sb2d5&from=2022-08-31T23%3A00%3A00Z&metadataPrefix=oai_dc

We get the following response (XMLSchema-instance not included in metadata element):

<ListRecords>
  <record>
    <header xmlns:oai_dc="[http://www.openarchives.org/OAI/2.0/oai_dc/"](http://www.openarchives.org/OAI/2.0/oai_dc/%22) xmlns:dc="[http://purl.org/dc/elements/1.1/">](http://purl.org/dc/elements/1.1/%22%3E)
      <identifier>oai:doaj.org/article:8182b35def72476c83cd4214682b200b</identifier>
      <datestamp>2023-01-02T02:13:18Z</datestamp>
      <setSpec>TENDOkRlcm1hdG9sb2d5</setSpec>
    </header>
    <metadata xmlns:oai_dc="[http://www.openarchives.org/OAI/2.0/oai_dc/"](http://www.openarchives.org/OAI/2.0/oai_dc/%22) xmlns:dc="[http://purl.org/dc/elements/1.1/">](http://purl.org/dc/elements/1.1/%22%3E)

This makes it impossible for parsers that rely on a correct XML document to retrieve data from DOAJ.

Is it possible to update DOAJ to include the required xmlns:xsi element on each record's metadata element?

@mmalmeida mmalmeida added the bug label Jan 2, 2023
@dommitchell
Copy link
Contributor

Thanks @mmalmeida ! We will investigate and I will answer here.

@mmalmeida
Copy link
Author

hi @dommitchell any news on this?

@dommitchell
Copy link
Contributor

We have identified a fix for this and we will get that into production as soon as we can.

@richard-jones
Copy link
Contributor

Hi @mmalmeida thanks for your patience on this issue, we have a fix passing through final testing at the moment and a release should follow in due course.

In the meantime, I wanted to outline how the fix we are producing differs from what you have suggested, and why, and see if that's an issue.

Below I've outlined what the OAI-PMH specification says (which ties up with what you have indicated in the original issue), and then after that how our incoming fix modifies the response

OAI-PMH specification

First of all, these are examples from the OAI-PMH specification on how the record should look

<?xml version="1.0" encoding="UTF-8" ?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2002-05-01T19:20:30Z</responseDate>
 <request verb="GetRecord" identifier="oai:arXiv.org:hep-th/9901001"
          metadataPrefix="oai_dc">http://an.oa.org/OAI-script</request> 
 <GetRecord>
  <record>
      ...
  </record>
 </GetRecord> 
</OAI-PMH>     

Note:

  • xmlns:xsi is in the root OAI-PMH element
  • xsi:schemaLocation is in the root OAI-PMH element

Then inside the record element, this:

<metadata>
 <oai_dc:dc 
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" 
     xmlns:dc="http://purl.org/dc/elements/1.1/" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
     http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <dc:title>Using Structural Metadata to Localize Experience of Digital 
             Content</dc:title>
 </oai_dc:dc>
</metadata>

Note:

  • There are no attributes on the metadata element
  • xmlns:oai_dc is defined on the oai_dc:dc element
  • xmlns:dc is defined on the oai_dc:dc element
  • xmlns:xsi is defined on the oai_dc:dc element
  • xsi:schemaLocation is defined on the oai_dc:dc element

These lines up with what we see on the scielo OAI endpoint https://scielo.isciii.es/oai/scielo-oai.php?verb=ListRecords&set=0213-1285&from=2022-11-30&metadataPrefix=oai_dc

DOAJ Implementation

Our (updated, not yet released) implementation, meanwhile, looks like this:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2023-03-03T15:22:12Z</responseDate>
    <ListRecords>
        <record>

Note:

  • xmlns:xsi is in the root OAI-PMH element
  • xsi:schemaLocation is in the root OAI-PMH element
  • This matches the specification

Then inside the record element:

<metadata>
    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                           xmlns:dc="http://purl.org/dc/elements/1.1/"
                           xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:title>Exploring the Virulent Jazz Counterculture in Mumbo Jumbo</dc:title>

Note:

  • There are no attributes on the metadata element
  • xmlns:oai_dc is defined on the oai_dc:dc element
  • xmlns:dc is defined on the oai_dc:dc element
  • xmlns:xsi is NOT defined on the oai_dc:dc element (or, indeed, anywhere)
  • xsi:schemaLocation is defined on the oai_dc:dc element
  • This does not match the specification

Analysis

The code which produces this latter part is built using lxml which is a wrapper around LibXML, and which has been given all of the appropriate namespaces at the appropriate times, and we have no reason to believe that this produces invalid XML.

In producing the above snippet the code (paraphrased) is:

NSMAP = {None: PMH_NAMESPACE, "xsi": XSI_NAMESPACE, "xmlns": XMLNS_NAMESPACE}
NSMAP.update({"oai_dc": OAIDC_NAMESPACE, "dc": DC_NAMESPACE})
oai_dc = etree.SubElement(metadata, self.OAIDC + "dc", nsmap=NSMAP)

That is, the xsi namespace is supplied to the oai_dc:dc element at construction, but it does not appear on this element when the XML is rendered. This is because our XML library is hoisting that namespace declaration up to the top of the XML file (in the root OAI-PMH element), as it is normalising its usage throughout the document (avoiding repetition of the same namespace declaration).

We believe this is syntactically correct XML, and should work with any formal XML parser. Exactly why the OAI-PMH specification expects the xsi namespace declaration to be repeated I do not know, but I don't think that not repeating it here makes the XML incorrect.

The only way that I can see that we could become formally specification compliant is to render each oai_dc:dc element as a string with all the namespaces inserted, and then stitch together the final OAI-PMH response out of string representations. This seems like a Bad Idea, so we are not going to go that route.

We'd be interested to know whether you would still find the above correct (but not spec compliant) XML problematic to work with?

@eduardorep
Copy link

Hello @richard-jones thanks for the followup.

"We'd be interested to know whether you would still find the above correct (but not spec compliant) XML problematic to work with?"

Yes it is still a problem, because we use the XOAI parser. That parser is a Java library for the harvesting of metadata from the oai records. This parser follows the spec given by the OAI-PMH.

(Dspace xoai github page: https://github.com/DSpace/xoai)

So when we try to harvest using that library it gives us an error stating that it needs to see that element (XMLSchema-instance in xmlns:xsi) and thus does not return the metadata. So this solution does not solve our problem, nor can we find a quick solution for this since the xoai parser is the tool that provides us the processed metadata.

On another note, we have had the same issue with another specific repository that was using python for their project with the same library as you are (lxml) which presented the same behaviour that you are experiencing, if the element exists on the xml header it is removed from the record metadata headers. We had a meeting with them regarding this issue and the conclusion was that it was not fixable on their side due to that library. Their idea is to eventually stop using that library in order to allows the clients to harvest the metadata properly and as
by the OAI spec.

Maybe a simple version upgrade can solve it. Or using other library for that case, if that would be viable in your case?

"The only way that I can see that we could become formally specification compliant is to render each oai_dc:dc element as a string with all the namespaces inserted, and then stitch together the final OAI-PMH response out of string representations. This seems like a Bad Idea, so we are not going to go that route."

It does seem like a bad idea and I understand your worries on implementing something as such but this is probably the fastest solution that we could get (Even though far from optimal).

@richard-jones
Copy link
Contributor

Hi @eduardorep , thanks for the details that's useful.

It's unfortunate that XOAI won't accept this input. Can you give me some details of the error you get from it when the import fails because of the missing xsi declaration?

We've discussed this internally, and we're not comfortable moving to a string-manipulation based approach, and we don't plan to introduce a second or alternative xml serialiser at this stage (though we will consider it for the future). Therefore the fastest approach would probably be to introduce some flexibility into the XOAI library to parse these records in the absence of a repeated xsi attribute.

I have briefly looked at the XOAI code, and I wasn't able to quickly see how to do this, but if you can provide me with the error message it gives you and/or a reference code snippet that's using XOAI then I could take a look and see if there's a quick option there.

@richard-jones
Copy link
Contributor

I'd go so far as to say this is a bug in XOAI, it seems like it breaks the XML down into parts and parses them separately, which is where this error probably comes from: DSpace/xoai#67

@eduardorep
Copy link

Hello @richard-jones could you take a look at the bug in XOAI in order to make it work with what your project requires? Do you have experience in java to be able to fix it? Would you require any assistance to complete it?

@eduardorep
Copy link

Here is a possible solution we are exploring gdcc/xoai#141

@dommitchell
Copy link
Contributor

Hi there! Apologies for the small delay in replying. @richard-jones had to catch up with me first to discuss this. I am the Operations Manager for DOAJ and I work closely with Richard in prioritising our developments in a way that is the best use of the funding that we have.

Richard explained the detail here and in order for us to implement a fix, we would need to do some investigations into how complex the fix is and whether or not it would be accepted by the XOAI maintainers. It's worth noting at this point that Richard contact them 3 weeks ago with this question and they still haven't responded. This does make us nervous as we could essentially do the work and have it rejected.

The fix itself is not a small amount of work, as it will involve us understanding XOAI, setting up a testing environment, making the fix (which may be complex) and then piloting the fix through to acceptance.

All in all, I don't think that we can risk that kind of resource and money on this, at this moment in time. We are severely underfinanced and have a long development list with several high-priority items in it.

I am sorry that I don't have a better answer for you at this time. Thank you for taking the time to go through this with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants