New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax NG: Add support for default class attribute values #1624

Closed
nosaj3 opened this Issue Sep 24, 2018 · 26 comments

Comments

Projects
None yet
4 participants
@nosaj3

nosaj3 commented Sep 24, 2018

I'm currently using BaseX as the underlying database for a project built as a search database for DITA content. This project employs the use of the DITA for small teams project (D4ST) as the application layer [1]. The DITA content I'm attempting to manage through this project is validated through Relax NG.

The problem is since the D4ST project relies on BaseX for parsing of the XML, none of the content imported into the database that is validated through Relax NG is recognized. The reason for this is D4ST identifies elements using its class attribute value. Class attribute values are not explicitly defined in the content but this is not a problem because the DITA Open Toolkit supports a Java framework as a plugin that supplies default class attribute values for Relax NG [2].

The plugin is basically two JARs. Is it possible to add a similar framework to BaseX?

[1] http://www.d4st.org/
[2] https://github.com/oxygenxml/dita-relaxng-defaults

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Sep 25, 2018

Anyone else interested in such a framework?

@nosaj3

This comment has been minimized.

nosaj3 commented Sep 25, 2018

I think I might be stretching the truth to say "Yes, definitely!". Typically, DITA content is validated by way of DTDs, which means this problem does not exist in those cases. Also, in the past, to support the type of search functionality I'm trying to achieve, you would simply invest in a content management system.

However, it's not a stretch to say that the adoption of Relax NG validation for DITA content could see a rapid uptick. Extending the DITA Open Toolkit through DTDs and XSD is a notoriously convoluted process that is prone to error. Use of Relax NG makes it far easier to customize and extend the DITA Open Toolkit [1].

Also, DITA has gained a healthy reputation in the world of technical communication. I'm sure many smaller groups without big budget backing would adopt it more voraciously if they had access to more open source tools for building out their publishing pipeline. The D4ST project I reference above is one such example of how smaller groups can build out a near enterprise class publishing pipeline at next to no cost.

[1] https://www.balisage.net/Proceedings/vol13/html/Kimber01/BalisageVol13-Kimber01.html#d223296e340

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Sep 25, 2018

So maybe we can find some parties who would be interested in sponsoring such a feature?

@nosaj3

This comment has been minimized.

nosaj3 commented Sep 25, 2018

That certainly is a possibility. I can try reaching out to the folks at SynchroSoft who make Oxygen and were responsible for development of plugin mentioned in my request.

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Sep 25, 2018

Sounds interesting.

@nosaj3

This comment has been minimized.

nosaj3 commented Sep 28, 2018

After reaching out to SynchroSoft, it seems like this is possible just by adding the dita-ng.jar to BaseX as a third-party jar. There should be no need to build a new framework. Is it as simple as adding the jar to the lib directory or would I need to follow instructions outlined here?

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Sep 28, 2018

Perfect! Adding it to the lib directory (or, better, lib/custom) and restarting BaseX should be sufficient. Looking forward to your feedback.

@nosaj3

This comment has been minimized.

nosaj3 commented Oct 1, 2018

I tried copying the jar to the lib/custom dir but no luck after a restart. Is there a way in BaseX to determine if the jar is recognized or classes are being loaded?

EDIT: I just saw that Radu Coravu from SynchroSoft responded to my basex-talk thread with some good insight on the inner workings of the plugin.

@raducoravu

This comment has been minimized.

raducoravu commented Oct 15, 2018

Somehow the "dita-ng.jar" library from our project:

https://github.com/oxygenxml/dita-relaxng-defaults

would need to be placed in the BaseX startup Java classpath so that it is loaded before the Xerces JAR library.

@raducoravu

This comment has been minimized.

raducoravu commented Oct 24, 2018

Had more time to look into this, here's my analysis:

  • The "dita-ng.jar" JAR library has in its "MANIFEST-MF" folder a "services" subfolder with a file called "org.apache.xerces.xni.parser.XMLParserConfiguration" which points to a custom parser configuration which is able to inject Relax NG default attribute values in each parsed XML element before the SAX handler callbacks.

  • Placing the "dita-ng.jar" in the "custom" folder should have been enough but it is not.

  • The problem is that basex does not use the Apache Xerces library, it uses by default the Xerces library bundled with the Java JDK. Looking at the Java source code for this Apache Xerces class "org.apache.xerces.parsers.XMLGrammarParser.XMLGrammarParser(SymbolTable)" it looks like this:

    protected XMLGrammarParser(SymbolTable symbolTable) {
        super((XMLParserConfiguration)ObjectFactory.createObject(
            "org.apache.xerces.xni.parser.XMLParserConfiguration",
            "org.apache.xerces.parsers.XIncludeAwareParserConfiguration"
            ));
    

meaning that the Apache Xerces library is extendable, it looks in the class loader for this particular "org.apache.xerces.xni.parser.XMLParserConfiguration" configuration and would load the one from the "dita-ng.jar" if it were loaded in the same class loader.
But looking at the equivalent Xerces class from the Java JDK rt.jar "com.sun.org.apache.xerces.internal.parsers.XMLGrammarParser" reveals:

protected XMLGrammarParser(SymbolTable symbolTable) {
    super(new XIncludeAwareParserConfiguration());
    fConfiguration.setProperty(Constants.XERCES_PROPERTY_PREFIX+Constants.SYMBOL_TABLE_PROPERTY, symbolTable);
}

so the same class from the Java JDK uses a fixed configuration, it does not look for services defined in other JAR libraries.

  • I added to the "basex/lib" folder the latest "xercesImpl.jar" library downloaded from the Apache Xerces project.

  • Having the proper XML catalog specified in the .basex, executing an XQuery with the server on a RelaxNG-based DITA topic started to take into account the default attribute values.

So this started to work by using the Apache Xerces library, it still has performance problems though, the xquery takes quite a long time to finish. With the fix in place parsing the XML means also parsing the RelaxNG schemas referenced by it but it still takes too long and I did not yet figure out why this long delay until the XQuery executed.

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Oct 24, 2018

Thanks for the comprehensive summary. It’s great to hear that it’s working now. Regarding the performance issues, we may need to do some profiling.

@raducoravu

This comment has been minimized.

raducoravu commented Oct 24, 2018

About the performance problem, from what it seems the most part of it seems to be spent on an HTTP timeout.
Thing is that I installed basex on a computer which needs an HTTP proxy server setup in order to access remote web sites. So any connection from the basex Java process to an external web site will timeout in about 20 seconds.
When my sample XQuery is executed on the server side your Jetty bundled HTTP server will try to connect to a web site called "antiddos.ghostsquadhackers.org" because it somehow considers that a denial of service attack is in progress and probably it wants to announce . And in my case it delays the xquery execution time with 20 seconds.
Oxygen issues a single HTTP request with the query command:

> "POST /rest HTTP/1.1[\r][\n]"
> "Authorization: Basic XXXXX[\r][\n]"
> "Content-Type: application/xml[\r][\n]"
> "Content-Length: 223[\r][\n]"
> "Host: localhost:8984[\r][\n]"
> "Connection: Keep-Alive[\r][\n]"
> "User-Agent: Oxygen XML Editor/21.0[\r][\n]"
 > "Accept-Encoding: gzip,deflate[\r][\n]"
> "<query xmlns='http://basex.org/rest'>[\n]"> "<text>for $country in doc(&apos;file:/C:/Users/radu_coravu/Desktop/basex/basex/repo/http-www.functx.com-1.0/abc.xml&apos;)//*[\n]"
> "return &lt;aaa&gt;{ $country }&lt;/aaa&gt;[\n]"
> "</text>[\n]"

so I do not know why the Jetty server somehow considers this is a denial of service attach. The XML parsing process indeed parses lots of XML catalogs and RelaxNG files but they are all locally placed and I do not think they are requested from the local Jetty server.

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Oct 24, 2018

Maybe it helps to run BaseX.jar without Jetty, and to reproduce the behavior with basex or the BaseX GUI alone?

@raducoravu

This comment has been minimized.

raducoravu commented Oct 24, 2018

Good idea, using the BaseX GUI I still get about 20 seconds of time for executing the same simple XQuery which just serializes the root element. If I disable the network, for the same query the server returns results in 200 miliseconds. So from what I tested with a TCP/IP logging tool , even without Jetty, it seems the "BaseXGUI" is also connecting (or trying to connect) to the same ""antiddos.ghostsquadhackers.org"" HTTP server.

@micheee

This comment has been minimized.

Member

micheee commented Oct 29, 2018

Hi @raducoravu

Thanks for your observations, are you sure BaseX ist initiating that connection to ghostsquadhackers?

I am neither able to find that URL in our sources, nor in a TCP-Dump:

michael@mbp:~|⇒  sudo tcpdump -i en0 -s 0 -B 524288 -w ~/Desktop/DumpFile01.pcap
# … start BaseX 
# …quit basex…
# stop tcpdump
# search for basex or ghostsquadhackers
michael@mbp:~|⇒  tcpdump -s 0 -n -e -x -vvv -r ~/Desktop/DumpFile01.pcap|grep -C2 (ghostsquadhackers|basex)
reading from file /Users/michael/Desktop/DumpFile01.pcap, link-type EN10MB (Ethernet)
	GET /version.txt HTTP/1.1
	User-Agent: Java/1.8.0_131
	Host: files.basex.org
	Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
	Connection: keep-alive

I manage to only find our call to /version.txt in the network traffic.

Maybe this is something on your side?

@raducoravu

This comment has been minimized.

raducoravu commented Oct 29, 2018

@micheee I'm on Windows, I used an utility called ProcessExplorer listed on the Microsoft Web site:

https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer

it allows double clicking on a certain process and looking at its TCP/IP connections. I cannot guarantee that the application is correct about the name of the remote HTTP host to which the basex server tries to connect, but I consider that at least on my side the basex server seemed to try to connect to some remote website but timed out because it did not know my HTTP proxy settings. This added a 20 seconds delay for each XQuery I executed. Once I disabled the network connection the delay when executing the XQuery disappeared. I could try to find some time and use another utility like WireShark, then get back to you.

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Oct 29, 2018

Hi @raducoravu, could you provide us with the sample query you mentioned, and XML files that are referenced by this query?

@raducoravu

This comment has been minimized.

raducoravu commented Oct 29, 2018

I found a better approach, I enabled HTTP logging by adding to the basex command line:

         -Djava.util.logging.config.file=/full/path/to/logging.properties

and then the logging.properties looking something like:

       handlers= java.util.logging.ConsoleHandler
       java.util.logging.ConsoleHandler.level = FINEST
       sun.net.www.protocol.http.HttpURLConnection.level=ALL

Started the "basexhttp.bat", executed the XQuery on it and the server's console says something like:

 Oct 29, 2018 1:10:50 PM sun.net.www.protocol.http.HttpURLConnection plainConnect0
 FINEST: ProxySelector Request for http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd

so it looks like a connection to a on the OASIS web site DTD, probably the XML catalog resolver tries that. I do not know why "ProcessExplorer" gets confused and reports that other host name, possibly because the OASIS host redirects through that other host name.
The XML catalog parser may try to perform such connections to a "catalog.dtd" if the XML catalogs themselves have DOCTYPE declarations pointing to the remote catalog.dtd location. So this looks like a false alarm, it is probably caused by one of my custom directly or indirectly referenced XML catalogs having DOCTYPE references to the XML Catalog DTDs.

@micheee

This comment has been minimized.

Member

micheee commented Oct 29, 2018

You are right, I can kind of reproduce this behavior:

It seems like this might be due to some DNS magic or maybe misconfiguration?

The DNS A-Record for oasis-open.org happens to be "172.99.100.168":

⇒  nslookup oasis-open.org
Non-authoritative answer:
Name:	oasis-open.org
Address: 172.99.100.168

which in turn gets reported to have the hostname antiddos…:

⇒  nslookup 172.99.100.168
Non-authoritative answer:
168.100.99.172.in-addr.arpa	name = antiddos.ghostsquadhackers.org.

Weird, but I'm glad there are no "hidden" calls to obscure hosts buried in our codebase!

@raducoravu

This comment has been minimized.

raducoravu commented Oct 29, 2018

Right, network TCP/IP sniffers only have the IP address to work with so they need to reverse DNS it.
Just found the catalog.xml which still has the DOCTYPE declaration set on it pointing to the remote DTD. And of course there is no XML catalog mapping it to a local resource. So that was the problem.

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Oct 30, 2018

Sounds good! @raducoravu, @nosaj3: Does this mean that the issue has been resolved?

@raducoravu

This comment has been minimized.

raducoravu commented Oct 30, 2018

@ChristianGruen Yes, basically if in the basex "lib" folder or in the "lib/custom" folder you add a "dita-ng.jar" and a "xercesImpl.jar", basex will start "seeing" default attribute values when the xqueried XML document has a reference using xml-model to a RNG-based schema. This is mostly useful for people storing DITA content in basex.

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Oct 30, 2018

Thanks, Radu! I’ll see if I can document this somewhere in our Wiki. @nosaj3: Does the solution work for you?

@ChristianGruen

This comment has been minimized.

Member

ChristianGruen commented Oct 30, 2018

A minor addendum: Libraries in the lib directory will now be added to the classpath after the libraries in the lib/custom directory (4b1600a).

@nosaj3

This comment has been minimized.

nosaj3 commented Oct 30, 2018

Fantastic! Thanks a ton all for working through this!

@ChristianGruen I'll follow the recommendation and see how it goes but I think you can close this.

@raducoravu

This comment has been minimized.

raducoravu commented Oct 31, 2018

@ChristianGruen Thanks for the help. I originally considered that the "dita-ng.jar" needs to be added in the classpath before the Xerces library but it does not, it just needs to be in the same classpath as the Xerces library. So the order in which the "lib" and "lib/custom" folders are added to the classpath is not important in this context. But in general it's probably a good idea to add the "custom" folder first, maybe people want to replace or patch a certain library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment