Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance degredation when loading GATE xml documents #122

Closed
johann-petrak opened this issue May 3, 2020 · 12 comments
Closed

Comments

@johann-petrak
Copy link
Contributor

johann-petrak commented May 3, 2020

Loading (in the GUI) a corpus of 2015 documents (about 1.4G) with many annotations and features after starting a fresh GATE with -Xmx20G -Xms20G -tmp:

  • GATE 8.6.1: 132s, 133s
  • GATE 8.4: 121s
  • GATE 9.0-SNAPSHOT: 735s, 723s

Testing if this may be related to the changed xstream version used:

  • GATE 9.0-SNAPSHOT: using xstream 1.4.12 instead of 1.4.11.1: 714s
  • GATE 9.0-SNAPSHOT: using xstream 1.4.7: 130s

So it appears this is somehow related to the xstream version we use.

@ianroberts
Copy link
Member

We definitely can't roll back to 1.4.7 as that dates back to 2014 and there are several serious CVEs that have been fixed since then relating to exposure of filesystem data and arbitrary code execution.

@greenwoodma
Copy link
Contributor

My guess is that this is probably down to the extra security checks xstream does to prevent code execution etc. (if you look at the change log it's mostly just new check after new check) and may just be something we have to swallow.

@johann-petrak
Copy link
Contributor Author

Yes, was thinking the same, but once I have more info I might check if somebody opened an issue about that in the xstream repo, and if not, open one. A bigger than 5-fold slowdown is really quite annoying and maybe not really necessary, even with all those checks :)

@greenwoodma
Copy link
Contributor

It may be worth looking to see if any of the xstream dependencies themselves have changed version as it might not be a change in xstream itself. I have a feeling I had to mess with another related dependency recently.

@johann-petrak
Copy link
Contributor Author

We made a big jump from 1.4.7 to 1.4.11.1 so I also checked with other xstream versions:

  • 9.0-SNAPSHOT, using xstream 1.4.7: 130s
  • 9.0-SNAPSHOT, using xstream 1.4.8: 144s
  • 9.0-SNAPSHOT, using xstream 1.4.9: 165s, 164s
  • 9.0-SNAPSHOT, using xstream 1.4.10: 721s
  • 9.0-SNAPSHOT: using xstream 1.4.11.1: 735s, 723s
  • 9.0-SNAPSHOT, using xstream 1.4.12: 714s

So there was some gradual slowdown to version 1.4.9, but the severe slowdown happened from version 1.4.9 to 1.4.10.

@greenwoodma
Copy link
Contributor

Interestingly 1.4.10 includes a fix for a performance issue: x-stream/xstream#61

Maybe there is a related, but unfixed, issue. Might be a good starting point if you want to go digging about.

@johann-petrak
Copy link
Contributor Author

I have created: x-stream/xstream#200
The answer points to the FAQ http://x-stream.github.io/faq.html#Scalability_Performance
and when time allows we can have a closer look into how we actually do this in detail, especially the advice of keeping an initialized instance around which could be very important for the population task.

@greenwoodma
Copy link
Contributor

Cool. A singleton instance might make configuring the security side easier as well; it should certainly make it easier for us to expose an API for gate users to further configure the xstream security.

@johann-petrak
Copy link
Contributor Author

Yes, had been thinking the same, especially since apparently they want to completely prevent the use without the security handling in place from version 1.5.
I think that for all our loading and saving purposes, we do not even need to worry about multithreading because (de)serialization should always happen outside of any duplication (out of my memory I do not know about GCP handlers though)

@greenwoodma
Copy link
Contributor

@johann-petrak I've switched to using static instances of XStream which should fix this I hope. Do you still have the dataset you used last time to see if this has helped?

@johann-petrak
Copy link
Contributor Author

Cool, I will try to dig this up or re-run a new benchmark with both versions to check this as soon as I find the time!

@johann-petrak
Copy link
Contributor Author

OK I did re-run this, timing the loading of all documents:

  • git version f5fa668, xstream 1.4.11.1: 767 seconds
  • git version e4210b2, xstream 1.4.15: 37.5 seconds

I would call that fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants