-
-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce SemanticDataCache #820
Conversation
Parsing pages is an expensive endeavour and therefore liable to the amount of time spent on each page during an update (re-parse). Using the `SemanticDataCache` can reduce the burden and processing times significantly by making a decision on whether a page needs a "real" update (revId to identify the content status) or it is merely enough to re-fetch the data from cache and provide the input. The premis is that if the revision hasn't changed then the SemanticData haven't changed either hence the data from cache can be used. - If `rebuildData.php` is run with option `--no-cache` then no `SemanticDataCache` is used. - In case the de-serialization returns with an error a re-parse of page is automatically triggered during the `UpdateJob` run. - Currently `smwgCacheType` is used to indentify the cache type for building the `CompositeCache` for the `SemanticDataCache` object. Sample using `rebuildData.php -s 1 -e 40 --runtime`: - without-cache: Memory used: 24309648 (b: 8997048, a: 33306696) with a runtime of 82.89 sec (1.38 min) - with-cache: Memory used: 14102776 (b: 8997048, a: 23099824) with a runtime of 36.71 sec
Before the PR:
After the PR:
|
I see one problem with this for a certain kind of SMW extension, though I think there's also a simple solution. Semantic Dependency is used for specific pages where (unlike other pages) semantic data does depend on other pages. The way it works is that update jobs are created for the dependent pages when the pages they depend on are updated. I think such functionality could work together with the cache feature, if it were possible to disable reading from the cache for specific update jobs. The jobs created by the extension could then use this, and the cache would be updated with the new data, and could still be used to improve overall performance. |
Ah. @joelkp beat me to it. I was about to point out the same thing. Some pages might indeed have values that depend on something not part of the page, be it semantic data in another page, or something else entirely. Not recomputing can cause unexpected behaviour. I'm thinking it's somewhat unfortunate that such dependencies are supported on all pages by default. If users had to explicitly allow it per page, then we could do several things more efficiently. |
I currently can't see how an extension is involved here,
Data added through a hook is done after the parsing and is being run in any event independent of the source of the data. The main objective for By default Possible ways to intervene with the
I don't know Semantic Dependency therefore I can't comment specifically here.
The I would certainly appreciate extensive testing to identify possible yet uncovered dependencies. |
You can easily create this problem with SMW itself: assign the result of a query to a property value |
Query usage tracking is a whole other thing which needs addressing in order to be able to disable the |
Yeah sure. The issue is already there to some extend, yet this new cache might well make it worse and thus break things currently working. Don't misunderstand me: I like the stuff being done here. If data rebuilding indeed gets twice as fast, that's awesome |
Instead of the |
@mkroetzsch can you think of something we're forgetting about that will cause problems if this change is made? |
Parsing pages is an expensive endeavour and therefore liable to the amount of time spent on each page during an update (re-parse).
Using the
SemanticDataCache
can reduce the burden and processing times significantly by making a decision on whether a page needs a "real" update (revId to identify the content status) or it is merely enough to re-fetch the data from cache and provide the input.The premis is that if the revision hasn't changed then the
SemanticData
haven't changed either hence the data from cache can be used.rebuildData.php
is run with option--no-cache
then noSemanticDataCache
is used.UpdateJob
run.smwgCacheType
is used to identify the cache type for building theCompositeCache
for theSemanticDataCache
object.Sample using
rebuildData.php -s 1 -e 40 --runtime
: