Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SemanticDataCache #820

Closed
wants to merge 1 commit into from
Closed

Introduce SemanticDataCache #820

wants to merge 1 commit into from

Conversation

mwjames
Copy link
Contributor

@mwjames mwjames commented Feb 14, 2015

Parsing pages is an expensive endeavour and therefore liable to the amount of time spent on each page during an update (re-parse).

Using the SemanticDataCache can reduce the burden and processing times significantly by making a decision on whether a page needs a "real" update (revId to identify the content status) or it is merely enough to re-fetch the data from cache and provide the input.

The premis is that if the revision hasn't changed then the SemanticData haven't changed either hence the data from cache can be used.

  • If rebuildData.php is run with option --no-cache then no SemanticDataCache is used.
  • In case the de-serialization returns with an error a re-parse of a page is automatically triggered during the UpdateJob run.
  • Currently smwgCacheType is used to identify the cache type for building the CompositeCache for the SemanticDataCache object.

Sample using rebuildData.php -s 1 -e 40 --runtime:

  • without-cache: Memory used: 24309648 (b: 8997048, a: 33306696) with a runtime of 82.89 sec (1.38 min)
  • with-cache: Memory used: 14102776 (b: 8997048, a: 23099824) with a runtime of 36.71 sec

Parsing pages is an expensive endeavour and therefore liable to the amount
of time spent on each page during an update (re-parse).

Using the `SemanticDataCache` can reduce the burden and processing times significantly
by making a decision on whether a page needs a "real" update
(revId to identify the content status) or it is merely enough to re-fetch
the data from cache and provide the input.

The premis is that if the revision hasn't changed then the SemanticData
haven't changed either hence the data from cache can be used.

- If `rebuildData.php` is run with option `--no-cache` then no `SemanticDataCache` is used.
- In case the de-serialization returns with an error a re-parse of page is automatically triggered during the `UpdateJob` run.
- Currently `smwgCacheType` is used to indentify the cache type for building the `CompositeCache` for the `SemanticDataCache` object.

Sample using `rebuildData.php -s 1 -e 40 --runtime`:
- without-cache: Memory used: 24309648 (b: 8997048, a: 33306696) with a runtime of 82.89 sec (1.38 min)
- with-cache: Memory used: 14102776 (b: 8997048, a: 23099824) with a runtime of 36.71 sec
@mwjames mwjames added the new feature A new, or altered behaviour of an existing functionality that fundamentally impacts behaviour label Feb 14, 2015
@mwjames mwjames added this to the SMW 2.2 milestone Feb 14, 2015
@mwjames
Copy link
Contributor Author

mwjames commented Feb 14, 2015

Before the PR:

==========================================================================================
JobQueue benchmarks
------------------------------------------------------------------------------------------
- Dataset: BaseLoremIpsumDataset.v1.xml
- MediaWiki: 1.25alpha
- Store: SMWSQLStore3
- ShowMemoryUsage: false
- ReuseDatasets: true
- PageCopyThreshold: 1000
- RepetitionExecutionThreshold: 1
------------------------------------------------------------------------------------------
- SMW\RefreshJob: 1.2924681 (n) 1.2924681 (mean) 1.2924681 (total) (sec)
- SMW\UpdateJob: 0.0270573 (n) 27.0573311 (mean) 27.0573311 (total) (sec)
==========================================================================================

SMW\Tests\Benchmark\JobQueueBenchmarkTest doBenchmark ran for 28.413 seconds
.
==========================================================================================
RebuildData benchmarks
------------------------------------------------------------------------------------------
- Dataset: BaseLoremIpsumDataset.v1.xml
- MediaWiki: 1.25alpha
- Store: SMWSQLStore3
- ShowMemoryUsage: false
- ReuseDatasets: true
- FullDelete: true
- PageCopyThreshold: 1000
- RepetitionExecutionThreshold: 1
------------------------------------------------------------------------------------------
- SMW\Maintenance\RebuildData: 0.0427976 (n) 42.797627 (mean) 42.797627 (total) (sec)
==========================================================================================

After the PR:

==========================================================================================
JobQueue benchmarks
------------------------------------------------------------------------------------------
- Dataset: BaseLoremIpsumDataset.v1.xml
- MediaWiki: 1.25alpha
- Store: SMWSQLStore3
- ShowMemoryUsage: false
- ReuseDatasets: true
- PageCopyThreshold: 1000
- RepetitionExecutionThreshold: 1
------------------------------------------------------------------------------------------
- SMW\RefreshJob: 1.007448 (n) 1.007448 (mean) 1.007448 (total) (sec)
- SMW\UpdateJob: 0.0145479 (n) 14.5478842 (mean) 14.5478842 (total) (sec)
==========================================================================================

SMW\Tests\Benchmark\JobQueueBenchmarkTest doBenchmark ran for 15.604 seconds
.
==========================================================================================
RebuildData benchmarks
------------------------------------------------------------------------------------------
- Dataset: BaseLoremIpsumDataset.v1.xml
- MediaWiki: 1.25alpha
- Store: SMWSQLStore3
- ShowMemoryUsage: false
- ReuseDatasets: true
- FullDelete: true
- PageCopyThreshold: 1000
- RepetitionExecutionThreshold: 1
------------------------------------------------------------------------------------------
- SMW\Maintenance\RebuildData: 0.024953 (n) 24.952961 (mean) 24.952961 (total) (sec)
==========================================================================================

@joelkp
Copy link
Contributor

joelkp commented Feb 14, 2015

The premis is that if the revision hasn't changed then the SemanticData haven't changed either hence the data from cache can be used.

I see one problem with this for a certain kind of SMW extension, though I think there's also a simple solution.

Semantic Dependency is used for specific pages where (unlike other pages) semantic data does depend on other pages. The way it works is that update jobs are created for the dependent pages when the pages they depend on are updated.

I think such functionality could work together with the cache feature, if it were possible to disable reading from the cache for specific update jobs. The jobs created by the extension could then use this, and the cache would be updated with the new data, and could still be used to improve overall performance.

@JeroenDeDauw
Copy link
Member

Ah. @joelkp beat me to it. I was about to point out the same thing. Some pages might indeed have values that depend on something not part of the page, be it semantic data in another page, or something else entirely. Not recomputing can cause unexpected behaviour.

I'm thinking it's somewhat unfortunate that such dependencies are supported on all pages by default. If users had to explicitly allow it per page, then we could do several things more efficiently.

@mwjames
Copy link
Contributor Author

mwjames commented Feb 15, 2015

I see one problem with this for a certain kind of SMW extension, though I think there's also a simple solution.

I currently can't see how an extension is involved here, SemanticDataCache is for internal use and in case an extension runs a SMW specific hook it will do so for the UpdateJob no matter of the source of the data (either being parsed or retrieved from the SemanticDataCache).

SemanticDataCache is similar to the ParserCache to avoid having the ContentParser do unnecessary parsing. Neither ContentParser nor the InternalParseBeforeLinks hook contains any auxiliary hook for extensions to modify data during the in-text annotation parsing which also means that no extension can modify content during the parse process.

Data added through a hook is done after the parsing and is being run in any event independent of the source of the data.

The main objective for SemanticDataCache is to mitigate #347 in LinksUpdateConstructed and improve memory/runtime for the UpdateJob. ( SemanticDataCache !== QueryCache !== ParserCache )

By default $GLOBALS['smwgCacheUsage']['smwgSemanticDataCache'] is set true.

Possible ways to intervene with the SemanticDataCache usage:

  • Set ApplicationFactory::getInstance()->getSettings()->set( 'smwgSemanticDataCache', false ); (as done for rebuildData.php with option --no-cache)
  • Set SemanticData::setUpdateIdentifier to an extension dependent identifier (by default it contains the getLatestRevID)

Semantic Dependency is used for specific pages where (unlike other pages) semantic data does depend on other pages.

I don't know Semantic Dependency therefore I can't comment specifically here.

if it were possible to disable reading from the cache for specific update jobs. The jobs created by the extension could then use this,

The UpdateJob requires 'smwgSemanticDataCache' to be true in order to evaluated whether the cache can be used or not. I don't know if we need an extra UpdateJob parameter --no-cache because currently the UpdateJob does not allow for extra parameters to be invoked (legacy interface) and adding it on-the-fly could make it difficult for existing jobs (those already in the JobQueue) to recognize the added interface parameter (well, if it comes to adding parameters then it needs testing).

I would certainly appreciate extensive testing to identify possible yet uncovered dependencies.

@JeroenDeDauw
Copy link
Member

I don't know Semantic Dependency therefore I can't comment specifically here.

You can easily create this problem with SMW itself: assign the result of a query to a property value

@mwjames
Copy link
Contributor Author

mwjames commented Feb 15, 2015

You can easily create this problem with SMW itself: assign the result of a query to a property value

Query usage tracking is a whole other thing which needs addressing in order to be able to disable the ParserCache of a subject to ensure query results are updated but also allow for SemanticDataCache to be disabled.

@JeroenDeDauw
Copy link
Member

Yeah sure. The issue is already there to some extend, yet this new cache might well make it worse and thus break things currently working.

Don't misunderstand me: I like the stuff being done here. If data rebuilding indeed gets twice as fast, that's awesome

@mwjames
Copy link
Contributor Author

mwjames commented Feb 19, 2015

Instead of the LinksUpdateConstructed hook as trigger, the SMWStore::updateDataAfter is going to be used to ensure that all available data after processing are to represent the cache item for the selected revision.

@JeroenDeDauw
Copy link
Member

@mkroetzsch can you think of something we're forgetting about that will cause problems if this change is made?

@mwjames mwjames modified the milestones: SMW 2.3, SMW 2.2 Mar 23, 2015
@mwjames mwjames modified the milestones: SMW 2.4, SMW 2.3 Jul 7, 2015
@mwjames
Copy link
Contributor Author

mwjames commented Sep 2, 2015

Superseded by #1127 and #1035.

@mwjames mwjames closed this Sep 2, 2015
@mwjames mwjames deleted the semanticdata-cache branch September 2, 2015 21:50
@kghbln
Copy link
Member

kghbln commented Nov 18, 2017

Documented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature A new, or altered behaviour of an existing functionality that fundamentally impacts behaviour performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants