Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page cache refactor full implementation #302

Open
n00dles opened this issue Nov 2, 2012 · 55 comments
Open

Page cache refactor full implementation #302

n00dles opened this issue Nov 2, 2012 · 55 comments

Comments

@n00dles
Copy link
Contributor

n00dles commented Nov 2, 2012

Original author: carnav (February 24, 2012 18:36:58)

I just noticed that in template_functions.php, function generate_sitemap() is populating $pagesArray 'the old way' (lines 1039 and 1044-1059) instead of simply calling getPagesXmlValues()

Original issue: http://code.google.com/p/get-simple-cms/issues/detail?id=302

@n00dles
Copy link
Contributor Author

n00dles commented Nov 2, 2012

From carnav on February 25, 2012 10:02:50
Well it's not only that one - other functions in template_functions.php are reading all data/pages/*.xml instead of getting the data from pages.xml

(BTW I said "calling getPagesXmlValues()" without thinking much, :) I'm not sure it's the correct way, I haven't checked...)

@n00dles
Copy link
Contributor Author

n00dles commented Nov 2, 2012

From m...@digimute.com on February 25, 2012 14:53:32
originally we were just going to convert a couple of the functions in the core to use the caching to see how things went. Seems we forgot about converting the rest.

@n00dles
Copy link
Contributor Author

n00dles commented Nov 2, 2012

From tablatronics on June 26, 2012 15:39:26
Can we identify all these not yet converted code here ?

@ghost ghost assigned n00dles Nov 6, 2012
@n00dles
Copy link
Contributor Author

n00dles commented Nov 6, 2012

I'll take this one.

@n00dles
Copy link
Contributor Author

n00dles commented Dec 12, 2012

The following function need to be updated to use the caching rather than reading all the files in.

theme_ functions
menu_data()

template_ functions
get_available_pages()
list_pages_json()
generate_sitemap()

@cnb
Copy link
Contributor

cnb commented Jan 5, 2013

With list_pages_json fixed, the page editor will load faster.
Now it is loading all xml files in data/pages when you edit or create a page.

[edit: I said get_available_pages() but it's list_pages_json for the page editor]

@tablatronix
Copy link
Member

Yeah these are alot of things this will improve.
I was also thinking about how we could cache theme folder structure as well since we load it now and then although not nearly as much as other things. We might see if we can cache anything we use readdir etc on.

@n00dles
Copy link
Contributor Author

n00dles commented Jan 5, 2013

just updated all the remaining function to use caching functions.

can ye guys double check everything is working fine. all looks good to me.

its a straight swap of the file reading for the global $pagesArray

@tablatronix
Copy link
Member

Looks ok to me, no errors.

@cnb
Copy link
Contributor

cnb commented Jan 14, 2013

After this change, I found an issue I haven't thought about, with plugins that call get_available_pages() or list_pages_json() (e.g. GS Blog, News Manager):
$pagesArray has not been populated so these functions return an empty array.

A possible fix would be calling getPagesXmlValues() in load.php, but that would do it for every plugins including those that don't need this.
I think this would be better:

Just after:

    global $pagesArray;

insert this:

    if (!$pagesArray) getPagesXmlValues();

both in get_available_pages() and list_pages_json()
(admin/inc/template_functions.php)

@tablatronix
Copy link
Member

Wait why is it not populated ? It's a cache it should be prepolated before anything might use. And after anything changes it.

So its never initialized ? It should be globalized in common.php and initialized as soon as possible afterwards.

@cnb
Copy link
Contributor

cnb commented Jan 14, 2013

Caching started as a plugin, and still works more or less like that. I suppose this also needs some refactoring.

Anyway, why should $pagesArray always be available/populated? If a backend page (GS or plugin) is not going to need it, isn't it better to leave out this extra file read, etc.?

@n00dles
Copy link
Contributor Author

n00dles commented Jan 14, 2013

Looks like a typo in the caching functions init code. Firing on the header was not actually doing anything.

It now loads $pagesArray , if file doesn't exist it will create it.

Can you check again that everything is working fine.

@tablatronix
Copy link
Member

I see that now, so we have core plugins ?
weird, i guess at some point we need to incorporate it fully.

Also we need to look at its performance, say with a site with 500-2000 pages.

I do not see a problem with it being loaded all the time, we could probably use it in core for some stuff. Are we not ? pages display ? how we doing that ?
Menu manager ?

@n00dles
Copy link
Contributor Author

n00dles commented Jan 15, 2013

caching started off as a plugin but is now fully integrated. It just uses hooks rather than hard coding the calls.

Initial testing of the plugin was done on a site with 2000 pages each one set as a menu item to ensure each file was read in. As far as I remember there was about a 70-80% speed increase when using caching than reading in the 2000 files each time, and sometimes depending on the template/functions those 2000 files were read in multiple times per page refresh.

I have an old plugin i'll dig out to create large test sites for this type of testing.

Caching should now be used on internal core routines which previously read in the files individually.

$pagesArray should now be available and populated for front/backend use at all time.
caching file is only updated on page update/delete and includes all fields except for 'content'

@tablatronix
Copy link
Member

Ok so...

// load pages cache for frontend
add_action('index-pretemplate','getPagesXmlValues',array('false'));             // make $pagesArray available to the theme 

// load pages cache for backend
add_action('header', 'getPagesXmlValues',array('false'));  

Perhaps admin-pre-header is better ? or sooner.
Or can we just get rid of both and call its hook function directly after
include_once('caching_functions.php');

future speak on
Maybe we should consider not automatically loading this until something needs it ?
Instead of a global we would need a getter function that would init it when its empty for those functions that read only.
Functions that write it would still use globals.
I might be adding some profiling functions into core, so we can decide its impact latter on.

Once we have some stats I think this will be the most efficient way to optimize the core.
We might find a need in the future to chunk the cache up, or replace the slow xml functions with a straight serialization in and out of the file store.

I honestly do not know how it works. Did you write it ?
I guess ill have to sit down and read the functions, figure out which ones read the file in write the file out and get stuff from the cache array.

@tablatronix
Copy link
Member

@n00dles
Copy link
Contributor Author

n00dles commented Jan 15, 2013

yeah I wrote it, cant you tell from the coding style... 8)

I think loading it straight after the include would work best.

Its just legacy that we did it that way as it was any easy way to add plugins to the core without having to rewrite anything. (Just delete the plugin init code and it should work)

@tablatronix
Copy link
Member

So this should be a really good speed improvement now, I am not sure how often those functions were called, but it seems like it could have been significant. excellent.

@cnb
Copy link
Contributor

cnb commented Jan 15, 2013

Why load pages.xml in admin pages like settings, plugins, files, etc.?

IMHO it would be better loading + populating if necessary.
I think you'd just have to insert:

if (!$pagesArray) getPagesXmlValues();

after every global $pagesArray; in caching_functions.php's functions.

I'd even suggest to not to load in the frontend by default (changing get_navigation to load it)

@cnb
Copy link
Contributor

cnb commented Jan 15, 2013

As for loading caching functions in a hook: a plugin that runs at 'header' and wants to use getPageField, etc. may stop with an error, depending on the order the server uses. (I think this has happened to me with some experiments, but I forgot to report).
'admin-pre-header' would be better than 'header', but even better the require_once.

@tablatronix
Copy link
Member

Yeah well when we do it we should do it right.
That's why I said I we are editing every function then use a getter not just adding g more redundant code all over the place.

@tablatronix
Copy link
Member

cnb , remember we have alot of refactoring going on, so there is no reason to do major changes like that when we will probably be refactoring cache anyway.

Can we use pagecache to eliminate some directory reads ?

For example updateSlugs does a readdir and then loads the filenames into an array ( which it then filters again for some reason, after it was already filtered for .xml files., then it loops the bitches and looks for parents that changed and then updates it. Seems like this could all be done with a cache loop.

updateslugs is called

  • every page delete
  • backup restore
  • page slug change

I will add more as I find them

@tablatronix
Copy link
Member

one problem with this I am seeing is the same issue with plugins we had.

You are loading the cache file after saving the cache file.
Seems unnecessary to reload the xml you just saved, when we can create a new cache array at the same time and avoid the extra file load.

Also some more enhancements we can make possibly is that xml object are navigable using xpath, so when changing something we can modify xml per node and resave instead of recreating the entire node tree from an array.
Not sure if we would benefit at all from that or not, I imagine on large sites we could.

But i guess these would require modification of the createpages function to create a xml save and the pages array at the same time. So they are not instant things.

@n00dles
Copy link
Contributor Author

n00dles commented Jan 15, 2013

Looks like I'll need to restore that hook call at the start of the cahing functions for the moment until I can sort out the flow of things as its not working as it should.

Test: first create a new plugin called caching_info.php ( i use this for looking at the cache)

<?php 
/****************************************************
*
* @File:  caching_info.php
* @Package: GetSimple
* @Action:  Plugin to view caching information  
*
*****************************************************/

# get correct id for plugin
$thisfile=basename(__FILE__, ".php");

# register plugin
register_plugin(
  $thisfile,
  'Caching Info',
  '2.3',
  'Mike Swan',
  'http://www.digimute.com/',
  'Pages Caching Info',
  'plugins',
  'showDebugInfo'
  );


add_action('plugins-sidebar','createSideMenu',array($thisfile,'Caching Info'));     // add menu entry


function showDebugInfo(){
getPagesXmlValues();
  global $pagesArray;
  echo '
  <style type="text/css">
  #load pre code {
      display:block;font-size:11px;width:560px;line-height:13px;
      white-space: pre-wrap; /* css-3 */
      white-space: -moz-pre-wrap !important; /* Mozilla, since 1999 */
      white-space: -pre-wrap; /* Opera 4-6 */
      white-space: -o-pre-wrap; /* Opera 7 */
      word-wrap: break-word; /* Internet Explorer 5.5+ */}
  </style>
  ';
  echo "<h3>Caching Info</h3>";
  echo "<pre><code>";
    print_r($pagesArray);
  echo "</code></pre>"; 
}

?>
'''

then create 2 pages, test & test 2
make test2 a child of test. 
check the plugin info, parent is not updated. 

restore caching_function.php hook to

add_action('header', 'create_pagesxml',array('false'));  

and all works as it should . 

@tablatronix
Copy link
Member

Sweet I was thinking we need a unit tester for this

@tablatronix
Copy link
Member

probably because there is no hook to catch changedata.php

with
add_action('header', 'create_pagesxml',array('false'));

Your just recreating the pagesxml every time a backend page is loaded. So it probably catches everything, and is probably definitely wrong.

A 'changedata-aftersave' should fix that for now.
We need a function to update a single page in the cache and resave the cache, that would be neat here and to replace cloning etc. So when a single page changes we do not need to reload the cache from files again and all that jazz. we can just pass the xml before saving to the cachingPagefunction

@cnb
Copy link
Contributor

cnb commented Jan 16, 2013

Just a comment: I'm not sure if it's still this way in current dev version.
Some time ago, while doing experiments with patches to the core, I found that $pagesArray was in some cases an associative array and others a numeric one. (I hadn't reported because first I wanted to do more tests and make sure of this... )

@tablatronix
Copy link
Member

What is the purpose of
if ((isset($_GET['upd']) && $_GET['upd']=="edit-success") || $flag=='true'){

@tablatronix
Copy link
Member

I edited it above, that fixes it.
But i gotta get pagesArray outta there and make sure nothing else uses it as a reference.

@tablatronix
Copy link
Member

This same code exists in edit.php same loop. REFACTOR 🔴
Also there are a few reassignents to $pagesArray that need a removing.

@cnb
Copy link
Contributor

cnb commented Jan 17, 2013

Ok, great.

Another comment, have you seen the 3rd line in that piece of code you've pasted?

$parentdata = getXML(GSDATAPAGESPATH . $page['parent'] .'.xml');

'tis reading the parent data from disk... Should it get it from the global $pagesArray?

@tablatronix
Copy link
Member

Yup, I forgot to actually search for getxml() as a file read function for refactoring.
Good catch.

And again in edit.php.

@cnb
Copy link
Contributor

cnb commented Jan 17, 2013

As a side note, the $pagesArray variable name was used in GS before caching was implemented.
Mike's Pages plugin had it as $digi_pagesArray but was changed to $pagesArray when integrated to GS.

@tablatronix
Copy link
Member

ahh, ok.
Yeah a namespace would be nice. But we aren't close to using namespaces or classes. Maybe GS4 with min php 5.3.
lol

@cnb
Copy link
Contributor

cnb commented Jan 17, 2013

BTW the other day I did some tests by patching getXML :

function getXML($file) {
    $xml = file_get_contents($file);
    $data = simplexml_load_string($xml, 'SimpleXMLExtended', LIBXML_NOCDATA);
    echo '[read:',$file,']<br />'; // <-- ADDED FOR TESTING
    return $data;
}

I used this isntead of debugLog to also test in the frontend.
I found that in GS 3.1.2 plugins.xml is read 3 times.
However latest 3.2 beta only reads it once.

@tablatronix
Copy link
Member

Yeah my goal is to have all file reads and writes use core functions or classes, so we can profile this stuff easier.
Yeah we fixed that some time ago, it shaved like 100ms off everything. That is why i want this whole write a file read a file back in to go bye bye, and it appears this is the same design pattern. There is nothing wrong with it, its almost negligible, but if i can make sure we read caches in once and live modify the cache when we make changes ( which we can do now per page with this refactor ). Then the cache only has to be written not read everytime you save a page, this is essential for ajax saving also. We cannot be reading in and then writing out the entire file mutiple times per load.

Also we should be able to modify the cache ondemand instead of rescaning all pages every time to check for changes. We will still need a basic check in case people meddle with the files, same as the issues with plugins, i found those issues btw. lol

@tablatronix
Copy link
Member

refs #345

@tablatronix
Copy link
Member

188f839
09b86ba

@tablatronix
Copy link
Member

Some commits cherry picked for #603

@tablatronix
Copy link
Member

Cherry-picked into v3.3.0

@tablatronix
Copy link
Member

Make page cache protected somehow, obviously we are already using a global array so its a bit late to move to a class.

It is very important that cache can be available as an associative array so that it can be cross indexed by key to sorting arrays and menu arrays.

I would like to see cache always accessed via a wrapper function, this wrapper will perform the following.

  • detect and avoid contamination of global, using either a backup array or some kind of checksum
  • recreate page array if corrupted by global misuse, key unsafe operation etc.
  • detect corruption of keys, and recreate using values id, or reload entirely

@cnb
Copy link
Contributor

cnb commented Jan 19, 2014

(previously posted here http://get-simple.info/forums/showthread.php?tid=5520&pid=42315#pid42315)

I'd rather like that these functions (getPageContent, getPageField, etc., available since GS 3.1) were not camelCase, but lowercase with underscores, to be consistent with the other template tags.

If it wasn't because get_page_content, etc. already have an optional parameter ($echo), I would suggest that they had it but for the slug of other page.

An alternative could be an equivalent, like get_other_page_content, get_a_page_content or something (that actually would be an alias for getPageContent)

@cnb
Copy link
Contributor

cnb commented Jan 29, 2014

(I said)

If it wasn't because get_page_content, etc. already have an optional parameter ($echo), I would suggest that they had it but for the slug of other page.

Another option would be using the first parameter for $echo OR $slug
If it's a string use it as slug, else as boolean for echo.
But then, possible issues with pages with slug "0" or "1". This could be addressed with strict checks (... !== false), but there may be some scripts out there using e.g. get_parent(0) (some of them by me, oops)

@tablatronix
Copy link
Member

Yeah i was thinking that also, booleans can be evaluated with === cant they ?
Oh nm, i was skimming.

tablatronix added a commit that referenced this issue Feb 8, 2014
@tablatronix
Copy link
Member

more re-factoring done, all legacy functions are aliases for the most part, aside form the getters
Yet to do remove hook based calls, redo loading and saving logic.

getPagesXml values now returns pagecache, so you can avoid using global in functions unless you need to modify it. Not passing by reference, perhaps this is a good way to prevent altering it when using that as a wrapper. eg. readonly

  • Optimize cache saving and utilize cache is dirty flagging and perform saves at shutdown
  • Optimize best way to regen from filesystem changes.
  • Optimize saving or modifying cache before save, per page editing on changedata, deletes. Do direct modification of page cache and resave, avoiding cache rebuilding.

tablatronix added a commit that referenced this issue Feb 14, 2014
tablatronix added a commit that referenced this issue Feb 14, 2014
@tablatronix tablatronix changed the title Page caching refactor full implementation Page cache refactor full implementation Jun 25, 2014
@tablatronix
Copy link
Member

@tablatronix
Copy link
Member

as mentioned above, possible solutions

  • Make page cache more efficient, allow other way to refresh it. right now everytime cache needs update it regenerates from scratch.

It should allow individual page updates, update from objects ( avoid file reads ), and upload only pages that have changed, use time stamps or something ( only if newer )

timestamps is the best optimization as it covers all these cases, but might causes issues with plugins and filters.

@tablatronix tablatronix added this to the v3.4.0 milestone Apr 13, 2015
@tablatronix
Copy link
Member

  • some versions of php always create references to simplexml object elements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants