Skip to content

MLC OPcache details

Terry Ellison edited this page Aug 4, 2013 · 10 revisions

The page describes the multi-level cache (MLC) variant of OPcache. Its primary purpose is to extend the performance benefits of OPcache into the CLI and GCI SAPI environments. A secondary purpose is to enable full functional testing of OPcache within the PHP regression test environment. This page's intended audience are PHP developers who are interesting in understanding how this is implemented and some of the design decisions that have been taken that impact this implementation.

Table of Contents

The addition of an extra file-cache tier

In MLC mode, OPcache executes with a multi-level cache, with the in-memory cache largely unchanged from normal OPcache use. However, an additional file-based cache tier is added using a file that broadly mirrors a cdb-style file-based (piecewise) constant D/B. The implementation is embedded in the OPcache code and, unlike cdb, the file index is fully loaded into memory at opening to minimize seeks and I/O calls. This approach is well suited to this type of cache. To enable safe multiple access and atomic update, the base file is always opened read-only and can thus be used safely by multiple processes. It contains the index followed by N logical records, each one comprising:

  • a compressed copy of the module from the memory cache
  • its relocation vector.
These are stored in creation order; this is because most applications will access the objects in the same order on multiple execution paths, so ordering on creation order is a good strategy to minimise seeks and serialise access to the file cache.

Serialized versions of the includes and persistent script hashes are stored in the file index:

  • The includes cache is a simple list of include paths used in the application, with each mapping onto a single ASCII character, e.g. the first, say "." mapping to "A", the second onto "B", etc.
  • The persistent script hash enables symbolic mapping of script filenames to individual cached scripts. Note that with OPcache, each module in the persistent script hash has a primary key based on its fully resolved absolute file path, but each symlinked or path relative reference from an application adds a secondary "indirect" keys based on the current working directory at load, the filename and the include path. Hence a module might have /home/test/myapp/mod1.php as the primary key and /home/test/myapp:mod1.php:A as a secondary. These secondary keys are used to avoid unnecessary include path resolutions when looking up modules.
  • Note that for non-ZTS builds PHP version 5.4 or greater, this is followed by a serialised dump of the interned strings as discussed below.

The memory cache is initialized to an empty state by OPcache at extension startup; however, if the cache file exists at request startup in MLC mode, then the file index is read and these serialized versions are used to load the two main hashes. On this initial loading from the file cache, a contiguous array of zend_file_cached_script entries is also created, sized to the number of modules in the file cache, plus headroom for additions (and a standard realloc and grow algorithm to expand above this). The data entry for each persistent script ZCSG(hash) initially points to this entry, instead of the script itself, as this is yet to be load. All find calls to this hash within ZendAcclerator.c are replaced by a macro equivalent which in the case of the file-cache enabled version calls an inline wrapper which contains the following code:

 zend_accel_hash_entry *bucket = zend_accel_hash_find_entry(&ZCSG(hash), k, kl);
 if (bucket && bucket->data && ZSMMG(use_file_cache)) {
     zend_uint ndx = (zend_file_cached_script *)(bucket->data) - ZFCSG(file_cached_scripts);
     if (ndx < ZFCSG(file_cached_script_count)) {
         zend_accel_load_module_from_file(ndx, bucket TSRMLS_CC);
     }
 }

Hence zend_accel_load_module_from_file() is invoked on cache miss; this retrieves the module in a bulk read, then unpacks and relocates it to its correct absolute address. It then patches up the ZCSG(hash) and the persistent_script pointer, enabling processing to proceed normally.

Creation of new module objects is supported, that is the addition of new compiled script on an invocation subsequent to the first creation of the file cache. These are written through to a temporary file that is private to the creating process; this temporary file is created on demand with the first new object. A new file cache is then created at request run-down if this temporary file exists. This new cache comprises a new serialised index, followed by the previous modules (from the old cache) and the new modules (from the temporary file).

For Linux and other POSIX compliant file systems, the date-time stamp of the base file is checked before and after the (putative) new database creation and if unchanged then this new file cache is then moved over the old cache. This effectively aborts the commit of the new file cache version the file has already been replaced by some other asynchronous process. Hence this append is transactionally consistent, but not guaranteed to succeed though it should rarely fail. The main scenario where is does fail is the loser process in a race between two processes making (the same) updates in parallel.

Some specific technical issues must be addressed to get the MLC version working robustly. In summary, these are as follows and separate sections below discuss the solutions in further details.

  • Relocating the compiled script records. Any JMP target, ZVAL target or HashTables inside a compiled scripts use internal (absolute) address pointers which are only correct if the compiled script is based at the same absolute address on reload into a new process. This isn't the case on *nix platforms, so the compiled script must be converted into relocatable from before output to the file cache and relocated it new base address on reload.
  • Removing lock management for access to the SMA. The "SMA" is private to a single process in the case of the CLI and GCI-based MLC modes, so lock management is both redundant and a small performance burden. This functionality is removed.
  • String Interning. The Zend 2.4 engine introduced string interning for all strings embedded as compile-time constants and literals. Standard OPcache implements its own interning functionality when saves a script into SMA. Any MLC apporach must dovetail into this.
  • Resource exhaustion and invalidating the cache. Standard OPcache treats the SMA as a resource pool that can fill and therefore trigger a restart cascade. MLC OPcache uses a per-request approach so this complexity needs some streamlining.
  • File cache integrity. The file cache can contain fields that are specific to the PHP execution environment and binary executables. Changes in this context could render the cache content invalid and result in aborting of the PHP process. Hence any such changes should be detected and handled
  • Performance. There is no point in developing an MLC approach unless it delivers real performance benefits. This section discusses the typical benefits that this approach can realise.

Note that the current demonstrator doesn't include certain Windows-specific processing which hasn't as yet been implemented (for example the file move technique and any attempt to delete a dirty cache can fail due to a share-violation, if another process also has the file open). Windows-specific logic to address this will be added in a future version.

Relocating the file cache record content

OPcache calls the underlying PHP compiler to compile the source to the op_array format and its associated structure hierarchy (as described in overview on the Home page). The PHP opcode model uses normal absolute address pointers for all inter-structure references. However, OPcache relies on its cache being in SMA so that each PHP worker process can map it into its address space. Because absolute addressing is used, the SMA must be located at the same absolute address within worker process and this is achieved on *nix platforms by forking the PHP interpreter after the anonymous shared region has been allocated.

The PHP compiler generates its output in emalloced (that is private to process) memory. OPcache must therefore relocate this compiled output to the SMA. It does this in a two-pass process: first to scan and parse the op_array structure hierarchy in order to calculate the total storage required to hold this; it then allocates a single brick from the SMA and then copies and marshals the entire the op_array structure into this brick using a sequential allocator and relocating any necessary addresses before passing control back to the Zend execution environment.

This approach can't work for CLI and CGI modes, as Linux makes no guarantees about address consistency between separate activations of the same image and in practice the SMA base addresses can and do change. So the MLC must use a position-independent encoding for all address references in its file cache to insulate different script executions from this issue. It achieves this by having an extra preparation pass which converts all addresses into a relative form before saving the brick in compressed form to the temporary file.

A separate pass is used to do the inverse relocation operation. It is also used immediately after compression to convert the brick back to it's original in-SMA absolute addresses. Note that if the debug ACCEL_DBG_RELR flag is set then mode a temporary copy of the brick is taken before this process to validate that the preparation and relocation passes together correctly regenerates the original content.

The relocation pass is done on every script fetch from the file cache to convert the compiled script to its correct absolute addressing, whereas the processing pass is once per compile. The processing split between the two passes therefore aims to move as much computation as possible into the preparation pass. The expansion of the compressed format module stored in the file cache is the largest computational element of the relocation pass; the preparation pass therefore zeros some redundant pointers rather than converting them into relocatable offset form. This is done when the recalculation algorithm is simple and low processing cost. This improves compression ratios and delivers a net improvements in expansion timing.

The preparation pass

The preparation pass takes place during the OPcache compile sequence and largely mirrors the approach taken in the zend_persist.c routines; it is named zend_persist_prepare.c to reflect the fact that it is based on a fork of this code. It uses the same context dependent processing to scan the op_array hierarchy tagging addresses to be relocated. These address are allocated to one of four categories:

  1. Zend opcode handlers. Each opline includes a pointer to the corresponding handler function that manages that opcode / operand type combination. These will be the same for a given executable, but the default addresses can be calculated from the opcode, op1 and op2 fields, and in this case the handler address is zeroed. In this case only the pointer to the op_array address is tagged.
  2. Internal HashTables. Most of the internal pointers enable fast access and enumeration to the table, but are redundant in that they can be recomputed when iterating over the HashTable elements. All internal-to-HashTable pointers are zeroed excepting the pListNext (and arKey in the case of PHP 5.4 or greater) pointer(s) which are each converted into a relocatable offset format by subracting off its own address. In this case only the pointer to the arBuckets address itself is tagged (if this is null, there is nothing to relocate).
  3. Internal references. The address is that of another location in the same module, other than internal-to-HashTable as discussed in the previous item. The module base address is subtracted to convert the pointer to relative form.
  4. Interned strings. PHP 5.4 or greater interns string as discussed in the next section, and such interned string references point into the interned string pool. The interned string pool base address is subtracted to convert the pointer to relative form.

Also note that the compiled output only includes address pointers in one of these categories, and that all address pointers and address targets are size_t aligned, hence in all four categories the low 2 bits of the relative form are set used to tag which of these categories applies.

The tagging scan uses a bitmap with one bit per site_t element in the module to tag these for relocation. The only changes in this scan are to set the two low bits of the internal pointers to the HashTables and op_arrays.

The bitmap is then scanned to process the tagged addresses: to generate the relocation vector (which typically takes one byte per tagged address), which will be appended to the copy of the module written to the cache file to enable the relocation pass; to do the actual position independent conversions.

Note that if any zend extension uses its op_array handler to overwrite the handler address with a non-default one, then this address cannot be converted and the cache becomes executable dependent. In the case of such executable dependent caches, a zend_adler32 checksum of the opcode handle vector is included in the cache fingerprint to detect executable change and invalidate the cache. This could still cause cache thrashing in pathological infrastructure cases, such as a farm of LAMP servers with varying linked PHP images sharing a common user document root hierarchy.

The relocation pass

The relocation process take place on a per-request basis when reloading a module from the file cache into the OPcache SMA. It uses a byte relocation vector to iterate through the pointers to be relocated, with the 2 low order bits of the pointer being used to determine which address conversion should be applied. In the case of the HashTable type, the HashTable is rebuild using the minimum offset information in the relocatable format, and the case of the op_array type, the opcode handlers are recalculated by iterating over the opline array.

String Interning

The Zend 2.4 engine introduced string interning for all strings embedded as compile-time constants and literals. Standard OPcache replaces interning hooks used during compilation by dummies to disable the creation of interned strings by the PHP compiler, and implements its own interning functionality when saves a script into SMS by creating any interned strings in an SMA-based intern string pool which is shared across all PHP processes. It does this by using a standard HashTable in zend_accel_shared_globals (in the SMA), but using its own insert logic within accel_new_interned_string() so that the Bucket records are allocated inside the intern area, and hence the intern strings, which are the keys to the buckets, are in the intern area.

This complexity is redundant in MLC mode as the intern pool is private to the process anyway. Nonetheless, MLC OPcache currently uses the standard OPcache architecture to minimise the source code changes required, though this will be reviewed in due course.

There are two main alternative design approaches for MLC interning:

  • maintain a single per-filecache serialized copy of the interned string pool in the header index;
  • maintain a per-module intern serial dump in each module record.
The current implementation uses the first of these, which means that MLC mode interning works in the same as with way standard modes, but the interned string pool is appended to the file-cache index record on creation or update and reread on file-cache use. This approach has the advantage of simplicity and it is also more efficient in cases where the majority of modules are ultimately loaded during script execution. However, there are also disadvantages for example needing to invalidate the cache if INI changes result in a different baseline interned string set being loaded, so this decision might be revisited after further evaluation.

Interned string pointers are relocated relative to ZCSG(interned_strings_start), plus the high address bit set to differentiate interned string from intra module pointers. The serialised form of the interned string pool is simply an enumeration of the interned keys in creation order, as this will recreate the same offsets on reload.

Use of locking to manage access to the SMA

OPcache in its standard operation uses a lock-on-file technique on non-Windows configurations for lock management of the SMA with the following logical locks:

  Function            Lock
  Write lock on SMA   mem_write_(lock|unlock)
  Restart control     restart_(in_prgress|check|finished)
  Memory usage check  memory_usage_(lock|check|unlock|unlock_all)

These locks are used to manage atomic access to the SMA from independent processes and threads, and their access is encapsulated in various lock/unlock management routines (which also have conditional coding to handle Windows-specific lock management):

  Write lock on SMA   zend_shared_alloc(_|_un|_safe_un|_create_)lock
  Restart control     accel_restart_(enter|is_active|leave)
  Memory usage check  accel_activate_add|accel_deactivate_sub|accel_unlock_all|accel_is_inactive

In the case of the CLI and GCI-based MLC modes, the "SMA" is private to a single process and therefore use of these routines is redundant. Since the fcntl(F_SETLK | F_SETLKW | F_GETLK) calls generate a runtime overhead, the single-threaded MLC mode includes code-logic effectively to null out these lock management calls, and the lock-file is not used. In the case of the restart logic, the zend_accel_schedule_restart() logic is covered by a MLC mode guard, so all restart logic is effectively bypassed.

Response to resource exhaustion and invalidated cache

OPcache statically allocates resources based on a number of opcache INI parameters (memory_consumption, interned_strings_buffer, max_accelerated_files and max_wasted_percentage) as described in the README file, and which in normal SMA modes can trigger optimizer restart in an attempt to reclaim stale SMA resources. The opcache INI parameters (validate_timestamps, revalidate_freq and revalidate_path) are used to tune detection of changes to scripts and invalidate and cached compiled scripts.

In MLC-mode, because "SMA" memory is allocated in private process space on a per-request basis, these parameters are handled differently:

  • memory_consumption. OPcache allocates cache memory in 8Mb bricks on demand. This is treated as an upper bound.
  • interned_strings_buffer. Whilst this is only relevant for PHP version 5.4 and relates to the sizing of an SMA-based interned strings pool. It is ignored in MLC modes, because the standard PHP interning is used. This is discussed below in further detail.
  • max_accelerated_files. The maximum number of compiled script files to be cached in the script.
  • max_wasted_percentage. Effectively ignored in MLC-modes.
  • blacklist_filename. Use to exclude volatile source scripts (e.g. auto generated code such like Smarty compiled templates) that might cause unnecessary cache invalidation.
The general recommendation for these is to size them to the typical maximum for any script in your application for example:
 opcache.memory_consumption=128
 opcache.max_accelerated_files=250

However, whilst hitting these limits will generate a warning, OPcache simply disables caching of any further compiled scripts for the remainder of the request execution.

OPcache optionally does timestamp checks on script files to validate its cache integrity, setting persistent_script->corrupted flag and replacing the stale module. It then recalculates ZSMMG(wasted_shared_memory), scheduling restart if necessary. In MLC mode, any validation turns off further cache actions, effective disabling OPcache for the remainder of the request, and the cache is deleted at request shutdown, allowing a fresh cache to recreated by a subsequent request.

File cache integrity

The filecache contains fields which are specific to the PHP execution environment and binary executables. Changes in this context could render the cache content invalid and result in aborting of the PHP process. Hence any such changes should be detected and in this case the cache should be treated as dirty and refreshed on a subsequent refresh as discussed above. Examples of where this could occur include:

  • the (zend_op).handler fields include the build-specific address of the corresponding handler routine
  • the layout of various structures is PHP version-specific
  • the interned string pool is primed by loading the extensions prior to compilation of any request-specific scripts
  • the filecache records are compressed using a specific compression algorithm.
At a minimum the filecache should be invalidated on change of key offsets in zend_opcode_handlers, PHP_VERSION_ID, interned string pool content prior to script execution and compression algorithm.

Performance

PHP as standard operates in a compile-on-load mode, that is every source required by the invoking script is parsed and compiled at each execution. In practice standard OPcache works in much the same way in CLI and GCI SAPI modes. MLC OPcache adds compiled script persistence through its use of a per-script file cache, and here the compilation process effectively becomes a one-off (per script change) and the many repeat requests are serviced by loading the compiled scripts from the file cache. So long as the one-off priming process is not so burdensome as to create request failures due to resource or process constraints, it isn't material in performance comparisons: the primary comparison is between the no-cache and the MLC repeat request timing and I/O resources needed.

In fact where practical, the MLC design shifts processing into the one-off compilation request and streamlines the repeat requests which can exploit the file cache. Nonetheless, on a repeat request OPcache still preserves its memory-based SMA cache, instead resolving any cache misses by reading in the required compiled script into the SMA cache from file cache. An additional processing overhead is incurred in loading the SMA cache from the file cache. This overhead has three components:

  • The single logical I/O request needed to read the module into memory
  • Uncompressing it into the correct SMA slot
  • Relocating addresses to the correct absolute locations.
Profiling shows that this last operation is relatively lightweight, so the additional I/O and decompression dominate the additional overheads incurred by using the MLC, and these have to balanced against corresponding I/O and compilation overheads in the standard non-cached CLI and GCI modes.

In the most common usecase for GCI modes, which is a shared-host infrastructure template using NFS or equivalent shared filesystem over an array of LAMP servers, I/O can be a dominant performance constraint. Contrast the equivalent strace extracts of the I/O associated with requiring a script file without OPcache which requires the following I/O calls:

 getcwd()
 lstat() + 4 x fstat()
 open() + close() 
 mmap() + mapped access + munmap()

and with OPcache MLC: 2 x fread() (if the modules are reloaded in the same creation order), with the compressed binary cache records typically smaller than the source content. Note that the second read is due to internal buffering logic in the C RTL. (Evaluation of using mmapped I/O is on my TODO list, but in my experience there would be no real-time savings unless the entire cache was mapped with corresponding hit on process virtual memory size, and even then it would be marginal.) In the case of fully VFAT-cached source files for local filesystems, there is little real-time difference between these two OPcache variants, but the stats and open/close can add a material additional RPC load to the file-server in the case of shared file systems, leading to significant I/O performance gains for the MLC variant.

The uncompress vs. compile CPU overhead is perhaps the most contentious, as is the question of whether compression is needed at all. Clearly this issue needs further benchmarking and investigation, but again thinking of the main usecase, I/O can be expected to generate I/O cache misses and off-server RPC traffic so information density (i.e. compression of content) usually leads to significant overall performance gains due to the reduced I/O RPC traffic. There is clearly a trade-off of decompression overhead vs. size reduction. (Compression is done so infrequently that timing if this isn't really a factor in this trade-off.) The standard zlib inflate algorithm is perhaps the best of the standard algorithms, hence its initial choice, but the Google LZ4 claims better performance and therefore the LZ4 and LZ4HC algorithms have also been added, together with an extra INI parameter compression_algo with the values 1 = zlib compression, 2 = LZ4 and 3 = LZ4 High Compression.

An indication of the potential performance benefits of OPcache CLI mode can be seen from a simple benchmark based on 100 executions of the MediaWiki runJobs.php maintenance batch script. This compiles some 44 PHP sources, comprising 45K lines and 1,312 Kbytes. The cached version reads a single runJobs.cache file of 1,013 Kbytes.

                            PHP 5.3-17(ZTS)  PHP5.4-9(non-ZTS)
 Time in mSec               Average  Stdev    Average  Stdev
 Uncached Execution           179      7        148      4
 Cached Execution (algo 1)     77      7         63      5
 Cached Execution (algo 2)     70      6
 Cached Execution (algo 2)(*)  58      7   
 Cached Execution (algo 3)                       56      5

I chose this script because it loads a reasonable number of modules (a typical Wiki page render does more than double this), but has lightweight processing, so the compilation and initiation dominate the execution time. In this case, the use of the cache on PHP 5.3 dropped the script real time by some 57% with zlib compression and over 60% with LZ4 compression, or (70% when LZ4 is compiled with -O3). I have added LZ4 High Compression to my PHP 5.4 build (plus caching of interned strings) and this gave comparable improvements, e.g. over 60% for LZ4HC compression.

In this test all I/O is fully VFAT cached, so real = user + sys. However in the real world cases where some I/O might be off-server or to physical devices then these savings would be even more pronounced, as serial access to a 1Mb file will be far more I/O-efficient that reading in some 44 separated files of 1.3Mb.

OK, this is a -O0 build with debug enabled (excepting LZ4 for one run), rather than a -O3 production build, but I would not anticipate any material shift here.

The bottom line is that the file cache is working largely as anticipated.

ZTS support

Since most SAPI modes can support TSRM multi-threading, all file-cache mods are TSRM compliant to the extent that they will build the correct calling arguments in ZTS enabled builds, and work correctly when executing non-MLC modes on SAPI modes that will exploit multi-threading. However, the MLC modes only operates in the CLI / CGI SAPI modes which only executes a single request per image activation and can therefore only have a single thread, so the MLC exploits this de-facto single threading, for example the removal of the lock file and use of the shared file-cache globals ZFCSG().