The Zend Engine and opcode caching

Table of Contents The Zend execution engine Comments on Zend engine performance Opcode caching with OPcache Processing compiler invocations The script isn't cacheable No valid copy of the compiled script exists A copy of the script exists in the SMA cache The simple SMA model Source script related I/O avoidance Supported SAPIs The Optimizer

The Zend execution engine

The PHP system is build either in the form of a stand-alone executable, or as a dynamic load library for use within a web-server environment such as Apache2. It uses a Server API or SAPI as an abstraction layer, with each of the web-server and command line variants implementing its own SAPI to isolate PHP from the web-server specifics. Each SAPI implementation uses a sapi_module_struct to define:

The identification of the SAPI
Startup and shutdown handlers
Activate and deactivate handlers
I/O and logging handlers.

Startup enables the initialisation of the compiler context including its internal storage structures, the attachment and initialisation of any extensions. This occurs once for single-threaded builds, and the equivalent exit handler is known as shutdown. When PHP is build with its Thread Safe Resource Manager (TSRM) enabled, it can support a multi-threaded mode where each thread has its own shartup / shutdown context.

Note that TSRM was a late addition to the PHP architecture, and this thread implementation incurs roughly an overall 20% performance hit; this, together with the fact that some PHP extensions are not thread-safe, means that it is rarely used in practice.

*nix platforms provide a simple alternative to multi-threading by enabling the web-server to fork the parent process which embeds the PHP execution system after startup. Each child process can then run within its own context without the overhead of TSRM and without the overhead of an additional image startup, yet still service its own request activation(s).

Activation is the handler used to receive and service a (web-service) request, and the equivalent exit handler that runs down the context of the request is known as deactivation. Many SAPIs support only one request per image activation, and in this case these request processing bookends are typically folded into startup and shutdown, with the activate and deactivate handlers being set to NULL. As image startup overheads are relatively large, there are significant performance gains in a SAPI being able to handle multiple requests within one image activation; hence the higher throughput SAPIs (such those for Apache2 and FastCGI) support multiple requests per startup. As I discuss later, there are also additional benefits in doing this when implementing an opcode cache.

The overall execution process for any given request is initiated by the PHP SAPI being activated by the web-server or shell environment. This activation typically has an associated source file (e.g. as a parameter in a php-cli script execution or a URI scriptname as a web-server request). This resource name is resolved into the first PHP script file to be compiled and then executed.

Unlike languages such as C or C++ which typically compile a source tree to a native machine code linked into a binary image or library in a completely separate build process from the execution of that binary, PHP is a compile-and-go language in that in normal operation every PHP script is compiled once per request on initial loading. The Zend execution system compiles the PHP source to a hardware-independent virtual machine (VM) using a fixed 3-operand format and known as opcode. Each PHP source script is compiled into one or more block of opcodes known as an opcode array (or op_array for short), with each function or method compiled to is own op_array. HashTable structures are used to index these by class and function / method name.

Each opcode also includes the address of the handler within the VM executor responsible for decoding and executing that opcode. The current op_array is executed one opcode at a time by dispatching to its handler. Execution normally proceeds by executing the next opcode, but some handlers such as JMP will change the execution order. Function and method calls are executed by resolving the function header in the function HashTable, and this header points to its op_array; a new execution frame is then created and execution proceeds with the first instruction in the new op_array.

When this opcode is an INCLUDE_OR_EVAL then the Zend compiler is invoked to compile that source. Execution is then suspended whilst the new script is compiled, and resumed on completion of the compile. In this way, the executing script can then add extra classes, functions and immediate code. (The PHP include, require, eval and other compilation statements all generate an INCLUDE_OR_EVAL opcode.) Complex applications such as MediaWiki might include 100 or more source scripts, repeating this cycle of compilation and execution

PHP is a dynamically typed language and the VM opcode format has been designed to map onto the PHP language constructs and data types, so that a typical PHP statement will compile into only a few opcodes, with all the type dependent processing handled by the VM executor. For example the PHP statement:

    $x = "Value = $a[fred]";

would be compiled to the opcode sequence of the form:

     Op                  Result    Operands
     ADD_STRING            ~1      'Value+%3D+'
     FETCH_DIM_R           $0      !1, 'fred'
     ADD_VAR               ~1      ~1, $0
     ASSIGN                        !0, ~1

where !0 is the compiled variable $x; !0 is the compiled variable $y, and ~1 is a temporary variable used to hold intermediate values. All variables can take the full range of PHP types, so the VM executor makes heavy use of the PHP storage mechanisms and data types (zvals, hash tables, etc.) for both program data and its own op_array structures. The op_array hierarchies are built-up over the course of the compilation / execution cycle, so a destructor is executed on deactivation to enumerate and destroy all remaining PHP data structures, including those associated with the request's op_arrays.

Comments on Zend engine performance

The VM has also been designed to be executed efficiently by the Zend VM executor, with format revisions at PHP 5.0 and 5.4 to improve efficiency.

I have added RDTSC-based opcode profiling to one of my PHP 5.5 builds to give an indication of the relative costs of execution on complex PHP applications. I see a typical Pareto distribution where 5-10% of the opcode handlers account for 90-95% of all opcode executions across most applications. Taking one example of PHP rendering of this page using MediaWiki (Github's own wiki engine is written in Ruby):

This request loads and compiles 75 PHP scripts (taking ~39% of the total execution).
It executes some 63K opcodes in total.
These include some 8K function and methods call related opcodes (taking ~58% of the total execution). Theses are spilt roughly 50:50 PHP and build-in functions, but the costly calls in this application are to the Mysqli extension built-ins.
Excluding function and method calls and return opcodes, the average opcode execution is ~350 CPU clocks. The opcodes that manipulate strings and arrays represent a heavier load, but the simple opcodes such as jumps and simple boolean / integer arithmetic typically take 50-100 clocks to execute.

Clearly this mix can shift significantly with the application implementation (for example if I configured MediaWiki page caching, then the MySQL overhead would drop significantly). However, the following general pattern is reflected in most PHP applications:

Script compilations can typically take half of total execution time.
The material bulk of the remaining execution time is built-in function calls.
PHP function and method invocation / return overheads are the next biggest runtime cost.
Dynamic storage constructors and destructors are the next biggest runtime cost.
A typical request executes surprisingly few opcodes in total and most are cheap.

If I do a assembly code listing option for the gcc compile of Zend/zend_execute.c (which includes the opcode handlers defined in Zend/zend_vm_execute.h) at maximum optimization, I have to respect the expertise both of the gcc developers and the Zend team when I look at the machine code output for any given opcode handler. Using a gdb-based instruction trace of the execution of typical opcode handlers, confirms that these have been heavily optimized by an expert development team.

I find it difficult to identify any obvious "low hanging fruit" for improvement. In the case of perhaps 95% of these opcode handlers, even if I could identify a way to shave 25% off the opcode execution, this would still only make an immaterial improvement to overall request execution times.

The overall efficiency of application at a PHP coding level and the implementation of the 3rd-party extensions is outside the scope of the Zend execution engine.

The immediate area which would dramatically impact on Zend performance would be the removal of the need to do the expensive per-request compile of PHP script files, and given that the op_arrays have a well defined binary format, this is entirely achievable. The next section describes how the standard OPcache opcode accelerator achieves this for a subset of SAPIs. The separate wiki page MLC OPcache details describes how using a file-based second level cache can extend OPcache's cover of other SAPIs.

Opcode caching with OPcache

Summarising the previous section, the PHP Zend engine is a compile-and-go system which executes a PHP application request by compiling the PHP sources to its VM opcode format, which is executed one opcode at a time. When this opcode is an INCLUDE_OR_EVAL then the Zend compiler is invoked to compile that source.

The compiler converts each source into a set of op_arrays, with some associated ZVAL structures and HashTables for operands and variables. Each function and class has an entry in execution-global HashTables that is used to access functions and classes symbolically. Any top level statements are compiled into an initialisation function that is identified by the resolved filename of the source script. This set of arrays, structures, and HashTable entries is referred to collectively as a compiled script.

The aim of OPcache and other opcode accelerators is to remove I/O and processing burden of identical compiles that are duplicated across separate requests. It does this by maintaining a shared cache of compiled scripts, which it generates by hooking into the Zend compile function address. (There is a standard hook interface to enable this). It can then intercept all Zend compiler invocations.

Processing compiler invocations

OPcache takes one of three main execution paths depending the source request:

The script isn't cacheable.
No copy of the compiled script exists, or the existing one is out of date.
A copy of the script exists in the SMA cache.

Note that for PHP 5.3 and later versions, the Zend compiler can be set to a mode where the compiled script for a given source file is always the same (save one complication) and doesn't depend on other previously compiled scripts. That complication is that the various data elements in the compiled script are allocated in memory using the standard Zend memory management (MM) system and are therefore scattered through the allocated storage. Any inter-element references use absolute address pointers. So these addresses can and will change across requests from compilation to compilation. OPcache overcomes this absolute addressing issue by fixing a shared memory area (SMA) at the same address range(s) in all processes which use it for the compiled script cache.

The script isn't cacheable

Some types of script are disqualified from caching, for example if the filename has been blacklisted; it is loaded from a PHP stream other than a local file or phar stream; or it is an eval string. In these cases, OPcache simply acts as a pass-though invoking the real Zend compiler.

No valid copy of the compiled script exists

A hash table within the SMA is used to index all compiled scripts held in the SMA by their resolved filename. Lookup on this index will either return a NULL or a pointer to a compiled script. The NULL case means that no compiled script exists. Depending on the opcache.validate_timestamps and opcache.revalidate_freq INI parameters, the timestamp of the underlying file can also be validated against that of the compiled script. A mismatch is also treated as no valid compiled script exists.

OPcache calls the real Zend compiler to compile the source, outputting the compiled script into standard Zend MM storage. It then does a multi-pass process on this compiled script output:

The compiled script data hierarchy is traversed to compute the total storage required to hold it as a contiguous brick. This includes a space for special header containing any associated class and function entries.
A single brick of the correct size is then allocated in the SMA cache
The compiled script data hierarchy is traversed a second time copying the data elements serially into this brick.
The associated class and function entries are copied to the special header
An entry for the compiled script is added to the SMA hash table index
The DTOR for the original compiled script data hierarchy executed to recover the MM storage.
The processing then falls through to the cached case.

A copy of the script exists in the SMA cache

The brick is retrieved by lookup in the SMA hash table index
The special header is processed to register the associated class and function entries in the relevant execution engine HashTables.
The read-only elements of the compiled script in the SMA cache are linked to in-situ.
Some data elements associated with a compiled script are R/W (e.g. static class properties). These have to be deep-copied into the request's local MM storage.

This total processing involved in instantiating a previously compiled script is more than a factor of 10x less that the comparable compile and with the default configuration setting removes all I/O operations on most "compiles".

The simple SMA model

The two-pass approach to processing the compiler output in MM storage has a number of important benefits (Accelerators such as APC don't do this and have a lower performance as a result.)

The compiled script gets copied to a single storage brick in the SMA, which is indexed by the resolved filename of the script.
The PHP storage elements are serialised within this brick, and as they share a common lifetime, any DTOR pointer within HashTables etc. are set to NULL to prevent the request deactivation DTOR attempting to remove elements in the SMA. This removes the MM overheads associated with construction and destruction of these elements. Also making the data elements for a function contiguous also results in a small runtime improvement due to improved hardware cache hit ratios (modern Intel CPUs have a 64byte line size in their L1/L2/L3 caches).
Compiled scripts are only added to the SMA. There is effectively no delete function. So in the case where a script has been updated, then another updated copy of the compiled script is added to the SMA. The hash bucket in the SMA hash table used to index compiled scripts is the only record this is truly updated. Because of this, the only write locking require is to ensure SMA integrity on update. No read locks are required for normal access.
Overall lock rates are therefore very low, so OPcache uses the relatively inefficient, but robust, lock file / fcntl() lock method on all non Windows builds.

This locking/update model for the SMA is both simple and fast. However, it has one major flaw, and that is that SMA storage is not recovered and reused, so the SMA storage can be exhausted. OPcache manages this by using a simple "reboot" SMA model. New requests are placed into a "restart mode" that disables all use of the SMA cache and reverts back to non-cached compilation. When the last script that was using compiled scripts from the SMA completes its request, the next script reinitialises the SMA as empty and re-enables caching. Subsequent requests then rebuild the cache.

This means that during the time interval from the start of "restart mode" to reinitialisation of the SMA, all new requests are using non-cached compilation. As discussed above, this can double the system load. A good analogy free is when an accident results in the closure of half the lanes on a freeway: unless the system is running at less than half-capacity, response time will collapse, and and service queues build up causing quite a service hick-up

In my view the single brick approach is elegant and efficient, but the simple update model for the SMA is a flaw in that any opcode cache accelerator which periodically halves system throughput should not be regarded as truly enterprise strength. Server memory is now relatively cheap and many templates exist for a shared pool which is stable and and fast at up to, say, an 80% memory utilisation.

Source script related I/O avoidance

No compile is required when the corresponding compiled script already exists in the SMA cache. So no I/O is required to access the source script, unless the OPcache has been configured to validate timestamps. However the INCLUDE_OR_EVAL handlers also do file checking in certain circumstances, so OPcache also intercepts the zend_stream_open_function() and zend_resolve_path() calls, and the metadata from the SMA hash index is used in the case of cached scripts. As result OPcache can be configured to avoid all source script related I/O.

Supported SAPIs

The fundamental issue is that OPcache only delivers a performance if the cached compiled scripts are used on multiple requests as a result of:

the same process serving multiple requests for the same source file
different processes or threads each sharing an SMA cache and serving requests for the same file.

So most supported SAPIs achieve one or both of these. The cli and cgi SAPIs are exceptions in that they are nominally supported by OPcache, but these modes are single request and non-shared SMA, so whilst scripts are compiled and written to the SMA cache and then executed from the cache, the life of the cache is only that one request. This is really for testing purposes only, as there is no performance benefit in doing this.

SAPI	SAPI id	Multi request	Description
apache	apache	Yes	The standard Apache 1.x handler
apache2handler	apache2handler	Yes	The standard Apache 2.0 Handler. Note that this SAPI like a few others breaks the "rule" discussed in the previous section, in that it uses a different activation mechanism. An apache handler is established which catches the request and this activates `php_request_startup()`.
apache2filter	apache2filter	Yes	This is an experimental Apache 2.0 handler giving extra hooks into the Apache 2 filters.
cli	cli-server	No	Command Line Interface PHP.
fpm	fpm-fcgi	Yes	FPM/FastCGI.
cgi	cgi-fcgi	No/Yes	CGI/FastCGI. `php-cgi` is single-request unless it has been configured and build with the `--enable-fastcgi` and has been started with the `-b <address:port>\|<port>` option to bind a socket listener for external FASTCGI Server mode. In single-request mode, the GCI protocol is used, with the environment variables parsed to establish the request context. In FastCGI mode it acts as a server listening on the allocated socket for FastCGI protocol requests. Within this server mode, there are two sub-modes depending on the environment variable `PHP_FCGI_CHILDREN`. If this is non-zero then the image forks the specified number of children after SAPI startup, so that the children all share the startup context including a shared SMA.
isapi	isapi	No(*)	This ISAPI SAPI is a simple single request SAPI. However it is run inside ISAPI which exploits multi-threading to achieve high performance, and hence requires TSRM and thread-safe extensions. Hence ISAPI for production PHP apps is often configured using an IIS FastCGI extension withs `php-cgi` in FastCGI mode.
litespeed	litespeed	No	LiteSpeed

Having non-shared SMAs, that is one OPcache cache per PHP process, is a heavy memory burden that is usually impractical. The caches can't easily be shared across separately initiated processes on *nix platforms as these must be mapped at the same absolute address range and limitations of the mmap() function make it very difficult to guarantee that this occurs without overwriting other process memory.

I discuss how these issues can be overcome in the page MLC OPcache details.

The Optimizer

The optimizer uses a Zend_extension hook to examine each op_array during compilation (and not during copy to SMA) which is why OPcache must be a Zend extension, rather than a PHP one. It is controlled by the opcache.optimization_level INI parameter. This is a bitmask with 6 bits currently controlling optimization processes:

Flag = 0x001
- Substitute persistent constants (true, false, null, etc)
- Perform compile-time evaluation of constant binary and unary operations
- Optimize series of ADD_STRING and/or ADD_CHAR
- Convert CAST(IS_BOOL,x) into BOOL(x)
- Convert INTI_FCALL_BY_NAME and DO_FCALL_BY_NAME into DO_FCALL
Flag = 0x002
- Convert non-numeric constants to numeric constants in numeric operators
- Optimize constant conditional JMPs
- Optimize static BRKs and CONTs
Flag = 0x004
- Optimize $i = $i+expr to $i+=expr
- Optimize series of JMP opcodes
- Change $i++ to ++$i where possible
Flag = 0x010
- Optimize the Control flow graph. e.g. eliminate unreachable opcode sequences.
Flag = 0x100
- Optimize temp variables usage
Flag = 0x200
- Remove NOPs.

The Optimizer can reduce the size of op_arrays by a material percentage, say 10-15%. However, in my experience this does not result in a similar saving in runtime, because the opcodes that are removed are either unreachable or very cheap to execute.

One of the optimizations, changing $i++ to ++$i, is traditionally done for compilers generating native machine code, though with modern CPUs the benefits are marginal. However it is pointless in the context of an interpreted VM as the POST_INC handler takes exactly the same time to execute as the PRE_INC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly