Is there a plan to support pipeline scheduling, i.e. emitting codes in available slots of parallel pipelines (e.g. hyperthreading). This does not just apply to Intel x86/x64, but to various kinds of processors that have multiple processing units, each one with its own pipeline, and possibly by chaining pipepline of different kinds.
CPUs/GPUs/APUs today now use extensive parallel pipelines (this is used because there are multi-cycle delays for processing a single instruction, where some stages may even have unpredictable number of wait cycles: while they are waiting, the processing unit is doing nothing, but other parallel slots may be doing something else.
The idea is to track the instruction dependencies (data paths, including for general purpose registers, FP registers, flags, and atomicity of instructions that can be partially interrupted in the middle of their completion, and resumed after handling the interrupt).
Basically the idea is to handle multiple instruction flows in parallel, schedule the instructions in the correct instruction flow, counting the minimum number of cycles they need to complete in their pipeline before being able to process another instruction.
So the instructions are arranged in a "grid", with one column per parallel flow, and one row per cycle used (a single instruction will be occupying multiple cells vertically in one column, and each instruction has a graph of dependencies, i.e. it needs the data resulting from the completion of another prior instruction being terminated.
In compilers, this can be used to schedule the instruction and then merge the instruction flow into a single one where instructions may be delayed or freely reordered (within some maximum distance, i.e. the height of the grid, which is basically on the same order as the number of stages in a CPU/GPU/APU pipeline).
This in turn can be used by compilers to avoid CPUs using many false guesses in branch prediction, out-of-order execution. This can also help reduce the exposition to time issues like Spectre and Meltdown, caused by false guesses branch prediction: the instructions, even if their execution is anticipated, has lower chance to be cancelled and restarted with a higher latency which then becomes measurable and detectable if we compare to the time taken to complete instructions executed without successful guesses (which complete with a higher speed and the lowest latency).
As well such scheduling, because it would reduce the number of cycles spent in CPUs (anticipating some dependencies, which turn later to be false guess, meaning that execution must be restarted from the first stage in the pippeline), could help reduce energy used. With CPUs with hyperthreading, NUMA topology, many parallel cores, this can dramataically improve the throughput of the intruction set used. Modern CPUs take lot of resources trying to do branch prediction, and now spend too many cycles anticipating the result of a dependency before this dependency can be really checked, so many cycles are wasted performing computation thaty will finally be completely abandonned. The problem being that when anticipating the execution, they have monopolized important processing units (such as an ALU, a FPU, a VPU) which would have better been used by other competing instructions waiting for a time slot.
Instruction scheduling can avoid most of these wasted cycles, and help improve the security of computing systems (against time attacks), because the throughput is more linear, latency is stabilized. And globally performance is largely improved.
Instruction scheduling is especially useful in JIT compilers for highly dynamic languages like Lua, where branche prediction is much more difficult: interleaving the instructions that are mutually dependant by inserting enough other non-dependant instructions allow more instruction to complete (while other interleaved instructions are also running in their own separate pipeline of the processor). False guesses in branch prediction of processors will be much less likely, there will be less waiting cycles in all pipelines.
It is possible to instanciate a JIT compiler so that it will allow a maximum of N pipelines and schedule all instructions in these N pipelines. If finally the hardware processor has LESS pipelines, the optimization is still not lost (all what other emulated pipelines are doing are to eliminate data dependency and wait cycles, and eliminate many false guesses of processors in out-or-order execution.
This optimization is even more effective when a CPU has a large file of registers (8086 had very few registers and lot of dependencies across instructions at short distance, but modern x64 or ARM64 processors (and even more importantly the VPUs and dramatically more the GPUs) have allowed to create instruction flows with many more datas paths. But we still don't maximize this by scheduling also instructions into multiple instruction flow: an higher number of available data paths should also mean a higher number of instruction flows so we can maximize the number of data paths successfully parallelized.
All processors today (ecept the old CPUs of the 1980s) can benefit of this (including RISC processors that are built with NO or very basic branch prediction, just to offer more data paths and a higher number of instructions that can run in parallel without mutual dependency, such as "vector" instructions: MMX, SSE, GPU...)
So I wonder if you have a way to improve the JIT compiler so that its "code emitter" will allocate and schedule instructions into more virtual pipelines (note: it's not necessary that the virtual pipelines match the final pipelines of hardware processors: for CPU instruction sets, generally emulating about 16 or 32 pipelines in the code emitter will good results. For VPUs (and MMX/SSE instructions) and GPUs instruction sets, you may need more (typically about 256 virtual pipelines).
Reducing the number of virtual pipelines (e.g. to only 8 virtual pipelines, in the emitted instruction set is only needed for some privileged code that needs to have the smallest latency and accelerated responses, but this is rare and should only concern a few internal critial exception handlers in a OS or in hardware drivers that need near "real time" response, and that will run with higher priority (in ring 0), and will ensure that their high priority (causing higher delays in concurrent non-privileged code) will not open time attacks (the critical part of this privilege code will erase the predictability of delays experienced by concurrent non-critical unprivileged threads), by forcing them to wait more randomly.
A good scheduler in the code emitter can avoid many experiences of resource starvation (where a resource is an execution time slot to a processing unit, a memory access, access to L1/L2/L3 caches (or even now network caches, in the case of distributed computing, which is the next target of time attacks that will try to steal private data worldwide, using time-based covert channels now possible over fast Internet accesses at 1 Gbps or more).
Is there a plan to support pipeline scheduling, i.e. emitting codes in available slots of parallel pipelines (e.g. hyperthreading). This does not just apply to Intel x86/x64, but to various kinds of processors that have multiple processing units, each one with its own pipeline, and possibly by chaining pipepline of different kinds.
CPUs/GPUs/APUs today now use extensive parallel pipelines (this is used because there are multi-cycle delays for processing a single instruction, where some stages may even have unpredictable number of wait cycles: while they are waiting, the processing unit is doing nothing, but other parallel slots may be doing something else.
The idea is to track the instruction dependencies (data paths, including for general purpose registers, FP registers, flags, and atomicity of instructions that can be partially interrupted in the middle of their completion, and resumed after handling the interrupt).
Basically the idea is to handle multiple instruction flows in parallel, schedule the instructions in the correct instruction flow, counting the minimum number of cycles they need to complete in their pipeline before being able to process another instruction.
So the instructions are arranged in a "grid", with one column per parallel flow, and one row per cycle used (a single instruction will be occupying multiple cells vertically in one column, and each instruction has a graph of dependencies, i.e. it needs the data resulting from the completion of another prior instruction being terminated.
In compilers, this can be used to schedule the instruction and then merge the instruction flow into a single one where instructions may be delayed or freely reordered (within some maximum distance, i.e. the height of the grid, which is basically on the same order as the number of stages in a CPU/GPU/APU pipeline).
This in turn can be used by compilers to avoid CPUs using many false guesses in branch prediction, out-of-order execution. This can also help reduce the exposition to time issues like Spectre and Meltdown, caused by false guesses branch prediction: the instructions, even if their execution is anticipated, has lower chance to be cancelled and restarted with a higher latency which then becomes measurable and detectable if we compare to the time taken to complete instructions executed without successful guesses (which complete with a higher speed and the lowest latency).
As well such scheduling, because it would reduce the number of cycles spent in CPUs (anticipating some dependencies, which turn later to be false guess, meaning that execution must be restarted from the first stage in the pippeline), could help reduce energy used. With CPUs with hyperthreading, NUMA topology, many parallel cores, this can dramataically improve the throughput of the intruction set used. Modern CPUs take lot of resources trying to do branch prediction, and now spend too many cycles anticipating the result of a dependency before this dependency can be really checked, so many cycles are wasted performing computation thaty will finally be completely abandonned. The problem being that when anticipating the execution, they have monopolized important processing units (such as an ALU, a FPU, a VPU) which would have better been used by other competing instructions waiting for a time slot.
Instruction scheduling can avoid most of these wasted cycles, and help improve the security of computing systems (against time attacks), because the throughput is more linear, latency is stabilized. And globally performance is largely improved.
Instruction scheduling is especially useful in JIT compilers for highly dynamic languages like Lua, where branche prediction is much more difficult: interleaving the instructions that are mutually dependant by inserting enough other non-dependant instructions allow more instruction to complete (while other interleaved instructions are also running in their own separate pipeline of the processor). False guesses in branch prediction of processors will be much less likely, there will be less waiting cycles in all pipelines.
It is possible to instanciate a JIT compiler so that it will allow a maximum of N pipelines and schedule all instructions in these N pipelines. If finally the hardware processor has LESS pipelines, the optimization is still not lost (all what other emulated pipelines are doing are to eliminate data dependency and wait cycles, and eliminate many false guesses of processors in out-or-order execution.
This optimization is even more effective when a CPU has a large file of registers (8086 had very few registers and lot of dependencies across instructions at short distance, but modern x64 or ARM64 processors (and even more importantly the VPUs and dramatically more the GPUs) have allowed to create instruction flows with many more datas paths. But we still don't maximize this by scheduling also instructions into multiple instruction flow: an higher number of available data paths should also mean a higher number of instruction flows so we can maximize the number of data paths successfully parallelized.
All processors today (ecept the old CPUs of the 1980s) can benefit of this (including RISC processors that are built with NO or very basic branch prediction, just to offer more data paths and a higher number of instructions that can run in parallel without mutual dependency, such as "vector" instructions: MMX, SSE, GPU...)
So I wonder if you have a way to improve the JIT compiler so that its "code emitter" will allocate and schedule instructions into more virtual pipelines (note: it's not necessary that the virtual pipelines match the final pipelines of hardware processors: for CPU instruction sets, generally emulating about 16 or 32 pipelines in the code emitter will good results. For VPUs (and MMX/SSE instructions) and GPUs instruction sets, you may need more (typically about 256 virtual pipelines).
Reducing the number of virtual pipelines (e.g. to only 8 virtual pipelines, in the emitted instruction set is only needed for some privileged code that needs to have the smallest latency and accelerated responses, but this is rare and should only concern a few internal critial exception handlers in a OS or in hardware drivers that need near "real time" response, and that will run with higher priority (in ring 0), and will ensure that their high priority (causing higher delays in concurrent non-privileged code) will not open time attacks (the critical part of this privilege code will erase the predictability of delays experienced by concurrent non-critical unprivileged threads), by forcing them to wait more randomly.
A good scheduler in the code emitter can avoid many experiences of resource starvation (where a resource is an execution time slot to a processing unit, a memory access, access to L1/L2/L3 caches (or even now network caches, in the case of distributed computing, which is the next target of time attacks that will try to steal private data worldwide, using time-based covert channels now possible over fast Internet accesses at 1 Gbps or more).