Optimizing VMCS reads/writes with a new cache implementation #130

AlexAltea · 2018-11-18T16:25:18Z

Disclaimer: This patch isn't finished, but it's available for early reviews/discussion.

Motivation

During development, running HAXM in a virtual machine prevents your "bare-metal" host system from kernel panics if runtime errors occur. The downside of nested virtualization, is that during L2-guest VM-exit events, the L1-hypervisor, i.e. HAXM, might need to perform actions, e.g. vmread/vmwrite, that trigger another exit to the L0-hypervisor, i.e. KVM, VMware, etc. These recursive exits add latency so it's convenient to minimize them.

While booting few kernels using both HAXM and KVM as L1-hypervisor, under the same conditions, HAXM appears to be 4-5 times slower than KVM (related: #40). After measuring the amount of vmread/vmwrite's between guest-exit and guest-enter events, I've noticed HAXM amount is (unsurprisingly) 4-5 times higher. Specifically:

KVM: Does 6-10 reads and 2-6 writes on average.
KVM: Does 4-8 reads and 0-4 writes on average (depends on VMX exit handler).
HAXM: Does 32-33 reads and 21-23 writes on average.
HAXM: Does 32 reads and 1 write (nearly always!).

Here's the raw data, corresponds to around the first few seconds of trying to boot a Windows 7 install disc: vmx-measurements.zip ~~(haxm-vmx.log and kvm-vmx.log)~~. Efforts at 96af3d2 (thanks @junxiaoc!) have improved this situation, but there's still room for improvement.

Changes

Previous efforts, and other hypervisors like KVM, have minimized vmread/vmwrite usage by caching the component values with hand-written helper functions. While this works, it doesn't scale well for the 120+ components. Plus, it's not trivial what should be cached: caching unused components is counterproductive. Fine-tuning the set of cached components when in manually-implemented code is too time-consuming.

Instead, this patch relies in preprocessor X-Macros to automatically create the cache structures, alongside cached/dirty flags and helper functions to access all VMCS components. This Godbolt snippet allows to inspect the preprocessor output for clear view of the resulting code (ideally, apply clang-format).

Summarizing the changes, this patch:

Moves some VMX-specific definitions to vmx.h (required by other changes).
Adds VMCS-cache macros to vmx.h.
Optimizes code by using new VMCS-cache macros: Removed most vmcs_pending* fields and unnecessary vmread's at the end of cpu_vmx_execute.

Benchmarks

Measuring time elapsed between booting a virtual machine with a Windows XP SP2 installation disc, until the installer shows the "welcome screen", using an Ubuntu 18.04 (+ Linux 4.17) host. This is a particularly bad scenario that shows up to x10 slowdown. (Disclaimer: time measured manually with my phone).

KVM: 45.4 seconds.
HAXM (before): 532.5 seconds.
HAXM (after): TODO.

Pending

Using VMCS-cache for VCPU state reads/writes (huge performance impact!).
Replacing remaining vmread/vmwrite's.

raphaelning

Thanks! You've identified an important problem, and I like your systematic approach to solving it. I'm anxious to see how much performance improvement this PR will eventually make, and maybe not just for the nested virtualization use case :-)

So far I've only finished reviewing the first commit, and please allow me more time for the rest. Meanwhile, I have a question about the VMREAD/VMWRITE data you collected, which I'll ask in a separate post.

raphaelning · 2018-11-20T07:42:32Z

include/vcpu_state.h

@@ -187,7 +173,6 @@ struct vcpu_state_t {

    uint32_t _activity_state;
    uint32_t pad;
-    interruptibility_state_t _interruptibility_state;


I agree with the removal of this unused field, but vcpu_state_t is part of the HAXM API, so let's handle this with care:

Replace this field with a placeholder: uint64_t pad2;, so HAXM and QEMU always agree on the size of vcpu_state_t.

Update api.md accordingly.

Prepare a corresponding patch for QEMU target/i386/hax-interface.h.

Good point, thank you! Fixed in 5636f62. I'll submit a QEMU patch after merging this.
Please ignore the build issue: it's due to something else that has been fixed in #133.

raphaelning · 2018-11-20T08:32:15Z

First of all, let me share the "decoder" you sent me offline, in case other people are wondering how to interpret the logs:

$(guest rip) <= $(vmx exit reason)!
delta: $(absolute difference between rdtsc after guest exit and before guest enter)
vmread_ctr: $(number of vmread’s between guest exit and guest enter)
vmwrite_ctr: $(number of vmwrite’s between guest exit and guest enter)

I'm just surprised by how small vmwrite_ctr is: indeed, for HAXM it's almost always 1 or 2, although 9 also appears once in haxm-vmx.log.

But if you look at vcpu_save_host_state() (core/vcpu.c) alone, which is called before every VM entry, there are already 14 unconditional VMWRITEs! So is vcpu_save_host_state() not covered by vmwrite_ctr? When do you increment this counter, and when do you reset it?

For the ease of discussion, let me provide an outline of the current VM entry-exit workflow:

cpu_vmx_execute() {
    while (1) {
        ...
        load_vmcs()  // VMXON + VMPTRLD
        VMWRITE compA  // Saves host state, flushes dirty guest state (RIP, RFLAGS, etc.), updates VMX controls, etc.
        VMWRITE compB
        VMWRITE compC
        …  // There are some VMREADs as well
        asm_vmxrun()  // VM entry (VMLAUNCH)
        // VM exit
        VMREAD compX  // Caches VM exit info and some guest state (RIP, RFLAGS, etc.)
        VMREAD compY
        VMREAD compZ
        …  // Are there any VMWRITEs as well?
        put_vmcs()  // VMCLEAR + VMXOFF
        cpu_vmexit_handler()  {  // Handler specific to the VM exit reason
            // Ideally, no VMREAD or VMWRITE should be issued from here, since VMX is off (requiring an implicit load_vmcs()).
            // Instead, VMCS accesses are redirected to our cache.
            // PR #117 should have eliminated most of such expensive VMREAD/VMWRITE calls, if not all.
        }
    }
}

AlexAltea · 2018-11-20T17:43:06Z

I'm just surprised by how small vmwrite_ctr is [...] So is vcpu_save_host_state() not covered by vmwrite_ctr? When do you increment this counter, and when do you reset it?

Nice catch! On HAXM, the counter is incremented in vmread/vmx_vmwrite, which is fine since on 64-bit each will generate one single vmread/vmwrite instruction. The KVM counters also increment correctly.

However, on HAXM, I had placed the counter display/reset before and after cpu_vmx_run, respectively, without realizing that more accesses ocurred inside this function. I've updated my code, and now we are wrapping asm_vmxrun instead. A similar issue happened with KVM, which is fixed too.

I repeated the benchmarks for the unpatched code, and these are the updated results:

KVM: Does 6-10 reads and 2-6 writes on average.
HAXM: Does 32-33 reads and 21-23 writes on average.

The updated full logs (and the diff's to replicate them!) are available at: vmx-measurements.zip.
Data corresponds to the booting QEMU with HAXM/KVM without any disc (just BIOS).

- Moved vcpu_vmx_data and interruptibility_state_t from {vcpu,vcpu_state}.h to vmx.h. - Removed vcpu_state_t::interruptibility_state_t since it's not used. Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

- Replaced vmcs_pending* fields, with the new automatically-generated function that updates VMCS based on vmcs_cache_w flags. Only `vmcs_pending_guest_cr3` has been preserved since it requires extra logic. - Removed unnecessary vmread's at the end of `cpu_vmx_execute` that are not always required. Corresponding members in `vcpu_vmx_data` have been removed and function depending on them now call `vmcs_read`. Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

core/include/vcpu.h

core/include/vmx.h

raphaelning · 2018-11-21T10:49:09Z

core/vmx.c

+#define COMP_PENDING(cache_r, cache_w, width, name) \
+    COMP_PENDING_##cache_w(name)
+
+void vcpu_handle_vmcs_pending(struct vcpu_t* vcpu)


You have changed the semantics of this function, but not where it is called. Now that it has the responsibility of flushing all the dirty VMCS values, we should delay it as much as possible, in case vmcs_write gets called again after the flush but before VM entry (asm_vmxrun). Probably we need to draw a line somewhere in cpu_vmx_execute or in cpu_vmx_run:

// Cached vmcs_write() is preferred over uncached vmwrite() vmcs_write(..); vmcs_write(..); ... vcpu_handle_vmcs_pending(vcpu); // *** No cached vmcs_write allowed after this point *** vmwrite(..); vmwrite(..); ... // VM entry

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

raphaelning · 2018-11-21T10:59:41Z

The updated full logs (and the diff's to replicate them!) are available at: vmx-measurements.zip.

Thanks, the new data is plausible. Clearly there's a lot of room for improvement for HAXM, to look on the bright side :-)

AlexAltea · 2018-11-21T11:04:14Z

I've fixed some of the issues you mentioned. To make reviewing easier, I'll avoid rewriting the commit history until this is ready to merge. Then, I'll squash and re-sign everything into the first three commits, i.e.:

Moved VMX-specific definitions to vmx.h
Added VMCS-cache macros
Optimize code by using new VMCS-cache macros

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea · 2018-11-22T02:47:07Z

EDIT: Sorry, I clicked the wrong button and ended up closing/reopening the pull request.

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

nevilad · 2019-12-17T13:31:13Z

What is with this PR? My windows 7 guest boots and runs pretty slow compared to VmWare, I wan't do some enhancements to speed it up. This PR is a good start point.

wcwang · 2019-12-23T08:06:57Z

Thanks for @nevilad comments. I noticed that @AlexAltea added new commits a month ago. And @AlexAltea once declared that the pull request were not completed. @AlexAltea, could you help to rebase the current pull request to the head of master branch first? So we can consider to proceed it and @nevilad can start the enhancement based on it. Thanks for both of you.

nevilad · 2020-01-06T20:07:36Z

I noticed that @AlexAltea added new commits a month ago.

I don't see new comments, the latest is in november of 2018.

AlexAltea force-pushed the vmcs-refactor branch from b938039 to ba7cd13 Compare November 20, 2018 00:12

raphaelning reviewed Nov 20, 2018

View reviewed changes

AlexAltea added 4 commits November 21, 2018 11:37

Moved VMX-specific definitions to vmx.h

7562d10

- Moved vcpu_vmx_data and interruptibility_state_t from {vcpu,vcpu_state}.h to vmx.h. - Removed vcpu_state_t::interruptibility_state_t since it's not used. Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Added VMCS-cache macros

4247f6c

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Fixed padding in vcpu_state_t

5636f62

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea force-pushed the vmcs-refactor branch from ad2dd49 to 5636f62 Compare November 21, 2018 10:37

raphaelning reviewed Nov 21, 2018

View reviewed changes

Fixed style issues and removed unnecessary fields

3a3f953

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea added 2 commits November 21, 2018 06:41

Fixed unnecessary writes to HOST_RIP

7693efb

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Fixed missing read-cache flush after guest-exit

4b39e35

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea closed this Nov 22, 2018

AlexAltea reopened this Nov 22, 2018

AlexAltea added 2 commits November 22, 2018 06:00

Optimization: Cached segment reads

46fc754

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Optimization: Cached RIP reads

8e49240

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea force-pushed the vmcs-refactor branch from aa20c04 to 8e49240 Compare November 22, 2018 14:07

Optimization: Cached RSP/RFLAGS reads

989eea9

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea force-pushed the vmcs-refactor branch from c12dd15 to 689c04d Compare November 23, 2018 12:33

Optimization: Cached guest interruptibility/exit qualifier reads

61779ca

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea force-pushed the vmcs-refactor branch from 689c04d to 61779ca Compare November 23, 2018 13:02

AlexAltea added 2 commits November 23, 2018 05:52

Optimization: Cached CR reads

b6733f4

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

Optimization: Moved tunnel updates outside guest-execution loop

b00b798

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>

AlexAltea force-pushed the vmcs-refactor branch from e616045 to b00b798 Compare November 23, 2018 14:26

raphaelning mentioned this pull request Jan 10, 2019

Fixed PAE issues #152

Merged

HaxmCI added the CI:Build Pass CI:Build Pass label Jun 20, 2019

HaxmCI added the CI:Mac Test Pass CI:Mac Test Pass label Jun 20, 2019

intel deleted a comment from xianxiaoyin Aug 23, 2019

intel deleted a comment from yiqisiben Aug 23, 2019

wcwang mentioned this pull request Feb 12, 2020

HAXM is about 10x slower than kvm in Nested virtualization environment #264

Open

nevilad mentioned this pull request Apr 16, 2020

qemu + haxm does not work to boot and run Mac OS X High Sierra #149

Open

HaxmCI added CI:Build Fail CI:Build Fail and removed CI:Build Pass CI:Build Pass CI:Mac Test Pass CI:Mac Test Pass labels May 24, 2021

wcwang force-pushed the master branch from 44f450a to 4e0e920 Compare November 6, 2022 11:22

wcwang force-pushed the master branch 2 times, most recently from 563eb1b to 6b942e3 Compare November 25, 2022 03:23

wcwang force-pushed the master branch 2 times, most recently from b73a231 to da1b8ec Compare January 26, 2023 02:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing VMCS reads/writes with a new cache implementation #130

Optimizing VMCS reads/writes with a new cache implementation #130

AlexAltea commented Nov 18, 2018 •

edited

raphaelning left a comment

raphaelning Nov 20, 2018

AlexAltea Nov 20, 2018 •

edited

raphaelning commented Nov 20, 2018

AlexAltea commented Nov 20, 2018 •

edited

raphaelning Nov 21, 2018 •

edited

raphaelning commented Nov 21, 2018

AlexAltea commented Nov 21, 2018

AlexAltea commented Nov 22, 2018 •

edited

nevilad commented Dec 17, 2019

wcwang commented Dec 23, 2019

nevilad commented Jan 6, 2020

Optimizing VMCS reads/writes with a new cache implementation #130

Are you sure you want to change the base?

Optimizing VMCS reads/writes with a new cache implementation #130

Conversation

AlexAltea commented Nov 18, 2018 • edited

Motivation

Changes

Benchmarks

Pending

raphaelning left a comment

Choose a reason for hiding this comment

raphaelning Nov 20, 2018

Choose a reason for hiding this comment

AlexAltea Nov 20, 2018 • edited

Choose a reason for hiding this comment

raphaelning commented Nov 20, 2018

AlexAltea commented Nov 20, 2018 • edited

raphaelning Nov 21, 2018 • edited

Choose a reason for hiding this comment

raphaelning commented Nov 21, 2018

AlexAltea commented Nov 21, 2018

AlexAltea commented Nov 22, 2018 • edited

nevilad commented Dec 17, 2019

wcwang commented Dec 23, 2019

nevilad commented Jan 6, 2020

AlexAltea commented Nov 18, 2018 •

edited

AlexAltea Nov 20, 2018 •

edited

AlexAltea commented Nov 20, 2018 •

edited

raphaelning Nov 21, 2018 •

edited

AlexAltea commented Nov 22, 2018 •

edited