Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

Optimizing VMCS reads/writes with a new cache implementation #130

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

AlexAltea
Copy link
Contributor

@AlexAltea AlexAltea commented Nov 18, 2018

Disclaimer: This patch isn't finished, but it's available for early reviews/discussion.

Motivation

During development, running HAXM in a virtual machine prevents your "bare-metal" host system from kernel panics if runtime errors occur. The downside of nested virtualization, is that during L2-guest VM-exit events, the L1-hypervisor, i.e. HAXM, might need to perform actions, e.g. vmread/vmwrite, that trigger another exit to the L0-hypervisor, i.e. KVM, VMware, etc. These recursive exits add latency so it's convenient to minimize them.

While booting few kernels using both HAXM and KVM as L1-hypervisor, under the same conditions, HAXM appears to be 4-5 times slower than KVM (related: #40). After measuring the amount of vmread/vmwrite's between guest-exit and guest-enter events, I've noticed HAXM amount is (unsurprisingly) 4-5 times higher. Specifically:

  • KVM: Does 6-10 reads and 2-6 writes on average.
    KVM: Does 4-8 reads and 0-4 writes on average (depends on VMX exit handler).

  • HAXM: Does 32-33 reads and 21-23 writes on average.
    HAXM: Does 32 reads and 1 write (nearly always!).

Here's the raw data, corresponds to around the first few seconds of trying to boot a Windows 7 install disc: vmx-measurements.zip (haxm-vmx.log and kvm-vmx.log). Efforts at 96af3d2 (thanks @junxiaoc!) have improved this situation, but there's still room for improvement.

Changes

Previous efforts, and other hypervisors like KVM, have minimized vmread/vmwrite usage by caching the component values with hand-written helper functions. While this works, it doesn't scale well for the 120+ components. Plus, it's not trivial what should be cached: caching unused components is counterproductive. Fine-tuning the set of cached components when in manually-implemented code is too time-consuming.

Instead, this patch relies in preprocessor X-Macros to automatically create the cache structures, alongside cached/dirty flags and helper functions to access all VMCS components. This Godbolt snippet allows to inspect the preprocessor output for clear view of the resulting code (ideally, apply clang-format).

Summarizing the changes, this patch:

  • Moves some VMX-specific definitions to vmx.h (required by other changes).
  • Adds VMCS-cache macros to vmx.h.
  • Optimizes code by using new VMCS-cache macros: Removed most vmcs_pending* fields and unnecessary vmread's at the end of cpu_vmx_execute.

Benchmarks

Measuring time elapsed between booting a virtual machine with a Windows XP SP2 installation disc, until the installer shows the "welcome screen", using an Ubuntu 18.04 (+ Linux 4.17) host. This is a particularly bad scenario that shows up to x10 slowdown. (Disclaimer: time measured manually with my phone).

  • KVM: 45.4 seconds.
  • HAXM (before): 532.5 seconds.
  • HAXM (after): TODO.

Pending

  • Using VMCS-cache for VCPU state reads/writes (huge performance impact!).
  • Replacing remaining vmread/vmwrite's.

Copy link
Contributor

@raphaelning raphaelning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You've identified an important problem, and I like your systematic approach to solving it. I'm anxious to see how much performance improvement this PR will eventually make, and maybe not just for the nested virtualization use case :-)

So far I've only finished reviewing the first commit, and please allow me more time for the rest. Meanwhile, I have a question about the VMREAD/VMWRITE data you collected, which I'll ask in a separate post.

@@ -187,7 +173,6 @@ struct vcpu_state_t {

uint32_t _activity_state;
uint32_t pad;
interruptibility_state_t _interruptibility_state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the removal of this unused field, but vcpu_state_t is part of the HAXM API, so let's handle this with care:

  1. Replace this field with a placeholder: uint64_t pad2;, so HAXM and QEMU always agree on the size of vcpu_state_t.
  2. Update api.md accordingly.
  3. Prepare a corresponding patch for QEMU target/i386/hax-interface.h.

Copy link
Contributor Author

@AlexAltea AlexAltea Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thank you! Fixed in 5636f62. I'll submit a QEMU patch after merging this.
Please ignore the build issue: it's due to something else that has been fixed in #133.

@raphaelning
Copy link
Contributor

First of all, let me share the "decoder" you sent me offline, in case other people are wondering how to interpret the logs:

$(guest rip) <= $(vmx exit reason)!
delta: $(absolute difference between rdtsc after guest exit and before guest enter)
vmread_ctr: $(number of vmread’s between guest exit and guest enter)
vmwrite_ctr: $(number of vmwrite’s between guest exit and guest enter)

I'm just surprised by how small vmwrite_ctr is: indeed, for HAXM it's almost always 1 or 2, although 9 also appears once in haxm-vmx.log.

But if you look at vcpu_save_host_state() (core/vcpu.c) alone, which is called before every VM entry, there are already 14 unconditional VMWRITEs! So is vcpu_save_host_state() not covered by vmwrite_ctr? When do you increment this counter, and when do you reset it?

For the ease of discussion, let me provide an outline of the current VM entry-exit workflow:

cpu_vmx_execute() {
    while (1) {
        ...
        load_vmcs()  // VMXON + VMPTRLD
        VMWRITE compA  // Saves host state, flushes dirty guest state (RIP, RFLAGS, etc.), updates VMX controls, etc.
        VMWRITE compB
        VMWRITE compC
        …  // There are some VMREADs as well
        asm_vmxrun()  // VM entry (VMLAUNCH)
        // VM exit
        VMREAD compX  // Caches VM exit info and some guest state (RIP, RFLAGS, etc.)
        VMREAD compY
        VMREAD compZ
        …  // Are there any VMWRITEs as well?
        put_vmcs()  // VMCLEAR + VMXOFF
        cpu_vmexit_handler()  {  // Handler specific to the VM exit reason
            // Ideally, no VMREAD or VMWRITE should be issued from here, since VMX is off (requiring an implicit load_vmcs()).
            // Instead, VMCS accesses are redirected to our cache.
            // PR #117 should have eliminated most of such expensive VMREAD/VMWRITE calls, if not all.
        }
    }
}

@AlexAltea
Copy link
Contributor Author

AlexAltea commented Nov 20, 2018

I'm just surprised by how small vmwrite_ctr is [...] So is vcpu_save_host_state() not covered by vmwrite_ctr? When do you increment this counter, and when do you reset it?

Nice catch! On HAXM, the counter is incremented in vmread/vmx_vmwrite, which is fine since on 64-bit each will generate one single vmread/vmwrite instruction. The KVM counters also increment correctly.

However, on HAXM, I had placed the counter display/reset before and after cpu_vmx_run, respectively, without realizing that more accesses ocurred inside this function. I've updated my code, and now we are wrapping asm_vmxrun instead. A similar issue happened with KVM, which is fixed too.

I repeated the benchmarks for the unpatched code, and these are the updated results:

  • KVM: Does 6-10 reads and 2-6 writes on average.
  • HAXM: Does 32-33 reads and 21-23 writes on average.

The updated full logs (and the diff's to replicate them!) are available at: vmx-measurements.zip.
Data corresponds to the booting QEMU with HAXM/KVM without any disc (just BIOS).

- Moved vcpu_vmx_data and interruptibility_state_t from {vcpu,vcpu_state}.h to vmx.h.

- Removed vcpu_state_t::interruptibility_state_t since it's not used.

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
- Replaced vmcs_pending* fields, with the new automatically-generated function that updates VMCS based on vmcs_cache_w flags. Only `vmcs_pending_guest_cr3` has been preserved since it requires extra logic.

- Removed unnecessary vmread's at the end of `cpu_vmx_execute` that are not always required. Corresponding members in `vcpu_vmx_data` have been removed and function depending on them now call `vmcs_read`.

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
core/include/vcpu.h Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/include/vmx.h Outdated Show resolved Hide resolved
core/vmx.c Outdated
#define COMP_PENDING(cache_r, cache_w, width, name) \
COMP_PENDING_##cache_w(name)

void vcpu_handle_vmcs_pending(struct vcpu_t* vcpu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have changed the semantics of this function, but not where it is called. Now that it has the responsibility of flushing all the dirty VMCS values, we should delay it as much as possible, in case vmcs_write gets called again after the flush but before VM entry (asm_vmxrun). Probably we need to draw a line somewhere in cpu_vmx_execute or in cpu_vmx_run:

// Cached vmcs_write() is preferred over uncached vmwrite()
vmcs_write(..);
vmcs_write(..);
...
vcpu_handle_vmcs_pending(vcpu);
// *** No cached vmcs_write allowed after this point ***
vmwrite(..);
vmwrite(..);
...
// VM entry

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
@raphaelning
Copy link
Contributor

The updated full logs (and the diff's to replicate them!) are available at: vmx-measurements.zip.

Thanks, the new data is plausible. Clearly there's a lot of room for improvement for HAXM, to look on the bright side :-)

@AlexAltea
Copy link
Contributor Author

I've fixed some of the issues you mentioned. To make reviewing easier, I'll avoid rewriting the commit history until this is ready to merge. Then, I'll squash and re-sign everything into the first three commits, i.e.:

  • Moved VMX-specific definitions to vmx.h
  • Added VMCS-cache macros
  • Optimize code by using new VMCS-cache macros

Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
@AlexAltea
Copy link
Contributor Author

AlexAltea commented Nov 22, 2018

EDIT: Sorry, I clicked the wrong button and ended up closing/reopening the pull request.

@AlexAltea AlexAltea closed this Nov 22, 2018
@AlexAltea AlexAltea reopened this Nov 22, 2018
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
Signed-off-by: Alexandro Sanchez Bach <asanchez@kryptoslogic.com>
@HaxmCI HaxmCI added the CI:Mac Test Pass CI:Mac Test Pass label Jun 20, 2019
@intel intel deleted a comment from xianxiaoyin Aug 23, 2019
@intel intel deleted a comment from xianxiaoyin Aug 23, 2019
@intel intel deleted a comment from yiqisiben Aug 23, 2019
@intel intel deleted a comment from yiqisiben Aug 23, 2019
@nevilad
Copy link
Contributor

nevilad commented Dec 17, 2019

What is with this PR? My windows 7 guest boots and runs pretty slow compared to VmWare, I wan't do some enhancements to speed it up. This PR is a good start point.

@wcwang
Copy link
Contributor

wcwang commented Dec 23, 2019

Thanks for @nevilad comments. I noticed that @AlexAltea added new commits a month ago. And @AlexAltea once declared that the pull request were not completed. @AlexAltea, could you help to rebase the current pull request to the head of master branch first? So we can consider to proceed it and @nevilad can start the enhancement based on it. Thanks for both of you.

@nevilad
Copy link
Contributor

nevilad commented Jan 6, 2020

I noticed that @AlexAltea added new commits a month ago.

I don't see new comments, the latest is in november of 2018.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CI:Build Fail CI:Build Fail
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants