Skip to content
Permalink
Jing-Zhang/KVM…
Switch branches/tags

Commits on Aug 10, 2021

  1. KVM: stats: Add VM dirty_pages stats for the number of dirty pages

    Add a generic VM stats dirty_pages to record the number of dirty pages
    reflected in dirty_bitmap at the moment.
    
    Original-by: Peter Feiner <pfeiner@google.com>
    Signed-off-by: Jing Zhang <jingzhangos@google.com>
    jingzhangos authored and intel-lab-lkp committed Aug 10, 2021

Commits on Aug 7, 2021

  1. KVM: selftests: Add a test of an unbacked nested PI descriptor

    Add a regression test for the unsupported configuration of a VMCS12
    posted interrupt descriptor that has no backing memory in L1. KVM
    should exit to userspace with KVM_INTERNAL_ERROR rather than just
    silently doing something wrong.
    
    Signed-off-by: Jim Mattson <jmattson@google.com>
    Reviewed-by: Oliver Upton <oupton@google.com>
    Message-Id: <20210604172611.281819-13-jmattson@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    jsmattsonjr authored and bonzini committed Aug 7, 2021
  2. KVM: selftests: Introduce prepare_tpr_shadow

    Add support for yet another page to hang from the VMCS12 for nested
    VMX testing: the virtual APIC page. This page is necessary for a
    VMCS12 to be launched with the "use TPR shadow" VM-execution control
    set (except in some oddball circumstances permitted by KVM).
    
    Signed-off-by: Jim Mattson <jmattson@google.com>
    Reviewed-by: Oliver Upton <oupton@google.com>
    Message-Id: <20210604172611.281819-12-jmattson@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    jsmattsonjr authored and bonzini committed Aug 7, 2021
  3. KVM: x86: Exit to userspace when kvm_check_nested_events fails

    If kvm_check_nested_events fails due to raising an
    EXIT_REASON_INTERNAL_ERROR, propagate it to userspace
    immediately, even if the vCPU would otherwise be sleeping.
    This happens for example when the posted interrupt descriptor
    points outside guest memory.
    
    Fixes: 966eefb ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
    Cc: stable@vger.kernel.org
    Reported-by: Jim Mattson <jmattson@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    bonzini committed Aug 7, 2021
  4. KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file

    Use this file to dump rmap statistic information.  The statistic is done by
    calculating the rmap count and the result is log-2-based.
    
    An example output of this looks like (idle 6GB guest, right after boot linux):
    
    Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
    Level=4K:       3086676 53045   12330   1272    502     121     76      2       0       0       0
    Level=2M:       5947    231     0       0       0       0       0       0       0       0       0
    Level=1G:       32      0       0       0       0       0       0       0       0       0       0
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20210730220455.26054-5-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    xzpeter authored and bonzini committed Aug 7, 2021
  5. KVM: X86: Introduce kvm_mmu_slot_lpages() helpers

    Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
    The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
    than fetching from the memslot pointer.  Start to use the latter one in
    kvm_alloc_memslot_metadata().
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20210730220455.26054-4-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    xzpeter authored and bonzini committed Aug 7, 2021
  6. KVM: Allow to have arch-specific per-vm debugfs files

    Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
    besides the stats fields.  The new interface kvm_arch_create_vm_debugfs() is
    defined but not yet used.  It's called after kvm->debugfs_dentry is created, so
    it can be referenced directly in kvm_arch_create_vm_debugfs().  Arch should
    define their own versions when they want to create extra debugfs nodes.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20210730220455.26054-2-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    xzpeter authored and bonzini committed Aug 7, 2021

Commits on Aug 6, 2021

  1. KVM: selftests: Move vcpu_args_set into perf_test_util

    perf_test_util is used to set up KVM selftests where vCPUs touch a
    region of memory. The guest code is implemented in perf_test_util.c (not
    the calling selftests). The guest code requires a 1 parameter, the
    vcpuid, which has to be set by calling vcpu_args_set(vm, vcpu_id, 1,
    vcpu_id).
    
    Today all of the selftests that use perf_test_util are making this call.
    Instead, perf_test_util should just do it. This will save some code but
    more importantly prevents mistakes since totally non-obvious that this
    needs to be called and failing to do so results in vCPUs not accessing
    the right regions of memory.
    
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210805172821.2622793-1-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  2. KVM: selftests: Support multiple slots in dirty_log_perf_test

    Introduce a new option to dirty_log_perf_test: -x number_of_slots. This
    causes the test to attempt to split the region of memory into the given
    number of slots. If the region cannot be evenly divided, the test will
    fail.
    
    This allows testing with more than one slot and therefore measure how
    performance scales with the number of memslots.
    
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-8-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  3. KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap

    gfn_to_rmap was removed in the previous patch so there is no need to
    retain the double underscore on __gfn_to_rmap.
    
    Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-7-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  4. KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and rmap_rec…

    …ycle
    
    rmap_add() and rmap_recycle() both run in the context of the vCPU and
    thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This
    enables rmap_add() and rmap_recycle() to take advantage of
    vcpu->last_used_slot and avoid expensive memslot searching.
    
    This change improves the performance of "Populate memory time" in
    dirty_log_perf_test with tdp_mmu=N. In addition to improving the
    performance, "Populate memory time" no longer scales with the number
    of memslots in the VM.
    
    Command                         | Before           | After
    ------------------------------- | ---------------- | -------------
    ./dirty_log_perf_test -v64 -x1  | 15.18001570s     | 14.99469366s
    ./dirty_log_perf_test -v64 -x64 | 18.71336392s     | 14.98675076s
    
    Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-6-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  5. KVM: x86/mmu: Leverage vcpu->last_used_slot in tdp_mmu_map_handle_tar…

    …get_level
    
    The existing TDP MMU methods to handle dirty logging are vcpu-agnostic
    since they can be driven by MMU notifiers and other non-vcpu-specific
    events in addition to page faults. However this means that the TDP MMU
    is not benefiting from the new vcpu->last_used_slot. Fix that by
    introducing a tdp_mmu_map_set_spte_atomic() which is only called during
    a TDP page fault and has access to the kvm_vcpu for fast slot lookups.
    
    This improves "Populate memory time" in dirty_log_perf_test by 5%:
    
    Command                         | Before           | After
    ------------------------------- | ---------------- | -------------
    ./dirty_log_perf_test -v64 -x64 | 5.472321072s     | 5.169832886s
    
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-5-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  6. KVM: Cache the last used slot index per vCPU

    The memslot for a given gfn is looked up multiple times during page
    fault handling. Avoid binary searching for it multiple times by caching
    the most recently used slot. There is an existing VM-wide last_used_slot
    but that does not work well for cases where vCPUs are accessing memory
    in different slots (see performance data below).
    
    Another benefit of caching the most recently use slot (versus looking
    up the slot once and passing around a pointer) is speeding up memslot
    lookups *across* faults and during spte prefetching.
    
    To measure the performance of this change I ran dirty_log_perf_test with
    64 vCPUs and 64 memslots and measured "Populate memory time" and
    "Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
    dirty logging to use fast_page_fault so its performance could be
    measured.
    
    Config     | Metric                        | Before | After
    ---------- | ----------------------------- | ------ | ------
    tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
    tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
    tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
    tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s
    
    The "Iteration 2 dirty memory time" results are especially compelling
    because they are equivalent to running the same test with a single
    memslot. In other words, fast_page_fault performance no longer scales
    with the number of memslots.
    
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-4-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  7. KVM: Move last_used_slot logic out of search_memslots

    Make search_memslots unconditionally search all memslots and move the
    last_used_slot logic up one level to __gfn_to_memslot. This is in
    preparation for introducing a per-vCPU last_used_slot.
    
    As part of this change convert existing callers of search_memslots to
    __gfn_to_memslot to avoid making any functional changes.
    
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-3-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021
  8. KVM: Rename lru_slot to last_used_slot

    lru_slot is used to keep track of the index of the most-recently used
    memslot. The correct acronym would be "mru" but that is not a common
    acronym. So call it last_used_slot which is a bit more obvious.
    
    Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: David Matlack <dmatlack@google.com>
    Message-Id: <20210804222844.1419481-2-dmatlack@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    dmatlack authored and bonzini committed Aug 6, 2021

Commits on Aug 5, 2021

  1. KVM: xen: do not use struct gfn_to_hva_cache

    gfn_to_hva_cache is not thread-safe, so it is usually used only within
    a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
    implementation has such a cache in kvm->arch, but it is not really
    used except to store the location of the shared info page.  Replace
    shinfo_set and shinfo_cache with just the value that is passed via
    KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
    initialization value is not zero anymore and therefore kvm_xen_init_vm
    needs to be introduced.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    bonzini committed Aug 5, 2021

Commits on Aug 4, 2021

  1. KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of per…

    …f interfaces
    
    Based on our observations, after any vm-exit associated with vPMU, there
    are at least two or more perf interfaces to be called for guest counter
    emulation, such as perf_event_{pause, read_value, period}(), and each one
    will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
    more severe when guest use counters in a multiplexed manner.
    
    Holding a lock once and completing the KVM request operations in the perf
    context would introduce a set of impractical new interfaces. So we can
    further optimize the vPMU implementation by avoiding repeated calls to
    these interfaces in the KVM context for at least one pattern:
    
    After we call perf_event_pause() once, the event will be disabled and its
    internal count will be reset to 0. So there is no need to pause it again
    or read its value. Once the event is paused, event period will not be
    updated until the next time it's resumed or reprogrammed. And there is
    also no need to call perf_event_period twice for a non-running counter,
    considering the perf_event for a running counter is never paused.
    
    Based on this implementation, for the following common usage of
    sampling 4 events using perf on a 4u8g guest:
    
      echo 0 > /proc/sys/kernel/watchdog
      echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
      echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
      echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
      for i in `seq 1 1 10`
      do
      taskset -c 0 perf record \
      -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
      /root/br_instr a
      done
    
    the average latency of the guest NMI handler is reduced from
    37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
    Also, in addition to collecting more samples, no loss of sampling
    accuracy was observed compared to before the optimization.
    
    Signed-off-by: Like Xu <likexu@tencent.com>
    Message-Id: <20210728120705.6855-1-likexu@tencent.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    Acked-by: Peter Zijlstra <peterz@infradead.org>
    Like Xu authored and bonzini committed Aug 4, 2021
  2. KVM: X86: Optimize zapping rmap

    Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
    slow.  The easy way is to travers the rmap list, collecting the a/d bits and
    free the slots along the way.
    
    Provide a pte_list_destroy() and do exactly that.
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20210730220605.26377-1-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    xzpeter authored and bonzini committed Aug 4, 2021
  3. KVM: X86: Optimize pte_list_desc with per-array counter

    Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
    logic.  E.g., we don't need to loop over the array any more for most reasons.
    
    This will make more sense after we've switched the array size to be larger
    otherwise the counter will be a waste.
    
    Initially I wanted to store a tail pointer at the head of the array list so we
    don't need to traverse the list at least for pushing new ones (if without the
    counter we traverse both the list and the array).  However that'll need
    slightly more change without a huge lot benefit, e.g., after we grow entry
    numbers per array the list traversing is not so expensive.
    
    So let's be simple but still try to get as much benefit as we can with just
    these extra few lines of changes (not to mention the code looks easier too
    without looping over arrays).
    
    I used the same a test case to fork 500 child and recycle them ("./rmap_fork
    500" [1]), this patch further speeds up the total fork time of about 4%, which
    is a total of 33% of vanilla kernel:
    
            Vanilla:      473.90 (+-5.93%)
            3->15 slots:  366.10 (+-4.94%)
            Add counter:  351.00 (+-3.70%)
    
    [1] xzpeter/clibs@825436f
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20210730220602.26327-1-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    xzpeter authored and bonzini committed Aug 4, 2021
  4. KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger

    Currently rmap array element only contains 3 entries.  However for EPT=N there
    could have a lot of guest pages that got tens of even hundreds of rmap entry.
    
    A normal distribution of a 6G guest (even if idle) shows this with rmap count
    statistics:
    
    Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
    Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
    Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
    Level=1G:       32      0       0       0       0       0       0       0       0       0       0
    
    If we do some more fork some pages will grow even larger rmap counts.
    
    This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
    use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
    will be slightly more efficient; but still not too large so less waste when
    array not full.
    
    It should not affecting EPT=Y since EPT normally only has zero or one rmap
    entry for each page, so no array is even allocated.
    
    With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
    this patch speeds up fork time of about 29%.
    
        Before: 473.90 (+-5.93%)
        After:  366.10 (+-4.94%)
    
    [1] xzpeter/clibs@825436f
    
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20210730220455.26054-6-peterx@redhat.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    xzpeter authored and bonzini committed Aug 4, 2021

Commits on Aug 3, 2021

  1. KVM: const-ify all relevant uses of struct kvm_memory_slot

    As alluded to in commit f36f3f2 ("KVM: add "new" argument to
    kvm_arch_commit_memory_region"), a bunch of other places where struct
    kvm_memory_slot is used, needs to be refactored to preserve the
    "const"ness of struct kvm_memory_slot across-the-board.
    
    Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
    Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
    [Do not touch body of slot_rmap_walk_init. - Paolo]
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    effective-light authored and bonzini committed Aug 3, 2021
  2. KVM: Don't take mmu_lock for range invalidation unless necessary

    Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
    that are unrelated to KVM.  This is possible now that memslot updates are
    blocked from range_start() to range_end(); that ensures that lock elision
    happens in both or none, and therefore that mmu_notifier_count updates
    (which must occur while holding mmu_lock for write) are always paired
    across start->end.
    
    Based on patches originally written by Ben Gardon.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    bonzini committed Aug 3, 2021
  3. KVM: Block memslot updates across range_start() and range_end()

    We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
    notifications that are unrelated to KVM.  Because mmu_notifier_count
    must be modified while holding mmu_lock for write, and must always
    be paired across start->end to stay balanced, lock elision must
    happen in both or none.  Therefore, in preparation for this change,
    this patch prevents memslot updates across range_start() and range_end().
    
    Note, technically flag-only memslot updates could be allowed in parallel,
    but stalling a memslot update for a relatively short amount of time is
    not a scalability issue, and this is all more than complex enough.
    
    A long note on the locking: a previous version of the patch used an rwsem
    to block the memslot update while the MMU notifier run, but this resulted
    in the following deadlock involving the pseudo-lock tagged as
    "mmu_notifier_invalidate_range_start".
    
       ======================================================
       WARNING: possible circular locking dependency detected
       5.12.0-rc3+ torvalds#6 Tainted: G           OE
       ------------------------------------------------------
       qemu-system-x86/3069 is trying to acquire lock:
       ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
    
       but task is already holding lock:
       ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
    
       which lock already depends on the new lock.
    
    This corresponds to the following MMU notifier logic:
    
        invalidate_range_start
          take pseudo lock
          down_read()           (*)
          release pseudo lock
        invalidate_range_end
          take pseudo lock      (**)
          up_read()
          release pseudo lock
    
    At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
    at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
    
    This could cause a deadlock (ignoring for a second that the pseudo lock
    is not a lock):
    
    - invalidate_range_start waits on down_read(), because the rwsem is
    held by install_new_memslots
    
    - install_new_memslots waits on down_write(), because the rwsem is
    held till (another) invalidate_range_end finishes
    
    - invalidate_range_end sits waits on the pseudo lock, held by
    invalidate_range_start.
    
    Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
    it would change the *shared* rwsem readers into *shared recursive*
    readers), so open-code the wait using a readers count and a
    spinlock.  This also allows handling blockable and non-blockable
    critical section in the same way.
    
    Losing the rwsem fairness does theoretically allow MMU notifiers to
    block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
    retry scheme in mmu_interval_read_begin also uses wait/wake_up
    and is likewise not fair.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    bonzini committed Aug 3, 2021

Commits on Aug 2, 2021

  1. KVM: nSVM: remove useless kvm_clear_*_queue

    For an event to be in injected state when nested_svm_vmrun executes,
    it must have come from exitintinfo when svm_complete_interrupts ran:
    
      vcpu_enter_guest
       static_call(kvm_x86_run) -> svm_vcpu_run
        svm_complete_interrupts
         // now the event went from "exitintinfo" to "injected"
       static_call(kvm_x86_handle_exit) -> handle_exit
        svm_invoke_exit_handler
          vmrun_interception
           nested_svm_vmrun
    
    However, no event could have been in exitintinfo before a VMRUN
    vmexit.  The code in svm.c is a bit more permissive than the one
    in vmx.c:
    
            if (is_external_interrupt(svm->vmcb->control.exit_int_info) &&
                exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR &&
                exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH &&
                exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI)
    
    but in any case, a VMRUN instruction would not even start to execute
    during an attempted event delivery.
    
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    bonzini committed Aug 2, 2021
  2. KVM: x86: Preserve guest's CR0.CD/NW on INIT

    Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
    defined by both Intel's SDM and AMD's APM.
    
    Note, current versions of Intel's SDM are very poorly written with
    respect to INIT behavior.  Table 9-1. "IA-32 and Intel 64 Processor
    States Following Power-up, Reset, or INIT" quite clearly lists power-up,
    RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1.  But the SDM
    then attempts to qualify CD/NW behavior in a footnote:
    
      2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
         are cleared.
    
    Presumably that footnote is only meant for INIT, as the RESET case and
    especially the power-up case are rather non-sensical.  Another footnote
    all but confirms that:
    
      6. Internal caches are invalid after power-up and RESET, but left
         unchanged with an INIT.
    
    Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
    else can hack their BIOS to check RESET and power-up :-D).
    
    Reported-by: Reiji Watanabe <reijiw@google.com>
    Reviewed-by: Reiji Watanabe <reijiw@google.com>
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-47-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  3. KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET

    Drop redundant clears of vcpu->arch.hflags in init_vmcb() since
    kvm_vcpu_reset() always clears hflags, and it is also always
    zero at vCPU creation time.  And of course, the second clearing
    in init_vmcb() was always redundant.
    
    Suggested-by: Reiji Watanabe <reijiw@google.com>
    Reviewed-by: Reiji Watanabe <reijiw@google.com>
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-46-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  4. KVM: SVM: Emulate #INIT in response to triple fault shutdown

    Emulate a full #INIT instead of simply initializing the VMCB if the
    guest hits a shutdown.  Initializing the VMCB but not other vCPU state,
    much of which is mirrored by the VMCB, results in incoherent and broken
    vCPU state.
    
    Ideally, KVM would not automatically init anything on shutdown, and
    instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force
    userspace to explicitly INIT or RESET the vCPU.  Even better would be to
    add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown
    (and SMI on Intel CPUs).
    
    But, that ship has sailed, and emulating #INIT is the next best thing as
    that has at least some connection with reality since there exist bare
    metal platforms that automatically INIT the CPU if it hits shutdown.
    
    Fixes: 46fe4dd ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace")
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-45-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  5. KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()

    Move VMWRITE sequences in vmx_vcpu_reset() guarded by !init_event into
    init_vmcs() to make it more obvious that they're, uh, initializing the
    VMCS.
    
    No meaningful functional change intended (though the order of VMWRITEs
    and whatnot is different).
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-44-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  6. KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT

    Drop a call to vmx_clear_hlt() during vCPU INIT, the guest's activity
    state is unconditionally set to "active" a few lines earlier in
    vmx_vcpu_reset().
    
    No functional change intended.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-43-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  7. KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function

    Consolidate all of the dynamic MSR bitmap adjustments into
    vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect
    that it is x2APIC specific.  If KVM gains more cases of dynamic MSR
    pass-through, odds are very good that those new cases will be better off
    with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL.
    
    Attempting to handle all updates in a common helper did more harm than
    good, as KVM ended up collecting a large number of useless "updates".
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-42-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  8. KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode

    Don't bother initializing msr_bitmap_mode to 0, all of struct vcpu_vmx is
    zero initialized.
    
    No functional change intended.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-41-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  9. KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is changed

    Drop an explicit call to update the x2APIC MSRs when the userspace MSR
    filter is modified.  The x2APIC MSRs are deliberately exempt from
    userspace filtering.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-40-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  10. KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions

    Drop unnecessary MSR bitmap updates during nested transitions, as L1's
    APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows,
    and L2's MSR bitmap is managed separately.  In the unlikely event that L1
    is pathological and loads APIC_BASE via the VM-Exit load list, KVM will
    handle updating the bitmap in its normal WRMSR flows.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-39-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  11. KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT

    Remove an unnecessary MSR bitmap refresh during vCPU RESET/INIT.  In both
    cases, the MSR bitmap already has the desired values and state.
    
    At RESET, the vCPU is guaranteed to be running with x2APIC disabled, the
    x2APIC MSRs are guaranteed to be intercepted due to the MSR bitmap being
    initialized to all ones by alloc_loaded_vmcs(), and vmx->msr_bitmap_mode
    is guaranteed to be zero, i.e. reflecting x2APIC disabled.
    
    At INIT, the APIC_BASE MSR is not modified, thus there can't be any
    change in x2APIC state.
    
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-38-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
  12. KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86

    Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
    common x86.  VMX and SVM now have near-identical sequences, the only
    difference being that VMX updates the exception bitmap.  Updating the
    bitmap on SVM is unnecessary, but benign.  Unfortunately it can't be left
    behind in VMX due to the need to update exception intercepts after the
    control registers are set.
    
    Reviewed-by: Reiji Watanabe <reijiw@google.com>
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20210713163324.627647-37-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    sean-jc authored and bonzini committed Aug 2, 2021
Older