Sean-Christoph…
Commits on Apr 13, 2021
-
KVM: x86: Defer tick-based accounting 'til after IRQ handling
When using tick-based accounting, defer the call to account guest time until after servicing any IRQ(s) that happened in the guest (or immediately after VM-Exit). When using tick-based accounting, time is accounted to the guest when PF_VCPU is set when the tick IRQ handler runs. The current approach of unconditionally accounting time in kvm_guest_exit_irqoff() prevents IRQs that occur in the guest from ever being processed with PF_VCPU set, since PF_VCPU ends up being set only during the relatively short VM-Enter sequence, which runs entirely with IRQs disabled. Fixes: 87fa7f3 ("x86/kvm: Move context tracking where it belongs") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Sean Christopherson <seanjc@google.com>
-
KVM: x86: Consolidate guest enter/exit logic to common helpers
Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common helpers. In addition to deduplicating code, this will allow tweaking the vtime accounting in the VM-Exit path without splitting logic across x86, VMX, and SVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> -
KVM: Move vtime accounting of guest exit to separate helper
Provide a standalone helper for guest exit vtime accounting so that x86 can defer tick-based accounting until the appropriate time, while still updating context tracking immediately after VM-Exit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com>
-
context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
Move the guest enter/exit wrappers to kvm_host.h so that KVM can manage its context tracking vs. vtime accounting without bleeding too many KVM details into the context tracking code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com>
-
context_tracking: Consolidate guest enter/exit wrappers
Consolidate the guest enter/exit wrappers by providing stubs for the context tracking helpers as necessary. This will allow moving the wrappers under KVM without having to bleed too many #ifdefs into the soon-to-be KVM code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com>
-
context_tracking: Move guest enter/exit logic to standalone helpers
Move guest enter/exit context tracking to standalone helpers, so that the existing wrappers can be moved under KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com>
-
sched/vtime: Move guest enter/exit vtime accounting to separate helpers
Provide separate helpers for guest enter/exit vtime accounting instead of open coding the logic within the context tracking code. This will allow KVM x86 to handle vtime accounting slightly differently when using tick- based accounting. Opportunstically delete the vtime_account_kernel() stub now that all callers are wrapped with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y. No functional change intended. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Commits on Apr 2, 2021
-
KVM: x86: Support KVM VMs sharing SEV context
Add a capability for userspace to mirror SEV encryption context from one vm to another. On our side, this is intended to support a Migration Helper vCPU, but it can also be used generically to support other in-guest workloads scheduled by the host. The intention is for the primary guest and the mirror to have nearly identical memslots. The primary benefits of this are that: 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they can't accidentally clobber each other. 2) The VMs can have different memory-views, which is necessary for post-copy migration (the migration vCPUs on the target need to read and write to pages, when the primary guest would VMEXIT). This does not change the threat model for AMD SEV. Any memory involved is still owned by the primary guest and its initial state is still attested to through the normal SEV_LAUNCH_* flows. If userspace wanted to circumvent SEV, they could achieve the same effect by simply attaching a vCPU to the primary VM. This patch deliberately leaves userspace in charge of the memslots for the mirror, as it already has the power to mess with them in the primary guest. This patch does not support SEV-ES (much less SNP), as it does not handle handing off attested VMSAs to the mirror. For additional context, we need a Migration Helper because SEV PSP migration is far too slow for our live migration on its own. Using an in-guest migrator lets us speed this up significantly. Signed-off-by: Nathan Tempelman <natet@google.com> Message-Id: <20210316014027.3116119-1-natet@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: pending exceptions must not be blocked by an injected event
Injected interrupts/nmi should not block a pending exception, but rather be either lost if nested hypervisor doesn't intercept the pending exception (as in stock x86), or be delivered in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit that corresponds to the pending exception. The only reason for an exception to be blocked is when nested run is pending (and that can't really happen currently but still worth checking for). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: selftests: remove redundant semi-colon
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Message-Id: <20210401142514.1688199-1-yangyingliang@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: introduce kvm_register_clear_available
Small refactoring that will be used in the next patch. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401141814.1029036-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: nSVM: call nested_svm_load_cr3 on nested state load
While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4 by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore only root_mmu is reset. On regular nested entries we call nested_svm_load_cr3 which both updates the guest's CR3 in the MMU when it is needed, and it also initializes the mmu again which makes it initialize the walk_mmu as well when nested paging is enabled in both host and guest. Since we don't call nested_svm_load_cr3 on nested state load, the walk_mmu can be left uninitialized, which can lead to a NULL pointer dereference while accessing it if we happen to get a nested page fault right after entering the nested guest first time after the migration and we decide to emulate it, which leads to the emulator trying to access walk_mmu->gva_to_gpa which is NULL. Therefore we should call this function on nested state load as well. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401141814.1029036-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
Similar to the rest of guest page accesses after migration, this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES request. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401141814.1029036-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: dump_vmcs should include the autoload/autostore MSR lists
When dumping the current VMCS state, include the MSRs that are being automatically loaded/stored during VM entry/exit. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-6-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: dump_vmcs should show the effective EFER
If EFER is not being loaded from the VMCS, show the effective value by reference to the MSR autoload list or calculation. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-5-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: dump_vmcs should consider only the load controls of EFER/PAT
When deciding whether to dump the GUEST_IA32_EFER and GUEST_IA32_PAT fields of the VMCS, examine only the VM entry load controls, as saving on VM exit has no effect on whether VM entry succeeds or fails. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-4-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: dump_vmcs should not conflate EFER and PAT presence in VMCS
Show EFER and PAT based on their individual entry/exit controls. Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-3-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid
If the VM entry/exit controls for loading/saving MSR_EFER are either not available (an older processor or explicitly disabled) or not used (host and guest values are the same), reading GUEST_IA32_EFER from the VMCS returns an inaccurate value. Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide whether to print the PDPTRs - always do so if the fields exist. Fixes: 4eb64dc ("KVM: x86: dump VMCS on invalid entry") Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: nSVM: improve SYSENTER emulation on AMD
Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel, we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP} msrs, and we also emulate the sysenter/sysexit instruction in long mode. (Emulator does still refuse to emulate sysenter in 64 bit mode, on the ground that the code for that wasn't tested and likely has no users) However when virtual vmload/vmsave is enabled, the vmload instruction will update these 32 bit msrs without triggering their msr intercept, which will lead to having stale values in kvm's shadow copy of these msrs, which relies on the intercept to be up to date. Fix/optimize this by doing the following: 1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel (This is both a tiny optimization and also ensures that in case the guest cpu vendor is AMD, the msrs will be 32 bit wide as AMD defined). 2. Store only high 32 bit part of these msrs on interception and combine it with hardware msr value on intercepted read/writes iff vendor=GenuineIntel. 3. Disable vmload/vmsave virtualization if vendor=GenuineIntel. (It is somewhat insane to set vendor=GenuineIntel and still enable SVM for the guest but well whatever). Then zero the high 32 bit parts when kvm intercepts and emulates vmload. Thanks a lot to Paulo Bonzini for helping me with fixing this in the most correct way. This patch fixes nested migration of 32 bit nested guests, that was broken because incorrect cached values of SYSENTER msrs were stored in the migration stream if L1 changed these msrs with vmload prior to L2 entry. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> -
KVM: x86: add guest_cpuid_is_intel
This is similar to existing 'guest_cpuid_is_amd_or_hygon' Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401111928.996871-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86: Account a variety of miscellaneous allocations
Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are clearly associated with a single task/VM. Note, there are a several SEV allocations that aren't accounted, but those can (hopefully) be fixed by using the local stack for memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331023025.2485960-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: SVM: Do not allow SEV/SEV-ES initialization after vCPUs are created
Reject KVM_SEV_INIT and KVM_SEV_ES_INIT if they are attempted after one or more vCPUs have been created. KVM assumes a VM is tagged SEV/SEV-ES prior to vCPU creation, e.g. init_vmcb() needs to mark the VMCB as SEV enabled, and svm_create_vcpu() needs to allocate the VMSA. At best, creating vCPUs before SEV/SEV-ES init will lead to unexpected errors and/or behavior, and at worst it will crash the host, e.g. sev_launch_update_vmsa() will dereference a null svm->vmsa pointer. Fixes: 1654efc ("KVM: SVM: Add KVM_SEV_INIT command") Fixes: ad73109 ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: SVM: Do not set sev->es_active until KVM_SEV_ES_INIT completes
Set sev->es_active only after the guts of KVM_SEV_ES_INIT succeeds. If the command fails, e.g. because SEV is already active or there are no available ASIDs, then es_active will be left set even though the VM is not fully SEV-ES capable. Refactor the code so that "es_active" is passed on the stack instead of being prematurely shoved into sev_info, both to avoid having to unwind sev_info and so that it's more obvious what actually consumes es_active in sev_guest_init() and its helpers. Fixes: ad73109 ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: SVM: Use online_vcpus, not created_vcpus, to iterate over vCPUs
Use the kvm_for_each_vcpu() helper to iterate over vCPUs when encrypting VMSAs for SEV, which effectively switches to use online_vcpus instead of created_vcpus. This fixes a possible null-pointer dereference as created_vcpus does not guarantee a vCPU exists, since it is updated at the very beginning of KVM_CREATE_VCPU. created_vcpus exists to allow the bulk of vCPU creation to run in parallel, while still correctly restricting the max number of max vCPUs. Fixes: ad73109 ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Simplify code for aging SPTEs in TDP MMU
Use a basic NOT+AND sequence to clear the Accessed bit in TDP MMU SPTEs, as opposed to the fancy ffs()+clear_bit() logic that was copied from the legacy MMU. The legacy MMU uses clear_bit() because it is operating on the SPTE itself, i.e. clearing needs to be atomic. The TDP MMU operates on a local variable that it later writes to the SPTE, and so doesn't need to be atomic or even resident in memory. Opportunistically drop unnecessary initialization of new_spte, it's guaranteed to be written before being accessed. Using NOT+AND instead of ffs()+clear_bit() reduces the sequence from: 0x0000000000058be6 <+134>: test %rax,%rax 0x0000000000058be9 <+137>: je 0x58bf4 <age_gfn_range+148> 0x0000000000058beb <+139>: test %rax,%rdi 0x0000000000058bee <+142>: je 0x58cdc <age_gfn_range+380> 0x0000000000058bf4 <+148>: mov %rdi,0x8(%rsp) 0x0000000000058bf9 <+153>: mov $0xffffffff,%edx 0x0000000000058bfe <+158>: bsf %eax,%edx 0x0000000000058c01 <+161>: movslq %edx,%rdx 0x0000000000058c04 <+164>: lock btr %rdx,0x8(%rsp) 0x0000000000058c0b <+171>: mov 0x8(%rsp),%r15 to: 0x0000000000058bdd <+125>: test %rax,%rax 0x0000000000058be0 <+128>: je 0x58beb <age_gfn_range+139> 0x0000000000058be2 <+130>: test %rax,%r8 0x0000000000058be5 <+133>: je 0x58cc0 <age_gfn_range+352> 0x0000000000058beb <+139>: not %rax 0x0000000000058bee <+142>: and %r8,%rax 0x0000000000058bf1 <+145>: mov %rax,%r15 thus eliminating several memory accesses, including a locked access. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331004942.2444916-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Remove spurious clearing of dirty bit from TDP MMU SPTE
Don't clear the dirty bit when aging a TDP MMU SPTE (in response to a MMU notifier event). Prematurely clearing the dirty bit could cause spurious PML updates if aging a page happened to coincide with dirty logging. Note, tdp_mmu_set_spte_no_acc_track() flows into __handle_changed_spte(), so the host PFN will be marked dirty, i.e. there is no potential for data corruption. Fixes: a6a0b05 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331004942.2444916-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint
Remove x86's trace_kvm_age_page() tracepoint. It's mostly redundant with the common trace_kvm_age_hva() tracepoint, and if there is a need for the extra details, e.g. gfn, referenced, etc... those details should be added to the common tracepoint so that all architectures and MMUs benefit from the info. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: Move arm64's MMU notifier trace events to generic code
Move arm64's MMU notifier trace events into common code in preparation for doing the hva->gfn lookup in common code. The alternative would be to trace the gfn instead of hva, but that's not obviously better and could also be done in common code. Tracing the notifiers is also quite handy for debug regardless of architecture. Remove a completely redundant tracepoint from PPC e500. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: Move prototypes for MMU notifier callbacks to generic code
Move the prototypes for the MMU notifier callbacks out of arch code and into common code. There is no benefit to having each arch replicate the prototypes since any deviation from the invocation in common code will explode. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing …
…SPTE Use the leaf-only TDP iterator when changing the SPTE in reaction to a MMU notifier. Practically speaking, this is a nop since the guts of the loop explicitly looks for 4k SPTEs, which are always leaf SPTEs. Switch the iterator to match age_gfn_range() and test_age_gfn() so that a future patch can consolidate the core iterating logic. No real functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
Move the address space ID check that is performed when iterating over roots into the macro helpers to consolidate code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
Pass the address space ID to TDP MMU's primary "zap gfn range" helper to allow the MMU notifier paths to iterate over memslots exactly once. Currently, both the legacy MMU and TDP MMU iterate over memslots when looking for an overlapping hva range, which can be quite costly if there are a large number of memslots. Add a "flush" parameter so that iterating over multiple address spaces in the caller will continue to do the right thing when yielding while a flush is pending from a previous address space. Note, this also has a functional change in the form of coalescing TLB flushes across multiple address spaces in kvm_zap_gfn_range(), and also optimizes the TDP MMU to utilize range-based flushing when running as L1 with Hyper-V enlightenments. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-6-seanjc@google.com> [Keep separate for loops to prepare for other incoming patches. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn rang…
…e zap Gather pending TLB flushes across both address spaces when zapping a given gfn range. This requires feeding "flush" back into subsequent calls, but on the plus side sets the stage for further batching between the legacy MMU and TDP MMU. It also allows refactoring the address space iteration to cover the legacy and TDP MMUs without introducing truly ugly code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
Gather pending TLB flushes across both the legacy and TDP MMUs when zapping collapsible SPTEs to avoid multiple flushes if both the legacy MMU (for nested guests) and TDP MMU have mappings for the memslot. Note, this also optimizes the TDP MMU to flush only the relevant range when running as L1 with Hyper-V enlightenments. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU
Place the onus on the caller of slot_handle_*() to flush the TLB, rather than handling the flush in the helper, and rename parameters accordingly. This will allow future patches to coalesce flushes between address spaces and between the legacy and TDP MMUs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>