Switch branches/tags
Commits on Dec 6, 2012
  1. KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 wit…

    paulusmack authored and agraf committed Nov 23, 2012
    …hout panicking
    Currently, if a machine check interrupt happens while we are in the
    guest, we exit the guest and call the host's machine check handler,
    which tends to cause the host to panic.  Some machine checks can be
    triggered by the guest; for example, if the guest creates two entries
    in the SLB that map the same effective address, and then accesses that
    effective address, the CPU will take a machine check interrupt.
    To handle this better, when a machine check happens inside the guest,
    we call a new function, kvmppc_realmode_machine_check(), while still in
    real mode before exiting the guest.  On POWER7, it handles the cases
    that the guest can trigger, either by flushing and reloading the SLB,
    or by flushing the TLB, and then it delivers the machine check interrupt
    directly to the guest without going back to the host.  On POWER7, the
    OPAL firmware patches the machine check interrupt vector so that it
    gets control first, and it leaves behind its analysis of the situation
    in a structure pointed to by the opal_mc_evt field of the paca.  The
    kvmppc_realmode_machine_check() function looks at this, and if OPAL
    reports that there was no error, or that it has handled the error, we
    also go straight back to the guest with a machine check.  We have to
    deliver a machine check to the guest since the machine check interrupt
    might have trashed valid values in SRR0/1.
    If the machine check is one we can't handle in real mode, and one that
    OPAL hasn't already handled, or on PPC970, we exit the guest and call
    the host's machine check handler.  We do this by jumping to the
    machine_check_fwnmi label, rather than absolute address 0x200, because
    we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
    two are equivalent because address 0x200 just contains a branch.
    Then, if the host machine check handler decides that the system can
    continue executing, kvmppc_handle_exit() delivers a machine check
    interrupt to the guest -- once again to let the guest know that SRR0/1
    have been modified.
    Signed-off-by: Paul Mackerras <>
    [agraf: fix checkpatch warnings]
    Signed-off-by: Alexander Graf <>
  2. KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalid…

    paulusmack authored and agraf committed Nov 21, 2012
    When we change or remove a HPT (hashed page table) entry, we can do
    either a global TLB invalidation (tlbie) that works across the whole
    machine, or a local invalidation (tlbiel) that only affects this core.
    Currently we do local invalidations if the VM has only one vcpu or if
    the guest requests it with the H_LOCAL flag, though the guest Linux
    kernel currently doesn't ever use H_LOCAL.  Then, to cope with the
    possibility that vcpus moving around to different physical cores might
    expose stale TLB entries, there is some code in kvmppc_hv_entry to
    flush the whole TLB of entries for this VM if either this vcpu is now
    running on a different physical core from where it last ran, or if this
    physical core last ran a different vcpu.
    There are a number of problems on POWER7 with this as it stands:
    - The TLB invalidation is done per thread, whereas it only needs to be
      done per core, since the TLB is shared between the threads.
    - With the possibility of the host paging out guest pages, the use of
      H_LOCAL by an SMP guest is dangerous since the guest could possibly
      retain and use a stale TLB entry pointing to a page that had been
      removed from the guest.
    - The TLB invalidations that we do when a vcpu moves from one physical
      core to another are unnecessary in the case of an SMP guest that isn't
      using H_LOCAL.
    - The optimization of using local invalidations rather than global should
      apply to guests with one virtual core, not just one vcpu.
    (None of this applies on PPC970, since there we always have to
    invalidate the whole TLB when entering and leaving the guest, and we
    can't support paging out guest memory.)
    To fix these problems and simplify the code, we now maintain a simple
    cpumask of which cpus need to flush the TLB on entry to the guest.
    (This is indexed by cpu, though we only ever use the bits for thread
    0 of each core.)  Whenever we do a local TLB invalidation, we set the
    bits for every cpu except the bit for thread 0 of the core that we're
    currently running on.  Whenever we enter a guest, we test and clear the
    bit for our core, and flush the TLB if it was set.
    On initial startup of the VM, and when resetting the HPT, we set all the
    bits in the need_tlb_flush cpumask, since any core could potentially have
    stale TLB entries from the previous VM to use the same LPID, or the
    previous contents of the HPT.
    Then, we maintain a count of the number of online virtual cores, and use
    that when deciding whether to use a local invalidation rather than the
    number of online vcpus.  The code to make that decision is extracted out
    into a new function, global_invalidates().  For multi-core guests on
    POWER7 (i.e. when we are using mmu notifiers), we now never do local
    invalidations regardless of the H_LOCAL flag.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  3. KVM: PPC: Book3S PR: MSR_DE doesn't exist on Book 3S

    paulusmack authored and agraf committed Nov 4, 2012
    The mask of MSR bits that get transferred from the guest MSR to the
    shadow MSR included MSR_DE.  In fact that bit only exists on Book 3E
    processors, and it is assigned the same bit used for MSR_BE on Book 3S
    processors.  Since we already had MSR_BE in the mask, this just removes
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  4. KVM: PPC: Book3S PR: Fix VSX handling

    paulusmack authored and agraf committed Nov 4, 2012
    This fixes various issues in how we were handling the VSX registers
    that exist on POWER7 machines.  First, we were running off the end
    of the current->thread.fpr[] array.  Ultimately this was because the
    vcpu->arch.vsr[] array is sized to be able to store both the FP
    registers and the extra VSX registers (i.e. 64 entries), but PR KVM
    only uses it for the extra VSX registers (i.e. 32 entries).
    Secondly, calling load_up_vsx() from C code is a really bad idea,
    because it jumps to fast_exception_return at the end, rather than
    returning with a blr instruction.  This was causing it to jump off
    to a random location with random register contents, since it was using
    the largely uninitialized stack frame created by kvmppc_load_up_vsx.
    In fact, it isn't necessary to call either __giveup_vsx or load_up_vsx,
    since giveup_fpu and load_up_fpu handle the extra VSX registers as well
    as the standard FP registers on machines with VSX.  Also, since VSX
    instructions can access the VMX registers and the FP registers as well
    as the extra VSX registers, we have to load up the FP and VMX registers
    before we can turn on the MSR_VSX bit for the guest.  Conversely, if
    we save away any of the VSX or FP registers, we have to turn off MSR_VSX
    for the guest.
    To handle all this, it is more convenient for a single call to
    kvmppc_giveup_ext() to handle all the state saving that needs to be done,
    so we make it take a set of MSR bits rather than just one, and the switch
    statement becomes a series of if statements.  Similarly kvmppc_handle_ext
    needs to be able to load up more than one set of registers.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  5. KVM: PPC: Book3S PR: Emulate PURR, SPURR and DSCR registers

    paulusmack authored and agraf committed Nov 4, 2012
    This adds basic emulation of the PURR and SPURR registers.  We assume
    we are emulating a single-threaded core, so these advance at the same
    rate as the timebase.  A Linux kernel running on a POWER7 expects to
    be able to access these registers and is not prepared to handle a
    program interrupt on accessing them.
    This also adds a very minimal emulation of the DSCR (data stream
    control register).  Writes are ignored and reads return zero.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  6. KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages

    paulusmack authored and agraf committed Nov 21, 2012
    Currently, if the guest does an H_PROTECT hcall requesting that the
    permissions on a HPT entry be changed to allow writing, we make the
    requested change even if the page is marked read-only in the host
    Linux page tables.  This is a problem since it would for instance
    allow a guest to modify a page that KSM has decided can be shared
    between multiple guests.
    To fix this, if the new permissions for the page allow writing, we need
    to look up the memslot for the page, work out the host virtual address,
    and look up the Linux page tables to get the PTE for the page.  If that
    PTE is read-only, we reduce the HPTE permissions to read-only.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  7. KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT

    paulusmack authored and agraf committed Nov 21, 2012
    This fixes a bug in the code which allows userspace to read out the
    contents of the guest's hashed page table (HPT).  On the second and
    subsequent passes through the HPT, when we are reporting only those
    entries that have changed, we were incorrectly initializing the index
    field of the header with the index of the first entry we skipped
    rather than the first changed entry.  This fixes it.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  8. KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT

    paulusmack authored and agraf committed Nov 21, 2012
    With HV-style KVM, we maintain reverse-mapping lists that enable us to
    find all the HPT (hashed page table) entries that reference each guest
    physical page, with the heads of the lists in the memslot->arch.rmap
    arrays.  When we reset the HPT (i.e. when we reboot the VM), we clear
    out all the HPT entries but we were not clearing out the reverse
    mapping lists.  The result is that as we create new HPT entries, the
    lists get corrupted, which can easily lead to loops, resulting in the
    host kernel hanging when it tries to traverse those lists.
    This fixes the problem by zeroing out all the reverse mapping lists
    when we zero out the HPT.  This incidentally means that we are also
    zeroing our record of the referenced and changed bits (not the bits
    in the Linux PTEs, used by the Linux MM subsystem, but the bits used
    by the KVM_GET_DIRTY_LOG ioctl, and those used by kvm_age_hva() and
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  9. KVM: PPC: Book3S HV: Provide a method for userspace to read and write…

    paulusmack authored and agraf committed Nov 19, 2012
    … the HPT
    A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
    this fd return the contents of the HPT (hashed page table), writes
    create and/or remove entries in the HPT.  There is a new capability,
    KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
    takes an argument structure with the index of the first HPT entry to
    read out and a set of flags.  The flags indicate whether the user is
    intending to read or write the HPT, and whether to return all entries
    or only the "bolted" entries (those with the bolted bit, 0x10, set in
    the first doubleword).
    This is intended for use in implementing qemu's savevm/loadvm and for
    live migration.  Therefore, on reads, the first pass returns information
    about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
    end of the HPT, it returns from the read.  Subsequent reads only return
    information about HPTEs that have changed since they were last read.
    A read that finds no changed HPTEs in the HPT following where the last
    read finished will return 0 bytes.
    The format of the data provides a simple run-length compression of the
    invalid entries.  Each block of data starts with a header that indicates
    the index (position in the HPT, which is just an array), the number of
    valid entries starting at that index (may be zero), and the number of
    invalid entries following those valid entries.  The valid entries, 16
    bytes each, follow the header.  The invalid entries are not explicitly
    Signed-off-by: Paul Mackerras <>
    [agraf: fix documentation]
    Signed-off-by: Alexander Graf <>
  10. KVM: PPC: Book3S HV: Make a HPTE removal function available

    paulusmack authored and agraf committed Nov 19, 2012
    This makes a HPTE removal function, kvmppc_do_h_remove(), available
    outside book3s_hv_rm_mmu.c.  This will be used by the HPT writing
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  11. KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs

    paulusmack authored and agraf committed Nov 19, 2012
    This uses a bit in our record of the guest view of the HPTE to record
    when the HPTE gets modified.  We use a reserved bit for this, and ensure
    that this bit is always cleared in HPTE values returned to the guest.
    The recording of modified HPTEs is only done if other code indicates
    its interest by setting kvm->arch.hpte_mod_interest to a non-zero value.
    The reason for this is that when later commits add facilities for
    userspace to read the HPT, the first pass of reading the HPT will be
    quicker if there are no (or very few) HPTEs marked as modified,
    rather than having most HPTEs marked as modified.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  12. KVM: PPC: Book3S HV: Fix bug causing loss of page dirty state

    paulusmack authored and agraf committed Nov 19, 2012
    This fixes a bug where adding a new guest HPT entry via the H_ENTER
    hcall would lose the "changed" bit in the reverse map information
    for the guest physical page being mapped.  The result was that the
    KVM_GET_DIRTY_LOG could return a zero bit for the page even though
    the page had been modified by the guest.
    This fixes it by only modifying the index and present bits in the
    reverse map entry, thus preserving the reference and change bits.
    We were also unnecessarily setting the reference bit, and this
    fixes that too.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  13. KVM: PPC: Book3S HV: Restructure HPT entry creation code

    paulusmack authored and agraf committed Nov 13, 2012
    This restructures the code that creates HPT (hashed page table)
    entries so that it can be called in situations where we don't have a
    struct vcpu pointer, only a struct kvm pointer.  It also fixes a bug
    where kvmppc_map_vrma() would corrupt the guest R4 value.
    Most of the work of kvmppc_virtmode_h_enter is now done by a new
    function, kvmppc_virtmode_do_h_enter, which itself calls another new
    function, kvmppc_do_h_enter, which contains most of the old
    kvmppc_h_enter.  The new kvmppc_do_h_enter takes explicit arguments
    for the place to return the HPTE index, the Linux page tables to use,
    and whether it is being called in real mode, thus removing the need
    for it to have the vcpu as an argument.
    Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
    HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
    to handle H_ENTER hcalls from the guest that need to pin a page of
    memory.  Since H_ENTER returns the index of the created HPTE in R4,
    kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
    in the case when it gets called from kvmppc_map_vrma on the first
    VCPU_RUN ioctl.  With this, kvmppc_map_vrma instead calls
    kvmppc_virtmode_do_h_enter with the address of a dummy word as the
    place to store the HPTE index, thus avoiding corrupting the guest R4.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
Commits on Nov 14, 2012
  1. TTY: hvc_console, fix port reference count going to zero prematurely

    paulusmack authored and gregkh committed Nov 14, 2012
    Commit bdb498c "TTY: hvc_console, add tty install" took the port
    refcounting out of hvc_open()/hvc_close(), but failed to remove the
    kref_put() and tty_kref_put() calls in hvc_hangup() that were there to
    remove the extra references that hvc_open() had taken.
    The result was that doing a vhangup() when the current terminal was
    a hvc_console, then closing the current terminal, would end up calling
    destroy_hvc_struct() and making the port disappear entirely.  This
    meant that Fedora 17 systems would boot up but then not display the
    login prompt on the console, and attempts to open /dev/hvc0 would
    give a "No such device" error.
    This fixes it by removing the extra kref_put() and tty_kref_put() calls.
    Signed-off-by: Paul Mackerras <>
    Acked-by: Jiri Slaby <>
    Signed-off-by: Greg Kroah-Hartman <>
Commits on Oct 30, 2012
  1. KVM: PPC: Book3S HV: Fix thinko in try_lock_hpte()

    paulusmack authored and agraf committed Oct 15, 2012
    This fixes an error in the inline asm in try_lock_hpte() where we
    were erroneously using a register number as an immediate operand.
    The bug only affects an error path, and in fact the code will still
    work as long as the compiler chooses some register other than r0
    for the "bits" variable.  Nevertheless it should still be fixed.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  2. KVM: PPC: Book3S HV: Allow DTL to be set to address 0, length 0

    paulusmack authored and agraf committed Oct 15, 2012
    Commit 55b665b ("KVM: PPC: Book3S HV: Provide a way for userspace
    to get/set per-vCPU areas") includes a check on the length of the
    dispatch trace log (DTL) to make sure the buffer is at least one entry
    long.  This is appropriate when registering a buffer, but the
    interface also allows for any existing buffer to be unregistered by
    specifying a zero address.  In this case the length check is not
    appropriate.  This makes the check conditional on the address being
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  3. KVM: PPC: Book3S HV: Fix accounting of stolen time

    paulusmack authored and agraf committed Oct 15, 2012
    Currently the code that accounts stolen time tends to overestimate the
    stolen time, and will sometimes report more stolen time in a DTL
    (dispatch trace log) entry than has elapsed since the last DTL entry.
    This can cause guests to underflow the user or system time measured
    for some tasks, leading to ridiculous CPU percentages and total runtimes
    being reported by top and other utilities.
    In addition, the current code was designed for the previous policy where
    a vcore would only run when all the vcpus in it were runnable, and so
    only counted stolen time on a per-vcore basis.  Now that a vcore can
    run while some of the vcpus in it are doing other things in the kernel
    (e.g. handling a page fault), we need to count the time when a vcpu task
    is preempted while it is not running as part of a vcore as stolen also.
    To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
    vcpu_load/put functions to count preemption time while the vcpu is
    in that state.  Handling the transitions between the RUNNING and
    BUSY_IN_HOST states requires checking and updating two variables
    (accumulated time stolen and time last preempted), so we add a new
    spinlock, vcpu->arch.tbacct_lock.  This protects both the per-vcpu
    stolen/preempt-time variables, and the per-vcore variables while this
    vcpu is running the vcore.
    Finally, we now don't count time spent in userspace as stolen time.
    The task could be executing in userspace on behalf of the vcpu, or
    it could be preempted, or the vcpu could be genuinely stopped.  Since
    we have no way of dividing up the time between these cases, we don't
    count any of it as stolen.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  4. KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run

    paulusmack authored and agraf committed Oct 15, 2012
    Currently the Book3S HV code implements a policy on multi-threaded
    processors (i.e. POWER7) that requires all of the active vcpus in a
    virtual core to be ready to run before we run the virtual core.
    However, that causes problems on reset, because reset stops all vcpus
    except vcpu 0, and can also reduce throughput since all four threads
    in a virtual core have to wait whenever any one of them hits a
    hypervisor page fault.
    This relaxes the policy, allowing the virtual core to run as soon as
    any vcpu in it is runnable.  With this, the KVMPPC_VCPU_STOPPED state
    and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single
    KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish
    between them.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  5. KVM: PPC: Book3S HV: Fixes for late-joining threads

    paulusmack authored and agraf committed Oct 15, 2012
    If a thread in a virtual core becomes runnable while other threads
    in the same virtual core are already running in the guest, it is
    possible for the latecomer to join the others on the core without
    first pulling them all out of the guest.  Currently this only happens
    rarely, when a vcpu is first started.  This fixes some bugs and
    omissions in the code in this case.
    First, we need to check for VPA updates for the latecomer and make
    a DTL entry for it.  Secondly, if it comes along while the master
    vcpu is doing a VPA update, we don't need to do anything since the
    master will pick it up in kvmppc_run_core.  To handle this correctly
    we introduce a new vcore state, VCORE_STARTING.  Thirdly, there is
    a race because we currently clear the hardware thread's hwthread_req
    before waiting to see it get to nap.  A latecomer thread could have
    its hwthread_req cleared before it gets to test it, and therefore
    never increment the nap_count, leading to messages about wait_for_nap
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  6. KVM: PPC: Book3s HV: Don't access runnable threads list without vcore…

    paulusmack authored and agraf committed Oct 15, 2012
    … lock
    There were a few places where we were traversing the list of runnable
    threads in a virtual core, i.e. vc->runnable_threads, without holding
    the vcore spinlock.  This extends the places where we hold the vcore
    spinlock to cover everywhere that we traverse that list.
    Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
    this moves the call of it from kvmppc_handle_exit out to
    kvmppc_vcpu_run, where we don't hold the vcore lock.
    In kvmppc_vcore_blocked, we don't actually need to check whether
    all vcpus are ceded and don't have any pending exceptions, since the
    caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
    actually checking for pending exceptions, so we add that.
    The change of if to while in kvmppc_run_vcpu is to make sure that we
    never call kvmppc_remove_runnable() when the vcore state is RUNNING or
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  7. KVM: PPC: Book3S HV: Fix some races in starting secondary threads

    paulusmack authored and agraf committed Oct 15, 2012
    Subsequent patches implementing in-kernel XICS emulation will make it
    possible for IPIs to arrive at secondary threads at arbitrary times.
    This fixes some races in how we start the secondary threads, which
    if not fixed could lead to occasional crashes of the host kernel.
    This makes sure that (a) we have grabbed all the secondary threads,
    and verified that they are no longer in the kernel, before we start
    any thread, (b) that the secondary thread loads its vcpu pointer
    after clearing the IPI that woke it up (so we don't miss a wakeup),
    and (c) that the secondary thread clears its vcpu pointer before
    incrementing the nap count.  It also removes unnecessary setting
    of the vcpu and vcore pointers in the paca in kvmppc_core_vcpu_load.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  8. KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads comin…

    paulusmack authored and agraf committed Oct 15, 2012
    …g online
    When a Book3S HV KVM guest is running, we need the host to be in
    single-thread mode, that is, all of the cores (or at least all of
    the cores where the KVM guest could run) to be running only one
    active hardware thread.  This is because of the hardware restriction
    in POWER processors that all of the hardware threads in the core
    must be in the same logical partition.  Complying with this restriction
    is much easier if, from the host kernel's point of view, only one
    hardware thread is active.
    This adds two hooks in the SMP hotplug code to allow the KVM code to
    make sure that secondary threads (i.e. hardware threads other than
    thread 0) cannot come online while any KVM guest exists.  The KVM
    code still has to check that any core where it runs a guest has the
    secondary threads offline, but having done that check it can now be
    sure that they will not come online while the guest is running.
    Signed-off-by: Paul Mackerras <>
    Acked-by: Benjamin Herrenschmidt <>
    Signed-off-by: Alexander Graf <>
Commits on Oct 5, 2012
  1. KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU …

    paulusmack authored and agraf committed Sep 25, 2012
    The PAPR paravirtualization interface lets guests register three
    different types of per-vCPU buffer areas in its memory for communication
    with the hypervisor.  These are called virtual processor areas (VPAs).
    Currently the hypercalls to register and unregister VPAs are handled
    by KVM in the kernel, and userspace has no way to know about or save
    and restore these registrations across a migration.
    This adds "register" codes for these three areas that userspace can
    use with the KVM_GET/SET_ONE_REG ioctls to see what addresses have
    been registered, and to register or unregister them.  This will be
    needed for guest hibernation and migration, and is also needed so
    that userspace can unregister them on reset (otherwise we corrupt
    guest memory after reboot by writing to the VPAs registered by the
    previous kernel).
    The "register" for the VPA is a 64-bit value containing the address,
    since the length of the VPA is fixed.  The "registers" for the SLB
    shadow buffer and dispatch trace log (DTL) are 128 bits long,
    consisting of the guest physical address in the high (first) 64 bits
    and the length in the low 64 bits.
    This also fixes a bug where we were calling init_vpa unconditionally,
    leading to an oops when unregistering the VPA.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  2. KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG int…

    paulusmack authored and agraf committed Sep 25, 2012
    This enables userspace to get and set all the guest floating-point
    state using the KVM_[GS]ET_ONE_REG ioctls.  The floating-point state
    includes all of the traditional floating-point registers and the
    FPSCR (floating point status/control register), all the VMX/Altivec
    vector registers and the VSCR (vector status/control register), and
    on POWER7, the vector-scalar registers (note that each FP register
    is the high-order half of the corresponding VSR).
    Most of these are implemented in common Book 3S code, except for VSX
    on POWER7.  Because HV and PR differ in how they store the FP and VSX
    registers on POWER7, the code for these cases is not common.  On POWER7,
    the FP registers are the upper halves of the VSX registers vsr0 - vsr31.
    PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the
    arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas
    HV KVM on POWER7 stores the whole VSX register in arch.vsr[].
    Signed-off-by: Paul Mackerras <>
    [agraf: fix whitespace, vsx compilation]
    Signed-off-by: Alexander Graf <>
  3. KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface

    paulusmack authored and agraf committed Sep 25, 2012
    This enables userspace to get and set various SPRs (special-purpose
    registers) using the KVM_[GS]ET_ONE_REG ioctls.  With this, userspace
    can get and set all the SPRs that are part of the guest state, either
    through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
    the KVM_[GS]ET_ONE_REG ioctls.
    The SPRs that are added here are:
    - DABR:  Data address breakpoint register
    - DSCR:  Data stream control register
    - PURR:  Processor utilization of resources register
    - SPURR: Scaled PURR
    - DAR:   Data address register
    - DSISR: Data storage interrupt status register
    - AMR:   Authority mask register
    - UAMOR: User authority mask override register
    - MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
    - PMC1..PMC8: Performance monitor unit counter registers
    In order to reduce code duplication between PR and HV KVM code, this
    moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and
    centralizes the copying between user and kernel space there.  The
    registers that are handled differently between PR and HV, and those
    that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg()
    functions that are specific to each flavor.
    Signed-off-by: Paul Mackerras <>
    [agraf: minimal style fixes]
    Signed-off-by: Alexander Graf <>
  4. KVM: PPC: Book3S HV: Fix calculation of guest phys address for MMIO e…

    paulusmack authored and agraf committed Sep 20, 2012
    In the case where the host kernel is using a 64kB base page size and
    the guest uses a 4k HPTE (hashed page table entry) to map an emulated
    MMIO device, we were calculating the guest physical address wrongly.
    We were calculating a gfn as the guest physical address shifted right
    16 bits (PAGE_SHIFT) but then only adding back in 12 bits from the
    effective address, since the HPTE had a 4k page size.  Thus the gpa
    reported to userspace was missing 4 bits.
    Instead, we now compute the guest physical address from the HPTE
    without reference to the host page size, and then compute the gfn
    by shifting the gpa right PAGE_SHIFT bits.
    Reported-by: Alexey Kardashevskiy <>
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  5. KVM: PPC: Book3S HV: Remove bogus update of physical thread IDs

    paulusmack authored and agraf committed Sep 20, 2012
    When making a vcpu non-runnable we incorrectly changed the
    thread IDs of all other threads on the core, just remove that
    Signed-off-by: Benjamin Herrenschmidt <>
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  6. KVM: PPC: Book3S HV: Fix updates of vcpu->cpu

    paulusmack authored and agraf committed Sep 20, 2012
    This removes the powerpc "generic" updates of vcpu->cpu in load and
    put, and moves them to the various backends.
    The reason is that "HV" KVM does its own sauce with that field
    and the generic updates might corrupt it. The field contains the
    CPU# of the -first- HW CPU of the core always for all the VCPU
    threads of a core (the one that's online from a host Linux
    However, the preempt notifiers are going to be called on the
    threads VCPUs when they are running (due to them sleeping on our
    private waitqueue) causing unload to be called, potentially
    clobbering the value.
    Signed-off-by: Benjamin Herrenschmidt <>
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  7. KVM: Move some PPC ioctl definitions to the correct place

    paulusmack authored and agraf committed Sep 13, 2012
    This moves the definitions of KVM_CREATE_SPAPR_TCE and
    KVM_ALLOCATE_RMA in include/linux/kvm.h from the section listing the
    vcpu ioctls to the section listing VM ioctls, as these are both
    implemented and documented as VM ioctls.
    Fortunately there is no actual collision of ioctl numbers at this
    point.  Moving these to the correct section will reduce the
    probability of a future collision.  This does not change the
    user/kernel ABI at all.
    Signed-off-by: Paul Mackerras <>
    Acked-by: Alexander Graf <>
    Signed-off-by: Alexander Graf <>
  8. KVM: PPC: Book3S HV: Handle memory slot deletion and modification cor…

    paulusmack authored and agraf committed Sep 11, 2012
    This adds an implementation of kvm_arch_flush_shadow_memslot for
    Book3S HV, and arranges for kvmppc_core_commit_memory_region to
    flush the dirty log when modifying an existing slot.  With this,
    we can handle deletion and modification of memory slots.
    kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which
    on Book3S HV now traverses the reverse map chains to remove any HPT
    (hashed page table) entries referring to pages in the memslot.  This
    gets called by generic code whenever deleting a memslot or changing
    the guest physical address for a memslot.
    We flush the dirty log in kvmppc_core_commit_memory_region for
    consistency with what x86 does.  We only need to flush when an
    existing memslot is being modified, because for a new memslot the
    rmap array (which stores the dirty bits) is all zero, meaning that
    every page is considered clean already, and when deleting a memslot
    we obviously don't care about the dirty bits any more.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  9. KVM: PPC: Move kvm->arch.slot_phys into memslot.arch

    paulusmack authored and agraf committed Sep 11, 2012
    Now that we have an architecture-specific field in the kvm_memory_slot
    structure, we can use it to store the array of page physical addresses
    that we need for Book3S HV KVM on PPC970 processors.  This reduces the
    size of struct kvm_arch for Book3S HV, and also reduces the size of
    struct kvm_arch_memory_slot for other PPC KVM variants since the fields
    in it are now only compiled in for Book3S HV.
    This necessitates making the kvm_arch_create_memslot and
    kvm_arch_free_memslot operations specific to each PPC KVM variant.
    That in turn means that we now don't allocate the rmap arrays on
    Book3S PR and Book E.
    Since we now unpin pages and free the slot_phys array in
    kvmppc_core_free_memslot, we no longer need to do it in
    kvmppc_core_destroy_vm, since the generic code takes care to free
    all the memslots when destroying a VM.
    We now need the new memslot to be passed in to
    kvmppc_core_prepare_memory_region, since we need to initialize its
    arch.slot_phys member on Book3S HV.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  10. KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots

    paulusmack authored and agraf committed Sep 11, 2012
    The generic KVM code uses SRCU (sleeping RCU) to protect accesses
    to the memslots data structures against updates due to userspace
    adding, modifying or removing memory slots.  We need to do that too,
    both to avoid accessing stale copies of the memslots and to avoid
    lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
    around code that accesses and uses memslots.
    Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE
    need to access the memslots, and we don't want to call the SRCU code
    in real mode (since we have no assurance that it would only access
    the linear mapping), we hold the SRCU read lock for the VM while
    in the guest.  This does mean that adding or removing memory slots
    while some vcpus are executing in the guest will block for up to
    two jiffies.  This tradeoff is acceptable since adding/removing
    memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE
    are performance-critical hot paths.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
  11. KVM: PPC: Quieten message about allocating linear regions

    paulusmack authored and agraf committed Aug 6, 2012
    This is printed once for every RMA or HPT region that get
    preallocated.  If one preallocates hundreds of such regions
    (in order to run hundreds of KVM guests), that gets rather
    painful, so make it a bit quieter.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Alexander Graf <>
Commits on Sep 5, 2012
  1. powerpc: Make sure IPI handlers see data written by IPI senders

    paulusmack authored and ozbenh committed Sep 4, 2012
    We have been observing hangs, both of KVM guest vcpu tasks and more
    generally, where a process that is woken doesn't properly wake up and
    continue to run, but instead sticks in TASK_WAKING state.  This
    happens because the update of rq->wake_list in ttwu_queue_remote()
    is not ordered with the update of ipi_message in
    smp_muxed_ipi_message_pass(), and the reading of rq->wake_list in
    scheduler_ipi() is not ordered with the reading of ipi_message in
    smp_ipi_demux().  Thus it is possible for the IPI receiver not to see
    the updated rq->wake_list and therefore conclude that there is nothing
    for it to do.
    In order to make sure that anything done before smp_send_reschedule()
    is ordered before anything done in the resulting call to scheduler_ipi(),
    this adds barriers in smp_muxed_message_pass() and smp_ipi_demux().
    The barrier in smp_muxed_message_pass() is a full barrier to ensure that
    there is a full ordering between the smp_send_reschedule() caller and
    scheduler_ipi().  In smp_ipi_demux(), we use xchg() rather than
    xchg_local() because xchg() includes release and acquire barriers.
    Using xchg() rather than xchg_local() makes sense given that
    ipi_message is not just accessed locally.
    This moves the barrier between setting the message and calling the
    cause_ipi() function into the individual cause_ipi implementations.
    Most of them -- those that used outb, out_8 or similar -- already had
    a full barrier because out_8 etc. include a sync before the MMIO
    store.  This adds an explicit barrier in the two remaining cases.
    These changes made no measurable difference to the speed of IPIs as
    measured using a simple ping-pong latency test across two CPUs on
    different cores of a POWER7 machine.
    The analysis of the reason why processes were not waking up properly
    is due to Milton Miller.
    Cc: # v3.0+
    Reported-by: Milton Miller <>
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Benjamin Herrenschmidt <>
  2. powerpc/powernv: Always go into nap mode when CPU is offline

    paulusmack authored and ozbenh committed Jul 26, 2012
    The CPU hotplug code for the powernv platform currently only puts
    offline CPUs into nap mode if the powersave_nap variable is set.
    However, HV-style KVM on this platform requires secondary CPU threads
    to be offline and in nap mode.  Since we know nap mode works just
    fine on all POWER7 machines, and the only machines that support the
    powernv platform are POWER7 machines, this changes the code to
    always put offline CPUs into nap mode, regardless of powersave_nap.
    Powersave_nap still controls whether or not CPUs go into nap mode
    when idle, as before.
    Signed-off-by: Paul Mackerras <>
    Signed-off-by: Benjamin Herrenschmidt <>