Commits on Jul 30, 2016
  1. kernel - Refactor cpu localization for VM page allocations

    * Change how cpu localization works.  The old scheme was extremely unbalanced
      in terms of vm_page_queue[] load.
      The new scheme uses cpu topology information to break the vm_page_queue[]
      down into major blocks based on the physical package id, minor blocks
      based on the core id in each physical package, and then by 1's based on
      (pindex + object->pg_color).
      If PQ_L2_SIZE is not big enough such that 16-way operation is attainable
      by physical and core id, we break the queue down only by physical id.
      Note that the core id is a real core count, not a cpu thread count, so
      an 8-core/16-thread x 2 socket xeon system will just fit in the 16-way
      requirement (there are 256 PQ_FREE queues).
    * When a particular queue does not have a free page, iterate nearby queues
      start at +/- 1 (before we started at +/- PQ_L2_SIZE/2), in an attempt to
      retain as much locality as possible.  This won't be perfect but it should
      be good enough.
    * Also fix an issue with the idlezero counters.
    Matthew Dillon committed Jul 29, 2016
Commits on Jul 29, 2016
  1. systat - Adjust extended vmstats display

    * When the number of devices are few enough (or you explicitly specify
      just a few disk devices, or one), there is enough room for the
      extended vmstats display.  Make some adjustments to this display.
    * Display values in bytes (K, M, G, etc) instead of pages like the other
    * Rename zfod to nzfod and subtract-away ozfod when displaying nzfod
      (only in the extended display), so the viewer doesn't have to do the
      subtraction in his head.
    Matthew Dillon committed Jul 29, 2016
  2. kernel - Reduce memory testing and early-boot zeroing.

    * Reduce the amount of memory testing and early-boot zeroing that
      we do, improving boot times on systems with large amounts of memory.
    * Fix race in the page zeroing count.
    * Refactor the VM zeroidle code.  Instead of having just one kernel thread,
      have one on each cpu.
      This significantly increases the rate at which the machine can eat up
      idle cycles to pre-zero pages in the cold path, improving performance
      in the hot-path (normal) page allocations which request zerod pages.
    * On systems with a lot of cpus there is usually a little idle time (e.g.
      0.1%) on a few of the cpus, even under extreme loads.  At the same time,
      such loads might also imply a lot of zfod faults requiring zero'd pages.
      On our 48-core opteron we see a zfod rate of 1.0 to 1.5 GBytes/sec and
      a page-freeing rate of 1.3 - 2.5 GBytes/sec.  Distributing the page
      zeroing code and eating up these miniscule bits of idle improves the
      kernel's ability to provide a pre-zerod page (vs having to zero-it in
      the hot path) significantly.
      Under the synth test load the kernel was still able to provide 400-700
      MBytes/sec worth of pre-zerod pages whereas before this change the kernel
      was only able to provide 20 MBytes/sec worth of pre-zerod pages.
    Matthew Dillon committed Jul 29, 2016
  3. kernel - Cleanup namecache stall messages on console

    * Report the proper elapsed time and also include td->td_comm
      in the printed output on the console.
    Matthew Dillon committed Jul 29, 2016
  4. kernel - Fix rare tsleep/callout race

    * Fix a rare tsleep/callout race.  The callout timer can trigger before
      the tsleep() releases its lwp_token (or if someone else holds the
      calling thread's lwp_token).
      This case is detected, but failed to adjust lwp_stat before
      descheduling and switching away.  This resulted in an endless sleep.
    Matthew Dillon committed Jul 29, 2016
  5. @zrj-rimwis

    mktemp.3: Improve the manpage, add mklinks.

    Fix SYNOPSIS, remove outdated information and clarify availability.
    Taken-from: FreeBSD
    zrj-rimwis committed with zrj Jul 29, 2016
  6. @zrj-rimwis
  7. hyperv/vmbus: Passthrough interrupt resource allocation to nexus

    This greatly simplies interrupt allocation.  And reenable the interrupt
    resource not found warning in acpi.
    Sepherosa Ziehau committed Jul 29, 2016
  8. libthread_xu - Don't override vfork()

    * Allow vfork() to operate normally in a threaded environment.  The kernel
      can handle multiple concurrent vfork()s by different threads (only the
      calling thread blocks, same as how Linux deals with it).
    Matthew Dillon committed Jul 28, 2016
Commits on Jul 28, 2016
  1. mktemp.3: Fix a typo and bump .Dd

    Sascha Wildner committed Jul 28, 2016
  2. kernel - Be nicer to pthreads in vfork()

    * When vfork()ing, give the new sub-process's lwp the same TID as the one
      that called vfork().  Even though user processes are not supposed to do
      anything sophisticated inside a vfork() prior to exec()ing, some things
      such as fileno() having to lock in a threaded environment might not be
      apparent to the programmer.
    * By giving the sub-process the same TID, operations done inside the
      vfork() prior to exec that interact with pthreads will not confuse
      pthreads and cause corruption due to e.g. TID 0 clashing with TID 0
      running in the parent that is running concurrently.
    Matthew Dillon committed Jul 28, 2016
  3. ed(1): Sync with FreeBSD.

    Sascha Wildner committed Jul 28, 2016
  4. ed(1): Remove handling of non-POSIX environment.

    Sascha Wildner committed Jul 28, 2016
  5. libc - Fix more popen() issues

    * Fix a file descriptor leak between popen() and pclose() in a threaded
      environment.  The control structure is removed from the list, then the
      list is unlocked, then the file is closed.  This can race a popen
      inbetween the unlock and the closure.
    * Do not use fileno() inside vfork, it is a complex function in a threaded
      environment which could lead to corruption since the vfork()'s lwp id may
      clash with one from the parent process.
    Matthew Dillon committed Jul 28, 2016
  6. kernel - Fix getpid() issue in vfork() when threaded

    * upmap->invfork was a 0 or 1, but in a threaded program it is possible
      for multiple threads to be in vfork() at the same time.  Change invfork
      to a count.
    * Fixes improper getpid() return when concurrent vfork()s are occuring in
      a threaded program.
    Matthew Dillon committed Jul 28, 2016
  7. drm/linux: Clean-up pci_resource_start()

    Making it less verbose
    François Tigeot committed Jul 28, 2016
Commits on Jul 27, 2016
  1. systat - Restrict %rip sampling to root

    * Only allow root to sample the %rip and %rsp on all cpus.  The sysctl will
      not sample and return 0 for these fields if the uid is not root.
      This is for security, as %rip sampling can be used to break cryptographic
    * systat -pv 1 will not display the sampling columns if the sample value
      is 0.
    Matthew Dillon committed Jul 27, 2016
  2. test - Add umtx1 code

    * Add umtx1 code - fast context switch tests
    * Make blib.c thread-safe.
    Matthew Dillon committed Jul 27, 2016
  3. libc - Fix numerous fork/exec*() leaks, also add mkostemp() and mkost…

    * Use O_CLOEXEC in many places to prevent temporary descriptors from leaking
      into fork/exec'd code (e.g. in multi-threaded situations).
    * Note that the popen code will close any other popen()'d descriptors in
      the child process that it forks just prior to exec.  However, there was
      a descriptor leak where another thread issuing popen() at the same time
      could leak the descriptors into their exec.
      Use O_CLOEXEC to close this hole.
    * popen() now accepts the 'e' flag (i.e. "re") to retain O_CLOEXEC in the
      returned descriptor.  Normal "r" (etc) will clear O_CLOEXEC in the returned
      Note that normal "r" modes are still fine for most use cases since popen
      properly closes other popen()d descriptors in the fork().  BUT!! If the
      threaded program calls exec*() in other ways, such descriptors may
      unintentionally be passed onto sub-processes.  So consider using "re".
    * Add mkostemp() and mkostemps() to allow O_CLOEXEC to be passed in,
      closing a thread race that would otherwise leak the temporary descriptor
      into other fork/exec()s.
    Taken-from: Mostly taken from FreeBSD
    Matthew Dillon committed Jul 27, 2016
Commits on Jul 26, 2016
  1. kernel - Disable lwp->lwp optimization in thread switcher

    * Put #ifdef around the existing lwp->lwp switch optimization and then
      disable it.  This optimizations tries to avoid reloading %cr3 and avoid
      pmap->pm_active atomic ops when switching to a lwp that shares the same
      This optimization is no longer applicable on multi-core systems as such
      switches are very rare.  LWPs are usually distributed across multiple cores
      so rarely does one switch to another on the same core (and in cpu-bound
      situations, the scheduler will already be in batch mode).  The conditionals
      in the optimization, on the other hand, did measurably (just slightly)
      reduce performance for normal switches.  So turn it off.
    * Implement an optimization for interrupt preemptions, but disable it for
      now.  I want to keep the code handy but so far my tests show no improvement
      in performance with huge interrupt rates (from nvme devices), so it is
      #undef'd for now.
    Matthew Dillon committed Jul 26, 2016
  2. kernel - Minor cleanup swtch.s

    * Minor cleanup
    Matthew Dillon committed Jul 26, 2016
  3. kernel - Fix namecache race & panic

    * Properly lock and re-check the parent association when iterating its
      children, fixing a bug in a code path associated with unmounting
      The code improperly assumed that there could be no races because there
      are were no accessors left.  In fact, under heavy loads, the namecache
      scan in this routine can race against the negative-name-cache management
    * Generally speaking can only happen when lots of mounts and unmounts are
      done under heavy loads (for example, tmpfs mounts during a poudriere or
      synth run).
    Matthew Dillon committed Jul 26, 2016
  4. kernel - Reduce atomic ops in switch code

    * Instead of using four atomic 'and' ops and four atomic 'or' ops, use
      one atomic 'and' and one atomic 'or' when adjusting the pmap->pm_active.
    * Store the array index and simplified cpu mask in the globaldata structure
      for the above operation.
    Matthew Dillon committed Jul 26, 2016
  5. kernel - refactor CPUMASK_ADDR()

    * Refactor CPUMASK_ADDR(), removing the conditionals and just indexing the
      array as appropriate.
    Matthew Dillon committed Jul 26, 2016
  6. kernel - Fix VM bug introduced earlier this month

    * Adding the yields to the VM page teardown and related code was a great
      idea (~Jul 10th commits), but it also introduced a bug where the page
      could get torn-out from under the scan due to the vm_object's token being
      temporarily lost.
    * Re-check page object ownership and (when applicable) its pindex before
      acting on the page.
    Matthew Dillon committed Jul 25, 2016
Commits on Jul 25, 2016
  1. systat - Refactor memory displays for systat -vm

    * Report paging and swap activity in bytes and I/Os instead of pages and
      I/Os (I/Os usually matched pages).
    * Report zfod and cow in bytes instead of pages.
    * Replace the REAL and VIRTUAL section with something that makes a bit
      more sense.
      Report active memory (this is just active pages), kernel memory
      (currently just wired but we can add more stuff later), Free
      (inactive + cache + free is considered free/freeable memory), and
      total system memory as reported at boot time.
      Report total RSS - basically how many pages the system is mapping to
      user processes.  Due to sharing this can be a large value.
      Do not try to report aggregate VSZ as there's no point in doing so
      any more.
      Reported swap usage on the main -vm display as well as total swap
    * Fix display bug in systat -sw display.
    * Add "nvme" device type match for the disk display.
    Matthew Dillon committed Jul 25, 2016
  2. @ivadasz

    if_iwm - Fix inverted logic in iwm_tx().

    The PROT_REQUIRE flag in should be set for data frames above a certain
    length, but we were setting it for !data frames above a certain length,
    which makes no sense at all.
    Taken-From: OpenBSD, Linux iwlwifi
    ivadasz committed Jul 24, 2016
  3. kernel - Fix mountctl() / unmount race

    * kern_mountctl() now properly checks to see if an unmount is in-progress
      and returns an error, fixing a later panic.
    Matthew Dillon committed Jul 25, 2016
  4. sysconf.3: Fix typo.

    Sascha Wildner committed Jul 25, 2016
  5. libc/strptime: Return NULL, not 0, since the function returns char *.

    While here, accept 'UTC' for %Z as well.
    Taken-from: FreeBSD
    Sascha Wildner committed Jul 25, 2016
  6. mountd, mount - Change how mount signals mountd, reduce mountd spam

    * mount now signals mountd with SIGUSR1 instead of SIGHUP.
    * mountd now recognizes SIGUSR1 as requesting an incremental update.
      Instead of wiping all exports on all mounts and then re-scanning
      the exports file and re-adding from the exports file, mountd will
      now only wipe the export(s) on mounts it finds in the exports file.
    * Greatly reduces unnecessary mountlist scans and commands due to
      mount_null and mount_tmpfs operations, while still preserving our
      ability to export such filesystems.
    Matthew Dillon committed Jul 25, 2016
  7. kernel - Close a few SMP holes

    * Don't trust the compiler when loading refs in cache_zap().  Make sure
      it doesn't reorder or re-use the memory reference.
    * In cache_nlookup() and cache_nlookup_maybe_shared(), do a full re-test
      of the namecache element after locking instead of a partial re-test.
    * Lock the namecache record in two situations where we need to set a
      flag.  Almost all other flag cases require similar locking.  This fixes
      a potential SMP race in a very thin window during mounting.
    * Fix unmount / access races in sys_vquotactl() and, more importantly, in
      sys_mount().  We were disposing of the namecache record after extracting
      the mount pointer, then using the mount pointer.  This could race an
      unmount and result in a corrupt mount pointer.
      Change the code to dispose of the namecache record after we finish using
      the mount point.  This is somewhat more complex then I'd like, but it
      is important to unlock the namecache record across the potentially
      blocking operation to prevent a lock chain from propagating upwards
      towards the root.
    * Enhanced debugging for the namecache teardown case when nc_refs changes
    * Remove some dead code (cache_purgevfs()).
    Matthew Dillon committed Jul 24, 2016
  8. kernel - Cut buffer cache related pmap invalidations in half

    * Do not bother to invalidate the TLB when tearing down a buffer
      cache buffer.  On the flip side, always invalidate the TLB
      (the page range in question) when entering pages into a buffer
      cache buffer.  Only applicable to normal VMIO buffers.
    * Significantly improves buffer cache / filesystem performance with
      no real risk.
    * Significantly improves performance for tmpfs teardowns on unmount
      (which typically have to tear-down a lot of buffer cache buffers).
    Matthew Dillon committed Jul 24, 2016
  9. kernel - Add some more options for pmap_qremove*()

    * Add pmap_qremove_quick() and pmap_qremove_noinval(), allowing pmap
      entries to be removed without invalidation under carefully managed
      circumstances by other subsystems.
    * Redo the virtual kernel a little to work the same as the real kernel
      when entering new pmap entries.  We cannot assume that no invalidation
      is needed when the prior contents of the pte is 0, because there are
      several ways it could have become 0 without a prior invalidation.
      Also use an atomic op to clear the entry.
    Matthew Dillon committed Jul 24, 2016
  10. kernel - cli interlock with critcount in interrupt assembly

    * Disable interrupts when decrementing the critical section count
      and gd_intr_nesting_level, just prior to jumping into doreti.
      This prevents a stacking interrupt from occurring in this roughly
      10-instruction window.
    * While limited stacking is not really a problem, this closes a very
      small and unlikely window where multiple device interrupts could
      stack excessively and run the kernel thread out of stack space.
      (unlikely that it has ever happened in real life, but becoming more
      likely as some modern devices are capable of much higher interrupt
    Matthew Dillon committed Jul 24, 2016