Qu-Wenruo/btrf…
Commits on Jan 7, 2016
-
btrfs: dedup: Add ioctl for inband deduplication
Add ioctl interface for inband deduplication, which includes: 1) enable 2) disable 3) status We will later add ioctl to disable inband dedup for given file/dir. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedup: Add support for adding hash for on-disk backend
Now on-disk backend can add hash now. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedup: Add support to delete hash for on-disk backend
Now on-disk backend can delete hash now. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedup: Add support for on-disk hash search
Now on-disk backend should be able to search hash now. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedup: Introduce interfaces to resume and cleanup dedup info
Since we will introduce a new on-disk based dedup method, introduce new interfaces to resume previous dedup setup. And since we introduce a new tree for status, also add disable handler for it. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedup: Add basic tree structure for on-disk dedup method
Introduce a new tree, dedup tree to record on-disk dedup hash. As a persist hash storage instead of in-memeory only implement. Unlike Liu Bo's implement, in this version we won't do hack for bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such search case, just like in-memory backend. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedup: Inband in-memory only de-duplication implement
Core implement for inband de-duplication. It reuse the async_cow_start() facility to calculate dedup hash. And use dedup hash to do inband de-duplication at extent level. The work flow is as below: 1) Run delalloc range for an inode 2) Calculate hash for the delalloc range at the unit of dedup_bs 3) For hash match(duplicated) case, just increase source extent ref and insert file extent. For hash mismatch case, go through the normal cow_file_range() fallback, and add hash into dedup_tree. Compress for hash miss case is not supported yet. Current implement restore all dedup hash in memory rb-tree, with LRU behavior to control the limit. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: ordered-extent: Add support for dedup
Add ordered-extent support for dedup. Note, current ordered-extent support only supports non-compressed source extent. Support for compressed source extent will be added later. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedup: Implement btrfs_dedup_calc_hash interface
Unlike in-memory or on-disk dedup method, only SHA256 hash method is supported yet, so implement btrfs_dedup_calc_hash() interface using SHA256. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedup: Introduce function to search for an existing hash
Introduce static function inmem_search() to handle the job for in-memory hash tree. The trick is, we must ensure the delayed ref head is not being run at the time we search the for the hash. With inmem_search(), we can implement the btrfs_dedup_search() interface. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: delayed_ref: Add support for handle dedup hash
Add support for delayed_ref to handle dedup_hash. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: delayed-ref: Add support for atomic increasing extent ref
Slightly modify btrfs_add_delayed_data_ref() to allow it accept GFP_ATOMIC, and allow it to do be called inside a spinlock. This is used by later dedup patches. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
-
btrfs: dedup: Introduce function to remove hash from in-memory tree
Introduce static function inmem_del() to remove hash from in-memory dedup tree. And implement btrfs_dedup_del() and btrfs_dedup_destroy() interfaces. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedup: Introduce function to add hash into in-memory tree
Introduce static function inmem_add() to add hash into in-memory tree. And now we can implement the btrfs_dedup_add() interface. Sgined-o Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedup: Introduce function to initialize dedup info
Add generic function to initialize dedup info. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
-
btrfs: dedup: Introduce dedup framework and its header
Introduce the header for btrfs online(write time) de-duplication framework and needed header. The new de-duplication framework is going to support 2 different dedup method and 1 dedup hash. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Commits on Jan 6, 2016
-
perf/x86/amd: Remove l1-dcache-stores event for AMD
This is a long standing bug with the l1-dcache-stores generic event on AMD machines. My perf_event testsuite has been complaining about this for years and I'm finally getting around to trying to get it fixed. The data_cache_refills:system event does not make sense for l1-dcache-stores. Maybe this was a typo and it was meant to be for l1-dcache-store-misses? In any case, the values returned are nowhere near correct for l1-dcache-stores and in fact the umask values for the event have completely changed with fam15h so it makes even less sense than ever. So just remove it. Signed-off-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1512091134350.24311@vincent-weaver-1.umelst.maine.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
perf/x86/intel/uncore: Add Knights Landing uncore PMU support
Knights Landing uncore performance monitoring (perfmon) is derived from Haswell-EP uncore perfmon with several differences. One notable difference is in PCI device IDs. Knights Landing uses common PCI device ID for multiple instances of an uncore PMU device type. In Haswell-EP, each instance of a PMU device type has a unique device ID. Knights Landing uncore components that have performance monitoring units are UBOX, CHA, EDC, MC, M2PCIe, IRP and PCU. Perfmon registers in EDC, MC, IRP, and M2PCIe reside in the PCIe configuration space. Perfmon registers in UBOX, CHA and PCU are accessed via the MSR interface. For more details, please refer to the public document: https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdf Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Harish Chegondi <harish.chegondi@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/8ac513981264c3eb10343a3f523f19cc5a2d12fe.1449470704.git.harish.chegondi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
perf/x86/intel/uncore: Remove hard coding of PMON box control MSR offset
Call uncore_pci_box_ctl() function to get the PMON box control MSR offset instead of hard coding the offset. This would allow us to use this snbep_uncore_pci_init_box() function for other PCI PMON devices whose box control MSR offset is different from SNBEP_PCI_PMON_BOX_CTL. Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Harish Chegondi <harish.chegondi@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/872e8ef16cfc38e5ff3b45fac1094e6f1722e4ad.1449470704.git.harish.chegondi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
perf/x86/intel: Add perf core PMU support for Intel Knights Landing
Knights Landing core is based on Silvermont core with several differences. Like Silvermont, Knights Landing has 8 pairs of LBR MSRs. However, the LBR MSRs addresses match those of the Xeon cores' first 8 pairs of LBR MSRs Unlike Silvermont, Knights Landing supports hyperthreading. Knights Landing offcore response events config register mask is different from that of the Silvermont. This patch was developed based on a patch from Andi Kleen. For more details, please refer to the public document: https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdf Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Harish Chegondi <harish.chegondi@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/d14593c7311f78c93c9cf6b006be843777c5ad5c.1449517401.git.harish.chegondi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
perf/x86/intel/uncore: Add Broadwell-EP uncore support
The uncore subsystem for Broadwell-EP is similar to Haswell-EP. There are some differences in pci device IDs, box number and constraints. This patch extends the Broadwell-DE codes to support Broadwell-EP. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449176411-9499-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
perf/x86/rapl: Use unified perf_event_sysfs_show instead of special i…
…nterface Actually, rapl_sysfs_show is a duplicate of perf_event_sysfs_show. We prefer to use the unified interface. Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dasaratharaman Chandramouli<dasaratharaman.chandramouli@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449223661-2437-1-git-send-email-ray.huang@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
perf/x86: Enable cycles:pp for Intel Atom
This patch updates the PEBS support for Intel Atom to provide an alias for the cycles:pp event used by perf record/top by default nowadays. On Atom, only INST_RETIRED:ANY supports PEBS, so we use this event instead with a large cmask to count cycles. Given that Core2 has the same issue, we use the intel_pebs_aliases_core2() function for Atom as well. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1449172990-30183-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane Eranian authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: fix PEBS issues on Intel Atom/Core2
This patch fixes broken PEBS support on Intel Atom and Core2 due to wrong pointer arithmetic in intel_pmu_drain_pebs_core(). The get_next_pebs_record_by_bit() was called on PEBS format fmt0 which does not use the pebs_record_nhm layout. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Fixes: 2150908 ("perf/x86/intel: Handle multiple records in the PEBS buffer") Link: http://lkml.kernel.org/r/1449182000-31524-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane Eranian authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: Fix LBR related crashes on Intel Atom
This patches fixes the LBR kernel crashes on Intel Atom. The kernel was assuming that if the CPU supports 64-bit format LBR, then it has an LBR_SELECT MSR. Atom uses 64-bit LBR format but does not have LBR_SELECT. That was causing NULL pointer dereferences in a couple of places. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Fixes: 96f3eda ("perf/x86/intel: Fix static checker warning in lbr enable") Link: http://lkml.kernel.org/r/1449182000-31524-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane Eranian authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: Fix filter_events() bug with event mappings
This patch fixes a bug in the filter_events() function. The patch fixes the bug whereby if some mappings did not exist, e.g., STALLED_CYCLES_FRONTEND, then any event after it in the attrs array would disappear from the published list of events in /sys/devices/cpu/events. This could be verified easily on any system post SNB (which do not publish STALLED_CYCLES_FRONTEND): $ ./perf stat -e cycles,ref-cycles true Performance counter stats for 'true': 1,217,348 cycles <not supported> ref-cycles The problem is that in filter_events() there is an assumption that the argument (attrs) is organized in increasing continuous event indexes related to the event_map(). But if we remove the non-supported events by shifing the position in the array, then the lookup x86_pmu.event_map() needs to compensate for it, otherwise we are looking up the wrong index. This patch corrects this problem by compensating for the deleted events and with that ref-cycles reappears (here shown on Haswell): $ perf stat -e ref-cycles,cycles true Performance counter stats for 'true': 4,525,910 ref-cycles 1,064,920 cycles 0.002943888 seconds time elapsed Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: jolsa@kernel.org Cc: kan.liang@intel.com Fixes: 8300daa ("perf/x86: Filter out undefined events from sysfs events attribute") Link: http://lkml.kernel.org/r/1449516805-6637-1-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>Stephane Eranian authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp
Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as base. The basic mechanism of abusing the inverse cmask to get all cycles works the same as before. PREC_DIST is available on Sandy Bridge or later. It had some problems on Sandy Bridge, so we only use it on IvyBridge and later. I tested it on Broadwell and Skylake. PREC_DIST has special support for avoiding shadow effects, which can give better results compare to UOPS_RETIRED. The drawback is that PREC_DIST can only schedule on counter 1, but that is ok for cycle sampling, as there is normally no need to do multiple cycle sampling runs in parallel. It is still possible to run perf top in parallel, as that doesn't use precise mode. Also of course the multiplexing can still allow parallel operation. :pp stays with the previous event. Example: Sample a loop with 10 sqrt with old cycles:pp 0.14 │10: sqrtps %xmm1,%xmm0 <-------------- 9.13 │ sqrtps %xmm1,%xmm0 11.58 │ sqrtps %xmm1,%xmm0 11.51 │ sqrtps %xmm1,%xmm0 6.27 │ sqrtps %xmm1,%xmm0 10.38 │ sqrtps %xmm1,%xmm0 12.20 │ sqrtps %xmm1,%xmm0 12.74 │ sqrtps %xmm1,%xmm0 5.40 │ sqrtps %xmm1,%xmm0 10.14 │ sqrtps %xmm1,%xmm0 10.51 │ ↑ jmp 10 We expect all 10 sqrt to get roughly the sample number of samples. But you can see that the instruction directly after the JMP is systematically underestimated in the result, due to sampling shadow effects. With the new PREC_DIST based sampling this problem is gone and all instructions show up roughly evenly: 9.51 │10: sqrtps %xmm1,%xmm0 11.74 │ sqrtps %xmm1,%xmm0 11.84 │ sqrtps %xmm1,%xmm0 6.05 │ sqrtps %xmm1,%xmm0 10.46 │ sqrtps %xmm1,%xmm0 12.25 │ sqrtps %xmm1,%xmm0 12.18 │ sqrtps %xmm1,%xmm0 5.26 │ sqrtps %xmm1,%xmm0 10.13 │ sqrtps %xmm1,%xmm0 10.43 │ sqrtps %xmm1,%xmm0 0.16 │ ↑ jmp 10 Even with PREC_DIST there is still sampling skid and the result is not completely even, but systematic shadow effects are significantly reduced. The improvements are mainly expected to make a difference in high IPC code. With low IPC it should be similar. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andi Kleen authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for Skylake
I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for cycles:pp. But the event is not documented for Skylake, and has some issues. The recommended replacement for cycles:pp is to use INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The event is not really new, but has been already used by perf before Sandy Bridge for the original cycles:p Note the SDM doesn't document that event either, but it's being documented in the latest version of the event list on: https://download.01.org/perfmon/SKL This patch does: - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to allow cmask=16,inv=1 for cycles:pp - We don't need an extra entry for the base INST_RETIRED event, because it is already covered by the catch-all PEBS table entry. - Switch Skylake to use the Core2 PEBS alias (which is INST_RETIRED.TOTAL_CYCLES_PS) Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andi Kleen authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: Allow zero PEBS status with only single active event
Normally we drop PEBS events with a zero status field. But when there is only a single PEBS event active we can assume the PEBS record is for that event. The PEBS buffer is always flushed when PEBS events are disabled, so there is no risk of mishandling state PEBS records this way. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andi Kleen authored and Ingo Molnar committedJan 6, 2016 -
perf/x86: Remove warning for zero PEBS status
The recent commit: 75f8085 ("perf/x86/intel/pebs: Robustify PEBS buffer drain") causes lots of warnings on different CPUs before Skylake when running PEBS intensive workloads. They can have a zero status field in the PEBS record when PEBS is racing with clearing of GLOBAl_STATUS. This also can cause hangs (it seems there are still problems with printk in NMI). Disable the warning, but still ignore the record. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andi Kleen authored and Ingo Molnar committedJan 6, 2016 -
perf/core: Collapse more IPI loops
This patch collapses the two 'hard' cases, which are perf_event_{dis,en}able(). I cannot seem to convince myself the current code is correct. So starting with perf_event_disable(); we don't strictly need to test for event->state == ACTIVE, ctx->is_active is enough. If the event is not scheduled while the ctx is, __perf_event_disable() still does the right thing. Its a little less efficient to IPI in that case, over-all simpler. For perf_event_enable(); the same goes, but I think that's actually broken in its current form. The current condition is: ctx->is_active && event->state == OFF, that means it doesn't do anything when !ctx->active && event->state == OFF. This is wrong, it should still mark the event INACTIVE in that case, otherwise we'll still not try and schedule the event once the context becomes active again. This patch implements the two function using the new event_function_call() and does away with the tricky event->state tests. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>Peter Zijlstra authored and Ingo Molnar committedJan 6, 2016 -
Merge branch 'perf/urgent' into perf/core, to pick up fixes before ap…
…plying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar committedJan 6, 2016 -
perf: Fix race in swevent hash
There's a race on CPU unplug where we free the swevent hash array while it can still have events on. This will result in a use-after-free which is BAD. Simply do not free the hash array on unplug. This leaves the thing around and no use-after-free takes place. When the last swevent dies, we do a for_each_possible_cpu() iteration anyway to clean these up, at which time we'll free it, so no leakage will occur. Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra authored and Ingo Molnar committedJan 6, 2016 -
perf: Fix race in perf_event_exec()
I managed to tickle this warning: [ 2338.884942] ------------[ cut here ]------------ [ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80() [ 2338.900504] Modules linked in: [ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty torvalds#244 [ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013 [ 2338.923071] ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000 [ 2338.931382] ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400 [ 2338.939678] 0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00 [ 2338.947987] Call Trace: [ 2338.950726] [<ffffffff815c680c>] dump_stack+0x4e/0x82 [ 2338.956474] [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0 [ 2338.963195] [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20 [ 2338.969720] [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80 [ 2338.976244] [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180 [ 2338.982575] [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0 [ 2338.988810] [<ffffffff8126de83>] load_elf_binary+0x393/0x1660 [ 2338.995339] [<ffffffff811dc772>] ? get_user_pages+0x52/0x60 [ 2339.001669] [<ffffffff8121e297>] search_binary_handler+0x97/0x200 [ 2339.008581] [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0 [ 2339.016072] [<ffffffff8121fcea>] SyS_execve+0x3a/0x50 [ 2339.021819] [<ffffffff819fc165>] stub_execve+0x5/0x5 [ 2339.027469] [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71 [ 2339.034860] ---[ end trace ee1337c59a0ddeac ]--- Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not what we expected it to be. This is because context switches can swap the task_struct::perf_event_ctxp[] pointer around. Therefore you have to either disable preemption when looking at current, or hold ctx->lock. Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[] before disabling interrupts, therefore a preemption in the right place can swap contexts around and we're using the wrong one. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Potapenko <glider@google.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: syzkaller <syzkaller@googlegroups.com> Link: http://lkml.kernel.org/r/20151210195740.GG6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra authored and Ingo Molnar committedJan 6, 2016
Commits on Dec 18, 2015
-
Merge tag 'perf-core-for-mingo-3' of git://git.kernel.org/pub/scm/lin…
…ux/kernel/git/acme/linux into perf/core Pull new perf tool feature from Arnaldo Carvalho de Melo: " User visible changes: - Generate perf.data files from 'perf stat', to tap into the scripting capabilities perf has instead of defining a 'perf stat' specific scripting support to calculate event ratios, etc. Simple example: $ perf stat record -e cycles usleep 1 Performance counter stats for 'usleep 1': 1,134,996 cycles 0.000670644 seconds time elapsed $ perf stat report Performance counter stats for '/home/acme/bin/perf stat record -e cycles usleep 1': 1,134,996 cycles 0.000670644 seconds time elapsed $ It generates PERF_RECORD_ userspace records to store the details: $ perf report -D | grep PERF_RECORD 0xf0 [0x28]: PERF_RECORD_THREAD_MAP nr: 1 thread: 27637 0x118 [0x12]: PERF_RECORD_CPU_MAP nr: 1 cpu: 65535 0x12a [0x40]: PERF_RECORD_STAT_CONFIG 0x16a [0x30]: PERF_RECORD_STAT -1 -1 0x19a [0x40]: PERF_RECORD_MMAP -1/0: [0xffffffff81000000(0x1f000000) @ 0xffffffff81000000]: x [kernel.kallsyms]_text 0x1da [0x18]: PERF_RECORD_STAT_ROUND [acme@ssdandy linux]$ An effort was made to make perf.data files generated like this to not generate cryptic messages when processed by older tools. The 'perf script' bits need rebasing, will go up later. Jiri's cover letter for this series: The initial attempt defined its own formula lang and allowed triggering user's script on the end of the stat command: http://marc.info/?l=linux-kernel&m=136742146322273&w=2 This patchset abandons the idea of new formula language and rather adds support to: - store stat data into perf.data file - add python support to process stat events Basically it allows to store stat data into perf.data and post process it with python scripts in a similar way we do for sampling data. The stat data are stored in new stat, stat-round, stat-config user events. stat - stored for each read syscall of the counter stat round - stored for each interval or end of the command invocation stat config - stores all the config information needed to process data so report tool could restore the same output as record The python script can now define 'stat__<eventname>_<modifier>' functions to get stat events data and 'stat__interval' to get stat-round data. See CPI script example in scripts/python/stat-cpi.py." Also a few other changes: User visible changes: - Make command line options always available, even when they depend on some feature being enabled, warning the user about use of such options (Wang Nan) - Support --vmlinux in perf record, useful, so far, for eBPF, where we will set up events that will be used in the record session (He Kuang) - Automatically disable collecting branch flags and cycles with --call-graph lbr. This allows avoiding a bunch of extra MSR reads in the PMI on Skylake. (Andi Kleen) Infrastructure changes: - Dump the stack when a 'perf test -v ' entry segfaults, so far we would have to run it under gdb with 'set follow-fork-mode child' set to get a proper backtrace (Arnaldo Carvalho de Melo) - Initialize the refcnt in 'struct thread' to 1 and fixup its users accordingly, so that we try to have the same refcount model accross the perf codebase (Arnaldo Carvalho de Melo) - More prep work for moving the subcmd infrastructure out of tools/perf/ and into tools/lib/subcmd/ to be used by other tools/ living utilities (Josh Poimboeuf) - Fix 'perf test' hist testcases when kptr_restrict is on (Namhyung Kim) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>Ingo Molnar committedDec 18, 2015