Performance Regression Data Set

Note that github provides a table of content (TOC) directly by clicking on the icon in front of README.md

This repository provides data sets on performance measurements to support research on performance regression detection.

Available data sets:

HANA
1. small 2021
  - A small data set from SAP HANA internal testing to explore the data format
2. large 2022
  - A rather large data set from SAP HANA internal testing that represent half a year of performance measurements

The directories of the data sets may provide further details.

On (automated) performance regression detection

Assume we have a software under test SUT that provides an important functionality F. The SUT may evolve over time. Starting with a version v_1, each change may lead to a new version v_n.

Objective: Detect if a change increases the execution time of F and therefore results in a performance regression.

In a practical scenario, we want to detect changes that have a negative impact on performance before they are delivered to customers. A typical way to achieve this are change-based performance measurements. Simplified, for each new version v_(n+1), we determine the execution time t_(n+1) of F and compare it to t_n for version v_n. This approach results in a series of time measurements.

Time Series of Measurements

A series of measurements consists of n data points for m <= n different versions. There is one measurement for each version if n == m, but there may be multiple measurements for at least one version if n > m.

We typically measure execution time (in units like milliseconds, seconds, minutes or hours), but we can also measure memory usage, cpu usage, network traffic, latency or any other metric.

A time series is ordered by time or provides information to be ordered by time.

Problem

Detecting whether a regression occurred sounds first like a trivial problem: We can just compare the new value v_n at t_(m+1) against the old value t_o at t_m. We report a regression if t_n > t_o.

However, this approach may not be applicable in practice. We will iterate over several practical issues.

Performance measurements must handle deviations.

We rarely measure deterministic execution in practice. Think about dynamic cpu frequency scaling, caches, thread scheduling, shared resource access, system load, cpu internal caches, branch prediction, memory prefect, or any other random interaction of the system under tests with the software under test. We may even use different systems for performance measurements if the required time to execute them of too long on a single machine. Given these examples, it is unreasonable to assume perfect deterministic execution. Therefore, we cannot apply the comparison in a strict mathematical sense.

We could argue that, given enough time, we could do an infinite amount of measurements to calculate a statistical stable value. However, we may not have infinite time and resources in practice.

Therefore, we must utilize a significant threshold ts in an absolute way or relative way. For example, we only detect a regression if t_n - t_o > ts (absolute) or if t_n / t_o > ts (relative).

It is unclear how large a difference must be to be considered a regression.

Do we accept t_n = 500 seconds compared to t_o = 490 seconds? This decision depends on the context and cannot be answered in general.

A trend may hide regressions.

Assume a linear increasing time series where each change is below our threshold. We will not report any change as a regression, but over time performance may have decreased by several factors. Should we compare each new value against all old values?

Changes in the software may influence the performance by purpose.

We may anticipate and accept an increase in execution time. How does our detection approach now adapt to the change in our time series data?

The reality does not produce clean time series.

We may think about time series that represent smooth lines if plotted. However, practical results may have a lot of noise for a wide range of reasons. See the examples section.

False positives and false negatives should be avoided.

False positives result in human intervention. Humans will lose trust if there are too many false positives and if they spent too much time on manual analysis, then we could skip the automatic analysis at all.

False negatives represent performance regressions which may be costly if they are delivered to the customer or must be corrected in later development stages.

In summary, we want an automated approach that is robust, adaptive, cheap, and has a high precision and recall.

The priority for each attribute may vary depending on the use case.

Examples

Each figure represents a chart where the x-axis is time and the y-axis is some value. The charts do not contain units as they are not relevant for the examples.

The first example is the most simple. It is a stable time series without regressions. It may represent functionality that never changed over the whole observation period.

The second chart is also rather simple. It clearly represents a regression about in the mid of the measurements. However, already here we could question details. Does the one value after about one fifth of the data points already represent a regression?

The next charts show practical measurements where no regression occurred, but the time series look different to the first example of a stable time series.

The following chart shows symptoms that can be classified as build dependent. We get different performance characteristics, although nothing in the implementation has changed. The performance result is based on how the software is build.

The next chart shows spikes. They occur due to external influence on the performance measurements.

The last chart shows an example with a rather high deviation. Although the time series itself is not stable, the deviations are stable. Overall, there are no performance regressions.

How to use the data sets

Each data set contains specific files that describe the data format.

You are free to use the data sets according to the license declared by this repository. You can refer to a specific data set either via the name that is used here or just refer to the whole repository.

Examples for online reference:

performance regression data set from SAP, https://github.com/SAP/performance-regression-data-set
performance regression data set from SAP

Examples for citations

Thomas Bach, Pal Lv, Minh Le. "Performance Regression Data Set"
- Venue will be added if it exists.
- keys to refer to a specific data set may be added in the future
- Add a date according to your needs
Biblatex via CITATION.cff file (rendered in github)

Please inform use via issue or email if you use this data set. We are happy to link to your work.

Background about HANA

You can find general information about how SAP HANA is tested via

Bach, T., Andrzejak, A., Seo, C. et al. Testing Very Large Database Management Systems: The Case of SAP HANA. Datenbank Spektrum 22, 195–215 (2022). https://doi.org/10.1007/s13222-022-00426-x

License

Copyright 2021-2022 SAP SE or an SAP affiliate company and performance-regression-data-set contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.reuse		.reuse
LICENSES		LICENSES
hana		hana
pics/examples/classification		pics/examples/classification
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.reuse

.reuse

LICENSES

LICENSES

hana

hana

pics/examples/classification

pics/examples/classification

CITATION.cff

CITATION.cff

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Performance Regression Data Set