Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

[BUG] - KoP CPU usage is 10 times higher than Pulsar on the broker side #1652

Open
Hongten opened this issue Dec 29, 2022 · 9 comments
Open
Labels

Comments

@Hongten
Copy link
Contributor

Hongten commented Dec 29, 2022

Describe the bug
KoP CUP usage is 10 times higher than Pulsar on the broker side

To Reproduce
Steps to reproduce the behavior:

  1. Create a topic with 100 partitions.
  2. Use a pulsar client to produce data on the topic, and record the broker CPU usage.
  3. Use a Kafka client to produce data on the topic, and record the broker CPU usage.

Expected behavior
KoP CPU usage should be greater than or equal to the Pulsar itself.
In this case, when I use the Pulsar client, the broker CPU usage is around 3. When I use Kafka client, the broker CPU usage is around 30.
Expected result: When I use Kafka client, the broker CPU usage >=3, but not too high.

Screenshots
The broker(KoP) flame graph
image
Pulsar client vs. Kafka client(CUP usage)
image

Additional context
Add any other context about the problem here.

@Hongten
Copy link
Contributor Author

Hongten commented Dec 29, 2022

Please refer to the flame graph HTML file.
broker_flame_graph.html.zip

@BewareMyPower
Copy link
Collaborator

What's your KoP version? It looks like the sanitizeMetricName, which converts a metric name to a valid name, took too much time.

image

However, the io.prometheus:simpleclient:0.16.0 dependency of KoP master does not call regex related APIs:

    public static String sanitizeMetricName(String metricName) {
        int length = metricName.length();
        char[] sanitized = new char[length];

        for(int i = 0; i < length; ++i) {
            char ch = metricName.charAt(i);
            if (ch != ':' && (ch < 'a' || ch > 'z') && (ch < 'A' || ch > 'Z') && (i <= 0 || ch < '0' || ch > '9')) {
                sanitized[i] = '_';
            } else {
                sanitized[i] = ch;
            }
        }

        return new String(sanitized);
    }

@Hongten
Copy link
Contributor Author

Hongten commented Dec 29, 2022

KOP- 2.10.1.11
Pulsar- 2.10.1

@BewareMyPower BewareMyPower changed the title [BUG] - KoP CUP usage is 10 times higher than Pulsar on the broker side [BUG] - KoP CPU usage is 10 times higher than Pulsar on the broker side Dec 29, 2022
@BewareMyPower
Copy link
Collaborator

It seems to be a bug of Pulsar itself. I checked the prometheus dependency of Pulsar 2.10.2, the dependency is still 0.5.0, which has a inefficient implementation of sanitizeMetricName.

apache-pulsar-2.10.2$ ls lib | grep prometheus
io.prometheus-simpleclient-0.5.0.jar

The prometheus dependency was upgraded from apache/pulsar#13785, which cannot be cherry-picked to branch-2.10. I think we need to implement the method manually to avoid the upgrade.

BewareMyPower added a commit to BewareMyPower/kop that referenced this issue Dec 29, 2022
Fixes streamnative#1652

### Motivation

Pulsar 2.10 or earlier depends on io.prometheus:simpleclient:0.5.0,
whose implementation of `Collectors.sanitizeMetricName` uses regex,
which is inefficient and costs much CPU.

### Modifications

Migrate the implementation of `sanitizeMetricName` from
io:prometheus:simpleclient:0.16.0 to branch-2.10 or earlier.
@Hongten
Copy link
Contributor Author

Hongten commented Dec 30, 2022

hmm, I double-check the pulsar(broker) lib folder and find the io.prometheus-simpleclient is 0.5.0.

~ ll | grep prometheus
-rwxrwxrwx 1 root root    59175 Jan 22  2020 io.prometheus-simpleclient-0.5.0.jar*
-rwxrwxrwx 1 root root     5575 Jan 22  2020 io.prometheus-simpleclient_caffeine-0.5.0.jar*
-rwxrwxrwx 1 root root     5838 Jan 22  2020 io.prometheus-simpleclient_common-0.5.0.jar*
-rwxrwxrwx 1 root root    18245 Jan 22  2020 io.prometheus-simpleclient_hotspot-0.5.0.jar*
-rwxrwxrwx 1 root root     9517 Jan 22  2020 io.prometheus-simpleclient_httpserver-0.5.0.jar*
-rwxrwxrwx 1 root root     5177 Jan 22  2020 io.prometheus-simpleclient_jetty-0.5.0.jar*
-rwxrwxrwx 1 root root     4583 Jan 22  2020 io.prometheus-simpleclient_log4j2-0.5.0.jar*
-rwxrwxrwx 1 root root     7103 Jan 22  2020 io.prometheus-simpleclient_servlet-0.5.0.jar*
-rwxrwxrwx 1 root root    29885 Jan 22  2020 io.prometheus.jmx-collector-0.14.0.jar*
-rwxrwxrwx 1 root root    30649 Jan 22  2020 org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.5.jar*
-rwxrwxrwx 1 root root    16427 Jan 22  2020 org.apache.zookeeper-zookeeper-prometheus-metrics-3.6.3.jar*

Demogorgon314 pushed a commit that referenced this issue Dec 30, 2022
…me (#1654)

Fixes #1652

### Motivation

Pulsar 2.10 or earlier depends on io.prometheus:simpleclient:0.5.0,
whose implementation of `Collectors.sanitizeMetricName` uses regex,
which is inefficient and costs much CPU.

### Modifications

Migrate the implementation of `sanitizeMetricName` from
io:prometheus:simpleclient:0.16.0 to branch-2.10 or earlier.
Demogorgon314 pushed a commit to Demogorgon314/kop that referenced this issue Jan 4, 2023
…me (streamnative#1654)

Fixes streamnative#1652

### Motivation

Pulsar 2.10 or earlier depends on io.prometheus:simpleclient:0.5.0,
whose implementation of `Collectors.sanitizeMetricName` uses regex,
which is inefficient and costs much CPU.

### Modifications

Migrate the implementation of `sanitizeMetricName` from
io:prometheus:simpleclient:0.16.0 to branch-2.10 or earlier.
@Hongten
Copy link
Contributor Author

Hongten commented Jan 9, 2023

After I have upgraded(I cherry-pick the code from this MR - apache/pulsar#13785) the Prometheus client version from 0.5.0 to 0.15.0, the CPU usage still exceeds 30 and the Prometheus-related code costs more 24% than before(0.5.0).

P1 - The CPU usage with Prometheus version 0.5.0
image

P2 - The CPU usage with Prometheus version 0.15.0.
image

P3 - Prometheus version 0.5.0. The Prometheus-related code costs 37.93% CPU.
image

P4 - Prometheus version 0.15.0. The Prometheus-related code costs 62.13% CPU.
image

The broker Prometheus libs

ll | grep prometheus
-rwxrwxrwx 1 root root    89127 Jan 22  2020 io.prometheus-simpleclient-0.15.0.jar*
-rwxrwxrwx 1 root root     5556 Jan 22  2020 io.prometheus-simpleclient_caffeine-0.15.0.jar*
-rwxrwxrwx 1 root root     8008 Jan 22  2020 io.prometheus-simpleclient_common-0.15.0.jar*
-rwxrwxrwx 1 root root    24094 Jan 22  2020 io.prometheus-simpleclient_hotspot-0.15.0.jar*
-rwxrwxrwx 1 root root    14456 Jan 22  2020 io.prometheus-simpleclient_httpserver-0.15.0.jar*
-rwxrwxrwx 1 root root     5200 Jan 22  2020 io.prometheus-simpleclient_jetty-0.15.0.jar*
-rwxrwxrwx 1 root root     4595 Jan 22  2020 io.prometheus-simpleclient_log4j2-0.15.0.jar*
-rwxrwxrwx 1 root root    88445 Jan 22  2020 io.prometheus-simpleclient_servlet-0.15.0.jar*
-rwxrwxrwx 1 root root    12941 Jan 22  2020 io.prometheus-simpleclient_servlet_common-0.15.0.jar*
-rwxrwxrwx 1 root root     3378 Jan 22  2020 io.prometheus-simpleclient_tracer_common-0.15.0.jar*
-rwxrwxrwx 1 root root     4272 Jan 22  2020 io.prometheus-simpleclient_tracer_otel-0.15.0.jar*
-rwxrwxrwx 1 root root     4537 Jan 22  2020 io.prometheus-simpleclient_tracer_otel_agent-0.15.0.jar*
-rwxrwxrwx 1 root root    31808 Jan 22  2020 io.prometheus.jmx-collector-0.16.1.jar*
-rwxrwxrwx 1 root root    30649 Jan 22  2020 org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.5.jar*
-rwxrwxrwx 1 root root    16427 Jan 22  2020 org.apache.zookeeper-zookeeper-prometheus-metrics-3.6.3.jar*

@BewareMyPower
Copy link
Collaborator

The sanitizeMetricName definition was defined when building KoP. It is not loaded dynamically so upgrading the broker does not work. You should upgrade the KoP.

@Hongten
Copy link
Contributor Author

Hongten commented Jan 12, 2023

Summary:
Upgrade the Prometheus version to 0.16.0 in both Pulsar and KoP. The performance can be improved by around 37%.

After upgrading the Prometheus version to 0.16.0 in both Pulsar and KoP. The Prometheus-related code CPU usage is 0.58% now based on the Flame graph. The Broker's CPU usage reduces by around 16.3%. (Before 33%, Now 16.7%)

image

image

pulsar_kop_with_prometheus_version_0.16.0.html.zip

@Hongten
Copy link
Contributor Author

Hongten commented Jan 13, 2023

Summary:
The KoP-related code CPU usage can improve by around 10.75% if we set entryFormat=kafka(The default entryFormat=pulsar.)

entryFormat.flame_graph.zip

P1 - entryFormat=pulsar, the kop-related code CPU usage is around 50.37%. Almost CPU usages are on the two methods:PulsarEntryFormatter.encode() and ByteBufUtils.decodePulsarEntryToKafkaRecords()
image

P2 - entryFormat=kafka, the kop-related code CPU usage is around 39.62%.
image

P3 - The broker CPU usage seems no change. The accepted result is the CPU usage of entryForm=kafka is smaller than entryFormat=pulsar. This may be caused by the different partition numbers in one broker after the broker restart.
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants