Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ layout: learningpathall

There are numerouse client-server and network-based workloads, and Tomcat is a typical example of such applications, which provide services via HTTP/HTTPS network requests.

In this section, you'll set up a benchmark environment using Apache Tomcat and `wrk2` to simulate HTTP load and evaluate performance on an Arm-based bare-metal (**__`Nvidia-Grace`__**).
In this section, you'll set up a benchmark environment using `Apache Tomcat` and `wrk2` to simulate HTTP load and evaluate performance on an Arm-based bare-metal, such as **__`AWS c8g.metal-48xl`__**.

## Set up the Tomcat benchmark server on **Nvidia Grace**
## Set up the Tomcat benchmark server on **AWS c8g.metal-48xl**
[Apache Tomcat](https://tomcat.apache.org/) is an open-source Java Servlet container that runs Java web applications, handles HTTP requests, and serves dynamic content. It supports technologies such as Servlet, JSP, and WebSocket.

## Install the Java Development Kit (JDK)
Expand All @@ -30,8 +30,8 @@ sudo apt install -y openjdk-21-jdk
Download and extract Tomcat:

```bash
wget -c https://dlcdn.apache.org/tomcat/tomcat-11/v11.0.9/bin/apache-tomcat-11.0.9.tar.gz
tar xzf apache-tomcat-11.0.9.tar.gz
wget -c https://dlcdn.apache.org/tomcat/tomcat-11/v11.0.10/bin/apache-tomcat-11.0.10.tar.gz
tar xzf apache-tomcat-11.0.10.tar.gz
```
Alternatively, you can build Tomcat [from source](https://github.com/apache/tomcat).

Expand All @@ -41,7 +41,7 @@ To access the built-in examples from your local network or external IP, use a te

The file is at:
```bash
apache-tomcat-11.0.9/webapps/examples/META-INF/context.xml
~/apache-tomcat-11.0.10/webapps/examples/META-INF/context.xml
```

```xml
Expand All @@ -60,17 +60,17 @@ To achieve maximum performance of Tomcat, the maximum number of file descriptors
Start the server:

```bash
ulimit -n 65535 && ./apache-tomcat-11.0.9/bin/startup.sh
ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh
```

You should see output like:

```output
Using CATALINA_BASE: /home/ubuntu/apache-tomcat-11.0.9
Using CATALINA_HOME: /home/ubuntu/apache-tomcat-11.0.9
Using CATALINA_TMPDIR: /home/ubuntu/apache-tomcat-11.0.9/temp
Using CATALINA_BASE: /home/ubuntu/apache-tomcat-11.0.10
Using CATALINA_HOME: /home/ubuntu/apache-tomcat-11.0.10
Using CATALINA_TMPDIR: /home/ubuntu/apache-tomcat-11.0.10/temp
Using JRE_HOME: /usr
Using CLASSPATH: /home/ubuntu/apache-tomcat-11.0.9/bin/bootstrap.jar:/home/ubuntu/apache-tomcat-11.0.9/bin/tomcat-juli.jar
Using CLASSPATH: /home/ubuntu/apache-tomcat-11.0.10/bin/bootstrap.jar:/home/ubuntu/apache-tomcat-11.0.10/bin/tomcat-juli.jar
Using CATALINA_OPTS:
Tomcat started.
```
Expand Down Expand Up @@ -132,28 +132,28 @@ ulimit -n 65535 && wrk -c32 -t16 -R50000 -d60 http://${tomcat_ip}:8080/examples/
You should see output similar to:

```console
Running 1m test @ http://172.26.203.139:8080/examples/servlets/servlet/HelloWorldExample
Running 1m test @ http://172.31.46.193:8080/examples/servlets/servlet/HelloWorldExample
16 threads and 32 connections
Thread calibration: mean lat.: 0.986ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.984ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.999ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.994ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.983ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.989ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.991ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.993ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.985ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.990ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.987ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.990ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.984ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.991ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.978ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 0.976ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.381ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.626ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.020ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.578ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.166ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.275ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.454ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.655ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.334ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.089ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.365ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.382ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.342ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.349ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.023ms, rate sampling interval: 10ms
Thread calibration: mean lat.: 3.275ms, rate sampling interval: 10ms
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.00ms 454.90us 5.09ms 63.98%
Req/Sec 3.31k 241.68 4.89k 63.83%
2999817 requests in 1.00m, 1.56GB read
Requests/sec: 49997.08
Latency 1.02ms 398.88us 4.24ms 66.77%
Req/Sec 3.30k 210.16 4.44k 70.04%
2999776 requests in 1.00m, 1.56GB read
Requests/sec: 49996.87
Transfer/sec: 26.57MB
```
Original file line number Diff line number Diff line change
Expand Up @@ -11,45 +11,79 @@ To achieve maximum performance, ulimit -n 65535 must be executed on both server
{{% /notice %}}

## Optimal baseline before tuning
- Baseline on Grace bare-metal (default configuration)
- Baseline on Grace bare-metal (access logging disabled)
- Baseline on Grace bare-metal (optimal thread count)
- Align the IOMMU settings with default Ubuntu
- Baseline on Arm Neoverse bare-metal (default configuration)
- Baseline on Arm Neoverse bare-metal (access logging disabled)
- Baseline on Arm Neoverse bare-metal (optimal thread count)

### Align the IOMMU settings with default Ubuntu

{{% notice Note %}}
Due to the customized Ubuntu distribution on AWS, you first need to align the IOMMU settings with default Ubuntu: iommu.strict=1 and iommu.passthrough=0.
{{% /notice %}}

1. Setting IOMMU default status, use a text editor to modify the `grub` file by adding or updating the `GRUB_CMDLINE_LINUX` configuration.

```bash
sudo vi /etc/default/grub
```
then add or update
```bash
GRUB_CMDLINE_LINUX="iommu.strict=1 iommu.passthrough=0"
```

2. Update GRUB and reboot to apply the default settings.
```bash
sudo update-grub && sudo reboot
```

3. Verify whether the default settings have been successfully applied.
```bash
sudo dmesg | grep iommu
```
It can be observed that under the default configuration, iommu.strict is enabled, and iommu.passthrough is disabled.
```bash
[ 0.877401] iommu: Default domain type: Translated (set via kernel command line)
[ 0.877404] iommu: DMA domain TLB invalidation policy: strict mode (set via kernel command line)
...
```

### Baseline on Arm Neoverse bare-metal (default configuration)

### Baseline on Grace bare-metal (default configuration)
{{% notice Note %}}
To align with the typical deployment scenario of Tomcat, reserve 8 cores online and set all other cores offline
{{% /notice %}}

1. You can offline the CPU cores using the below command.
```bash
for no in {8..143}; do sudo bash -c "echo 0 > /sys/devices/system/cpu/cpu${no}/online"; done
for no in {8..191}; do sudo bash -c "echo 0 > /sys/devices/system/cpu/cpu${no}/online"; done
```
2. Use the following commands to verify that cores 0-7 are online and the remaining cores are offline.
```bash
lscpu
```
You can check the following information:
```bash
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 144
On-line CPU(s) list: 0-7
Off-line CPU(s) list: 8-143
Vendor ID: ARM
Model name: Neoverse-V2
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-7
Off-line CPU(s) list: 8-191
Vendor ID: ARM
Model name: Neoverse-V2
...
```

3. Use the following command on the Grace bare-metal where `Tomcat` is on
3. Use the following command on the Arm Neoverse bare-metal where `Tomcat` is on
```bash
~/apache-tomcat-11.0.9/bin/shutdown.sh 2>/dev/null
ulimit -n 65535 && ~/apache-tomcat-11.0.9/bin/startup.sh
~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh
```

4. And use the following command on the `x86_64` bare-metal where `wrk2` is on
```bash
tomcat_ip=10.169.226.181
tomcat_ip=172.31.46.193
```
```bash
ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://${tomcat_ip}:8080/examples/servlets/servlet/HelloWorldExample
Expand All @@ -58,20 +92,20 @@ ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://${tomcat_ip}:8080/examp
The result of default configuration is:
```bash
Thread Stats Avg Stdev Max +/- Stdev
Latency 13.29s 3.25s 19.07s 57.79%
Req/Sec 347.59 430.94 0.97k 66.67%
3035300 requests in 1.00m, 1.58GB read
Socket errors: connect 1280, read 0, write 0, timeout 21760
Requests/sec: 50517.09
Transfer/sec: 26.84MB
Latency 16.76s 6.59s 27.56s 56.98%
Req/Sec 1.97k 165.05 2.33k 89.90%
14680146 requests in 1.00m, 7.62GB read
Socket errors: connect 1264, read 0, write 0, timeout 1748
Requests/sec: 244449.62
Transfer/sec: 129.90MB
```

### Baseline on Grace bare-metal (access logging disabled)
### Baseline on Arm Neoverse bare-metal (access logging disabled)
To disable the access logging, use a text editor to modify the `server.xml` file by commenting out or removing the **`org.apache.catalina.valves.AccessLogValve`** configuration.

The file is at:
```bash
vi ~/apache-tomcat-11.0.9/conf/server.xml
vi ~/apache-tomcat-11.0.10/conf/server.xml
```

The configuratin is at the end of the file, and common out or remove it.
Expand All @@ -83,10 +117,10 @@ The configuratin is at the end of the file, and common out or remove it.
-->
```

1. Use the following command on the Grace bare-metal where `Tomcat` is on
1. Use the following command on the Arm Neoverse bare-metal where `Tomcat` is on
```bash
~/apache-tomcat-11.0.9/bin/shutdown.sh 2>/dev/null
ulimit -n 65535 && ~/apache-tomcat-11.0.9/bin/startup.sh
~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh
```

2. And use the following command on the `x86_64` bare-metal where `wrk2` is on
Expand All @@ -97,15 +131,15 @@ ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://${tomcat_ip}:8080/examp
The result of access logging disabled is:
```bash
Thread Stats Avg Stdev Max +/- Stdev
Latency 12.66s 3.05s 17.87s 57.47%
Req/Sec 433.69 524.91 1.18k 66.67%
3572200 requests in 1.00m, 1.85GB read
Socket errors: connect 1280, read 0, write 0, timeout 21760
Requests/sec: 59451.85
Transfer/sec: 31.59MB
Latency 16.16s 6.45s 28.26s 57.85%
Req/Sec 2.16k 5.91 2.17k 77.50%
16291136 requests in 1.00m, 8.45GB read
Socket errors: connect 0, read 0, write 0, timeout 75
Requests/sec: 271675.12
Transfer/sec: 144.36MB
```

### Baseline on Grace bare-metal (optimal thread count)
### Baseline on Arm Neoverse bare-metal (optimal thread count)
To minimize resource contention between threads and overhead from thread context switching, the number of CPU-intensive threads in Tomcat should be aligned with the number of CPU cores.

1. When using `wrk` to perform pressure testing on `Tomcat`:
Expand All @@ -115,23 +149,39 @@ top -H -p$(pgrep java)

You can see the below information
```bash
top - 12:12:45 up 1 day, 7:04, 5 users, load average: 7.22, 3.46, 1.75
Threads: 79 total, 8 running, 71 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.4 us, 1.9 sy, 0.0 ni, 94.1 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
MiB Mem : 964975.5 total, 602205.6 free, 12189.5 used, 356708.3 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 952786.0 avail Mem
top - 08:57:29 up 20 min, 1 user, load average: 4.17, 2.35, 1.22
Threads: 231 total, 8 running, 223 sleeping, 0 stopped, 0 zombie
%Cpu(s): 31.7 us, 20.2 sy, 0.0 ni, 31.0 id, 0.0 wa, 0.0 hi, 17.2 si, 0.0 st
MiB Mem : 386127.8 total, 380676.0 free, 4040.7 used, 2801.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 382087.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
53254 yinyu01 20 0 38.0g 1.4g 28288 R 96.7 0.1 2:30.70 http-nio-8080-e
53255 yinyu01 20 0 38.0g 1.4g 28288 R 96.7 0.1 2:30.62 http-nio-8080-e
53256 yinyu01 20 0 38.0g 1.4g 28288 R 96.7 0.1 2:30.64 http-nio-8080-e
53258 yinyu01 20 0 38.0g 1.4g 28288 R 96.7 0.1 2:30.62 http-nio-8080-e
53260 yinyu01 20 0 38.0g 1.4g 28288 R 96.7 0.1 2:30.69 http-nio-8080-e
53257 yinyu01 20 0 38.0g 1.4g 28288 R 96.3 0.1 2:30.59 http-nio-8080-e
53259 yinyu01 20 0 38.0g 1.4g 28288 R 96.3 0.1 2:30.63 http-nio-8080-e
53309 yinyu01 20 0 38.0g 1.4g 28288 R 95.3 0.1 2:29.69 http-nio-8080-P
53231 yinyu01 20 0 38.0g 1.4g 28288 S 0.3 0.1 0:00.10 VM Thread
53262 yinyu01 20 0 38.0g 1.4g 28288 S 0.3 0.1 0:00.12 GC Thread#2
4677 ubuntu 20 0 36.0g 1.4g 24452 R 89.0 0.4 1:18.71 http-nio-8080-P
4685 ubuntu 20 0 36.0g 1.4g 24452 R 4.7 0.4 0:04.42 http-nio-8080-A
4893 ubuntu 20 0 36.0g 1.4g 24452 S 3.3 0.4 0:00.60 http-nio-8080-e
4963 ubuntu 20 0 36.0g 1.4g 24452 S 3.3 0.4 0:00.66 http-nio-8080-e
4924 ubuntu 20 0 36.0g 1.4g 24452 S 3.0 0.4 0:00.59 http-nio-8080-e
4955 ubuntu 20 0 36.0g 1.4g 24452 S 3.0 0.4 0:00.60 http-nio-8080-e
5061 ubuntu 20 0 36.0g 1.4g 24452 S 3.0 0.4 0:00.61 http-nio-8080-e
4895 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.58 http-nio-8080-e
4907 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.59 http-nio-8080-e
4940 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.58 http-nio-8080-e
4946 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.59 http-nio-8080-e
4956 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.65 http-nio-8080-e
4959 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.59 http-nio-8080-e
4960 ubuntu 20 0 36.0g 1.4g 24452 R 2.7 0.4 0:00.60 http-nio-8080-e
4962 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.57 http-nio-8080-e
4982 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.63 http-nio-8080-e
4983 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.58 http-nio-8080-e
4996 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.60 http-nio-8080-e
5033 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.59 http-nio-8080-e
5036 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.66 http-nio-8080-e
5056 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.61 http-nio-8080-e
5065 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.56 http-nio-8080-e
5068 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.61 http-nio-8080-e
5070 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.60 http-nio-8080-e
5071 ubuntu 20 0 36.0g 1.4g 24452 S 2.7 0.4 0:00.61 http-nio-8080-e
...
```

It can be observed that **`http-nio-8080-e`** and **`http-nio-8080-P`** threads are CPU-intensive.
Expand All @@ -141,7 +191,7 @@ To configure the `http-nio-8080-e` thread count, use a text editor to modify the

The file is at:
```bash
vi ~/apache-tomcat-11.0.9/conf/server.xml
vi ~/apache-tomcat-11.0.10/conf/server.xml
```


Expand All @@ -164,10 +214,10 @@ vi ~/apache-tomcat-11.0.9/conf/server.xml
/>
```

2. Use the following command on the Grace bare-metal where `Tomcat` is on
2. Use the following command on the Arm Neoverse bare-metal where `Tomcat` is on
```bash
~/apache-tomcat-11.0.9/bin/shutdown.sh 2>/dev/null
ulimit -n 65535 && ~/apache-tomcat-11.0.9/bin/startup.sh
~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh
```

3. And use the following command on the `x86_64` bare-metal where `wrk2` is on
Expand All @@ -178,9 +228,9 @@ ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://${tomcat_ip}:8080/examp
The result of optimal thread count is:
```bash
Thread Stats Avg Stdev Max +/- Stdev
Latency 24.34s 9.91s 41.81s 57.77%
Req/Sec 1.22k 4.29 1.23k 71.09%
9255672 requests in 1.00m, 4.80GB read
Requests/sec: 154479.07
Transfer/sec: 82.06MB
Latency 10.26s 4.55s 19.81s 62.51%
Req/Sec 2.86k 89.49 3.51k 77.06%
21458421 requests in 1.00m, 11.13GB read
Requests/sec: 357835.75
Transfer/sec: 190.08MB
```
Loading