Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running SpecJBB test #5

Open
agangidi53 opened this issue Jun 14, 2018 · 35 comments
Open

Error running SpecJBB test #5

agangidi53 opened this issue Jun 14, 2018 · 35 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@agangidi53
Copy link

agangidi53 commented Jun 14, 2018

Hi @johnjmar

Can you let me know if you've run into this error with iBM Java (that you've in-built into your benchmarking suite) ?

root@ubuntu:/home/ubuntu/op-benchmark-recipes/standard-benchmarks/Java/SPEC-jbb2015/18-06-14_174600# cat controller.log
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: joptsimple.HelpFormatter
at java.lang.J9VMInternals.prepareClassImpl(Native Method)
at java.lang.J9VMInternals.prepare(J9VMInternals.java:291)
at java.lang.Class.getMethod(Class.java:1216)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:556)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:538)
Caused by: java.lang.ClassNotFoundException: joptsimple.HelpFormatter
at java.net.URLClassLoader.findClass(URLClassLoader.java:607)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:850)
at java.lang.ClassLoader.loadClass(ClassLoader.java:829)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:325)
at java.lang.ClassLoader.loadClass(ClassLoader.java:809)
... 5 more

@agangidi53
Copy link
Author

I'm using the inbuilt IBM Java
16.04.4 OS 4.13 kernel

@riazhus
Copy link

riazhus commented Jun 18, 2018

Hey Adi,
You will need to check the SPECjbb2015 kit that you are using. If it has been copied from an Intel install , it may have some X86 run-time dependencies and give this JNI error. The install should be on a Power system using the SPEC installer.

@johnjmar johnjmar assigned basuv and johnjmar and unassigned basuv and johnjmar Jun 19, 2018
@agangidi53
Copy link
Author

I got around that error but running into another one:

Reading property file: /home/rack/op-benchmark-recipes/standard-benchmarks/Java/SPEC-jbb2015/iso/18-06-30_012343/./config/specjbb2015.props

 0s: Enumerating plugins...
 0s:    Connectivity:
 0s:             HTTP_Grizzly: Grizzly HTTP server, JDK HTTP client
 0s:              NIO_Grizzly: Grizzly NIO server, Grizzly NIO client
 0s:               HTTP_Jetty: Jetty HTTP server, JDK HTTP client
 0s:    Snapshot:
 0s:                 InMemory: Stores snapshots in heap memory
 0s:    Data Writers:
 0s:                     Demo: Send all frame to listener
 0s:                   InFile: Writing Data into file
 0s:                   Silent: Drop all frames
 0s:
 0s: Validating kit integrity...
 0s: Kit validation had passed.
 1s:
 1s: Tests are skipped.
 1s:
 1s:
 3s: Terminate the run due to the unexpected error: IC had failed to initialize the server
 3s:
 3s: Tests are skipped.

@agangidi53
Copy link
Author

This is a Dual 22 core Power9 system ( 3.3 GHz - 3.8 GHz)

@riazhus
Copy link

riazhus commented Jul 5, 2018

Hey Adi, from the above it is not clear why the run was terminated. It will be great if you can zip up the run directory (should look something like "18-......" with the date and time of the run) and attach it here, as it would give more insight into what is failing.

@agangidi53
Copy link
Author

IC_Master_error.zip

Please find attached log

@riazhus
Copy link

riazhus commented Jul 6, 2018

Hi Adi, the zip file is empty for some reason.

@Tom-Tran
Copy link

Hi Adi,

Please try with the latest IBMJDK build from: https://developer.ibm.com/javasdk/downloads/sdk8/ ... "ibm-java-ppc64le-sdk-8.0-5.17" or "ibm-java-ppc64le-jre-8.0-5.17.bin"

Please send a new zip of the run folder or have a look at the log files in the run folder to check for glaring errors (18*/*log)

@agangidi53
Copy link
Author

specjbb_failure_new_java.zip
Hi @Tom-Tran i tried the recent Java and the test still didn't run , albeit its a different signature error. I actually can't find the error explicitly since the test seems to have failed "quietly"

@agangidi53
Copy link
Author

System config:
Proc0: 22 core Power9 (SMT4 config)
Proc1: 22 core Power9 (SMT4 config)

root@ubuntu:/home/ubuntu/SPEC-jbb2015# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 176
On-line CPU(s) list: 0-175
Thread(s) per core: 4
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-87
NUMA node8 CPU(s): 88-175

root@ubuntu:/home/ubuntu/SPEC-jbb2015# lshw -short
H/W path Device Class Description

                        system     ZAIUS_FX_08 (ingrasys,zaius)

/0 bus Motherboard
/0/1 processor 02CY069
/0/12 processor 02CY069
/0/2 memory 256GiB System memory
/0/2/0 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/1 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/2 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/3 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/4 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/5 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/6 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/7 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/8 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/9 memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/a memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/b memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/c memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/d memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/e memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/2/f memory 16GiB RDIMM DDR4 2666 MHz (0.4ns)
/0/3 generic bmc-firmware-version
/0/5 generic buildroot
/0/6 generic capp-ucode
/0/7 generic hcode
/0/8 generic hostboot
/0/9 generic hostboot-binaries
/0/a generic linux
/0/b generic machine-xml
/0/c generic occ
/0/d generic petitboot
/0/e generic sbe
/0/f generic skiboot
/0/10 generic version
/0/100 bridge IBM
/0/101 bridge IBM
/0/102 bridge IBM
/0/102/0 storage 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller
/0/103 bridge IBM
/0/104 bridge IBM
/0/105 bridge IBM
/0/105/0 storage MegaRAID Tri-Mode SAS3516
/0/106 bridge IBM
/0/106/0 bus uPD720201 USB 3.0 Host Controller
/0/106/0/0 usb1 bus xHCI Host Controller
/0/106/0/0/1 bus General Purpose USB Hub
/0/106/0/1 usb2 bus xHCI Host Controller
/0/107 bridge IBM
/0/108 bridge IBM
/0/108/0 enP52p1s0 network NetXtreme BCM5719 Gigabit Ethernet PCIe
/0/0 bridge IBM
/0/0/0 bridge AST1150 PCI-to-PCI Bridge
/0/0/0/0 display ASPEED Graphics Family
/0/11 scsi2 storage
/0/11/0.0.0 /dev/sda disk 80GB INTEL SSDSCKHB08
/0/11/0.0.0/0 /dev/sda disk 80GB
/0/11/0.0.0/0/1 /dev/sda1 volume 7167KiB EFI partition
/0/11/0.0.0/0/2 /dev/sda2 volume 74GiB EXT4 volume

@Tom-Tran
Copy link

Hi Adi,

I have reviewed the run folder. The main issue is stemming from the run script: run_multi.sh.ibmjdk_829_20C_2S_2grp_63GB.sh
There are a lot of old Java options that should not be used. We will be updating this repo very soon with the latest recipe. Apologies for the inconvenience.

In the run script, please replace the Java options for the controller, TXI and BE with the below. Also, replace the Java execution command for the controller, TXI and BE.

JAVA_OPTS_C="-XX:-RuntimeInstrumentation -Xms1g -Xmx1g -Xmn800m -Xcompressedrefs -XX:-EnableHCR"
JAVA_OPTS_TI="-XX:-RuntimeInstrumentation -Xlp -Xms1000m -Xmx1000m -Xmn700m -Xcompressedrefs -Xtrace:none -Xconcurrentlevel0 -Xaggressive -XX:-EnableHCR"
JAVA_OPTS_BE="-XX:-RuntimeInstrumentation -Xlp -Xms61g -Xmx61g -Xmn59g -Xcompressedrefs -Xtrace:none -Xconcurrentlevel0 -Xaggressive -XX:-EnableHCR"

echo "Start Controller JVM"
$JAVA $JAVA_OPTS_C $SPEC_OPTS_C -jar ../specjbb2015.jar -m MULTICONTROLLER $MODE_ARGS_C 2>controller.log > controller.out &

echo " Start $TI_NAME"
numactl --physcpubind=${tprocs[gnum]} --membind=${mem[gnum]} $JAVA $JAVA_OPTS_TI $SPEC_OPTS_TI -jar ../specjbb2015.jar -m TXINJECTOR -G=$GROUPID -J=$JVMID $MODE_ARGS_TI > $TI_NAME.log 2>&1 &

echo " Start $BE_NAME"
numactl --physcpubind=${procs[gnum]} --membind=${mem[gnum]} $JAVA $JAVA_OPTS_BE $SPEC_OPTS_BE -jar ../specjbb2015.jar -m BACKEND -G=$GROUPID -J=$JVMID $MODE_ARGS_BE > $BE_NAME.log 2>&1 &

@agangidi53
Copy link
Author

It seems as though these arguments work. (Test is still in progress) . First time I've gotten IBM java to run on Barreleye G2 Power9 systems.

@agangidi53
Copy link
Author

@Tom-Tran @johnjmar

Here are the numbers I got from my run . As you can see, they are not the most stellar numbers , you have observed. I think it could be because I ran this test with SMT=0 . With SMT=4 I was running out of memory.

Do you can how I can work around this ? Thanks in advance for the suggestions.
ibm_java_22c_run1.zip

@Tom-Tran
Copy link

Hi Adi, great to hear it is working now. Odd to see that changing SMT causes OOM. I see you are running two groups, and I'm guessing each backend JVM is using about 63GB... so the benchmark should be using 140GB total system memory. You have 256GB RAM from above. Can you please check that you are not over allocating hugepages in the tune script ?

@agangidi53
Copy link
Author

@Tom-Tran

Here is my tune script

ulimit -n 1048576
ulimit -i unlimited
ulimit -s unlimited
ulimit -u unlimited

swapoff -a

echo 120000 > /proc/sys/vm/nr_hugepages
free && sync && echo 3 > /proc/sys/vm/drop_caches && free

ppc64_cpu --dscr=1

Network tuning

sysctl -w net.ipv4.udp_rmem_min=1024
echo 65536 > /proc/sys/net/core/rmem_max
echo "8192 1024 65536" > /proc/sys/net/ipv4/tcp_rmem
echo 32768 > /proc/sys/net/core/wmem_max
echo "32768 32768 32768" > /proc/sys/net/ipv4/tcp_wmem

Scheduler tuning

echo 1000 > /proc/sys/kernel/sched_migration_cost_ns
echo 65000000 > /proc/sys/kernel/sched_min_granularity_ns
echo 24000000 > /proc/sys/kernel/sched_wakeup_granularity_ns

CPU governor

cpupower frequency-set -g performance

@Tom-Tran
Copy link

Hi Adi, try reducing the number of memory pages (currently u are allocating 245GB of hugepages):
echo 60000 > /proc/sys/vm/nr_hugepages

Also, check that your hugepages are divided equally on your two sockets:
cat /sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages

@agangidi53
Copy link
Author

Hey @Tom-Tran I realized that I was allocated almost all my memory and reduced it to half. Thanks for confirming that's fine. Started the run now. Hopefully numbers are better with SMT-4

Seems equally divided.

This tuning script is what I got from @basuv . I need to look into new patch / tuning scripts uploaded by @johnjmar . If things still look shitty, hopefully we can arrange a screen share next week to debug. A

Thoroughly appreciate your help in tuning this here.

root@ubuntu:/home/ubuntu/SPEC-jbb2015# cat /sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages
0
30000
0
30000

@agangidi53
Copy link
Author

@Tom-Tran

Sorry to bother you on the weekend. Failure again. (OOM)

[ 1156.011862] bash (6599): drop_caches: 3
[ 1173.432188] bash (6890): drop_caches: 3
[ 1204.067523] run_multi.sh.ib (6912): drop_caches: 3
[ 1475.511016] Group2.Backend. invoked oom-killer: gfp_mask=0x15080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=8, order=0, oom_score_adj=0
[ 1475.511020] Group2.Backend. cpuset=/ mems_allowed=0,8
[ 1475.511026] CPU: 157 PID: 11113 Comm: Group2.Backend. Not tainted 4.15.0-23-generic #25-Ubuntu
[ 1475.511029] Call Trace:
[ 1475.511037] [c000201c3b8b36e0] [c000000000cdeb7c] dump_stack+0xb0/0xf4 (unreliable)
[ 1475.511042] [c000201c3b8b3720] [c0000000002e5b94] dump_header+0x98/0x2f4
[ 1475.511044] [c000201c3b8b37e0] [c0000000002e4c68] oom_kill_process+0x368/0x6c0
[ 1475.511046] [c000201c3b8b38a0] [c0000000002e55ec] out_of_memory+0x2bc/0x710
[ 1475.511050] [c000201c3b8b3940] [c0000000002ed5dc] __alloc_pages_nodemask+0xfbc/0x1070
[ 1475.511056] [c000201c3b8b3b30] [c000000000378e80] alloc_pages_current+0xa0/0x140
[ 1475.511061] [c000201c3b8b3b70] [c00000000006d5dc] pte_fragment_alloc+0xdc/0x1e0
[ 1475.511068] [c000201c3b8b3bc0] [c0000000003a0324] do_huge_pmd_anonymous_page+0x374/0x9d0
[ 1475.511072] [c000201c3b8b3c40] [c00000000033f68c] __handle_mm_fault+0xc6c/0xe10
[ 1475.511074] [c000201c3b8b3d20] [c00000000033f958] handle_mm_fault+0x128/0x210
[ 1475.511076] [c000201c3b8b3d60] [c00000000006a72c] __do_page_fault+0x21c/0xa70
[ 1475.511085] [c000201c3b8b3e30] [c00000000000a634] handle_page_fault+0x18/0x38
[ 1475.511086] Mem-Info:
[ 1475.511093] active_anon:991119 inactive_anon:124 isolated_anon:0
active_file:239 inactive_file:185 isolated_file:0
unevictable:127 dirty:2 writeback:0 unstable:0
slab_reclaimable:2596 slab_unreclaimable:24374
mapped:436 shmem:130 pagetables:2485 bounce:0
free:1234619 free_pcp:0 free_cma:140128
[ 1475.511098] Node 8 active_anon:61769088kB inactive_anon:5632kB active_file:4864kB inactive_file:4608kB unevictable:8128kB isolated(anon):0kB isolated(file):0kB mapped:7040kB dirty:128kB writeback:0kB shmem:5952kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 60764160kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1475.511100] Node 8 DMA free:9148352kB min:180288kB low:314176kB high:448064kB active_anon:61779328kB inactive_anon:5632kB active_file:4864kB inactive_file:4608kB unevictable:8128kB writepending:128kB present:134217728kB managed:133939648kB mlocked:8128kB kernel_stack:53680kB pagetables:145792kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:8968192kB
[ 1475.511105] lowmem_reserve[]: 0 0 0 0 0
[ 1475.511108] Node 8 DMA: 104864kB (UME) 551128kB (ME) 138256kB (UME) 13512kB (UME) 11024kB (M) 12048kB (C) 14096kB (C) 18192kB (C) 537*16384kB (C) = 8993152kB
[ 1475.511119] Node 0 hugepages_total=30000 hugepages_free=812 hugepages_surp=0 hugepages_size=2048kB
[ 1475.511120] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1475.511121] Node 8 hugepages_total=30000 hugepages_free=29614 hugepages_surp=0 hugepages_size=2048kB
[ 1475.511122] Node 8 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1475.511123] 741 total pagecache pages
[ 1475.511125] 0 pages in swap cache
[ 1475.511126] Swap cache stats: add 0, delete 0, find 0/0
[ 1475.511126] Free swap = 0kB
[ 1475.511127] Total swap = 0kB
[ 1475.511128] 4194304 pages RAM
[ 1475.511128] 0 pages HighMem/MovableOnly
[ 1475.511129] 9705 pages reserved
[ 1475.511130] 209920 pages cma reserved
[ 1475.511130] 0 pages hwpoisoned
[ 1475.511131] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 1475.511150] [ 1642] 0 1642 1603 109 32000 0 0 systemd-journal
[ 1475.511152] [ 1673] 0 1673 344 119 31488 0 -1000 systemd-udevd
[ 1475.511154] [ 1677] 0 1677 1266 37 26624 0 0 lvmetad
[ 1475.511156] [ 3152] 100 3152 430 110 27904 0 0 systemd-network
[ 1475.511158] [ 3229] 62583 3229 1413 64 28160 0 0 systemd-timesyn
[ 1475.511159] [ 3230] 101 3230 273 115 31744 0 0 systemd-resolve
[ 1475.511161] [ 3356] 0 3356 1357 38 30976 0 0 lxcfs
[ 1475.511163] [ 3363] 0 3363 101 67 30464 0 0 atd
[ 1475.511164] [ 3366] 0 3366 1720 240 30208 0 0 networkd-dispat
[ 1475.511166] [ 3369] 0 3369 269 113 27648 0 0 systemd-logind
[ 1475.511168] [ 3372] 102 3372 3512 89 28160 0 0 rsyslogd
[ 1475.511169] [ 3379] 0 3379 1331 78 27136 0 0 irqbalance
[ 1475.511171] [ 3383] 103 3383 183 88 27392 0 -900 dbus-daemon

  • (( gnum=1+1 ))
  • (( 2<2+1 ))
    [ 1475.511172] [ 3430] 0 3430 1331 8 30208 0 0 iprdump
    [ 1475.511174] [ 3461] 0 3461 148 73 26880 0 0 cron
    [ 1475.511175] [ 3463] 0 3463 3767 119 33792 0 0 accounts-daemon
    [ 1475.511177] [ 3471] 0 3471 294 73 27904 0 0 opal-prd
    [ 1475.511178] [ 3474] 0 3474 38499 329 75264 0 -900 snapd
    [ 1475.511180] [ 3540] 0 3540 54 9 30208 0 0 iprinit
    [ 1475.511181] [ 3545] 0 3545 54 9 25856 0 0 iprupdate
    [ 1475.511183] [ 3568] 0 3568 3728 151 28928 0 0 polkitd
    [ 1475.511185] [ 3671] 0 3671 268 97 27392 0 -1000 sshd
    [ 1475.511186] [ 3686] 0 3686 121 26 30720 0 0 iscsid
    [ 1475.511188] [ 3690] 0 3690 129 125 30720 0 -17 iscsid
    [ 1475.511189] [ 3742] 0 3742 234 121 27392 0 0 login
    [ 1475.511191] [ 3749] 0 3749 116 43 30464 0 0 agetty
    [ 1475.511192] [ 3964] 1000 3964 313 124 28160 0 0 systemd
    [ 1475.511194] [ 3973] 1000 3973 393 129 28672 0 0 (sd-pam)
    [ 1475.511195] [ 3991] 1000 3991 167 69 26624 0 0 bash
    [ 1475.511197] [ 3999] 0 3999 214 104 31232 0 0 sudo
    [ 1475.511198] [ 4001] 0 4001 203 98 31232 0 0 su
    [ 1475.511200] [ 4002] 0 4002 149 60 30976 0 0 bash
    [ 1475.511201] [ 4024] 0 4024 333 148 28416 0 0 sshd
    [ 1475.511203] [ 4121] 1000 4121 333 111 28416 0 0 sshd
    [ 1475.511204] [ 4122] 1000 4122 167 69 30720 0 0 bash
    [ 1475.511206] [ 4130] 0 4130 214 104 31744 0 0 sudo
    [ 1475.511208] [ 4131] 0 4131 203 97 27136 0 0 su
    [ 1475.511209] [ 4132] 0 4132 151 62 30720 0 0 bash
    [ 1475.511211] [ 6579] 0 6579 157 61 27136 0 0 screen
    [ 1475.511212] [ 6580] 0 6580 149 61 30720 0 0 bash
    [ 1475.511214] [ 6903] 0 6903 133 11 30976 0 0 full_run.sh
    [ 1475.511215] [ 6912] 0 6912 133 45 30720 0 0 run_multi.sh.ib
    [ 1475.511217] [ 6927] 0 6927 1236296 16241 822784 0 0 java
    [ 1475.511271] [ 8485] 0 8485 417347 3670 269824 0 0 java
    [ 1475.511272] [ 8557] 0 8557 1912692 945196 8305152 0 0 java
    [ 1475.511274] Out of memory: Kill process 8557 (java) score 437 or sacrifice child
    [ 1475.513927] Killed process 8557 (java) total-vm:122412288kB, anon-rss:60478848kB, file-rss:13696kB, shmem-rss:0kB
    [ 1476.525881] oom_reaper: reaped process 8557 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@Tom-Tran
Copy link

Hi Adi, can you try
echo 90000 > /proc/sys/vm/nr_hugepages

Sorry 30K each socket only gives 58GB.

@agangidi53
Copy link
Author

ibm_java_run2_smt4_hpg_90k.zip

@Tom-Tran much better .

From here , I'd like to tune it, better to get a bit more close to 24 core published numbers. Appreciate your help till here!!!

@Tom-Tran
Copy link

Hi Adi, great to hear the issues are resolved.

What is your performance goal?

The largest performance gain you can probably get is from increasing the number of groups per socket. I recommend 2 JVM groups per socket with 11c binded to each JVM group. Since you are limited to 128GB RAM per socket, I'd recommend you bump down the backend JVM heap in the JAVA_OPT_BE by 5 to 15 GB, ie, let's try "-Xms50g -Xmx50g -Xmn48g". Also, you'll have to increase the number of hugepages again. echo 115000 > /proc/sys/vm/nr_hugepages

@agangidi53
Copy link
Author

@Tom-Tran My goal is to beat Dual EPYC 7601 by a 10% or so margin. 7601 clocks around 120K max-jops as well.

@agangidi53
Copy link
Author

@Tom-Tran
echo 115000 > /proc/sys/vm/nr_hugepages ran into OOM .
I decreased 'nr_hugepages' to 110000 and the test is running.

Here is what I went with

procs[1]="0-43"
procs[2]="44-87"
procs[3]="88-131"
procs[4]="132-175"
tprocs[1]="0-7"
tprocs[2]="44-51"
tprocs[3]="88-95"
tprocs[4]="132-139"
mem[1]="0"
mem[2]="0"
mem[3]="8"
mem[4]="8"
tmem[1]="0"
tmem[2]="0"
tmem[3]="8"
tmem[4]="8"

What do you think ?

While I'm not sure if this is going to give better results, and I don't want to fiddle with system while the test is running. I got the "out of band" power measurement and chips seems to be running 10% hotter Wattage wise than previous test.
Chip 0 : 188.00 W (lowest = 54.00 W, highest = 222.00 W)
Chip 0 Vdd: 139.00 W (lowest = 7.00 W, highest = 173.00 W)
Chip 0 Vdn: 14.00 W (lowest = 12.00 W, highest = 17.00 W)
Chip 8 : 162.00 W (lowest = 54.00 W, highest = 222.00 W)
Chip 8 Vdd: 113.00 W (lowest = 7.00 W, highest = 173.00 W)
Chip 8 Vdn: 14.00 W (lowest = 11.00 W, highest = 17.00 W)

@Tom-Tran
Copy link

Those bindings look good. I'm curious if you'll get any minor performance gains from:
tprocs[1]="40-43"
tprocs[2]="44-47"
tprocs[3]="128-131"
tprocs[4]="132-135"

Are there any vmstat/mpstat output to complement the wattage output from the two runs? Is the hotter run achieving higher max-jops, and therefore, maximizing the CPUs which can cause increase wattage?

@agangidi53
Copy link
Author

agangidi53 commented Jul 14, 2018

ibm_java_run3_4_groups (1).zip

Hi Tom, here is the output with the previous run. It is showing 8% improvement :)

Is there anything else that comes to your mind, to try ?

@Tom-Tran
Copy link

That's great. Here are some other items that may get a couple more percent.

  1. tproc settings above
  2. Increase the BE heap as much as you can in JAVA_OPTS_BE... see if you can increase Xmx, Xms and Xmn by 5
  3. Here are two OS sched tunings that slightly differs from what you are using
    echo 150000000 > /proc/sys/kernel/sched_min_granularity_ns
    echo 1000000000 > /proc/sys/kernel/sched_wakeup_granularity_ns
  4. Tune config/specjbb2015.props... the three forkjoin tiers, the worker min/max and the selector runner... these are all threadpools that needs to be tuned, but there are no formulas to find the optimal configuration. You might get a 1-2% tuning this, unless it is really poorly configured. I'd start by just increasing/decreasing the forkjoin.worker.Tier3 by 2, 4 and 6. "specjbb.forkjoin.workers.Tier3"

@agangidi53
Copy link
Author

Hi @Tom-Tran
(1) didn't really help the max-jops. It did hurt the critical jops by a bit.
Will try (2), (3), (4)

@agangidi53
Copy link
Author

@Tom-Tran

  1. Didn't help , but hampered results by 2%
  2. Couldn't increase by 5 exactly (OOM) but was able to increase by 3 . Didn't help, but hampered results by 2%
  3. Helped the original setting by 2%: max-jOPS = 132390, critical-jOPS = 45363
  4. Haven't tried yet.

@Tom-Tran
Copy link

Interesting that 2) hampered results, Was the difference between Xmx and Xmn 2? I'm guessing this is run-to-run variation.

@agangidi53
Copy link
Author

agangidi53 commented Jul 20, 2018

Hi @Tom-Tran Till now I was running the tests with 2400 MHz Host build setting and moved to 2666MHz and seem to be running into errors :

Does this makes sense to you ? I used the same parameters and settings from what we discussed before.

dmesg
[32661.390624] run_multi.sh.ib (16641): drop_caches: 3
[32680.432182] JIT Compilation invoked oom-killer: gfp_mask=0x15080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=8, order=0, oom_score_adj=0
[32680.432186] JIT Compilation cpuset=/ mems_allowed=0,8
[32680.432191] CPU: 136 PID: 20033 Comm: JIT Compilation Not tainted 4.15.0-23-generic #25-Ubuntu
[32680.432192] Call Trace:
[32680.432201] [c00020003309f6e0] [c000000000cdeb7c] dump_stack+0xb0/0xf4 (unreliable)
[32680.432206] [c00020003309f720] [c0000000002e5b94] dump_header+0x98/0x2f4
[32680.432208] [c00020003309f7e0] [c0000000002e4c68] oom_kill_process+0x368/0x6c0
[32680.432210] [c00020003309f8a0] [c0000000002e55ec] out_of_memory+0x2bc/0x710
[32680.432213] [c00020003309f940] [c0000000002ed5dc] __alloc_pages_nodemask+0xfbc/0x1070
[32680.432219] [c00020003309fb30] [c000000000378e80] alloc_pages_current+0xa0/0x140
[32680.432224] [c00020003309fb70] [c00000000006d5dc] pte_fragment_alloc+0xdc/0x1e0
[32680.432231] [c00020003309fbc0] [c0000000003a0324] do_huge_pmd_anonymous_page+0x374/0x9d0
[32680.432235] [c00020003309fc40] [c00000000033f68c] __handle_mm_fault+0xc6c/0xe10
[32680.432237] [c00020003309fd20] [c00000000033f958] handle_mm_fault+0x128/0x210
[32680.432239] [c00020003309fd60] [c00000000006a72c] __do_page_fault+0x21c/0xa70
[32680.432247] [c00020003309fe30] [c00000000000a634] handle_page_fault+0x18/0x38
[32680.432248] Mem-Info:
[32680.432256] active_anon:88667 inactive_anon:127 isolated_anon:0
active_file:728 inactive_file:983 isolated_file:0
unevictable:127 dirty:0 writeback:0 unstable:0
slab_reclaimable:3689 slab_unreclaimable:30156
mapped:929 shmem:132 pagetables:870 bounce:0
free:498038 free_pcp:0 free_cma:209920
[32680.432261] Node 8 active_anon:2986432kB inactive_anon:2560kB active_file:2944kB inactive_file:4032kB unevictable:4096kB isolated(anon):0kB isolated(file):0kB mapped:1408kB dirty:0kB writeback:0kB shmem:2816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2265088kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[32680.432264] Node 8 DMA free:13608256kB min:177408kB low:307136kB high:436864kB active_anon:2990528kB inactive_anon:2560kB active_file:3136kB inactive_file:2176kB unevictable:4096kB writepending:0kB present:134217728kB managed:129743552kB mlocked:4096kB kernel_stack:46912kB pagetables:28288kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:13434880kB
[32680.432268] lowmem_reserve[]: 0 0 0 0 0
[32680.432271] Node 8 DMA: 27864kB (UME) 69128kB (ME) 21256kB (ME) 4512kB (UM) 11024kB (M) 12048kB (U) 14096kB (U) 18192kB (M) 828*16384kB (MC) = 13615360kB
[32680.432281] Node 0 hugepages_total=54500 hugepages_free=52320 hugepages_surp=0 hugepages_size=2048kB
[32680.432282] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[32680.432283] Node 8 hugepages_total=54500 hugepages_free=52306 hugepages_surp=0 hugepages_size=2048kB
[32680.432284] Node 8 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[32680.432285] 1904 total pagecache pages
[32680.432286] 0 pages in swap cache
[32680.432287] Swap cache stats: add 0, delete 0, find 0/0
[32680.432288] Free swap = 0kB
[32680.432288] Total swap = 0kB
[32680.432289] 4194304 pages RAM
[32680.432290] 0 pages HighMem/MovableOnly
[32680.432290] 75269 pages reserved
[32680.432291] 209920 pages cma reserved
[32680.432291] 0 pages hwpoisoned
[32680.432292] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[32680.432309] [ 1619] 0 1619 1158 172 37120 0 0 systemd-journal
[32680.432311] [ 1651] 0 1651 1266 50 26880 0 0 lvmetad
[32680.432313] [ 1659] 0 1659 340 142 31744 0 -1000 systemd-udevd
[32680.432315] [ 2920] 100 2920 429 181 32000 0 0 systemd-network
[32680.432317] [ 3014] 62583 3014 1413 111 32000 0 0 systemd-timesyn
[32680.432318] [ 3041] 101 3041 273 165 27648 0 0 systemd-resolve
[32680.432320] [ 3209] 0 3209 101 72 30464 0 0 atd
[32680.432321] [ 3213] 0 3213 3537 44 32000 0 0 lxcfs
[32680.432323] [ 3214] 102 3214 3512 80 32256 0 0 rsyslogd
[32680.432324] [ 3234] 0 3234 1331 9 25856 0 0 iprdump
[32680.432326] [ 3240] 0 3240 297 74 28160 0 0 opal-prd
[32680.432327] [ 3243] 0 3243 3767 178 33280 0 0 accounts-daemon
[32680.432329] [ 3247] 0 3247 269 168 27648 0 0 systemd-logind
[32680.432330] [ 3250] 0 3250 148 88 30720 0 0 cron
[32680.432332] [ 3255] 0 3255 1332 100 31232 0 0 irqbalance
[32680.432333] [ 3268] 0 3268 51199 347 76800 0 -900 snapd
[32680.432335] [ 3270] 103 3270 183 105 27136 0 -900 dbus-daemon
[32680.432337] [ 3388] 0 3388 1720 298 34304 0 0 networkd-dispat
[32680.432338] [ 3440] 0 3440 54 9 29952 0 0 iprupdate
[32680.432340] [ 3448] 0 3448 54 9 30208 0 0 iprinit
[32680.432341] [ 3451] 0 3451 3728 134 29184 0 0 polkitd
[32680.432343] [ 3567] 111 3567 801 221 28928 0 0 redis-server
[32680.432344] [ 3690] 112 3690 388416 6522 376064 0 0 mysqld
[32680.432346] [ 3753] 0 3753 268 117 27904 0 -1000 sshd
[32680.432347] [ 3765] 0 3765 121 26 26368 0 0 iscsid
[32680.432349] [ 3769] 0 3769 129 126 26368 0 -17 iscsid
[32680.432351] [ 3823] 0 3823 234 141 27392 0 0 login
[32680.432352] [ 3832] 0 3832 116 49 26368 0 0 agetty
[32680.432354] [ 4057] 1000 4057 314 188 32000 0 0 systemd
[32680.432355] [ 4066] 1000 4066 394 129 28160 0 0 (sd-pam)
[32680.432357] [ 4093] 1000 4093 167 106 30976 0 0 bash
[32680.432358] [ 4101] 0 4101 214 122 31488 0 0 sudo
[32680.432360] [ 4103] 0 4103 203 116 27136 0 0 su
[32680.432362] [ 4104] 0 4104 151 98 26368 0 0 bash
[32680.432363] [ 6895] 0 6895 333 169 32256 0 0 sshd
[32680.432365] [ 7000] 1000 7000 333 154 32000 0 0 sshd
[32680.432366] [ 7001] 1000 7001 167 106 26880 0 0 bash
[32680.432368] [ 7009] 0 7009 214 122 27136 0 0 sudo
[32680.432369] [ 7010] 0 7010 203 116 27136 0 0 su
[32680.432371] [ 7011] 0 7011 151 102 30720 0 0 bash
[32680.432372] [ 7388] 0 7388 157 68 26880 0 0 screen
[32680.432374] [ 7389] 0 7389 151 101 26624 0 0 bash
[32680.432375] [16573] 0 16573 151 85 30976 0 0 screen
[32680.432377] [16631] 0 16631 133 31 26368 0 0 full_run.sh
[32680.432379] [16641] 0 16641 133 75 30464 0 0 run_multi.sh.ib
[32680.432380] [16656] 0 16656 546881 8222 376832 0 0 java
[32680.432457] [18295] 0 18295 139656 2129 121856 0 0 java
[32680.432458] [19100] 0 19100 1337714 16164 450560 0 0 java
[32680.432460] [19223] 0 19223 136584 2036 120576 0 0 java
[32680.432461] [20028] 0 20028 1362019 15982 459264 0 0 java
[32680.432463] Out of memory: Kill process 19100 (java) score 7 or sacrifice child
[32680.433492] Killed process 19100 (java) total-vm:85613696kB, anon-rss:1000000kB, file-rss:34496kB, shmem-rss:0kB
[32680.452836] oom_reaper: reaped process 19100 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@deathscytheh28
Copy link

SpeccJBB ran error while we set Barreleye_G2 memory freq to 2666.

18-07-19_234122.zip

@Tom-Tran
Copy link

Hi Adi, OOM sounds more of an OS problem than Java. Do you still have the same amount of available memory on the OS with the 2666 GHz memory freq?

Is Terry doing the runs now ?

@agangidi53
Copy link
Author

Both terry and I did the runs and ran into issues . Yes still 16x 16 GB DIMMs , so same amount of memory.

@Tom-Tran
Copy link

I see. @basuv please comment on any side affects of changing the memory frequency from 2400 to 2666. From the outputs, one of the JVMs on node 8 got killed by the kernel because of OOM. I don't think it is due to the lack of allocated hugepages. Probably ran out of normal 64K page memory. Can we try reducing the hugepage allocaiton per node from 54500 to 53000 or 52500?

@johnjmar
Copy link
Member

Is this issue resolved? Wondering if I should close it.

@johnjmar johnjmar added bug Something isn't working question Further information is requested labels Oct 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants