Segmentation Fault when running the sample code #2

Alchem-Lab · 2018-09-07T05:01:23Z

Hi,

I am trying to run the sample script but I encounter a Segmentation Fault error. Do you have any suggestions on resolving this issue?

I used the hosts.xml file and config.xml file as mentioned in the README. Here is the command I use:
./run2.py config.xml noccocc "-t 24 -c 10 -r 100" tpcc 3

The output is like this: (I added some more output on my own):

[START] Input parsing done.
[START] cleaning remaining processes.
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && rm log"
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench tpcc --txn-flags 1 --verbose --config config.
xml --id 1 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && rm log"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench tpcc --txn-flags 1 --verbose --config config.
xml --id 2 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench tpcc --txn-flags 1 --verbose --config config.xml --id 0 -t 24
-c 10 -r 100 -p 3
NOCC started with program [noccocc]. at 06-09-2018 09:36:48
[tpcc] settings:
new_order_remote_item_pct : 1
uniform_item_dist : 0
micro dist :20
[bench_runner.cc:324] Use TCP port 8888
[bench_runner.cc:346] use scale factor: 72; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 3 backups to assign
Txn NewOrder, 100
Remote counts: 100
NAIVE: 4[util.cc:164] huge page alloc failed!
[librdma] get device name mlx4_0, idx 0
[librdma] : Device 0 has 1 ports
[bench_runner.cc:153] Total logger area 0.00585938G.
[bench_runner.cc:163] add RDMA store size 4.88281G.
[bench_runner.cc:172] [Mem] RDMA heap size 8.03902G.
[util.cc:164] huge page alloc failed!
[util.cc:164] huge page alloc failed!
[NOCC] Meet a segmentation fault!
stack trace:
./noccocc() [0x4b3bb8]
/lib64/libc.so.6 : ()+0x35270
/lib64/libc.so.6 : ()+0x8981d
./noccocc : MemDB::AddSchema(int, TABLE_CLASS, int, int, int, int, bool)+0x105
./noccocc : nocc::oltp::tpcc::TpccMainRunner::init_store(MemDB&)+0xe0
./noccocc : nocc::oltp::BenchRunner::run()+0x3d4
./noccocc : nocc::oltp::tpcc::TpccTest(int, char*)+0x143
./noccocc : main()+0x589
/lib64/libc.so.6 : __libc_start_main()+0xf5
./noccocc() [0x47813c]

Thanks!

The text was updated successfully, but these errors were encountered:

wxdwfc · 2018-09-07T06:52:43Z

Hi, could you present the hosts.xml information and your compilation options to me (cmake -Dxxxx=xx) ?
If possible, could you give me the results of your RDMA nic information ? (i.e. with "ibstat")
Thanks!

Alchem-Lab · 2018-09-07T09:35:15Z

The hosts.xml is like this:

  1 <hosts>
  2   <!-- all reachable hosts -->
  3   <macs>
  4     <a>nerv1</a>
  5     <a>nerv2</a>
  6     <a>nerv3</a>
  7   </macs>
  8   <black>
  9     <a>nerv4</a>
 10   </black>
 11   <!-- The macs which are ignored-->
 12 </hosts>

compilation options are the same as in your README:
cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -D ROCC_RBUF_SIZE_M=13240 -D RDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2 -DPA=0 .

The following is the the result when I ran ibstat for machine nerv1:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.34.5000
Hardware version: 0
Node GUID: 0x248a070300e47ac0
System image GUID: 0x248a070300e47ac3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x248a070300e47ac1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x248a070300e47ac2
Link layer: InfiniBand

ibstat for machine nerv2:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.40.7004
Hardware version: 0
Node GUID: 0x7cfe9003009e4ba0
System image GUID: 0x7cfe9003009e4ba3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 14
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e4ba1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e4ba2
Link layer: InfiniBand

ibstat for machine nerv3:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.40.7004
Hardware version: 0
Node GUID: 0x7cfe9003009e5160
System image GUID: 0x7cfe9003009e5163
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e5161
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e5162
Link layer: InfiniBand

I run the python script on nerv1, which is the first machine in hosts.xml.

Thanks!

wxdwfc · 2018-09-07T09:45:31Z

Hi, thanks for your information.
First, you can pull from the main stream, which I misses some scripts and files. I'm sorry for that (We are still refining the code).
Second, it seems that you have 1 NIC per machine. So maybe you can change the RWorker::choose_rnic_port() functions, and set use_port_ variable to be always be 0 (This will use the first RNIC on your server, while our servers use two NICs).
Third can you run the smallbank benchmark? It seems that the segmentation fault happens during data loading, while smallbank benchmark uses a much simpler store than TPC-C. You can replace tpcc with bank to run smallbank workload.
Thanks.

Alchem-Lab · 2018-09-08T22:53:34Z

Hi Thanks for the reply!

This segfaults issue is solved.

But I still cannot run the code well. I am trying to debug. This is my current output when run the script:

[chao@nerv1 scripts]$ ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 3 [START] Input parsing done.
[START] cleaning remaining processes.
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && rm *log"
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 1 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && rm *log"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 2 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 0 -t 24 -c 10 -r 100 -p 3
NOCC started with program [noccocc]. at 08-09-2018 03:45:29
[bench_runner.cc:324] Use TCP port 8888
[bench_runner.cc:346] use scale factor: 72; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 3 backups to assign
[Bank]: check workload 25, 15, 15, 15, 15, 15
[util.cc:164] huge page alloc failed!
[librdma] get device name mlx4_0, idx 0
[librdma] : Device 0 has 1 ports
[bench_runner.cc:153] Total logger area 0.00585938G.
[bench_runner.cc:163] add RDMA store size 4.88281G.
[bench_runner.cc:172] [Mem] RDMA heap size 8.03905G.
[util.cc:164] huge page alloc failed!
[util.cc:164] huge page alloc failed!
[Bank], total 21600000 accounts loaded
[bank_main.cc:263] check cv balance 46280
[Runner] local db size: 661.754 MB
[Runner] Cache size: 0 MB
[bench_listener2.cc:64] try log results to ./results/noccocc_bank_3_24_10_100.log
[bench_listener2.cc:73] New monitor running!
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory

This "Connect Memory Region failed error" is due to ibv_reg_mr() returns null. Do you have any idea why ibv_reg_mr() function can fail?

I appreciate your advise.

wxdwfc · 2018-09-09T02:15:04Z

Hi, it seems that there is no 2M huge pages available on your machine (This results in more memory to register the memory region on the NIC).
Can you allocate enough huge page and then try the results?W You can use command such assu -c "echo 10240 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages" for allocation.

ps: If you can not use huge page, you can configure the RNIC according to this post https://community.mellanox.com/docs/DOC-1120 to allow RNIC register larger memory. But huge page is suggested for better performance.

Alchem-Lab · 2018-09-09T09:21:21Z

I allocated enough huge page memory as your suggested. And the bank bench is now working fine on three machines and with 8 threads on each machine. Thanks for all the help :)

Alchem-Lab closed this as completed Sep 9, 2018

albertghtoun pushed a commit to Alchem-Lab/drtmh that referenced this issue Jul 25, 2019

fix tpcc SJTU-IPADS#2

1e87283

psistakis mentioned this issue Jun 8, 2021

ASSERT(cm_->open_thread_local_device(idx) != nullptr) in src/core/rworker.cc #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Fault when running the sample code #2

Segmentation Fault when running the sample code #2

Alchem-Lab commented Sep 7, 2018 •

edited

wxdwfc commented Sep 7, 2018

Alchem-Lab commented Sep 7, 2018

wxdwfc commented Sep 7, 2018

Alchem-Lab commented Sep 8, 2018

wxdwfc commented Sep 9, 2018 •

edited

Alchem-Lab commented Sep 9, 2018

Segmentation Fault when running the sample code #2

Segmentation Fault when running the sample code #2

Comments

Alchem-Lab commented Sep 7, 2018 • edited

wxdwfc commented Sep 7, 2018

Alchem-Lab commented Sep 7, 2018

wxdwfc commented Sep 7, 2018

Alchem-Lab commented Sep 8, 2018

wxdwfc commented Sep 9, 2018 • edited

Alchem-Lab commented Sep 9, 2018

Alchem-Lab commented Sep 7, 2018 •

edited

wxdwfc commented Sep 9, 2018 •

edited