Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault when running the sample code #2

Closed
Alchem-Lab opened this issue Sep 7, 2018 · 6 comments
Closed

Segmentation Fault when running the sample code #2

Alchem-Lab opened this issue Sep 7, 2018 · 6 comments

Comments

@Alchem-Lab
Copy link

Alchem-Lab commented Sep 7, 2018

Hi,

I am trying to run the sample script but I encounter a Segmentation Fault error. Do you have any suggestions on resolving this issue?

I used the hosts.xml file and config.xml file as mentioned in the README. Here is the command I use:
./run2.py config.xml noccocc "-t 24 -c 10 -r 100" tpcc 3

The output is like this: (I added some more output on my own):

[START] Input parsing done.
[START] cleaning remaining processes.
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && rm log"
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench tpcc --txn-flags 1 --verbose --config config.
xml --id 1 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && rm log"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench tpcc --txn-flags 1 --verbose --config config.
xml --id 2 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench tpcc --txn-flags 1 --verbose --config config.xml --id 0 -t 24
-c 10 -r 100 -p 3
NOCC started with program [noccocc]. at 06-09-2018 09:36:48
[tpcc] settings:
new_order_remote_item_pct : 1
uniform_item_dist : 0
micro dist :20
[bench_runner.cc:324] Use TCP port 8888
[bench_runner.cc:346] use scale factor: 72; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 3 backups to assign
Txn NewOrder, 100
Remote counts: 100
NAIVE: 4[util.cc:164] huge page alloc failed!
[librdma] get device name mlx4_0, idx 0
[librdma] : Device 0 has 1 ports
[bench_runner.cc:153] Total logger area 0.00585938G.
[bench_runner.cc:163] add RDMA store size 4.88281G.
[bench_runner.cc:172] [Mem] RDMA heap size 8.03902G.
[util.cc:164] huge page alloc failed!
[util.cc:164] huge page alloc failed!
[NOCC] Meet a segmentation fault!
stack trace:
./noccocc() [0x4b3bb8]
/lib64/libc.so.6 : ()+0x35270
/lib64/libc.so.6 : ()+0x8981d
./noccocc : MemDB::AddSchema(int, TABLE_CLASS, int, int, int, int, bool)+0x105
./noccocc : nocc::oltp::tpcc::TpccMainRunner::init_store(MemDB
&)+0xe0
./noccocc : nocc::oltp::BenchRunner::run()+0x3d4
./noccocc : nocc::oltp::tpcc::TpccTest(int, char
*)+0x143
./noccocc : main()+0x589
/lib64/libc.so.6 : __libc_start_main()+0xf5
./noccocc() [0x47813c]

Thanks!

@wxdwfc
Copy link
Collaborator

wxdwfc commented Sep 7, 2018

Hi, could you present the hosts.xml information and your compilation options to me (cmake -Dxxxx=xx) ?
If possible, could you give me the results of your RDMA nic information ? (i.e. with "ibstat")
Thanks!

@Alchem-Lab
Copy link
Author

The hosts.xml is like this:

  1 <hosts>
  2   <!-- all reachable hosts -->
  3   <macs>
  4     <a>nerv1</a>
  5     <a>nerv2</a>
  6     <a>nerv3</a>
  7   </macs>
  8   <black>
  9     <a>nerv4</a>
 10   </black>
 11   <!-- The macs which are ignored-->
 12 </hosts>

compilation options are the same as in your README:
cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -D ROCC_RBUF_SIZE_M=13240 -D RDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2 -DPA=0 .

The following is the the result when I ran ibstat for machine nerv1:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.34.5000
Hardware version: 0
Node GUID: 0x248a070300e47ac0
System image GUID: 0x248a070300e47ac3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0x248a070300e47ac1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x248a070300e47ac2
Link layer: InfiniBand

ibstat for machine nerv2:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.40.7004
Hardware version: 0
Node GUID: 0x7cfe9003009e4ba0
System image GUID: 0x7cfe9003009e4ba3
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 14
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e4ba1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e4ba2
Link layer: InfiniBand

ibstat for machine nerv3:

CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.40.7004
Hardware version: 0
Node GUID: 0x7cfe9003009e5160
System image GUID: 0x7cfe9003009e5163
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e5161
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x7cfe9003009e5162
Link layer: InfiniBand

I run the python script on nerv1, which is the first machine in hosts.xml.

Thanks!

@wxdwfc
Copy link
Collaborator

wxdwfc commented Sep 7, 2018

Hi, thanks for your information.
First, you can pull from the main stream, which I misses some scripts and files. I'm sorry for that (We are still refining the code).
Second, it seems that you have 1 NIC per machine. So maybe you can change the RWorker::choose_rnic_port() functions, and set use_port_ variable to be always be 0 (This will use the first RNIC on your server, while our servers use two NICs).
Third can you run the smallbank benchmark? It seems that the segmentation fault happens during data loading, while smallbank benchmark uses a much simpler store than TPC-C. You can replace tpcc with bank to run smallbank workload.
Thanks.

@Alchem-Lab
Copy link
Author

Hi Thanks for the reply!

This segfaults issue is solved.

But I still cannot run the code well. I am trying to debug. This is my current output when run the script:

[chao@nerv1 scripts]$ ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 3 [START] Input parsing done.
[START] cleaning remaining processes.
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && rm *log"
ssh -n -f nerv2 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 1 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && rm *log"
ssh -n -f nerv3 "cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 2 -t 24 -c 10 -r 100 -p 3 1>log 2>&1 &"
cd /home/chao/git_repos/rocc/scripts/ && ./noccocc --bench bank --txn-flags 1 --verbose --config config.xml --id 0 -t 24 -c 10 -r 100 -p 3
NOCC started with program [noccocc]. at 08-09-2018 03:45:29
[bench_runner.cc:324] Use TCP port 8888
[bench_runner.cc:346] use scale factor: 72; with total 24 threads.
[view.h:48] Start with 0 backups.
[view.cc:10] total 3 backups to assign
[Bank]: check workload 25, 15, 15, 15, 15, 15
[util.cc:164] huge page alloc failed!
[librdma] get device name mlx4_0, idx 0
[librdma] : Device 0 has 1 ports
[bench_runner.cc:153] Total logger area 0.00585938G.
[bench_runner.cc:163] add RDMA store size 4.88281G.
[bench_runner.cc:172] [Mem] RDMA heap size 8.03905G.
[util.cc:164] huge page alloc failed!
[util.cc:164] huge page alloc failed!
[Bank], total 21600000 accounts loaded
[bank_main.cc:263] check cv balance 46280
[Runner] local db size: 661.754 MB
[Runner] Cache size: 0 MB
[bench_listener2.cc:64] try log results to ./results/noccocc_bank_3_24_10_100.log
[bench_listener2.cc:73] New monitor running!
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory
[librdma]: Connect Memory Region failed at dev 0, err Cannot allocate memory

This "Connect Memory Region failed error" is due to ibv_reg_mr() returns null. Do you have any idea why ibv_reg_mr() function can fail?

I appreciate your advise.

@wxdwfc
Copy link
Collaborator

wxdwfc commented Sep 9, 2018

Hi, it seems that there is no 2M huge pages available on your machine (This results in more memory to register the memory region on the NIC).
Can you allocate enough huge page and then try the results?W You can use command such assu -c "echo 10240 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages" for allocation.

ps: If you can not use huge page, you can configure the RNIC according to this post https://community.mellanox.com/docs/DOC-1120 to allow RNIC register larger memory. But huge page is suggested for better performance.

@Alchem-Lab
Copy link
Author

I allocated enough huge page memory as your suggested. And the bank bench is now working fine on three machines and with 8 threads on each machine. Thanks for all the help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants