Skip to content
This repository has been archived by the owner on Dec 20, 2022. It is now read-only.

[SparkRDMA] set up SPARK_LOCAL_IP for RoCE network #5

Closed
li7hui opened this issue Jan 12, 2018 · 7 comments
Closed

[SparkRDMA] set up SPARK_LOCAL_IP for RoCE network #5

li7hui opened this issue Jan 12, 2018 · 7 comments

Comments

@li7hui
Copy link

li7hui commented Jan 12, 2018

Hello,

I am testing the SparkRDMA with Mellanox ConnectX-4Lx card. I installed the Spark-2.2.0 and download SparkTeraSort sample code. The sparkterasort sample code can ran successfully with spark-2.2.0, however, when run the terasort code with the SparkRDMA plugin, it throws out error which is show as following picture.
errors1
Do I need upgrade libibverb.so or do I need configure the RDMA network for Spark? Please help.

@yuvaldeg
Copy link
Contributor

Hi,
The mismatch in size usually doesn't result in failures, so I think the issue is with binding to the right RDMA device.
Is 172.31.101.104 your RDMA device IP address?

@li7hui
Copy link
Author

li7hui commented Jan 12, 2018

Hi,
the 172.31.101.104 is the internet IP address. the 10.10.10.104 is the intranet IP address for RoCE network. The 10.10.10.104 is already binded to mlx_bond_0. How shall I let SparkRDMA know this configuration? If I modify the /etc/hosts to point to 10.10.10.104 network, the Spark will not work...

@yuvaldeg
Copy link
Contributor

Since your hostname points to 172.31.101.104, Spark will bind to 172.31.101.104 by default, and SparkRDMA will follow.
One option to overcome this issue without changing your hosts file is by adding these lines to your spark-env.sh:
export SPARK_MASTER_HOST=
export SPARK_LOCAL_IP=/usr/sbin/ip addr show <THE NETWORK DEVICE NAME OF YOUR RDMA DEVICE> | grep "inet\b" | awk '{print $2}' | cut -d/ -f1

e.g., in my system for example, the master node RDMA IP address is "192.168.1.12", and the RDMA network device name (as it appears in ifconfig) is "ens2", so this how these line work on my setup:
export SPARK_MASTER_HOST=192.168.1.12
export SPARK_LOCAL_IP=/usr/sbin/ip addr show ens2 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1

The above assumes you are running in standalone mode, let me know if you are running in a different mode.

@li7hui
Copy link
Author

li7hui commented Jan 14, 2018

@yuvaldeg
hi, this is really help! i am testing this now.

@li7hui
Copy link
Author

li7hui commented Jan 15, 2018

@yuvaldeg
good news, after setting up the SPARK_LOCAL_IP, i can run the TeraSort with SparkRDMA successfully now. You can close this issue now. many thanks.

@li7hui li7hui changed the title [SparkRDMA] ibm.disni: sock_addr_in size mismatch, jverbs size 28, native size 16 [SparkRDMA] set up SPARK_LOCAL_IP for RoCE network Jan 15, 2018
@yuvaldeg
Copy link
Contributor

Happy to help!
Please let us know if you stumble upon any other issues

@tobegit3hub
Copy link

The SPARK_LOCAL_IP seems only work for standalone mode.

How can we set the IP for workers if we submit Spark jobs with Yarn-cluster mode? @yuvaldeg

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants