Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with spark-nlp #995

Closed
m-developer96 opened this issue Aug 5, 2020 · 23 comments
Closed

Problem with spark-nlp #995

m-developer96 opened this issue Aug 5, 2020 · 23 comments

Comments

@m-developer96
Copy link

m-developer96 commented Aug 5, 2020

Hi!
I'm using this example to create my own sentiment classifier but when I want to execute the below code, I got an error.

use = BertEmbeddings.load('/home/mahdi/workTable/dataset/bert/') \
                    .setInputCols(["document"])\
                    .setOutputCol("sentence_embeddings")\
                    .setPoolingLayer(-2)

I tested it with UniversalSentenceEncoder but got the same error.

The error:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007fac59e78da9, pid=1736, tid=0x00007fad517fb700
#
# JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~18.04-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
#
# Core dump written. Default location: /home/mahdi/workTable/core or core.1736

I used standalone cluster mode with one master and 3 slaves with 4G memory and 4 core for each one at first. Then I used one master and one slave with 10G memory and 6 core for each one. But still got the same error.

My spark initialization:

findspark.init()
conf=SparkConf()
conf.set("spark.driver.memory", "19g")
conf.set("spark.cores.max", "16")
conf.set("spark.executor.memory", "9700m")
conf.set("spark.executor.cores", "8")
conf.set("spark.executor.instances", "8")
conf.set("spark.rpc.message.maxSize","1024")
conf.set("spark.driver.extraJavaOptions","-Djava.io.tmpdir=/home/mahdi/workTable/temp/")
conf.set("spark.executor.extraJavaOptions","-Djava.io.tmpdir=/home/mahdi/workTable/temp/")


spark = SparkSession.builder.master("spark://172.18.16.74:7077").appName("Sentiment Analysis").config(conf=conf)\
                            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.4")\
                            .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

print("Spark version : " ,spark.version)
print("Spark-NLP version : " ,sparknlp.version())
# Spark version :  2.4.5
# Spark-NLP version :  2.5.4

How can I fix it?

Thanks for your help :)

@maziyarpanahi
Copy link
Member

What is /home/mahdi/workTable/dataset/bert/ ?

@maziyarpanahi
Copy link
Member

Also, please complete the template we provide, we need that information in order to reproduce and help.

@m-developer96
Copy link
Author

m-developer96 commented Aug 5, 2020

@maziyarpanahi
That's Bert (bert_base_cased) model that downloaded from here.
It contains:

-bert:
--bert_tensorflow
--fields
--metadate

Unfortunately, I couldn't complete the template and I stuck here.

@m-developer96
Copy link
Author

m-developer96 commented Aug 5, 2020

@maziyarpanahi
I even reduced my dataset down to 1500 sentences but got that error again.

@maziyarpanahi
Copy link
Member

Thanks. What is your Operating System (with distribution and version)?

This seems to be an issue with spark-submit, could you please provide the exact command you are running for your execution?

In the meantime please add these two configs to your Spark session:

spark.kryoserializer.buffer.max 1000M
spark.serializer org.apache.spark.serializer.KryoSerializer

@m-developer96
Copy link
Author

@maziyarpanahi
Thanks!
I use Ubuntu 18.04 and jupyter notebook for running this example.
I just added those two configs but still got the same error.

@maziyarpanahi
Copy link
Member

Got it. So the cluster is up and running and you just run that code inside of a Jupyter? Is there a way to share that coredump with us? (it seems there might be something installed already that has a conflict with the C++ in libtensorflow_framework.so.1)

@m-developer96
Copy link
Author

m-developer96 commented Aug 5, 2020

@maziyarpanahi
Yes.
I'm sorry but I'm new in ubuntu. How can I get coredump?

@maziyarpanahi
Copy link
Member

It says it was written here: /home/mahdi/workTable/core

@m-developer96
Copy link
Author

@maziyarpanahi
Yes, thanks!
I read it with file command (I don't know if it's correct or not) and got this result:

core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/jvm/java-8-openjdk-amd64//bin/java -cp /home/mahdi/workTable/spark/con', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: '/usr/lib/jvm/java-8-openjdk-amd64//bin/java', platform: 'x86_64'

@maziyarpanahi
Copy link
Member

maziyarpanahi commented Aug 5, 2020

OK, no worries.
A question about your standalone cluster, are all the nodes in dedicated machines? Is spark-nlp PyPI package installed on all of them? And do they all have the same Operating System? And do you have tensorflow installed in the same Python path configured for PySpark? (we don't need it and often this has the conflicts)
PS: please do check if you have protobuf installed as well and what is the version if it is.

@m-developer96
Copy link
Author

m-developer96 commented Aug 5, 2020

@maziyarpanahi
Yes, all of slave nodes are in dedicated machines,spark-nlp and PyPI installed on all of them, they have the same operating system (Ubuntu 18.04), and I didn't install tensorflow and protobuf on them.

@maziyarpanahi
Copy link
Member

Thank you, but just in case other PyPI packages can come with either one of them, could you please check all the PyPI packages in that environment to be sure?

@m-developer96
Copy link
Author

@maziyarpanahi
Yes, but I'm sorry I didn't understand exactly. Do you mean that I should check the PyPI in nodes?
If that was, I checked and upgraded PyPI in all of them: pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6)

@maziyarpanahi
Copy link
Member

I meant if you activate the same Python environment, and run pip freeze do you see any tensorflow or protobuf?

@m-developer96
Copy link
Author

@maziyarpanahi
Oh sorry.
I checked, none of them were installed.

@maziyarpanahi
Copy link
Member

Great, thank you very much. We are working on this to reproduce and find a workaround.

@m-developer96
Copy link
Author

@maziyarpanahi
Thank you so much.

@maziyarpanahi
Copy link
Member

@albertoandreottiATgmail It seems Ubuntu 18 has something that conflicts with TensorFlow similar to this issue: tensorflow/tensorflow#24976

@m-developer96 I will try to reproduce this on a fresh Ubuntu 18 updated to the latest today/tomorrow

@albertoandreottiATgmail
Copy link
Contributor

Hello @m-developer96 , that signal the process is receiving is supposed to happen when the process runs an instruction that the current architecture cannot handle.
Are all your nodes the same architecture? I have Ubuntu 18, and I use almost exactly the same OpenJDK you use.
Another possibility is that the binaries in your system are somewhat corrupted, can you try doing this,

jose@machine:~/.ivy2$ unzip ./cache/org.tensorflow/libtensorflow_jni/jars/libtensorflow_jni-1.15.0.jar org/tensorflow/native/linux-x86_64/libtensorflow_jni.so

jose@machine:~/.ivy2$ ldd org/tensorflow/native/linux-x86_64/libtensorflow_jni.so
linux-vdso.so.1 (0x00007fff100d8000)
libtensorflow_framework.so.1 => not found
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe14d978000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe14d5da000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe14d3bb000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe14d1b3000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe14ce2a000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe14cc12000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe14c821000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe156ee6000)

and post the paths you get on your system?

@rituparna-13
Copy link

rituparna-13 commented Oct 15, 2020

I am facing the same issue when trying to train using BertEmbeddings from spark nlp. I am using centOS and spark-nlp version 2.6.2. Is this issue fixed and is there a solution?
@maziyarpanahi @m-developer96 Were you able to solve this problem?

@maziyarpanahi
Copy link
Member

@phoenix1391 Would you mind creating a new issue with a complete template for us to reproduce this issue? (the more info we have the more chance we can reproduce this, especially the OS and its version). Unfortunately, there is no new update on this issue but I am hoping with your new issue we can reproduce it or say what is not compatible at least.

@github-actions
Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants