Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x #963

Closed
akash166d opened this issue Jul 6, 2020 · 28 comments
Closed

Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x #963

akash166d opened this issue Jul 6, 2020 · 28 comments

Comments

@akash166d
Copy link

akash166d commented Jul 6, 2020

  • Apache Spark version 2.3.2.3.1.5.0-152
  • Spark NLP version 1.7.3
  • Apache Spark setup (OS, docker, jupyter, zeppelin, Couldera, Databricks, EMR, etc.) : cloudera
  • How did you install Spark NLP:
    Quoting the IT team – “we don't install packages from source because doing so would not allow us to pass a umask value to the package during installation and thus making it only importable by the root user
    so we install via pip, specifically using the pip module in ansible, in order to pass the needed umask value”
  • Java version : 1.8.0_121
  • Python/Scala version : Python 3.6.5
  • Does anything else work in Apache Spark and only Spark NLP related part fails? Not sure I’m working on linux and assuming it is connected to Hadoop system letting me code on spark

Code Snippet:*****************************************************

import os
import sys
sys.path.append('../../')

print(sys.version)

from sparknlp.pretrained import ResourceDownloader
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

spark = SparkSession.builder \
    .appName("ner")\
    .master("local[*]")\
    .config("spark.driver.memory","4G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.driver.extraClassPath", "/hadoop/anaconda3.6/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

downloader = ResourceDownloader()


l = [(1,'Thanks for calling to ESI'),(2,'How can i help you'),(3,'Please reach out to us on mail')]

data = spark.createDataFrame(l, ['docID','text'])

#Working fine
document_assembler = DocumentAssembler() \
    .setInputCol("text")

#Working fine
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

#Working fine
tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

#Working fine
lemma = LemmatizerModel.load("/user/elxxx/emma_mod").setInputCols(["token"]).setOutputCol("lemma")

#Working fine
pos = PerceptronModel.load("/user/elxxx/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos")

#Working fine
nor_sweet = NorvigSweetingModel.load("/user/elxxx/spell_nor_mod").setInputCols(["token"]).setOutputCol("corrected")

#Working fine
sent_viv = ViveknSentimentModel.load("/user/elxxx/sent_vivek_mod").setInputCols(["sentence","token"]).setOutputCol("sentiment")


**#Error: WordEmbeddingsModel not defined**
embed = WordEmbeddingsModel.load("/user/elxxx/wordEmbedMod").setStoragePath("/user/elxxx/wordEmbedMod/glove.6B.100d.txt", "TEXT")\
      .setDimension(100)\
      .setStorageRef("glove_100d") \
      .setInputCols("document", "token") \
      .setOutputCol("embeddings")

#Similar issue with other modules
#Error: BertEmbeddingsModel not defined
#bert = BertEmbeddings.load ("/user/elxxx/bert").setInputCols("sentence", "token") .setOutputCol("bert").
**************************************************************************************************************************

We replaced the previous sparkNLP.jar with the newly provided sparkNLP fatJAR (and renamed it to sparkNLP.jar) file by @maziyarpanahi . It seems like it had some conflict with Jackson.Jar file which might be the reason the spark crashed.

Could you help us configure the sparkNLP for our version of spark given there are jar files that support the compatibility. Happy to fill you in with more details if needed.

@maziyarpanahi maziyarpanahi self-assigned this Jul 7, 2020
@maziyarpanahi
Copy link
Member

maziyarpanahi commented Jul 7, 2020

Hi,

Just as a test, I would like to see the same demo with the following changes:

# the correct imports to pull everything in
# you shouldn't import annotator by annotator, maybe that is why the others are missing

from sparknlp.base import *
from sparknlp.annotator import *

from sparknlp.pretrained import PretrainedPipeline
import sparknlp

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

# I don't know what this jar is: /hadoop/anaconda3.6/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar
# and why is it in the site-packages unless it's the same JAR I gave you and it was renamd and placed here for some reason
# if you are using the Fat Jar I gave you, let's do it this way
# Fat JAR: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-spark23-2.5.2.jar
# the session says local[*] so it's not in the cluster and can use the local fileSystem like ./tmp, but it can be anywhere

spark = SparkSession.builder \
.appName("ner") \
.master("local[*]") \
.config("spark.driver.memory","16G") \
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars", "./tmp/spark-nlp-assembly-spark23-2.5.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "1000m") \
.getOrCreate()


# I don't know what this is and what is it for but not required
# downloader = ResourceDownloader()

# the versions
sparknlp.version()
spark.version

# and the rest

If the PyPI package is correctly installed for 2.5.2 versions pip install spark-nlp==2.5.2 or installed via ansible in offline mode by using the Py Wheel from PyPI, the code I corrected should work or have some interesting results.

Let's see how the changes go first 😊

PS: If you have downloaded WordEmbeddingsModel and trying to load it, then you don't need to set glove file manually or even setStorageRef, they all come with the model. If you want to train it with custom word2vec or glove, then you need to us WordEmbeddings(). There are examples in our spark-nlp=workshop

PS2: Please copy the full error stack/logs for Jackson.Jar crash so we can read and see what happened.

@akash166d
Copy link
Author

Thanks a lot for such a detailed description. I'm super excited to try out the new snippet you have provided. Currently we are facing some issue in launching spark session on our dev environment and IT is trying to fix the bug. Meanwhile, I have couple of very short, sweet and simple questions :)

  1. The FAT jar file you provided us- "spark-nlp-assembly-spark23-2.5.2.jar". We can keep it anywhere (from where spark can access the location)? I previously assumed that this is analogous to sparknlp.jar file and we should replace and rename it. From what you mentioned it looks like we should not touch sparknlp.jar? Also since we cannot use online path, we will use our cluster/local path.

  2. I might reinstall sparkNLP because currently system is crashed and we are not sure what the issue might be. So as a next step i will simply pip install sparkNLP and then use the Fat Jar file provided by you in sparkSession.builder and that's just it, right?

@maziyarpanahi
Copy link
Member

You are welcome

  • The FAT jar file you provided us- "spark-nlp-assembly-spark23-2.5.2.jar". We can keep it anywhere (from where spark can access the location)? I previously assumed that this is analogous to sparknlp.jar file and we should replace and rename it. From what you mentioned it looks like we should not touch sparknlp.jar? Also since we cannot use online path, we will use our cluster/local path.

For local setup such as your example, the JAR should be accessible via the local file system. Assuming you have PySpark/Spark 2.3.x or do pip install pyspark==2.3.4 you can run the code I gave you where the JAR is accessible and it will start the SparkSession for Apache Spark 2.3.x with Spark NLP compatible with Apache Spark 2.3.x.
That being said, once you are in the cluster via pyspark --jars or other places like in Zeppelin or even manual SparkSession but with YARN as master, the JAR must be on the HDFS. This way all the machines/executors have access to hdfs:///jars/spark/spark-nlp-assembly-spark23-2.5.2.jar and nothing else needs to be changed in the session besides other configs related to your cluster like a queue, application name, etc.

  • I might reinstall sparkNLP because currently system is crashed and we are not sure what the issue might be. So as a next step i will simply pip install sparkNLP and then use the Fat Jar file provided by you in sparkSession.builder and that's just it, right?

All you need is to do pip install spark-nlp==2.5.2 pyspark==2.3.4 if you have PySpark present and you are sure about it, the only pip install spark-nlp==2.5.2 is enough. This should happen on all the machines in the Apache Spark cluster (workers) either via distributed Conda, or manually doing ansible command or if there is no internet you can download the Py Wheel file and do pip install with the Py Wheel on all the machines.

@akash166d
Copy link
Author

Hi @maziyarpanahi

After reinstalling I used the code provided by you. Please find below the full code (with few minor changes).


from sparknlp.base import *
from sparknlp.annotator import *

# Error with this. Cannot import PretrainedPipeline.
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

I updated .master to "yarn" as i wanted to run this on cluster and not local

spark = SparkSession.builder
.appName("ner")
.master("yarn")\
.config("spark.driver.memory","8G")
.config("spark.driver.maxResultSize", "2G")
.config("spark.jars", "/user/el7150/spark-nlp-assembly-spark23-2.5.2.jar")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "500m")
.getOrCreate()

l = [(1,'Thanks for calling to ESI'),(2,'How can i help you'),(3,'Please reach out to us on mail')]

data = spark.createDataFrame(l, ['docID','text'])

document_assembler = DocumentAssembler()
.setInputCol("text")

sentence_detector = SentenceDetector()
.setInputCols(["document"])
.setOutputCol("sentence")

tokenizer = Tokenizer()
.setInputCols(["sentence"])
.setOutputCol("token")

lemma = LemmatizerModel.load("/user/xxxxx/lemma_mod").setInputCols(["token"]).setOutputCol("lemma")

This was earlier working fine. I've enclosed Error log. Do i need to download a new file for this model?

pos = PerceptronModel.load("/user/xxxxx/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos")

nor_sweet = NorvigSweetingModel.load("/user/xxxx/spell_nor_mod").setInputCols(["token"]).setOutputCol("corrected")

sent_viv = ViveknSentimentModel.load("/user/xxxxx/sent_vivek_mod").setInputCols(["sentence","token"]).setOutputCol("sentiment")

this is throwing an error. Please find the error log details below.

embeddings = WordEmbeddingsModel.load("/user/xxxxxx/wrd_embd_glove_mod").setInputCols("document", "token").setOutputCol("embeddings")


Could you please help me debug the issue. I can't find anything on slack or Github issues related to it.

POS Error:

20/07/08 16:01:45 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 14, ch3dr028355.express-scripts.com, executor 1): java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.annotators.pos.perceptron.AveragedPerceptron
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1678)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

20/07/08 16:01:45 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times; aborting job
Traceback (most recent call last):
File "", line 1, in
File "/usr/hdp/current/spark2-client/python/pyspark/ml/util.py", line 311, in load
return cls.read().load(path)
File "/usr/hdp/current/spark2-client/python/pyspark/ml/util.py", line 249, in load
java_obj = self._jread.load(path)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o143.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 17, ch3dr028353.express-scripts.com, executor 4): java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.annotators.pos.perceptron.AveragedPerceptron
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1678)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

WordEmbeddings Error

embeddings = WordEmbeddingsModel.load("/user/el7150/wrd_embd_glove_mod").setInputCols("document", "token").setOutputCol("embeddings")
Traceback (most recent call last):
File "", line 1, in
File "/usr/hdp/current/spark2-client/python/pyspark/ml/util.py", line 311, in load
return cls.read().load(path)
File "/usr/hdp/current/spark2-client/python/pyspark/ml/util.py", line 249, in load
java_obj = self._jread.load(path)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o251.load.
: java.lang.UnsatisfiedLinkError: /tmp/librocksdbjni254535277396301076.so: /tmp/librocksdbjni254535277396301076.so: failed to map segment from shared object: Operation not permitted
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1086)
at org.rocksdb.NativeLibraryLoader.loadLibraryFromJar(NativeLibraryLoader.java:79)
at org.rocksdb.NativeLibraryLoader.loadLibrary(NativeLibraryLoader.java:57)
at org.rocksdb.RocksDB.loadLibrary(RocksDB.java:63)
at org.rocksdb.RocksDB.(RocksDB.java:34)
at com.johnsnowlabs.storage.RocksDBConnection.(RocksDBConnection.scala:11)
at com.johnsnowlabs.storage.RocksDBConnection$.getOrCreate(RocksDBConnection.scala:102)
at com.johnsnowlabs.storage.StorageHelper$.load(StorageHelper.scala:29)
at com.johnsnowlabs.storage.HasStorageModel$$anonfun$deserializeStorage$1.apply(HasStorageModel.scala:27)
at com.johnsnowlabs.storage.HasStorageModel$$anonfun$deserializeStorage$1.apply(HasStorageModel.scala:26)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at com.johnsnowlabs.storage.HasStorageModel$class.deserializeStorage(HasStorageModel.scala:26)
at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel.deserializeStorage(WordEmbeddingsModel.scala:32)
at com.johnsnowlabs.storage.StorageReadable$class.readStorage(StorageReadable.scala:23)
at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel$.readStorage(WordEmbeddingsModel.scala:156)
at com.johnsnowlabs.storage.StorageReadable$$anonfun$1.apply(StorageReadable.scala:26)
at com.johnsnowlabs.storage.StorageReadable$$anonfun$1.apply(StorageReadable.scala:26)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:31)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:30)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$class.com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead(ParamsAndFeaturesReadable.scala:30)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41)
at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:19)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)

@maziyarpanahi
Copy link
Member

maziyarpanahi commented Jul 8, 2020

@akash166d So we go from now on assuming this is a YARN cluster.

Let's forget about any code and error for now. You have PyPI issues right at the beginning and without solving that the SparkSession and the JAR are pointless.

If you have installed pip install spark-nlp==2.5.2 on all the machines correctly in the right Python ENV which is defined and known to Apache Spark, you should be able to run the following code correctly. If there is an error then you haven't either installed the PyPI, or it's not installed completely on all machines or you haven't correctly set PYTHON_ENV for YARN/PySpark:

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

sparknlp.version

The PyPI installation and successfully running this snippet are not related to Apache Spark version, so people in the IT team should manage to do this correctly the same way they do for any other third-party PyPI package.

NOTE: There is another possibility which is lower but still possible. That our PyPI package is not compatible with Apache Spark 2.3.x and I need to give you a Py Wheel for that to install it. (I am not sure about this so I am gonna test and ask @saif-ellafi )

@akash166d
Copy link
Author

akash166d commented Jul 9, 2020

Hi @maziyarpanahi @saif-ellafi

Thanks for helping us out. I rechecked with IT and they said and i quote-

'When he says “If you have installed ‘pip install spark-nlp==2.5.2’ on all the machines correctly in the right Python ENV which is defined and known to apache spark..” I’m assuming we’re defining /hadoop/anaconda3.6//bin/python (or something along those lines) as the python environment for Apache Spark? Or does spark have its own separate python installation (somewhere in a spark directory) where it pulls packages from? I’ve worked on some hadoop components that behave like this and have a whole separate but contained (not a venv but effectively the same thing) python installation so I just want to make sure we’re not going about getting these packages recognized by spark in the wrong way. '

Let me know if the above statement makes sense and meanwhile, we are checking if there is a separate python environment (some what on the line of venv may be) but as of now would it be possible for you to provide with the Updated Py Wheel file you mentioned earlier.

Also, just wanted to check on one additional thing. Do you think the error when importing pretrainedpipelines is directly related to messy PyPi installation or is it a version compatibility issue?

Once again, I really owe you guys a lot for helping me on this.

PS: He also provided me into details on how he uninstalled and installed the package, just in case if it gives more clarity-

'Technically we don’t uninstall/install packages directly on our servers, we copy them over from a cloned sandbox environment (this way we can keep multiple versions of the same package on the server as a roll back option if a new package update breaks something). When Akash asked me to uninstall the package, all I did was remove the package directory and then copy it back from the cloned sandbox environment. If they would like us to directly install it on the dev servers that’s fine, I’m just worried about doing this when we get to prod. not that we can’t do it this way on prod, there’s just a bit more at stake since we have 24/7 applications that run on those servers so there will be a lot more prep and it might take a bit longer to get those packages installed (likely will have to schedule a formal CHG, compared to just using CTASK for copying files). '

@maziyarpanahi
Copy link
Member

Hi @akash166d

Yes, if you cannot successfully run the few imports I gave you then it's absolutely the PyPi package issue. I have never personally cloned the PyPI package to another system. For fall back in terms of a bad upgrade, you can simply run the ansible command to downgrade the to the previous version easily. At least in our case, the spark-nlp PyPI package is just a mapping to the Scala/Java without messing with other packages unlike some of those heavy PyPI packages.

If you guys can just install it with pip install so we can b sure about the PyPI package being first visible by Apache Spark/SparkSeession and second after those imports pass successfully we can confirm the JAR compatibility.

@maziyarpanahi
Copy link
Member

So it's definitely the way PyPI package is being installed/managed. I did a test with pip install spark-nlp=2.5.3 pyspark=2.3.0 and the same JAR I gave you:

image

Our PyPI package as I guessed is compatible with Apache Spark 2.3.x and Apache Spark 2.4.x so no need to have a separate Py Wheel for it. You can just use the one on PyPI.

@akash166d
Copy link
Author

Thanks @maziyarpanahi , we tried installing sparkNLP via pip install but issue still persist. I'm still getting an error : 'ImportError: cannot import name 'PretrainedPipeline''. @https://github.com/kce6 He can help us with details regarding the steps for installation, so that you can have more clarity.

@kce6
Copy link

kce6 commented Jul 10, 2020

Hi,

Ken from the linux admin team here:

Here is the playbook we used to install the package


  • hosts: all
    user: root

    tasks:

    • name: Install Pkgs
      pip:
      name:
      - spark-nlp==2.5.3
      state: forcereinstall
      umask: '022'
      executable: /hadoop/anaconda3.6/bin/pip3

We have installed countless packages with this playbook and have never had any issues so I'm not really sure if it's the installation method being the issue here

@maziyarpanahi
Copy link
Member

Thanks @kce6

Are those PyPI packages being used in PySpark? (the installation might be correct but if it's not correctly configured for Apache Spark or PySpark being executed in a wrong Python environment it won't find the installed PyPI packages)

@akash166d
Copy link
Author

Hey Hi @maziyarpanahi We checked and there are multiple environments for python (py2 and anaconda) and also spark has it's own python installation. However, the spark python installation is pointing towards anaconda package directory.

Given this info, Is there any suggested next steps that we can take from here.

@kce6 Please let me know if I've missed on any details here that can help @maziyarpanahi

@maziyarpanahi
Copy link
Member

Hi @akash166d

Unfortunately, setting those configs are out of our hands as we don't have the knowledge nor control over the cluster. However, I suggest when the PySpark session is available, before dealing with Spark NLP you check the environment you are in and try to adjust it either on the Apache Spark side or where/how you are starting the PySpark ( maybe the default configs are not correctly distributed by ZooKeeper and your machine might point to something else):

import sys
import os

print(sys.version)
spark.version
print("####")
print(sys.path)
print(os.environ['PYSPARK_PYTHON'])

@akash166d
Copy link
Author

akash166d commented Jul 14, 2020

@maziyarpanahi
I'm enclosing below the output from code you mentioned above. Looks like the environment variable are correctly referenced. Please let me know if I'm missing anything.

Launching Pyspark:

export PYSPARK_PYTHON=/hadoop/anaconda3.6/bin/python

pyspark --num-executors 55 --executor-cores 4  --executor-memory 26G --driver-memory 20G --conf spark.sql.shuffle.partitions=7500 --conf spark.driver.maxResultsSize=0 --conf spark.default.parallelism=100 --master=yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar   --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip

Output

import sys
import os

print(sys.version)
3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
spark.version
'2.3.2.3.1.5.0-152'
print("####")
####
print(sys.path)
['', '/tmp/spark-9df21c1d-a62e-42a0-b871-3eb9099286a6/userFiles-b28c107f-41b0-4f1d-b104-8accd03cec47/pyspark_hwc-1.0.0.3.1.5.0-152.zip', '/tmp/spark-9df21c1d-a62e-42a0-b871-3eb9099286a6/userFiles-b28c107f-41b0-4f1d-b104-8accd03cec47', '/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip', '/usr/hdp/current/spark2-client/python', '/opt/site/lib/python', '/hadoop/anaconda3.6/lib/python36.zip', '/hadoop/anaconda3.6/lib/python3.6', '/hadoop/anaconda3.6/lib/python3.6/lib-dynload', '/hadoop/anaconda3.6/lib/python3.6/site-packages']
print(os.environ['PYSPARK_PYTHON'])
/hadoop/anaconda3.6/bin/python

I have one additional question: when i do
`

# this throws an error
from sparknlp.pretrained import PretrainedPipline
# but this doesn't 
from sparknlp.pretrained import *

`

Do i need to update the API as mentioned in this link : https://github.com/JohnSnowLabs/spark-nlp/blob/master/python/sparknlp/pretrained.py

@maziyarpanahi
Copy link
Member

In your pyspark command I don't see --jars pointing to the Fat JAR from Spark NLP I gave you. Could you please also put that JAR on HDFS and mentioning it in --jars and try again.

This from sparknlp.pretrained import PretrainedPipeline should be fine, however, let's test it with the other one which is successful but you really need that JAR as well or the rest. Let's do one more test with all the code and the same command but with the Fat JAR of Spark NLP or 2.3.x

@akash166d
Copy link
Author

akash166d commented Jul 15, 2020

Hi @maziyarpanahi

I tried placing --jars in my pyspark command like this

pyspark --jars /user/el7150/spark-nlp-assembly-spark23-2.5.2.jar --num-executors 55 --executor-cores 4 --executor-memory 26G --driver-memory 20G --conf spark.sql.shuffle.partitions=7500 --conf spark.driver.maxResultsSize=0 --conf spark.default.parallelism=100 --master=yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip

However the issue still persist with the same error: ClassNotFoundException.
One thing i noticed in the error log (please refer to previous message in thread for detailed log) is that it throws exception on
ClassNotFoundException: com.johnsnowlabs.nlp.annotators.pos.perceptron.AveragedPerceptron

Similar to this : [(https://github.com//issues/203)] @saif-ellafi

I was going through the perceptron module and was not able to find the AveragedPerceptron. Do you think the issue might be because of incompatibility between sparkJar and perceptron file I've. Please let me know, If you think i should use any updated spark jar for sparkNLP 2.5.3.

Please help me with next steps that I can try from here on in debugging this.
Thanks!!!!

@akash166d
Copy link
Author

akash166d commented Jul 15, 2020

@maziyarpanahi I tried looking for the classes we have in sparkNLP in following ways:

>>> [eval("sparknlp." + objname) for objname in dir(sparknlp) if type(eval("sparknlp." + objname)) is type]
[<class 'sparknlp.base.Chunk2Doc'>, <class 'sparknlp.base.Doc2Chunk'>, <class 'sparknlp.base.DocumentAssembler'>, <class 'sparknlp.base.Finisher'>, <class 'pyspark.sql.session.SparkSession'>, <class 'sparknlp.base.TokenAssembler'>] 
import sparknlp.pretrained as pt 
>>> [eval("sparknlp." + objname) for objname in dir(pt) if type(eval("sparknlp." + objname)) is type]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "<string>", line 1, in <module>
AttributeError: module 'sparknlp' has no attribute 'AlbertEmbeddings' 

Do you think this can help us decide next steps in troubleshooting from here?

@akash166d akash166d changed the title Version Compatibility between sparkNLP 2.5.2 and spark 2.3.x Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x Jul 15, 2020
@maziyarpanahi
Copy link
Member

@akash166d

OK, let's summarize this. Let's start the PySpark shell, with the same JAR you have, and please show me the results for the following code:

from sparknlp.base import *
from sparknlp.annotator import *
import sparknlp

sparknlp.version()

Then please also run this:

from sparknlp.pretrained import PretrainedPipeline

If the error persists in PretrainedPipeline, then it's not the JAR nor the PySpark version, it's PyPI package either not properly installed or somehow missing some classes.

Let's focus on this and not the models/pipelines. We are working on making the next release 100% compatible with Apache Spark 2.3.x meaning no more offline .load and it can be downloaded directly. Let's first fix this part then once the next release out and this is fixed we can continue with models/pipelines.

@akash166d
Copy link
Author

Hi @maziyarpanahi , I have some good news with respect to PretrainedPipeline. So There's no error now when importing the PretrainedPipeline and it got sorted after we installed sparkNLP throughout the servers. Earlier it was just installed on Edgenode. So that's a great news. Thanks for your support that we are making progress

Now, there's still an error when loading PerceptronModel. The code snippet is same as shared earlier. I'm eclosing below the new error for more context.

#Launching Pyspark from dev
export PYSPARK_PYTHON=/hadoop/anaconda3.6/bin/python

pyspark  --jars /user/el7150/spark-nlp-assembly-spark23-2.5.2.jar  --num-executors 55 --executor-cores 4  --executor-memory 26G --driver-memory 20G --conf spark.sql.shuffle.partitions=7500 --conf spark.driver.maxResultsSize=0 --conf spark.default.parallelism=100 --master=yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar   --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip

#inside pyspark env

import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *

from sparknlp.pretrained import PretrainedPipeline

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

spark = SparkSession.builder\
.appName("ner")\
.master("yarn")\
.config("spark.driver.memory","8G")\
.config("spark.driver.maxResultSize", "2G")\
.config("spark.jars", "/user/el7150/spark-nlp-assembly-spark23-2.5.2.jar")\
.config('spark.jars.packages', "/user/el7150/spark-nlp-assembly-spark23-2.5.2.jar")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.kryoserializer.buffer.max", "500m")\
.getOrCreate()


sparknlp.version()
spark.version

#Created Dummy data for testing
l = [(1,'Thanks for calling to ESI'),(2,'How can i help you'),(3,'Please reach out to us on mail')]

data = spark.createDataFrame(l, ['docID','text'])

# testing models and annotators
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

lemma = LemmatizerModel.load("/user/el7150/lemma_mod").setInputCols(["token"]).setOutputCol("lemma")

pos = PerceptronModel.load("/user/el7150/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos")
# Error

pos = PerceptronModel.load("/user/el7150/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos")
20/07/17 15:40:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 14, ch3dr028356.express-scripts.com, executor 5): java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.annotators.pos.perceptron.AveragedPerceptron
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
        at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1678)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

@maziyarpanahi
Copy link
Member

That's great news! Could you please show me the results of these two:

sparknlp.version()
spark.version

Also, where is this path /user/el7150/pos_anc_mod/? And could you please tell me which model is it you downloaded? (the link). I will test it locally.

@maziyarpanahi
Copy link
Member

Just as an FYI: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.5.4

@akash166d
Copy link
Author

akash166d commented Jul 20, 2020

Waoo!! Thanks :D

@maziyarpanahi We do not need to reference the FAT jar file you gave us earlier, right? Just confirming.

@akash166d
Copy link
Author

Hi @maziyarpanahi ,

Just wanted to inform you on few of the major developments that we have made. I was able to run WordEmbeddings, POS and all other models successfully on sparkNLP 2.5.2 using the FAT jar you provided. We are now planning to install new version (2.5.4) to check for compatibility with spark 2.3. Looks like every pieces of puzzle is now sorted. Following are the changes that we made, that helped us make it work (this is for future reference, just in case if it happens with anyone else)

  • The most important change that we made was changing the default executable path to /hadoop/tmp . /tmp was not a default executable path in our system which was creating an issue.

  • Along with sparkjar we also provided path for rocksDB jar. This solved our issues with word embeddings and other such modules.

Below is the code we are now using to launch the pySpark.

export PYSPARK_PYTHON=/hadoop/anaconda3.6/bin/python
export _JAVA_OPTIONS=-Djava.io.tmpdir=/hadoop/tmp

pyspark --num-executors 1 --executor-cores 1  --executor-memory 6G --driver-memory 6G --conf spark.sql.shuffle.partitions=500 --conf spark.driver.maxResultsSize=0 --conf spark.default.parallelism=10 --master=yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar,/hadoop/tmp/spark-nlp.jar,/hadoop/tmp/rocksdbjni-6.5.3.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip

I would like to take this opportunity to thank you for all your support so far. I'll wait for couple of days till we get new version installed and tested and then I will mark the thread as resolved and closed.

Best,
AD

@maziyarpanahi
Copy link
Member

Hi @akash166d

I am really happy to hear that and I appreciate more that you shared your steps with the community.

If you don't mind I would like to keep this issue open until you have successfully tested the 2.5.4 with all the .pretrained() instead of .load() even though I remembered you guys wanted offline/air-gaped.

@akash166d
Copy link
Author

@maziyarpanahi Yeah Sure, That's even better!!

I was parallely testing the pretrained pipelines (offline) on my current version of sparkNLP 2.5.2.
I checked for explain_document_ml , analyze_sentiment , clean_slang. All worked perfectly fine but when I'm using pipeline that have deep learning dependencies they are throwing tensorflow related error. Please find below the code and error output. Can you point me to a direction to troubleshoot this?

#working Fine
exp_pip = PipelineModel.load("/user/el7150/exp_doc_ml_pip/")
annotation = exp_pip.transform(data)
annotation.show()

anl_sent = PipelineModel.load("/user/el7150/analz_senti_pip")
annotation = anl_sent.transform(data)
annotation.show()

cln_slang_pip = PipelineModel.load("/user/el7150/clean_slang_pip")
annotation = cln_slang_pip.transform(data)
annotation.show()
# Error

chk_spell_pip = PipelineModel.load("/user/el7150/check_spelling_dl_pip")
annotation = chk_spell_pip.transform(data)
annotation.show()

chk_spell_pip = PipelineModel.load("/user/el7150/check_spelling_dl_pip")
2020-07-21 11:44:38.823603: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 11:44:38.828874: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192590000 Hz
2020-07-21 11:44:38.830245: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f2f5a923b80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-21 11:44:38.830272: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-21 11:44:38.890164: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 52440000 exceeds 10% of system memory.
2020-07-21 11:44:38.924204: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 52440000 exceeds 10% of system memory.

@maziyarpanahi
Copy link
Member

Hi @akash166d

Those lines are Warnings and Infos, could you please paste the actual error stack from the exception of the crash? Does it actually crash? Also, what is the memory you assigned to your driver.memory? (it seems very low memory)

@akash166d
Copy link
Author

Hi @maziyarpanahi , Everything is working nicely on production environment now. Just wanted to thank you for all your support.

For pretrained models, I'm not sure if it will run in my environment as it is air gapped. Did you earlier mean that it should even work without internet connection? I'm currently using load and it is working perfectly. The pretrained is trying to download the model and then throws connection error to S3.

If you want I can close the request or else can help you with any QC if you may like.

@maziyarpanahi
Copy link
Member

@akash166d
Very happy to hear that. For pretrained() you still need internet, but I just wanted to let you know that before it wasn't possible to do that from Apache Spark 2.3.x but after 2.5.4 release it is now possible if you have internet access.

If you think this issue has been resolved you can close it and please do not hesitate to re-open or create new issues if you experiencing any difficulties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants