New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x #963
Comments
Hi, Just as a test, I would like to see the same demo with the following changes: # the correct imports to pull everything in
# you shouldn't import annotator by annotator, maybe that is why the others are missing
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
# I don't know what this jar is: /hadoop/anaconda3.6/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar
# and why is it in the site-packages unless it's the same JAR I gave you and it was renamd and placed here for some reason
# if you are using the Fat Jar I gave you, let's do it this way
# Fat JAR: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-spark23-2.5.2.jar
# the session says local[*] so it's not in the cluster and can use the local fileSystem like ./tmp, but it can be anywhere
spark = SparkSession.builder \
.appName("ner") \
.master("local[*]") \
.config("spark.driver.memory","16G") \
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars", "./tmp/spark-nlp-assembly-spark23-2.5.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "1000m") \
.getOrCreate()
# I don't know what this is and what is it for but not required
# downloader = ResourceDownloader()
# the versions
sparknlp.version()
spark.version
# and the rest If the PyPI package is correctly installed for Let's see how the changes go first 😊 PS: If you have downloaded WordEmbeddingsModel and trying to load it, then you don't need to set glove file manually or even setStorageRef, they all come with the model. If you want to train it with custom word2vec or glove, then you need to us WordEmbeddings(). There are examples in our spark-nlp=workshop PS2: Please copy the full error stack/logs for Jackson.Jar crash so we can read and see what happened. |
Thanks a lot for such a detailed description. I'm super excited to try out the new snippet you have provided. Currently we are facing some issue in launching spark session on our dev environment and IT is trying to fix the bug. Meanwhile, I have couple of very short, sweet and simple questions :)
|
You are welcome
For local setup such as your example, the JAR should be accessible via the local file system. Assuming you have PySpark/Spark 2.3.x or do
All you need is to do |
After reinstalling I used the code provided by you. Please find below the full code (with few minor changes). from sparknlp.base import * # Error with this. Cannot import PretrainedPipeline. from pyspark.sql import SparkSession I updated .master to "yarn" as i wanted to run this on cluster and not localspark = SparkSession.builder l = [(1,'Thanks for calling to ESI'),(2,'How can i help you'),(3,'Please reach out to us on mail')] data = spark.createDataFrame(l, ['docID','text']) document_assembler = DocumentAssembler() sentence_detector = SentenceDetector() tokenizer = Tokenizer() lemma = LemmatizerModel.load("/user/xxxxx/lemma_mod").setInputCols(["token"]).setOutputCol("lemma") This was earlier working fine. I've enclosed Error log. Do i need to download a new file for this model?pos = PerceptronModel.load("/user/xxxxx/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos") nor_sweet = NorvigSweetingModel.load("/user/xxxx/spell_nor_mod").setInputCols(["token"]).setOutputCol("corrected") sent_viv = ViveknSentimentModel.load("/user/xxxxx/sent_vivek_mod").setInputCols(["sentence","token"]).setOutputCol("sentiment") this is throwing an error. Please find the error log details below.embeddings = WordEmbeddingsModel.load("/user/xxxxxx/wrd_embd_glove_mod").setInputCols("document", "token").setOutputCol("embeddings") Could you please help me debug the issue. I can't find anything on slack or Github issues related to it. POS Error: 20/07/08 16:01:45 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 14, ch3dr028355.express-scripts.com, executor 1): java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.annotators.pos.perceptron.AveragedPerceptron 20/07/08 16:01:45 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times; aborting job WordEmbeddings Error
|
@akash166d So we go from now on assuming this is a YARN cluster. Let's forget about any code and error for now. You have PyPI issues right at the beginning and without solving that the SparkSession and the JAR are pointless. If you have installed from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
sparknlp.version The PyPI installation and successfully running this snippet are not related to Apache Spark version, so people in the IT team should manage to do this correctly the same way they do for any other third-party PyPI package. NOTE: There is another possibility which is lower but still possible. That our PyPI package is not compatible with Apache Spark 2.3.x and I need to give you a Py Wheel for that to install it. (I am not sure about this so I am gonna test and ask @saif-ellafi ) |
Hi @maziyarpanahi @saif-ellafi Thanks for helping us out. I rechecked with IT and they said and i quote- 'When he says “If you have installed ‘pip install spark-nlp==2.5.2’ on all the machines correctly in the right Python ENV which is defined and known to apache spark..” I’m assuming we’re defining /hadoop/anaconda3.6//bin/python (or something along those lines) as the python environment for Apache Spark? Or does spark have its own separate python installation (somewhere in a spark directory) where it pulls packages from? I’ve worked on some hadoop components that behave like this and have a whole separate but contained (not a venv but effectively the same thing) python installation so I just want to make sure we’re not going about getting these packages recognized by spark in the wrong way. ' Let me know if the above statement makes sense and meanwhile, we are checking if there is a separate python environment (some what on the line of venv may be) but as of now would it be possible for you to provide with the Updated Py Wheel file you mentioned earlier. Also, just wanted to check on one additional thing. Do you think the error when importing pretrainedpipelines is directly related to messy PyPi installation or is it a version compatibility issue? Once again, I really owe you guys a lot for helping me on this. PS: He also provided me into details on how he uninstalled and installed the package, just in case if it gives more clarity- 'Technically we don’t uninstall/install packages directly on our servers, we copy them over from a cloned sandbox environment (this way we can keep multiple versions of the same package on the server as a roll back option if a new package update breaks something). When Akash asked me to uninstall the package, all I did was remove the package directory and then copy it back from the cloned sandbox environment. If they would like us to directly install it on the dev servers that’s fine, I’m just worried about doing this when we get to prod. not that we can’t do it this way on prod, there’s just a bit more at stake since we have 24/7 applications that run on those servers so there will be a lot more prep and it might take a bit longer to get those packages installed (likely will have to schedule a formal CHG, compared to just using CTASK for copying files). ' |
Hi @akash166d Yes, if you cannot successfully run the few imports I gave you then it's absolutely the PyPi package issue. I have never personally cloned the PyPI package to another system. For fall back in terms of a bad upgrade, you can simply run the ansible command to downgrade the to the previous version easily. At least in our case, the spark-nlp PyPI package is just a mapping to the Scala/Java without messing with other packages unlike some of those heavy PyPI packages. If you guys can just install it with pip install so we can b sure about the PyPI package being first visible by Apache Spark/SparkSeession and second after those imports pass successfully we can confirm the JAR compatibility. |
So it's definitely the way PyPI package is being installed/managed. I did a test with Our PyPI package as I guessed is compatible with Apache Spark 2.3.x and Apache Spark 2.4.x so no need to have a separate Py Wheel for it. You can just use the one on PyPI. |
Thanks @maziyarpanahi , we tried installing sparkNLP via pip install but issue still persist. I'm still getting an error : 'ImportError: cannot import name 'PretrainedPipeline''. @https://github.com/kce6 He can help us with details regarding the steps for installation, so that you can have more clarity. |
Hi, Ken from the linux admin team here: Here is the playbook we used to install the package
We have installed countless packages with this playbook and have never had any issues so I'm not really sure if it's the installation method being the issue here |
Thanks @kce6 Are those PyPI packages being used in PySpark? (the installation might be correct but if it's not correctly configured for Apache Spark or PySpark being executed in a wrong Python environment it won't find the installed PyPI packages) |
Hey Hi @maziyarpanahi We checked and there are multiple environments for python (py2 and anaconda) and also spark has it's own python installation. However, the spark python installation is pointing towards anaconda package directory. Given this info, Is there any suggested next steps that we can take from here. @kce6 Please let me know if I've missed on any details here that can help @maziyarpanahi |
Hi @akash166d Unfortunately, setting those configs are out of our hands as we don't have the knowledge nor control over the cluster. However, I suggest when the PySpark session is available, before dealing with Spark NLP you check the environment you are in and try to adjust it either on the Apache Spark side or where/how you are starting the PySpark ( maybe the default configs are not correctly distributed by ZooKeeper and your machine might point to something else): import sys
import os
print(sys.version)
spark.version
print("####")
print(sys.path)
print(os.environ['PYSPARK_PYTHON']) |
@maziyarpanahi Launching Pyspark:
Output
I have one additional question: when i do
` Do i need to update the API as mentioned in this link : https://github.com/JohnSnowLabs/spark-nlp/blob/master/python/sparknlp/pretrained.py |
In your This |
I tried placing --jars in my pyspark command like this
However the issue still persist with the same error: ClassNotFoundException. Similar to this : [(https://github.com//issues/203)] @saif-ellafi I was going through the perceptron module and was not able to find the AveragedPerceptron. Do you think the issue might be because of incompatibility between sparkJar and perceptron file I've. Please let me know, If you think i should use any updated spark jar for sparkNLP 2.5.3. Please help me with next steps that I can try from here on in debugging this. |
@maziyarpanahi I tried looking for the classes we have in sparkNLP in following ways:
Do you think this can help us decide next steps in troubleshooting from here? |
OK, let's summarize this. Let's start the PySpark shell, with the same JAR you have, and please show me the results for the following code: from sparknlp.base import *
from sparknlp.annotator import *
import sparknlp
sparknlp.version() Then please also run this: from sparknlp.pretrained import PretrainedPipeline If the error persists in Let's focus on this and not the models/pipelines. We are working on making the next release 100% compatible with Apache Spark 2.3.x meaning no more offline |
Hi @maziyarpanahi , I have some good news with respect to PretrainedPipeline. So There's no error now when importing the PretrainedPipeline and it got sorted after we installed sparkNLP throughout the servers. Earlier it was just installed on Edgenode. So that's a great news. Thanks for your support that we are making progress Now, there's still an error when loading PerceptronModel. The code snippet is same as shared earlier. I'm eclosing below the new error for more context.
|
That's great news! Could you please show me the results of these two:
Also, where is this path |
Just as an FYI: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.5.4 |
Waoo!! Thanks :D @maziyarpanahi We do not need to reference the FAT jar file you gave us earlier, right? Just confirming. |
Hi @maziyarpanahi , Just wanted to inform you on few of the major developments that we have made. I was able to run WordEmbeddings, POS and all other models successfully on sparkNLP 2.5.2 using the FAT jar you provided. We are now planning to install new version (2.5.4) to check for compatibility with spark 2.3. Looks like every pieces of puzzle is now sorted. Following are the changes that we made, that helped us make it work (this is for future reference, just in case if it happens with anyone else)
Below is the code we are now using to launch the pySpark.
I would like to take this opportunity to thank you for all your support so far. I'll wait for couple of days till we get new version installed and tested and then I will mark the thread as resolved and closed. Best, |
Hi @akash166d I am really happy to hear that and I appreciate more that you shared your steps with the community. If you don't mind I would like to keep this issue open until you have successfully tested the |
@maziyarpanahi Yeah Sure, That's even better!! I was parallely testing the pretrained pipelines (offline) on my current version of sparkNLP 2.5.2.
|
Hi @akash166d Those lines are Warnings and Infos, could you please paste the actual error stack from the exception of the crash? Does it actually crash? Also, what is the memory you assigned to your driver.memory? (it seems very low memory) |
Hi @maziyarpanahi , Everything is working nicely on production environment now. Just wanted to thank you for all your support. For pretrained models, I'm not sure if it will run in my environment as it is air gapped. Did you earlier mean that it should even work without internet connection? I'm currently using load and it is working perfectly. The pretrained is trying to download the model and then throws connection error to S3. If you want I can close the request or else can help you with any QC if you may like. |
@akash166d If you think this issue has been resolved you can close it and please do not hesitate to re-open or create new issues if you experiencing any difficulties. |
Quoting the IT team – “we don't install packages from source because doing so would not allow us to pass a umask value to the package during installation and thus making it only importable by the root user
so we install via pip, specifically using the pip module in ansible, in order to pass the needed umask value”
Code Snippet:*****************************************************
We replaced the previous sparkNLP.jar with the newly provided sparkNLP fatJAR (and renamed it to sparkNLP.jar) file by @maziyarpanahi . It seems like it had some conflict with Jackson.Jar file which might be the reason the spark crashed.
Could you help us configure the sparkNLP for our version of spark given there are jar files that support the compatibility. Happy to fill you in with more details if needed.
The text was updated successfully, but these errors were encountered: