John Snow Labs Spark-NLP 3.0.0: Supporting Spark 3.x, Scala 2.12, more Databricks runtimes, more EMR versions, performance improvements & lots more
Overview
We are very excited to release Spark NLP 3.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community.
Spark NLP 3.0.0 extends the support for Apache Spark 3.0.x and 3.1.x major releases on Scala 2.12 with both Hadoop 2.7. and 3.2. We will support all 4 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, and 3.1.x helping the community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP support.
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- Support for Apache Spark and PySpark 3.0.x on Scala 2.12
- Support for Apache Spark and PySpark 3.1.x on Scala 2.12
- Migrate to TensorFlow v2.3.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
- Welcoming 9x new Databricks runtimes to our Spark NLP family:
- Databricks 7.3
- Databricks 7.3 ML GPU
- Databricks 7.4
- Databricks 7.4 ML GPU
- Databricks 7.5
- Databricks 7.5 ML GPU
- Databricks 7.6
- Databricks 7.6 ML GPU
- Databricks 8.0
- Databricks 8.0 ML (there is no GPU in 8.0)
- Databricks 8.1 Beta
- Welcoming 2x new EMR 6.x series to our Spark NLP family:
- EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
- EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
- Starting Spark NLP 3.0.0 the default packages for CPU and GPU will be based on Apache Spark 3.x and Scala 2.12 (
spark-nlp
andspark-nlp-gpu
will be compatible only with Apache Spark 3.x and Scala 2.12) - Starting Spark NLP 3.0.0 we have two new packages to support Apache Spark 2.4.x and Scala 2.11 (
spark-nlp-spark24
andspark-nlp-gpu-spark24
) - Spark NLP 3.0.0 still is and will be compatible with Apache Spark 2.3.x and Scala 2.11 (
spark-nlp-spark23
andspark-nlp-gpu-spark23
) - Adding a new param to sparknlp.start() function in Python for Apache Spark 2.4.x (
spark24=True
) - Adding a new param to adjust Driver memory in sparknlp.start() function (
memory="16G"
)
Performance Improvements
Introducing a new batch annotation technique implemented in Spark NLP 3.0.0 for NerDLModel
, BertEmbeddings
, and BertSentenceEmbeddings
annotators to radically improve prediction/inferencing performance. From now on the batchSize
for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row. You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.
Performance achievements by using Spark NLP 3.0.0 vs. Spark NLP 2.7.x on CPU and GPU
(Performed on a Databricks cluster)
Spark NLP 3.0.0 vs. 2.7.x | PySpark 3.x on CPU | PySpark 3.x on GPU |
---|---|---|
BertEmbeddings (bert-base) | +10% | +550% (6.6x) |
BertEmbeddings (bert-large) | +12%. | +690% (7.9x) |
NerDLModel | +185% | +327% (4.2x) |
Breaking changes
There are only 6 annotators that are not compatible to be used with both Scala 2.11 (Apache Spark 2.3 and Apache Spark 2.4) and Scala 2.12 (Apache Spark 3.x) at the same time. You can either train and use them on Apache Spark 2.3.x/2.4.x or train and use them on Apache Spark 3.x.
- TokenizerModel
- PerceptronApproach (POS Tagger)
- WordSegmenter
- DependencyParser
- TypedDependencyParser
- NerCrfModel
The rest of our models/pipelines can be used on all Apache Spark and Scala major versions without any issue.
We have already retrained and uploaded all the exiting pretrained for Part of Speech and WordSegmenter models in Apache Spark 3.x and Scala 2.12. We will continue doing this as we see existing models which are not compatible with Apache Spark 3.x and Scala 2.12.
NOTE: You can always use the .pretrained()
function which seamlessly will find the compatible and most recent models to download for you. It will download and extract them in your HOME DIRECTORY ~/cached_pretrained/
.
More info: #2562
Deprecated
Starting Spark NLP 3.0.0 release we no longer publish any artifacts on spark-packages and we continue to host all the artifacts only Maven Repository.
Documentation
- Apache Spark Migration Guide
- PySpark Migration Guide
- "Spark NLP: Natural language understanding at scale" published paper
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP documentation
- Spark NLP Workshop notebooks
- Models Hub with new models
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.0.0
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.0
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.0
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.0.0
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.0.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.0.0</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.0.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.0.0</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.0.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.0.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.0.0.jar
-
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.0.0.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.0.0.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.0.0.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.0.0.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.0.0.jar