Skip to content

John Snow Labs Spark-NLP 3.0.0: Supporting Spark 3.x, Scala 2.12, more Databricks runtimes, more EMR versions, performance improvements & lots more

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 22 Mar 15:44
· 3623 commits to master since this release

Overview

We are very excited to release Spark NLP 3.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community.

Spark NLP 3.0.0 extends the support for Apache Spark 3.0.x and 3.1.x major releases on Scala 2.12 with both Hadoop 2.7. and 3.2. We will support all 4 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, and 3.1.x helping the community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP support.

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features

  • Support for Apache Spark and PySpark 3.0.x on Scala 2.12
  • Support for Apache Spark and PySpark 3.1.x on Scala 2.12
  • Migrate to TensorFlow v2.3.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
  • Welcoming 9x new Databricks runtimes to our Spark NLP family:
    • Databricks 7.3
    • Databricks 7.3 ML GPU
    • Databricks 7.4
    • Databricks 7.4 ML GPU
    • Databricks 7.5
    • Databricks 7.5 ML GPU
    • Databricks 7.6
    • Databricks 7.6 ML GPU
    • Databricks 8.0
    • Databricks 8.0 ML (there is no GPU in 8.0)
    • Databricks 8.1 Beta
  • Welcoming 2x new EMR 6.x series to our Spark NLP family:
    • EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
    • EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
  • Starting Spark NLP 3.0.0 the default packages for CPU and GPU will be based on Apache Spark 3.x and Scala 2.12 (spark-nlp and spark-nlp-gpu will be compatible only with Apache Spark 3.x and Scala 2.12)
  • Starting Spark NLP 3.0.0 we have two new packages to support Apache Spark 2.4.x and Scala 2.11 (spark-nlp-spark24 and spark-nlp-gpu-spark24)
  • Spark NLP 3.0.0 still is and will be compatible with Apache Spark 2.3.x and Scala 2.11 (spark-nlp-spark23 and spark-nlp-gpu-spark23)
  • Adding a new param to sparknlp.start() function in Python for Apache Spark 2.4.x (spark24=True)
  • Adding a new param to adjust Driver memory in sparknlp.start() function (memory="16G")

Performance Improvements

Introducing a new batch annotation technique implemented in Spark NLP 3.0.0 for NerDLModel, BertEmbeddings, and BertSentenceEmbeddings annotators to radically improve prediction/inferencing performance. From now on the batchSize for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row. You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.

Performance achievements by using Spark NLP 3.0.0 vs. Spark NLP 2.7.x on CPU and GPU

(Performed on a Databricks cluster)

Spark NLP 3.0.0 vs. 2.7.x PySpark 3.x on CPU PySpark 3.x on GPU
BertEmbeddings (bert-base) +10% +550% (6.6x)
BertEmbeddings (bert-large) +12%. +690% (7.9x)
NerDLModel +185% +327% (4.2x)

Breaking changes

There are only 6 annotators that are not compatible to be used with both Scala 2.11 (Apache Spark 2.3 and Apache Spark 2.4) and Scala 2.12 (Apache Spark 3.x) at the same time. You can either train and use them on Apache Spark 2.3.x/2.4.x or train and use them on Apache Spark 3.x.

  • TokenizerModel
  • PerceptronApproach (POS Tagger)
  • WordSegmenter
  • DependencyParser
  • TypedDependencyParser
  • NerCrfModel

The rest of our models/pipelines can be used on all Apache Spark and Scala major versions without any issue.

We have already retrained and uploaded all the exiting pretrained for Part of Speech and WordSegmenter models in Apache Spark 3.x and Scala 2.12. We will continue doing this as we see existing models which are not compatible with Apache Spark 3.x and Scala 2.12.

NOTE: You can always use the .pretrained() function which seamlessly will find the compatible and most recent models to download for you. It will download and extract them in your HOME DIRECTORY ~/cached_pretrained/.

More info: #2562


Deprecated

Starting Spark NLP 3.0.0 release we no longer publish any artifacts on spark-packages and we continue to host all the artifacts only Maven Repository.


Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.0.0

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.0

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.0

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.0.0

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.0.0</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.0.0</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.0.0</version>
</dependency>

FAT JARs