Merge pull request #14164 from JohnSnowLabs/release/530-release-candi…

…date * fixed all sbt warnings * remove file system url prefix (#14132) * SPARKNLP-942: MPNet Classifiers (#14147) * SPARKNLP-942: MPNetForSequenceClassification * SPARKNLP-942: MPNetForQuestionAnswering * SPARKNLP-942: MPNet Classifiers Documentation * Restore RobertaforQA bugfix * adding import notebook + changing default model + adding onnx support (#14158) * Sparknlp 876: Introducing LLAMA2 (#14148) * introducing LLAMA2 * Added option to read model from model path to onnx wrapper * Added option to read model from model path to onnx wrapper * updated text description * LLAMA2 python API * added method to save onnx_data * added position ids * - updated Generate.scala to accept onnx tensors - added beam search support for LLAMA2 * updated max input length * updated python default params changed test to slow test * fixed serialization bug * Doc sim rank as retriever (#14149) * Added retrieval interface to the doc sim rank approach * Added Python interface as retriever in doc sim ranker --------- Co-authored-by: Stefano Lori <s.lori@izicap.com> * 812 implement de berta for zero shot classification annotator (#14151) * adding code * adding notebook for import --------- Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * Add notebook for fine tuning sbert (#14152) * [SPARKNLP-986] Fixing optional input col validations (#14153) * [SPARKNLP-984] Fixing Deberta notebooks URIs (#14154) * SparkNLP 933: Introducing M2M100 : multilingual translation model (#14155) * introducing LLAMA2 * Added option to read model from model path to onnx wrapper * Added option to read model from model path to onnx wrapper * updated text description * LLAMA2 python API * added method to save onnx_data * added position ids * - updated Generate.scala to accept onnx tensors - added beam search support for LLAMA2 * updated max input length * updated python default params changed test to slow test * fixed serialization bug * Added Scala code for M2M100 * Documentation for scala code * Python API for M2M100 * added more tests for scala * added tests for python * added pretrained * rewording * fixed serialization bug * fixed serialization bug --------- Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * SPARKNLP-985: Add flexible naming for onnx_data (#14165) Some annotators might have different naming schemes for their files. Added a parameter to control this. * Add LLAMA2Transformer and M2M100Transformer to annotator * Add LLAMA2Transformer and M2M100Transformer to ResourceDownloader * bump version to 5.3.0 [skip test] * SPARKNLP-999: Fix remote model loading for some onnx models * used filesystem to check for the onnx_data file (#14169) * [SPARKNLP-940] Adding changes to correctly copy cluster index storage… (#14167) * [SPARKNLP-940] Adding changes to correctly copy cluster index storage when defined * [SPARKNLP-940] Moving local mode control to its right place * [SPARKNLP-940] Refactoring sentToCLuster method * [SPARKNLP-988] Updating EntityRuler documentation (#14168) * [SPARKNLP-940] Adding changes to support storage temp directory (cluster_tmp_dir) * SPARKNLP-1000: Disable init_all_tables for GPT2 (#14177) Fixes `java.lang.IllegalArgumentException: No Operation named [init_all_tables] in the Graph` when model needs to be deserialized. The deserialization is skipped when the modelis already loaded (so it will only appear on the worker nodes and not the driver) GPT2 does not contain tables and so does not require this command. * fixes python documentation (#14172) * revert MarianTransformer.scala * revert HasBatchedAnnotate.scala * revert Preprocessor.scala * Revert ViTClassifier.scala * disable hard exception * Replace hard exception with soft logs (#14179) This reverts commit eb91fde. * move the example from root to examples/ [skip test] * Cleanup some code [skip test] * Update onnxruntime to 1.17.0 [skip test] * Fix M2M100 default model's name [skip test] * Update docs [run doc] * Update Scala and Python APIs --------- Co-authored-by: ahmedlone127 <ahmedlone127@gmail.com> Co-authored-by: Jiamao Zheng <jiamaozheng@users.noreply.github.com> Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com> Co-authored-by: Prabod Rathnayaka <prabod@rathnayaka.me> Co-authored-by: Stefano Lori <wolliq@users.noreply.github.com> Co-authored-by: Stefano Lori <s.lori@izicap.com> Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Co-authored-by: Devin Ha <t.ha@tu-berlin.de> Co-authored-by: Danilo Burbano <danilo@johnsnowlabs.com> Co-authored-by: github-actions <action@github.com>
JohnSnowLabs · Feb 27, 2024 · ad5a4ea · ad5a4ea
2 parents e805c43 + fa2cb23
commit ad5a4ea
Show file tree

Hide file tree

Showing 1,616 changed files with 89,211 additions and 11,692 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,36 @@
+========
+5.3.0
+========
+----------------
+New Features & Enhancements
+----------------
+* **NEW:** Introducing Llama-2 and all the models fine-tuned based on this architecutre. This our very first CasualLM annotator in ONNX and it comes with support for quantization in INT4 and INT8 for CPUs.
+* **NEW:** Introducing `MPNetForSequenceClassification` annotator for sequence classification tasks. This annotator is based on the MPNet architecture and is designed to classify sequences of text into a set of predefined classes.
+* **NEW:** Introducing `MPNetForQuestionAnswering` annotator for question answering tasks. This annotator is based on the MPNet architecture and is designed to answer questions based on a given context.
+* **NEW:** Introducing `M2M100` state-of-the-art multilingual translation. M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model can directly translate between the 9,900 directions of 100 languages.
+* **NEW:** Introducing a new `DeBertaForZeroShotClassification` annotator for zero-shot classification tasks. This annotator is based on the DeBERTa architecture and is designed to classify sequences of text into a set of predefined classes.
+* **NEW:** Implement retreival feature in our `DocumentSimilarity`annotator. The new DocumentSimilarity ranker is a powerful tool for ranking documents based on their similarity to a given query document. It is designed to be efficient and scalable, making it ideal for a variety of RAG applications/
+* Add ONNNX support for `BertForZeroShotClassification` annotator.
+* Add support for in-memory use of `WordEmbeddingsModel` annotator in server-less cluster. We initially introduced in-memory feature for this annotator for users inside Kubernetes cluster without any `HDFS`, however, today it runs without any issue `locally`, Google `Colab`, `Kaggle`, `Databricks`, `AWS EMR`, `GCP`, and `AWS Glue`.
+* New Whisper Large and Distil models.
+* Update ONNX Runtime to 1.17.0
+* Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, 14.3 GPU
+* Support new EMR 6.15.0 and 7.0.0 versions
+* Add nobteook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it to Spark NLP
+* Add notebook to import BERT for Zero-Shot classification from Hugging Face
+* Add notebook to import DeBERTa for Zero-Shot classification from Hugging Face
+* Update EntityRuler documentation
+* Improve SBT project and resolve warnings (almost!)
+
+----------------
+Bug Fixes
+----------------
+* Fix Spark NLP Configuration's to set `cluster_tmp_dir` on Databricks' DBFS via `spark.jsl.settings.storage.cluster_tmp_dir` https://github.com/JohnSnowLabs/spark-nlp/issues/14129
+* Fix score calculation in `RoBertaForQuestionAnswering` annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147
+* Fix optional input col validations https://github.com/JohnSnowLabs/spark-nlp/pull/14153
+* Fix notebooks for importing DeBERTa classifiers https://github.com/JohnSnowLabs/spark-nlp/pull/14154
+* Fix GPT2 deserialization over the cluster (Databricks) https://github.com/JohnSnowLabs/spark-nlp/pull/14177
+
 ========
 5.2.3
 ========

diff --git a/README.md b/README.md
diff --git a/build.sbt b/build.sbt
@@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "5.2.3"
+version := "5.3.0"
 
 (ThisBuild / scalaVersion) := scalaVer
 
@@ -144,13 +144,17 @@ lazy val utilDependencies = Seq(
     exclude ("com.fasterxml.jackson.core", "jackson-annotations")
     exclude ("com.fasterxml.jackson.core", "jackson-databind")
     exclude ("com.fasterxml.jackson.core", "jackson-core")
+    exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor")
     exclude ("commons-configuration", "commons-configuration"),
   liblevenshtein
     exclude ("com.google.guava", "guava")
     exclude ("org.apache.commons", "commons-lang3")
     exclude ("com.google.code.findbugs", "annotations")
     exclude ("org.slf4j", "slf4j-api"),
-  gcpStorage,
+  gcpStorage
+    exclude ("com.fasterxml.jackson.core", "jackson-core")
+    exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor")
+  ,
   greex,
   azureIdentity,
   azureStorage)

diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,13 +1,13 @@
 {% set name = "spark-nlp" %}
-{% set version = "5.2.3" %}
+{% set version = "5.3.0" %}
 
 package:
   name: {{ name|lower }}
   version: {{ version }}
 
 source:
   url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
-  sha256: bdad9912c6f4fa36aef2169a4d7e4c33cd32d79d6ff0c628c04876d9354252e9
+  sha256: 2fa182f1850026fa7f9d5fbb7b92939856f78ddcc2cb2d87d56af5e2e90b97f0
 
 build:
   noarch: python