![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/19.03.Sentence_Embeddings_with_Transformers.ipynb)

# **Sentence Embeddings with Transformers**

This notebook will cover the different parameters and usages of Sentence Embeddings annotators.



**📖 Learning Objectives:**

1. Be able to create a pipeline for sentence embeddings using the annotator.

2. Understand how to use the annotator for predictions.

3. Become comfortable using the different parameters of the annotator.



**🔗 Helpful Links:**

- Documentation : [Transformers in Spark NLP](https://nlp.johnsnowlabs.com/docs/en/transformers)



- Scala Doc : [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/BertSentenceEmbeddings.html)


- For extended examples of usage, see the [Spark NLP Workshop repository.](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/open-source-nlp)



## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2 spark-nlp==4.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.5/469.5 KB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.3.0
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `SENTENCE`

- Output: `SENTENCE_EMBEDDINGS`

## **🔎Parameters**

- `setCaseSensitive()`:
Set whether to ignore case in tokens for embeddings matching with this parameter
(Default False)

- `setMaxSentenceLength()` : Maximum sentence length to process (Default: 128)

- `batchSize` : Large values allows faster processing but requires more memory (Default 8)

- `configProtoBytes` : ConfigProto from tensorflow, serialized into byte array. Get with `config_proto.SerializeToString()`


- `dimension` : Number of embedding dimensions, by default 768


- `isLong` : Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int.


## Defining the Spark NLP Pipeline

In [3]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer, SentenceDetector, BertSentenceEmbeddings, RoBertaSentenceEmbeddings, XlmRoBertaSentenceEmbeddings
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

➤ Let's prepared the pre-requisite columns first, so we can use them in different annotators.

In [29]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

pipeline = Pipeline(stages=[documentAssembler,
                            sentence])

In [30]:
example_df = spark.createDataFrame([["Customer satisfaction always holds a top priority for the success of our company."]]).toDF("text")

example_df = pipeline.fit(example_df).transform(example_df)

# 📍 **BertSentenceEmbeddings**

➤ Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

➤ The output of this annotator can be used in multi-class/multi-label text classifications (`ClassifierDL`, `SentimentDL`, and `MultiClassifierDL`) 

In [31]:
embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128") \
    .setInputCols(["sentence"]) \
    .setOutputCol("sentence_bert_embeddings")\
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [32]:
result = embeddings.transform(example_df)
result.show()

+--------------------+--------------------+--------------------+------------------------+
|                text|            document|            sentence|sentence_bert_embeddings|
+--------------------+--------------------+--------------------+------------------------+
|Customer satisfac...|[{document, 0, 80...|[{document, 0, 80...|    [{sentence_embedd...|
+--------------------+--------------------+--------------------+------------------------+



In [33]:
result.select("sentence_bert_embeddings.embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [34]:
embeddings.extractParamMap()

{Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='isLong', doc='Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int.'): False,
 Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='dimension', doc='Number of embedding dimensions'): 128,
 Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='maxSentenceLength', doc='Max sentence length to process'): 512,
 Param(parent='BERT_SENTENCE_EMBEDDINGS_9bc3506cd635', name='caseSensitive', doc='whether to ignore case in tokens for embeddings matching'): True,
 Param(parent='BERT_SENTENCE_EMBE

In [35]:
embeddings.getDimension()

128

In [36]:
embeddings.getStorageRef()

'sent_small_bert_L2_128'

In [37]:
embeddings.getMaxSentenceLength()

512

# 📍 **RoBertaSentenceEmbeddings**

➤Sentence-level embeddings using RoBERTa. The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

➤ The output of this annotator can be used in multi-class/multi-label text classifications (`ClassifierDL`, `SentimentDL`, and `MultiClassifierDL`) 

In [38]:
embeddings = RoBertaSentenceEmbeddings.pretrained("sent_roberta_base", "en") \
      .setInputCols("sentence") \
      .setOutputCol("sentence_roberta_embeddings")

pipeline = Pipeline(stages=[
    documentAssembler,
    sentence,
    embeddings
    ])

result = pipeline.fit(example_df).transform(example_df)
result.show()

sent_roberta_base download started this may take some time.
Approximate size to download 284.8 MB
[OK!]
+--------------------+--------------------+--------------------+---------------------------+
|                text|            document|            sentence|sentence_roberta_embeddings|
+--------------------+--------------------+--------------------+---------------------------+
|Customer satisfac...|[{document, 0, 80...|[{document, 0, 80...|       [{sentence_embedd...|
+--------------------+--------------------+--------------------+---------------------------+



In [39]:
result.selectExpr("sentence_roberta_embeddings.embeddings as Embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [40]:
embeddings.getDimension()

768

In [41]:
embeddings.getStorageRef()

'sent_roberta_base'

# 📍 **XlmRoBerta (Multilingual)**


➤ Sentence-level embeddings using XLM-RoBERTa. The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

➤ The output of this annotator can be used in multi-class/multi-label text classifications (`ClassifierDL`, `SentimentDL`, and `MultiClassifierDL`) 

In [42]:
example_df = spark.createDataFrame([["La satisfaction du client est toujours une priorité absolue pour le succès de notre entreprise." ]]).toDF("text")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

embeddings = XlmRoBertaSentenceEmbeddings.pretrained("sent_xlm_roberta_base", "xx") \
    .setInputCols("sentence") \
    .setOutputCol("sentence_xlmroberta_embeddings")

pipeline = Pipeline(stages=[
    documentAssembler,
    sentence,
    embeddings
    ])

result = pipeline.fit(example_df).transform(example_df)
result.show()

sent_xlm_roberta_base download started this may take some time.
Approximate size to download 619.5 MB
[OK!]
+--------------------+--------------------+--------------------+------------------------------+
|                text|            document|            sentence|sentence_xlmroberta_embeddings|
+--------------------+--------------------+--------------------+------------------------------+
|La satisfaction d...|[{document, 0, 94...|[{document, 0, 94...|          [{sentence_embedd...|
+--------------------+--------------------+--------------------+------------------------------+



In [43]:
result.selectExpr("sentence_xlmroberta_embeddings.embeddings as Embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [44]:
embeddings.getDimension()

768

In [45]:
embeddings.getStorageRef()

'sent_xlm_roberta_base'

# 📍 **EmbeddingsFinisher**



- Extracts embeddings from Annotations into a more easily usable form.

- This is useful for example: [WordEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/WordEmbeddings.html), [BertEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/BertEmbeddings.html), [SentenceEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html) and [ChunkEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html).

- By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol. It provides a set of tools for creating and managing vector representations of words, sentences, and documents. EmbeddingsFinisher can be used to improve the accuracy of text classification, sentiment analysis, and other natural language processing tasks.

For more extended examples see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb).



> Input Annotator Types:` EMBEDDINGS`


> Output Annotator Type: `NONE`

📌`setOutputAsVector`

The setOutputAsVector parameter in EmbeddingsFinisher is a boolean parameter used to specify whether the output should be a single vector or a list of vectors. When set to true, the output will be a single vector representing the embedding of the entire sequence of tokens. When set to false, the output will be a list of vectors, one for each token in the sequence.

📌 `setCleanAnnotations`

The setCleanAnnotations parameter in EmbeddingsFinisher is used to specify whether or not to clean the annotations before the embeddings are applied. When this parameter is set to true, the annotations will be stripped of any non-word characters and all words will be lowercase. This is useful for ensuring that the embeddings are applied consistently and accurately.

In [46]:
from sparknlp.base import EmbeddingsFinisher

In [47]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128", "en") \
    .setInputCols(["sentence"]) \
    .setOutputCol("bert_sentence_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("bert_sentence_embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
        documentAssembler,
        sentence,
        embeddings,
        embeddingsFinisher])

data = spark.createDataFrame([["I love working with SparkNLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.show()

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
+--------------------+--------------------+--------------------+------------------------+----------------------------+
|                text|            document|            sentence|bert_sentence_embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+------------------------+----------------------------+
|I love working wi...|[{document, 0, 27...|[{document, 0, 27...|    [{sentence_embedd...|        [[-1.146771311759...|
+--------------------+--------------------+--------------------+------------------------+----------------------------+



In [49]:
result.select("finished_sentence_embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [50]:
result.select("sentence.result", "finished_sentence_embeddings").show()

+--------------------+----------------------------+
|              result|finished_sentence_embeddings|
+--------------------+----------------------------+
|[I love working w...|        [[-1.146771311759...|
+--------------------+----------------------------+



#  📍Using LightPipeline

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details, check the following 
[Medium post](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1).

This class accepts strings or list of strings as input, without the need to transform your text into a spark data frame. The [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) method returns a dictionary (or list of dictionary if a list is passed as input) with the results of each step in the pipeline. To retrieve all metadata from the anntoators in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead, which always returns a list.

To extract the results from the object, you just need to parse the dictionary.

Let's use the `sent_small_bert_L2_128` model with `LightPipeline` and `.fullAnnotate()` it with sample data.

In [51]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

embeddings = RoBertaSentenceEmbeddings.pretrained("sent_roberta_base", "en") \
    .setInputCols("sentence") \
    .setOutputCol("sentence_roberta_embeddings")\
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    documentAssembler,
    sentence,
    embeddings
    ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sent_roberta_base download started this may take some time.
Approximate size to download 284.8 MB
[OK!]


In [52]:
from sparknlp.base import LightPipeline

light_model= LightPipeline(model, parse_embeddings=True)
light_result= light_model.fullAnnotate("Kindly note that the meeting is postponed to next week due to unforeseen circumstances.")[0]

🔹 Since the embedding array is stored under embeddings attribute of SentenceEmbeddings annotator, we set **parse_embeddings=True** to parse the embedding array. 

In [53]:
light_result

{'document': [Annotation(document, 0, 86, Kindly note that the meeting is postponed to next week due to unforeseen circumstances., {}, [])],
 'sentence': [Annotation(document, 0, 86, Kindly note that the meeting is postponed to next week due to unforeseen circumstances., {'sentence': '0'}, [])],
 'sentence_roberta_embeddings': [Annotation(sentence_embeddings, 0, 86, Kindly note that the meeting is postponed to next week due to unforeseen circumstances., {'sentence': '0', 'token': 'Kindly note that the meeting is postponed to next week due to unforeseen circumstances.', 'pieceId': '-1', 'isWordStart': 'true'}, [-0.0037002228, -0.20959197, -0.2302702, -0.06142079, 0.13756889, 0.19806089, 0.26013276, -0.06917772, -0.08934258, -0.17655912, 0.23377429, 0.0022054226, -0.09456771, 0.09067493, -0.13810542, 0.4973059, 0.224238, -0.4671985, 0.030549344, -0.006968976, -0.25457084, 0.07531452, 0.4603146, 0.31860563, 0.11540289, 0.057101194, -0.13899031, -0.026924418, 0.17662661, 0.23354071, 0.2885

In [54]:
light_result.keys()

dict_keys(['document', 'sentence', 'sentence_roberta_embeddings'])

# 📍 Export and Save HuggingFace model

> We will restart the Colab session to free memory.

In [1]:
import sparknlp
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import Tokenizer, SentenceDetector, BertSentenceEmbeddings
import pyspark.sql.functions as F

spark = sparknlp.start()

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.4.1` version and Transformers on `4.6.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [2]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [bert-base-cased](https://huggingface.co/bert-base-cased) model from HuggingFace as an example
- In addition to `TFBertModel` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [3]:
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf


MODEL_NAME = 'bert-base-cased'

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME).save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
  print('try downloading TF weights')
  model = TFBertModel.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  model = TFBertModel.from_pretrained(MODEL_NAME, from_pt=True)


# Define TF Signature
@tf.function(
  input_signature=[
      {
          "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
          "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
          "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
      }
  ]
)
def serving_fn(input):
    return model(input)

model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True, signatures={"serving_default": serving_fn})

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

try downloading TF weights


Downloading tf_model.h5:   0%|          | 0.00/527M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


➤ Let's have a look inside these two directories and see what we are dealing with:

In [4]:
!ls -l {MODEL_NAME}

total 423360
-rw-r--r-- 1 root root       628 Mar 22 14:19 config.json
drwxr-xr-x 3 root root      4096 Mar 22 14:19 saved_model
-rw-r--r-- 1 root root 433508328 Mar 22 14:19 tf_model.h5


In [5]:
!ls -l {MODEL_NAME}/saved_model/1

total 8800
drwxr-xr-x 2 root root    4096 Mar 22 14:19 assets
-rw-r--r-- 1 root root      54 Mar 22 14:19 fingerprint.pb
-rw-r--r-- 1 root root  165091 Mar 22 14:19 keras_metadata.pb
-rw-r--r-- 1 root root 8827430 Mar 22 14:19 saved_model.pb
drwxr-xr-x 2 root root    4096 Mar 22 14:19 variables


In [6]:
!ls -l {MODEL_NAME}_tokenizer

total 220
-rw-r--r-- 1 root root    125 Mar 22 14:18 special_tokens_map.json
-rw-r--r-- 1 root root    362 Mar 22 14:18 tokenizer_config.json
-rw-r--r-- 1 root root 213450 Mar 22 14:18 vocab.txt


To load the model using Spark NLP, we need all relevant information present in the model folder.

We have everything under `saved_model/1/`, except for the vocabulary that is present in the tokenizer folder. Se we can just copy the `vocab.txt` to `saved_model/1/assets` and we are ready to use the `.loadSavedModel()` method.

In [7]:
!cp {MODEL_NAME}_tokenizer/vocab.txt {MODEL_NAME}/saved_model/1/assets

## Import and Save BERT in Spark NLP for Sentence/Document embeddings

We use the `BertSentenceEmbeddings` annotator to load the sentence embeddings instead of the usual word embeddings. The processing is done internally by the `.loadSavedModel()` method by replacing the `last_hidden_state` that has shape feature of the Tensorflow model by the `pooler_output`. In practice, this generates one vector for the entire sentence/document instead of one vector for each word.

- `loadSavedModel` accepts two params, first is the path to the Tensorflow SavedModel. The second is the SparkSession (usually contained in the `spark` variable)
- `setStorageRef`is used as a reference when adding stages that has the embeddings as input to avoid loading a different embeddings model. It acts like an `ID` of this model
  - We can choose any name to set the `storageRef` but it cannot be changed later on. We usually use the name of the model be consistent, but it is not mandatory
- The `dimension` parameter should contain the same dimension of the chosen model so when we use the `.getDimension()` method the output will be correct.

> **NOTE**: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. Keep in mind that the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively. 


In [8]:
from sparknlp.annotator import BertSentenceEmbeddings

sent_bert = BertSentenceEmbeddings.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
 .setInputCols("sentence")\
 .setOutputCol("bert_sentence")\
 .setCaseSensitive(True)\
 .setDimension(768)\
 .setStorageRef('sent_bert_base_cased') 

➤ - Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [9]:
sent_bert.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

➤ Let's clean up stuff we don't need anymore

In [10]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome 😎  !

This is your BERT model for Sentence/Document embeddings from HuggingFace 🤗  loaded and saved by Spark NLP 🚀 

In [11]:
! ls -l {MODEL_NAME}_spark_nlp

total 431636
-rw-r--r-- 1 root root 441980332 Mar 22 14:25 bert_sentence_tensorflow
drwxr-xr-x 4 root root      4096 Mar 22 14:25 fields
drwxr-xr-x 2 root root      4096 Mar 22 14:25 metadata


Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BERT model 😊 

In [12]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

embeddings = BertSentenceEmbeddings.load(f"./{MODEL_NAME}_spark_nlp")\
  .setInputCols("sentence")\
  .setOutputCol("bert_sentence")\
  .setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
                            sentence,
                            embeddings])

In [13]:
example_df = spark.createDataFrame([["Customer satisfaction always holds a top priority for the success of our company."]]).toDF("text")
result = pipeline.fit(example_df).transform(example_df)


In [14]:
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|       bert_sentence|
+--------------------+--------------------+--------------------+--------------------+
|Customer satisfac...|[{document, 0, 80...|[{document, 0, 80...|[{sentence_embedd...|
+--------------------+--------------------+--------------------+--------------------+



In [15]:
result.select("bert_sentence.embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
embeddings.getStorageRef()

'sent_bert_base_cased'

In [17]:
embeddings.getDimension()

768



That's it! You can now go wild and use hundreds of BERT models from HuggingFace 🤗 in Spark NLP 🚀 