![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/16.04.UniversalSentenceEncoder.ipynb)

# **Universal Sentence Encoder**

This notebook will cover the different parameters and usages of UniversalSentenceEncoder annotator.



**📖 Learning Objectives:**

1. Be able to create a pipeline for sentence embeddings using the annotator.

2. Understand how to use the annotator for predictions.

3. Become comfortable using the different parameters of the annotator.



**🔗 Helpful Links:**

- Documentation : [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder)



- Scala Doc : [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/UniversalSentenceEncoder.html)

- Python Doc : [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/embeddings/universal_sentence_encoder/index.html#sparknlp.annotator.embeddings.universal_sentence_encoder.UniversalSentenceEncoder)


- For extended examples of usage, see the [Spark NLP Workshop repository.](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/open-source-nlp)

- The [Original research paper on arXiv](https://arxiv.org/abs/1803.11175)

- Tensforlow [TFHub page](https://tfhub.dev/google/universal-sentence-encoder/4)

## **📜 Background**

The Universal Sentence Encoder (USE) is a powerful natural language processing (NLP) model developed by Google's research team. It is designed to encode natural language sentences into high-dimensional vector representations, which can be used for a variety of NLP tasks such as sentiment analysis, text classification, and question answering.

One of the main aspects of the USE model is its ability to capture the semantic meaning of sentences, taking into account the context in which they appear. This allows a good alternative to averaging word embeddings with improved transfer learning capabilities. The model is trained on a large corpus of text data and uses a deep neural network architecture to generate sentence embeddings, which are dense vector representations that capture the semantic meaning of the sentence.

The USE model has several pros, including its ability to encode sentences from multiple languages, its high accuracy in several NLP tasks, and its ability to handle out-of-vocabulary words. Additionally, the model can be fine-tuned for specific NLP tasks, making it highly versatile and adaptable to different use cases.

However, there are also some cons to consider when using the USE model. For example, the model requires significant computational resources to train and can be slow to generate embeddings for large volumes of text data. Additionally, the model may struggle with understanding sarcasm or irony, which can lead to inaccuracies in some NLP tasks.

Overall, the Universal Sentence Encoder is a powerful and versatile NLP model that can be a valuable tool for a range of use cases, provided its strengths and limitations are properly understood and accounted for.

## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2 spark-nlp==4.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.5/469.5 KB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.3.0
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `SENTENCE_EMBEDDINGS`

### **🔎Parameters :**

`batchSize`: Size of every batch (Default depends on model).

`configProtoByte`: ConfigProto from tensorflow, serialized into byte array.

`dimension`: Number of embedding dimensions (Default: `512`)

`loadS`: Whether to load SentencePiece ops file which is required only by multi-lingual models (Default: `False`).

`storageRef`: Unique identifier for storage (Default: this.uid)

## **Creating the Spark NLP pipeline**

➤ The output of this annotator can be used in multi-class/multi-label text classifications (`ClassifierDL`, `SentimentDL`, and `MultiClassifierDL`) 

In [3]:
from sparknlp.base import DocumentAssembler, EmbeddingsFinisher
from sparknlp.annotator import UniversalSentenceEncoder
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

➤ We will create create DOCUMENT annotations from the example sentence, and then apply the sentence embeddings (using the pretrained model `tfhub_use`) to it.  

In [32]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("use_embeddings")

pipeline = Pipeline(stages=[
    documentAssembler,
    use_embeddings
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [33]:
example_df = spark.createDataFrame([["Customer satisfaction always holds a top priority for the success of our company."]]).toDF("text")

result = pipeline.fit(example_df).transform(example_df)
result.show()

+--------------------+--------------------+--------------------+
|                text|            document|      use_embeddings|
+--------------------+--------------------+--------------------+
|Customer satisfac...|[{document, 0, 80...|[{sentence_embedd...|
+--------------------+--------------------+--------------------+



In [34]:
result.selectExpr("use_embeddings.embeddings as Embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [35]:
use_embeddings.getDimension()

512

In [36]:
use_embeddings.getStorageRef()

'tfhub_use'

The model was trained on multilanguage data, let's find embedding vectors for texts in French:

In [37]:
example_french = spark.createDataFrame([["La satisfaction du client est toujours une priorité absolue pour le succès de notre entreprise." ]]).toDF("text")

result = pipeline.fit(example_french).transform(example_french)
result.show()

+--------------------+--------------------+--------------------+
|                text|            document|      use_embeddings|
+--------------------+--------------------+--------------------+
|La satisfaction d...|[{document, 0, 94...|[{sentence_embedd...|
+--------------------+--------------------+--------------------+



In [38]:
result.selectExpr("use_embeddings.embeddings as Embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# 📍 **EmbeddingsFinisher**



- Extracts embeddings from Annotations into a more easily usable form.

- This is useful for example: [WordEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/WordEmbeddings.html), [BertEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/BertEmbeddings.html), [SentenceEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html), [ChunkEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html), and `UniversalSentenceEncoder`.

- By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol. It provides a set of tools for creating and managing vector representations of words, sentences, and documents. EmbeddingsFinisher can be used to improve the accuracy of text classification, sentiment analysis, and other natural language processing tasks.

For more extended examples see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb).



> Input Annotator Types:` EMBEDDINGS` or `SENTENCE_EMBEDDINGS`


> Output Annotator Type: `NONE`

📌`setOutputAsVector`

The setOutputAsVector parameter in EmbeddingsFinisher is a boolean parameter used to specify whether the output should be a single vector or a list of vectors. When set to true, the output will be a single vector representing the embedding of the entire sequence of tokens. When set to false, the output will be a list of vectors, one for each token in the sequence.

📌 `setCleanAnnotations`

The setCleanAnnotations parameter in EmbeddingsFinisher is used to specify whether or not to clean the annotations before the embeddings are applied. When this parameter is set to true, the annotations will be stripped of any non-word characters and all words will be lowercase. This is useful for ensuring that the embeddings are applied consistently and accurately.

In [39]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("use_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("use_embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
        documentAssembler,
        use_embeddings,
        embeddingsFinisher])

data = spark.createDataFrame([["I love working with SparkNLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.show()

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
+--------------------+--------------------+--------------------+----------------------------+
|                text|            document|      use_embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+----------------------------+
|I love working wi...|[{document, 0, 27...|[{sentence_embedd...|        [[0.0165132191032...|
+--------------------+--------------------+--------------------+----------------------------+



In [40]:
result.select("finished_sentence_embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [41]:
result.select('finished_sentence_embeddings').take(1)

[Row(finished_sentence_embeddings=[DenseVector([0.0165, 0.0693, -0.0144, 0.0059, 0.0267, 0.0121, -0.0503, -0.0437, -0.0281, 0.0004, 0.0041, -0.0881, -0.0221, -0.0549, 0.0356, 0.0591, 0.0197, 0.0166, 0.0111, -0.0827, -0.003, -0.0088, -0.0271, -0.0063, -0.0354, 0.0606, -0.0286, -0.0965, -0.0554, -0.0355, 0.0175, -0.0692, 0.0166, -0.007, -0.0503, 0.0255, 0.0285, 0.0053, -0.0973, -0.0264, 0.0517, -0.0672, -0.0486, 0.0573, 0.0481, 0.0279, -0.0515, 0.0649, 0.012, -0.0387, 0.0567, 0.0178, 0.0604, 0.0117, 0.1102, -0.0352, 0.0664, -0.0485, -0.0526, 0.0194, -0.03, 0.0053, 0.005, -0.0642, -0.0082, -0.0449, 0.0653, 0.0166, -0.0322, 0.0509, -0.0029, -0.0178, 0.0111, 0.0857, -0.0073, 0.0633, 0.0641, 0.0058, -0.0127, -0.0221, 0.0677, -0.033, 0.0052, -0.0568, 0.0562, -0.0047, 0.0425, 0.0748, -0.0416, -0.0628, -0.0007, -0.0339, 0.0274, -0.0032, -0.0869, -0.0002, -0.0817, -0.0368, 0.0458, -0.0029, -0.0106, -0.0689, 0.0434, 0.0014, 0.017, -0.0084, 0.034, 0.0177, -0.0349, 0.0414, 0.0479, 0.0238, 0.0675, 0

In [42]:
result.select("document.result", "finished_sentence_embeddings").show()

+--------------------+----------------------------+
|              result|finished_sentence_embeddings|
+--------------------+----------------------------+
|[I love working w...|        [[0.0165132191032...|
+--------------------+----------------------------+



#  📍Using LightPipeline

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details, check the following 
[Medium post](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1).

This class accepts strings or list of strings as input, without the need to transform your text into a spark data frame. The [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) method returns a dictionary (or list of dictionary if a list is passed as input) with the results of each step in the pipeline. To retrieve all metadata from the anntoators in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead, which always returns a list.

To extract the results from the object, you just need to parse the dictionary.

In [43]:
from sparknlp.base import LightPipeline

In [44]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("use_embeddings")

pipeline = Pipeline(stages=[
    documentAssembler,
    use_embeddings
    ])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [45]:
light_model= LightPipeline(model, parse_embeddings=True)
light_result= light_model.fullAnnotate("Kindly note that the meeting is postponed to next week due to unforeseen circumstances.")[0]

🔹 Since the embedding array is stored under embeddings attribute of UniversalSentenceEncoder annotator, we set **parse_embeddings=True** to parse the embedding array as the default value for this parameter is `False`. 

In [46]:
light_result

{'document': [Annotation(document, 0, 86, Kindly note that the meeting is postponed to next week due to unforeseen circumstances., {}, [])],
 'use_embeddings': [Annotation(sentence_embeddings, 0, 86, Kindly note that the meeting is postponed to next week due to unforeseen circumstances., {'sentence': '0', 'token': 'Kindly note that the meeting is postponed to next week due to unforeseen circumstances.', 'pieceId': '-1', 'isWordStart': 'true'}, [0.0065772845, 0.0029609199, -0.043464486, -0.053878833, 0.017987942, -0.06322443, -0.011510993, 0.035108045, -0.029628625, -0.005868967, 0.010644218, 0.097605065, -0.029929617, 0.055966973, -0.063024625, -0.016434029, -0.052687094, 0.023096375, 0.048820704, -0.045458067, 0.06951576, -0.018010652, -0.043618567, 0.07199794, -0.058524065, 0.07317291, -0.025338782, -0.047432583, 0.03209837, 0.01982927, 0.014966904, -0.05659649, -0.087943986, -0.0009597439, -0.080084145, 0.027535185, 0.026820667, -0.015512441, -0.059884626, -0.015148148, -0.08833955,

In [47]:
light_result["use_embeddings"]

[Annotation(sentence_embeddings, 0, 86, Kindly note that the meeting is postponed to next week due to unforeseen circumstances., {'sentence': '0', 'token': 'Kindly note that the meeting is postponed to next week due to unforeseen circumstances.', 'pieceId': '-1', 'isWordStart': 'true'}, [0.0065772845, 0.0029609199, -0.043464486, -0.053878833, 0.017987942, -0.06322443, -0.011510993, 0.035108045, -0.029628625, -0.005868967, 0.010644218, 0.097605065, -0.029929617, 0.055966973, -0.063024625, -0.016434029, -0.052687094, 0.023096375, 0.048820704, -0.045458067, 0.06951576, -0.018010652, -0.043618567, 0.07199794, -0.058524065, 0.07317291, -0.025338782, -0.047432583, 0.03209837, 0.01982927, 0.014966904, -0.05659649, -0.087943986, -0.0009597439, -0.080084145, 0.027535185, 0.026820667, -0.015512441, -0.059884626, -0.015148148, -0.08833955, -0.014184727, 0.036851924, 0.014327133, -0.020741101, 0.033216562, 0.024093173, -0.07318729, -0.0057069194, -0.07171305, 0.043497253, 0.060036365, 0.0019522051

That's it! You can now use the UniversalSentenceEncoder annotator in Spark NLP!