![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/16.03.SentenceEmbeddings.ipynb)

# **SentenceEmbeddings**

This notebook will cover the different parameters and usages of `SentenceEmbeddings`, an annotator used to convert word embeddings into sentence embeddings.

**📖 Learning Objectives:**

1. Use `SentenceEmbeddings` to turn word embeddings into sentence embeddings.

2. Become comfortable using the different parameters of the annotator.

3. Convert sentence embeddings into features that can be used to Spark ML regression or clustering functions.


**🔗 Helpful Links:**

- Documentation : [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings)

- Python Docs : [SentenceEmbeddings](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/python/sparknlp/annotator/embeddings/sentence_embeddings/index.html#sparknlp.annotator.embeddings.sentence_embeddings.SentenceEmbeddings)

- Scala Docs : [SentenceEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb).

## **📜 Background**


Text embedding is one of the main steps before building any Deep Learning model in Natural Language Processing (NLP). This step of the pipeline consists in encoding words and sentences into high-dimensional numerical vectors in order to drastically improve the processing of textual data.

Sentence embeddings can be used for text classification, semantic similarity, clustering, and other natural language tasks. Apart from `SentenceEmbeddings`, other annotators such as `UniversalSentenceEncoder`and `BERTSentenceEmbeddings` can be used for these tasks. You can find some examples of the use of sentence embeddings in the following blog posts about [text classification](https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32) and [sentence similarity](https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf).

## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, size

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `WORD_EMBEDDINGS`

- Output: `SENTENCE_EMBEDDINGS`

## **🔎 Parameters**


- `dimension`: (Int) Number of embedding dimensions (Default: 100).

- `poolingStrategy`: (String) Choose how you would like to aggregate Word Embeddings to Sentence Embeddings (Default: "AVERAGE"). Can either be "AVERAGE" or "SUM".

- `storageRef`: (String) Unique identifier for storage (Default: this.uid)

## **Examples**

`SentenceEmbeddings` converts the word embeddings which result from annotators such as `WordEmbeddings` or `BertEmbeddings` into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document. If you use the output of `DocumentAssembler` as the input for `Tokenizer` (as in the example below), SentenceEmbeddings will return an array of embeddings. On the other hand, if you use the output of `SentenceDetector`, `SentenceEmbeddings` will return one array of embeddings per sentence.

By using `EmbeddingsFinisher` you can easily transform your embeddings into a more easily usable form.

In [3]:
document_assembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'token'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

sentence_embeddings = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings")
    
embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_sentence_embeddings"])

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            glove_embeddings,
            sentence_embeddings,
            embeddings_finisher])

text_df = spark.createDataFrame([["Sentence embeddings can be used for text classification, semantic similarity, clustering, and other natural language tasks."]]).toDF("text")

nlp_model = nlp_pipeline.fit(text_df)

results = nlp_model.transform(text_df)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [4]:
results.select('sentence_embeddings').show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [5]:
results.select('finished_sentence_embeddings').show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### `setPoolingStrategy()`

Sentence embeddings can be calculated either by summing up or by averaging all the word embeddings. The `setPoolingStrategy` method is used to select between these two options (by default, the pooling strategy is average).

In [6]:
document_assembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'token'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

sentence_embeddings_1 = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("average_sentence_embeddings")\
      .setPoolingStrategy('AVERAGE')
    
embeddings_finisher_1 = EmbeddingsFinisher() \
      .setInputCols(["average_sentence_embeddings"]) \
      .setOutputCols(["average_pooling_strategy"])

sentence_embeddings_2 = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sum_sentence_embeddings")\
      .setPoolingStrategy('SUM')
    
embeddings_finisher_2 = EmbeddingsFinisher() \
      .setInputCols(["sum_sentence_embeddings"]) \
      .setOutputCols(["sum_pooling_strategy"])

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            glove_embeddings,
            sentence_embeddings_1,
            embeddings_finisher_1,
            sentence_embeddings_2,
            embeddings_finisher_2])

text_df = spark.createDataFrame([["Sentence embeddings can be used for text classification, semantic similarity, clustering, and other natural language tasks."]]).toDF("text")

nlp_model = nlp_pipeline.fit(text_df)

results = nlp_model.transform(text_df)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [7]:
results.select('average_pooling_strategy').show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [8]:
results.select('sum_pooling_strategy').show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [9]:
results.withColumn('average_strategy_size', size(col('average_pooling_strategy')[0]))\
       .withColumn('sum_strategy_size', size(col('sum_pooling_strategy')[0]))\
       .select('average_strategy_size', 'sum_strategy_size').show()

+---------------------+-----------------+
|average_strategy_size|sum_strategy_size|
+---------------------+-----------------+
|                  100|              100|
+---------------------+-----------------+



The pooling strategy affects the way in which sentence embeddings are calculated, but it does not affect the their dimensions.

### Converting embeddings into feature vectors

In order to be able to use sentence embeddings in Spark ML regression or clustering functions, the vector needs to be exploded. This can be done inside or outside the NLP pipeline.

- Exploding inside the NLP pipeline:

In [10]:
from pyspark.ml.feature import SQLTransformer

In [11]:
document_assembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'token'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

sentence_embeddings = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings")
    
embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_sentence_embeddings"])\
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

explodeVectors = SQLTransformer(statement=
      "SELECT EXPLODE(finished_sentence_embeddings) AS features, * FROM __THIS__")

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            glove_embeddings,
            sentence_embeddings,
            embeddings_finisher,
            explodeVectors])

text_df = spark.createDataFrame([["Sentence embeddings can be used for text classification, semantic similarity, clustering, and other natural language tasks."]]).toDF("text")

nlp_model = nlp_pipeline.fit(text_df)

results = nlp_model.transform(text_df)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [12]:
results.select('features').take(1)

[Row(features=DenseVector([-0.1279, 0.427, 0.2371, -0.1029, 0.2104, 0.1084, -0.1526, 0.0555, -0.0959, 0.1368, -0.0456, -0.2613, 0.1812, 0.072, 0.3772, -0.0052, 0.306, 0.0936, -0.2829, 0.2281, -0.0733, -0.2044, 0.3646, 0.2549, 0.1152, -0.1613, 0.0326, -0.1783, -0.1831, -0.1186, -0.15, 0.3012, -0.3884, -0.1927, 0.2658, 0.2158, 0.1758, 0.141, 0.0558, -0.2148, -0.2485, -0.2732, 0.0038, -0.1918, -0.1958, 0.0669, 0.2002, -0.3136, -0.2371, -0.313, 0.2229, -0.1363, 0.1933, 1.0013, -0.0893, -1.5649, 0.0389, -0.2315, 1.2152, 0.5187, -0.2449, 0.4783, -0.1146, -0.0332, 0.8213, -0.0889, 0.4015, 0.0753, 0.3691, -0.2343, -0.3215, -0.1488, 0.3019, -0.2913, 0.19, 0.0728, -0.14, -0.0247, -0.6824, -0.1224, 0.3366, -0.2685, -0.5508, 0.0507, -1.3015, 0.1768, 0.1399, -0.2645, -0.0383, -0.2531, -0.26, -0.1532, -0.0849, 0.2264, -0.1725, 0.0568, -0.3261, -0.6217, 0.4141, 0.2112]))]

- Exploding outside the NLP pipeline:

In [13]:
from pyspark.sql.functions import explode

results = results.withColumn("exploded_features", explode(results.finished_sentence_embeddings))

results.select("exploded_features").take(1)

[Row(exploded_features=DenseVector([-0.1279, 0.427, 0.2371, -0.1029, 0.2104, 0.1084, -0.1526, 0.0555, -0.0959, 0.1368, -0.0456, -0.2613, 0.1812, 0.072, 0.3772, -0.0052, 0.306, 0.0936, -0.2829, 0.2281, -0.0733, -0.2044, 0.3646, 0.2549, 0.1152, -0.1613, 0.0326, -0.1783, -0.1831, -0.1186, -0.15, 0.3012, -0.3884, -0.1927, 0.2658, 0.2158, 0.1758, 0.141, 0.0558, -0.2148, -0.2485, -0.2732, 0.0038, -0.1918, -0.1958, 0.0669, 0.2002, -0.3136, -0.2371, -0.313, 0.2229, -0.1363, 0.1933, 1.0013, -0.0893, -1.5649, 0.0389, -0.2315, 1.2152, 0.5187, -0.2449, 0.4783, -0.1146, -0.0332, 0.8213, -0.0889, 0.4015, 0.0753, 0.3691, -0.2343, -0.3215, -0.1488, 0.3019, -0.2913, 0.19, 0.0728, -0.14, -0.0247, -0.6824, -0.1224, 0.3366, -0.2685, -0.5508, 0.0507, -1.3015, 0.1768, 0.1399, -0.2645, -0.0383, -0.2531, -0.26, -0.1532, -0.0849, 0.2264, -0.1725, 0.0568, -0.3261, -0.6217, 0.4141, 0.2112]))]

## Training a ClassifierDL Model with Sentence Embeddings

Now we are going to use our sentence embeddings to train a ClassifierDL Model. First, we need to download our dataset.

In [14]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv

In [15]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

trainDataset.show(10, truncate=50)

+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
|Business| Stocks ended slightly higher on Friday but sta...|
|Business| Assets of the nation's retail money market mut...|
|Business| Retail sales bounced back a bit in July, and n...|
|Business|" After earning a PH.D. in Sociology, Danny Baz...|
|Business| Short sellers, Wall Street's dwindling  band o...|
+--------+--------------------------------------------------+
only showing top 10 rows



In this dataset, news are classified into 4 categories:

In [69]:
trainDataset.select('category').distinct().collect()

[Row(category='World'),
 Row(category='Sci/Tech'),
 Row(category='Sports'),
 Row(category='Business')]

Let's create our pipeline and fit our training data.

In [16]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'token'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

sentence_embeddings = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings")

classifier_dl = ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("category")\
      .setMaxEpochs(3)

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            glove_embeddings,
            sentence_embeddings,
            classifier_dl])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [17]:
clf_model = nlp_pipeline.fit(trainDataset)

You can look at further examples in our [GitHub repo](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb) or in this [Blog post](https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32).

## Use Sentence Embeddings to Calculate Sentence Similarity

Sentence embeddings can be used to calculate similarity between sentences. For this, we need to import scipy and numpy, create a pipeline, and then use it to get embeddings from some sample sentences.

In [60]:
from scipy.spatial import distance
import numpy as np

In [61]:
document_assembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'token'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

sentence_embeddings = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings")

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            glove_embeddings,
            sentence_embeddings])

empty_df = spark.createDataFrame([[""]]).toDF("text")

nlp_model = nlp_pipeline.fit(empty_df)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [62]:
sentence_0 = "The patient was diagnosed with diabetes."
sentence_1 = "A diabetic foot is any pathology that results directly from peripheral arterial disease."
sentence_2 = "She does physical activity every day."

sentence_df = spark.createDataFrame([[sentence_0], [sentence_1], [sentence_2]]).toDF("text")

In [63]:
sentence_df.show(truncate = False)

+----------------------------------------------------------------------------------------+
|text                                                                                    |
+----------------------------------------------------------------------------------------+
|The patient was diagnosed with diabetes.                                                |
|A diabetic foot is any pathology that results directly from peripheral arterial disease.|
|She does physical activity every day.                                                   |
+----------------------------------------------------------------------------------------+



In [64]:
embeddings = nlp_model.transform(sentence_df).select('sentence_embeddings.embeddings').collect()

Once we have the embeddings, we need to turn them into arrays in order to use them to calculate cosine distances.

In [65]:
v0 = np.array(embeddings[0]['embeddings'])
v1 = np.array(embeddings[1]['embeddings'])
v2 = np.array(embeddings[2]['embeddings'])

In [66]:
similarity_a = 1 - distance.cosine(v0, v1)
similarity_b = 1 - distance.cosine(v1, v2)
similarity_c = 1 - distance.cosine(v2, v0)

In [67]:
similarity_a, similarity_b, similarity_c

(0.9166266555913121, 0.8655254538729918, 0.8406160060116418)

The pair of sentences with higher similarity are sentence_0 and sentence_1, which makes sense considering that they both refer to patients with diabetes.

You can find another example of sentence similarity in [this blog post](https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf).