![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/15.02.WordEmbeddings.ipynb)

#  **WordEmbeddingsModel**

This notebook will cover the different parameters and usages of WordEmbeddingsModel. 

**📖 Learning Objectives:**

1. Understand how Glove-based models can be loaded with this annotator.

2. Understand how to transfer Transformers to Spark NLP

3. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Documentaion : [WordEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#wordembeddings)


- Python Doc : [WordEmbeddings](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/embeddings/word_embeddings/index.html#sparknlp.annotator.embeddings.word_embeddings.WordEmbeddingsModel)


- Scala Doc : [WordEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/WordEmbeddingsModel.html)

- For extended examples of usage, see the [Spark NLP Workshop Repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.SparkNLP_Pretrained_Models.ipynb#scrollTo=s6BhkdS2jn9T)


## **📜 Background**

- Word Embeddings lookup annotator that maps tokens to vectors.

- Pretrained models can be loaded with `pretrained()` of the companion object.

- The default model is **`"glove_100d"`**, if no name is provided. 

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.2.1 spark-nlp==4.2.4

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**



- `Input`: DOCUMENT, TOKEN

- `Output`: WORD_EMBEDDINGS

## **`🔎PARAMETERS:`**

`dimension`: return number of embedding dimensions

`withCoverageColumn`: Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.

`overallCoverage`: Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.

`setStoragePath`: set a custom token lookup dictionary for embeddings


`writeBufferSize`: Buffer size limit before dumping to disk storage while writing, by default 10000

`readCacheSize`: Cache size for items retrieved from storage. Increase for performance but higher memory consumption

## Example Pipeline

`pretrained(name='glove_100d', lang='en', remote_loc=None)`


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained('glove_100d', 'en') \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

pipeline = Pipeline() \
    .setStages([
        documentAssembler,
        tokenizer,
        embeddings])

data = spark.createDataFrame([["I love working with SparkNLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.show()

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|
+--------------------+--------------------+--------------------+--------------------+
|I love working wi...|[{document, 0, 27...|[{token, 0, 0, I,...|[{word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("word_embeddings"))

result_df.show(truncate=100)

+--------+----------------------------------------------------------------------------------------------------+
|   token|                                                                                     word_embeddings|
+--------+----------------------------------------------------------------------------------------------------+
|       I|[-0.046539, 0.61966, 0.56647, -0.46584, -1.189, 0.44599, 0.066035, 0.3191, 0.14679, -0.22119, 0.7...|
|    love|[0.25975, 0.55833, 0.57986, -0.21361, 0.13084, 0.94385, -0.42817, -0.3742, -0.094499, -0.43344, -...|
| working|[0.076552, 0.17843, -0.44464, 0.085718, 0.28268, -0.30546, -0.30637, 0.36632, -0.19919, 0.35636, ...|
|    with|[-0.43608, 0.39104, 0.51657, -0.13861, 0.2029, 0.50723, -0.012544, 0.22948, -0.6316, 0.21199, -0....|
|SparkNLP|[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...|
+--------+----------------------------------------------------------------------------------------------

If a token is not found in the dictionary like SparkNLP, then the result will be a zero vector of the same dimension.

## dimension


`dimension`: return number of embedding dimensions

In [None]:
embeddings.getDimension()

100

## withCoverageColumn

`withCoverageColumn(dataset, embeddings_col, output_col='coverage')`: Adds a custom column with word coverage stats for the embedded field. This creates a new column with statistics for each row.

In [None]:
wordsCoverage = WordEmbeddingsModel.withCoverageColumn(result, "embeddings", "cov_embeddings")
wordsCoverage.select("text","cov_embeddings").show(truncate=False)

+----------------------------+--------------+
|text                        |cov_embeddings|
+----------------------------+--------------+
|I love working with SparkNLP|{4, 5, 0.8}   |
+----------------------------+--------------+



4 words were covered, one of them is out of vocab, so 80% of the tokens were covered

## overallCoverage

`overallCoverage(dataset, embeddings_col)`: Calculates overall word coverage for the whole data in the embedded field.

This returns a single coverage object considering all rows in the field.

In [None]:
wordsOverallCoverage = WordEmbeddingsModel.overallCoverage(
    result,"embeddings"
).percentage
wordsOverallCoverage

0.8

## setStoragePath

`setStoragePath`: A custom token lookup dictionary for embeddings can be set with setStoragePath(). Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces:

In [None]:
# download glove and unzip it in Notebook.
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

In [None]:
with open("glove.6B.50d.txt", 'r') as f:
    contents = f.read()

In [None]:
print(contents[:5000])

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353

In [None]:
len("""the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581""".split())

51

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddings() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")\
    .setStoragePath("glove.6B.50d.txt", ReadAs.TEXT) \
    .setStorageRef("glove_50d") \
    .setDimension(50) \

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings
    ])

data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|
+--------------------+--------------------+--------------------+--------------------+
|The patient was d...|[{document, 0, 39...|[{token, 0, 2, Th...|[{word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("word_embeddings"))

result_df.show(truncate=100)

+---------+----------------------------------------------------------------------------------------------------+
|    token|                                                                                     word_embeddings|
+---------+----------------------------------------------------------------------------------------------------+
|      The|[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862, -6.6023E-4, -0.6566, 0...|
|  patient|[1.0056, -0.013664, 0.10368, -0.86712, 0.5319, 0.84699, 0.32328, 0.6295, 1.4335, 0.38023, 0.68733...|
|      was|[0.086888, -0.19416, -0.24267, -0.33391, 0.56731, 0.39783, -0.97809, 0.03159, -0.61469, -0.31406,...|
|diagnosed|[1.2269, 0.088257, 0.23122, -0.68205, -0.64986, 1.8358, -0.24268, 0.42063, 0.19199, 0.082327, 1.3...|
|     with|[0.25616, 0.43694, -0.11889, 0.20345, 0.41959, 0.85863, -0.60344, -0.31835, -0.6718, 0.003984, -0...|
| diabetes|[0.96334, 0.32273, 0.015499, -0.66516, -1.1059, 1.9567, 0.74042, -0.30914, 1.708, 0.9

# EmbeddingsFinisher

- Extracts embeddings from Annotations into a more easily usable form.

- This is useful for example: [WordEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/WordEmbeddings.html), [BertEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/BertEmbeddings.html), [SentenceEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html) and [ChunkEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/SentenceEmbeddings.html).

- By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol. It provides a set of tools for creating and managing vector representations of words, sentences, and documents. EmbeddingsFinisher can be used to improve the accuracy of text classification, sentiment analysis, and other natural language processing tasks.

For more extended examples see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb).



> Input Annotator Types:` EMBEDDINGS`


> Output Annotator Type: `NONE`







📌`setOutputAsVector`

The setOutputAsVector parameter in EmbeddingsFinisher is a boolean parameter used to specify whether the output should be a single vector or a list of vectors. When set to true, the output will be a single vector representing the embedding of the entire sequence of tokens. When set to false, the output will be a list of vectors, one for each token in the sequence.

📌 `setCleanAnnotations`

The setCleanAnnotations parameter in EmbeddingsFinisher is used to specify whether or not to clean the annotations before the embeddings are applied. When this parameter is set to true, the annotations will be stripped of any non-word characters and all words will be lowercase. This is useful for ensuring that the embeddings are applied consistently and accurately.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained('glove_100d', 'en') \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher])

data = spark.createDataFrame([["I love working with SparkNLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.show()

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+----------------------------+
|                text|            document|               token|          embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+--------------------+----------------------------+
|I love working wi...|[{document, 0, 27...|[{token, 0, 0, I,...|[{word_embeddings...|        [[-0.046539001166...|
+--------------------+--------------------+--------------------+--------------------+----------------------------+



In [None]:
result.select("finished_sentence_embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result.select('finished_sentence_embeddings').take(1)

[Row(finished_sentence_embeddings=[DenseVector([-0.0465, 0.6197, 0.5665, -0.4658, -1.189, 0.446, 0.066, 0.3191, 0.1468, -0.2212, 0.7924, 0.2991, 0.1607, 0.0253, 0.1868, -0.31, -0.2811, 0.6051, -1.0654, 0.5248, 0.0642, 1.0358, -0.4078, -0.3801, 0.308, 0.5996, -0.2699, -0.7603, 0.9422, -0.4692, -0.1828, 0.9065, 0.7967, 0.2482, 0.2571, 0.6232, -0.4477, 0.6536, 0.769, -0.5123, -0.4433, -0.2187, 0.3837, -1.1483, -0.944, -0.1506, 0.3001, -0.5781, 0.2017, -1.6591, -0.0792, 0.0264, 0.2205, 0.9971, -0.5754, -2.7266, 0.3145, 0.7052, 1.4381, 0.9913, 0.1398, 1.3474, -1.1753, 0.004, 1.0298, 0.0646, 0.9089, 0.8287, -0.47, -0.1058, 0.5916, -0.4221, 0.5733, -0.5411, 0.1077, 0.3978, -0.0487, 0.0646, -0.6144, -0.286, 0.5067, -0.4976, -0.8157, 0.1641, -1.963, -0.2669, -0.3759, -0.9585, -0.8584, -0.7158, -0.3234, -0.4312, 0.4139, 0.2837, -0.7093, 0.15, -0.2154, -0.3762, -0.0325, 0.8062]), DenseVector([0.2598, 0.5583, 0.5799, -0.2136, 0.1308, 0.9438, -0.4282, -0.3742, -0.0945, -0.4334, -0.2094, 0.347, 0.08

In [None]:
result.select("token.result", "finished_sentence_embeddings").show()

+--------------------+----------------------------+
|              result|finished_sentence_embeddings|
+--------------------+----------------------------+
|[I, love, working...|        [[-0.046539001166...|
+--------------------+----------------------------+

