![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/16.02.ChunkEmbeddings.ipynb)

# **ChunkEmbeddings**

This notebook will cover the different parameters and usages of `ChunkEmbeddings`. This annotator utilizes WordEmbeddings, BertEmbeddings, etc. to generate chunk embeddings from either [Chunker](https://nlp.johnsnowlabs.com/docs/en/annotators#chunker), [NGramGenerator](https://nlp.johnsnowlabs.com/docs/en/annotators#ngramgenerator) or [NerConverter](https://nlp.johnsnowlabs.com/docs/en/annotators#nerconverter) outputs. 

**📖 Learning Objectives:**

1. Gain an understanding of how to generate embeddings for various annotator chunk outputs.

2. Learn how to generate embeddings for the chunks extracted by a NER model.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [ChunkEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#chunkembeddings)

- Python Docs : [ChunkEmbeddings](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/embeddings/chunk_embeddings/index.html)

- Scala Docs : [ChunkEmbeddings](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/embeddings/ChunkEmbeddings)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public).

## **📜 Background**


In NLP, embeddings refer to methods which map words (or groups of words) from a corpus to a high-dimensional vector space, in such a way that semantically similar words are mapped to vectors that are close in this space. The distance between vectors can be measured in various ways, the most common are the Euclidean distance and the cosine distance. 

The input text is split into relevant tokens by the *Tokenizer* and embeddings are created for these tokens. Next, meaningful phrases (chunks) are extracted using either one of the annotators *Chunker*, *NGramGenerator* or *NerConverter*. The corresponding chunk embeddings are calculated by aggregating the individual vectors of the tokens in each chunk, using either mean or sum pooling strategy.


## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`, `WORD_EMBEDDINGS`

- Output: `WORD_EMBEDDINGS`

## **🔎 Parameters**


- `lazyAnnotator`: (BooleanParam) Whether this AnnotatorModel acts as lazy in RecursivePipelines (Default: false).

- `poolingStrategy`: (String) The aggregation method for the word embeddings to create the chunk embeddings, options are AVERAGE or SUM (Default: AVERAGE).

- `inputCols`: (Array) Previous annotations columns to retrieve the chunks and the word embeddings to be aggregated, for example ['chunk', 'word_embeddings'].

- `outputCol`: (String) Output annotation column, can be left default, for example 'chunk_embeddings'.


### ChunkEmbeddings for Chunker output

The [Chunker](https://nlp.johnsnowlabs.com/docs/en/annotators#chunker) annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from the document. 

In [None]:
documentAssembler = DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document")

sentenceDetector = SentenceDetector() \
      .setInputCols(["document"]) \
      .setOutputCol("sentence")

tokenizer = Tokenizer() \
      .setInputCols(["sentence"]) \
      .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
      .setInputCols("sentence", "token") \
      .setOutputCol("pos")

chunker = Chunker() \
      .setInputCols("sentence", "pos") \
      .setOutputCol("chunk") \
      .setRegexParsers(["<DT>?<JJ>*<NN>+"])

wordEmbeddings = WordEmbeddingsModel.pretrained() \
      .setInputCols(["sentence", "token"]) \
      .setOutputCol("word_embeddings") \
      .setCaseSensitive(False)

chunkEmbeddings = ChunkEmbeddings() \
      .setInputCols(["chunk", "word_embeddings"]) \
      .setOutputCol("chunk_embeddings") \
      .setPoolingStrategy("AVERAGE")

pipeline = Pipeline().setStages([
          documentAssembler,
          sentenceDetector,
          tokenizer,
          posTagger,
          chunker,
          wordEmbeddings,
          chunkEmbeddings
      ])

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.begin", "result.end", "result.result", "result.metadata", "result.embeddings") \
    .show(6, 100)

+---------------+-----+---+----------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|  annotatorType|begin|end|    result|                                                                            metadata|                                                                                          embeddings|
+---------------+-----+---+----------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|word_embeddings|    8| 17|a sentence|{chunk -> 0, pieceId -> -1, isWordStart -> true, token -> a sentence, sentence -> 0}|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.012999997, 0.48517, -0.5434...|
+---------------+-----+---+----------+--------------------------------------------------------------

## ChunkEmbeddings for NGramGenerator output

[NGramGenerator](https://nlp.johnsnowlabs.com/docs/en/annotators#ngramgenerator) is a feature transformer that converts the input array of strings into an array of n-grams. 

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk") \
    .setN(2)

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkEmbeddings = ChunkEmbeddings() \
    .setInputCols(["chunk", "word_embeddings"]) \
    .setOutputCol("chunk_embeddings") \
    .setPoolingStrategy("AVERAGE")

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      nGrams,
      embeddings,
      chunkEmbeddings
    ])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.begin", "result.end", "result.result", "result.metadata", "result.embeddings") \
    .show(6, 100)

+---------------+-----+---+----------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|  annotatorType|begin|end|    result|                                                                            metadata|                                                                                          embeddings|
+---------------+-----+---+----------+------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|word_embeddings|    0|  6|   This is|   {chunk -> 0, pieceId -> -1, isWordStart -> true, token -> This is, sentence -> 0}|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, 0.47754502, 0.033937...|
|word_embeddings|    5|  8|      is a|      {chunk -> 1, pieceId -> -1, isWordStart -> true, token -

## ChunkEmbeddings for NerConverter output

In this case, the chunks are NER entities, extracted by a pipeline built around a NER model. The [NerConverter](https://nlp.johnsnowlabs.com/docs/en/annotators#nerconverter) transforms an IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their labels. 

### Extract NER entities using a CRF model

For more information regarding the NER model used below, see [CrfModel](https://nlp.johnsnowlabs.com/docs/en/annotators#nercrf) and for details on how to choose the pipeline stages see this [Spark NLP Colab Notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/model-downloader/Running_Pretrained_pipelines.ipynb).

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

wordEmbeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

nerModel = NerCrfModel.pretrained() \
    .setInputCols(["pos", "token", "document", "word_embeddings"]) \
    .setOutputCol("ner")

nerConverter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("chunk") 

chunkEmbeddings = ChunkEmbeddings() \
    .setInputCols(["chunk", "word_embeddings"]) \
    .setOutputCol("chunk_embeddings") \
    .setPoolingStrategy("AVERAGE")

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      posTagger,
      wordEmbeddings,
      nerModel,
      nerConverter,
      chunkEmbeddings
    ])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[OK!]


In [None]:
data = spark.createDataFrame([["Short sellers, Wall Street's dwindling  band of ultra cynics, are seeing green again."]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.begin", "result.end", "result.result", "result.metadata", "result.embeddings") \
    .show(6, 100)

+---------------+-----+---+-------------+---------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|  annotatorType|begin|end|       result|                                                                               metadata|                                                                                          embeddings|
+---------------+-----+---+-------------+---------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|word_embeddings|   15| 27|Wall Street's|{chunk -> 0, pieceId -> -1, isWordStart -> true, token -> Wall Street's, sentence -> 0}|[0.21384, 0.22098, 0.037105, -0.29186, -0.030131, -0.16247, -1.1043, -0.88436, -0.078059, -0.6353...|
+---------------+-----+---+-------------+-----------------------------------

### Extract NER entities using a NerDL model

The word embeddings are generated using Bert and the entities are predicted with a NER deep learning model. Refer to this [Spark NLP Workshop Notebook (11)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/11.Text_Similarities_and_dimension_reduction_visualizations_for_Embeddings.ipynb#scrollTo=2AkxJDshietW) for additional information.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

wordEmbeddings = BertEmbeddings().pretrained('bert_base_cased') \
      .setInputCols(["document",'token'])\
      .setOutputCol("word_embeddings")\
      .setCaseSensitive(False)

nerModel = NerDLModel.pretrained("ner_dl_bert", "en") \
        .setInputCols(["document", "token", "word_embeddings"]) \
        .setOutputCol("ner")

nerConverter = NerConverter() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("entities")

chunkEmbeddings = ChunkEmbeddings() \
    .setInputCols(["entities", "word_embeddings"]) \
    .setOutputCol("chunk_embeddings") \
    .setPoolingStrategy("AVERAGE")

pipeline = Pipeline().setStages([
      documentAssembler,
      tokenizer,
      wordEmbeddings,
      nerModel,
      nerConverter,
      chunkEmbeddings
    ])


bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]


In [None]:
data = spark.createDataFrame([["Stocks ended slightly higher on Friday but stayed near lows for the year as oil prices surged past  #36;46 a barrel, offsetting a positive outlook from computer maker Dell Inc. (DELL.O)   "]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.begin", "result.end", "result.result", "result.metadata", "result.embeddings") \
    .show(6, 100)

+---------------+-----+---+------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|  annotatorType|begin|end|result|                                                                        metadata|                                                                                          embeddings|
+---------------+-----+---+------+--------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|word_embeddings|  178|183|DELL.O|{chunk -> 0, pieceId -> -1, isWordStart -> true, token -> DELL.O, sentence -> 0}|[-0.31111056, -0.68521374, -0.49221164, 0.12427488, 0.54703796, -0.28151292, -0.535245, 0.2951617...|
+---------------+-----+---+------+--------------------------------------------------------------------------------+-----------------