![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/example/python/annotation/text/english/embeddings/ChunkEmbeddings.ipynb)


# **Chunk Embeddings**

In these examples we look at how to extract embeddings from chunks.

## **0. Colab Setup**

In [None]:
!pip install -q pyspark==3.3.0  spark-nlp==4.3.0

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  4.2.8
Apache Spark version:  3.3.0


### **Create Spark Dataframe**

In [None]:
!wget -q -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

In [None]:
import pyspark.sql.functions as F

news_df = spark.read\
                .option("header", "true")\
                .csv("news_category_test.csv")\
                .withColumnRenamed("description", "text")

news_df.show(truncate=50)

+--------+--------------------------------------------------+
|category|                                              text|
+--------+--------------------------------------------------+
|Business|Unions representing workers at Turner   Newall ...|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers...|
|Sci/Tech| A company founded by a chemistry researcher at...|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts ...|
|Sci/Tech| Southern California's smog fighting agency wen...|
|Sci/Tech|"The British Department for Education and Skill...|
|Sci/Tech|"confessed author of the Netsky and Sasser viru...|
|Sci/Tech|\\FOAF/LOAF  and bloom filters have a lot of in...|
|Sci/Tech|"Wiltshire Police warns about ""phishing"" afte...|
|Sci/Tech|In its first two years, the UK's dedicated card...|
|Sci/Tech| A group of technology companies  including Tex...|
|Sci/Tech| Apple Computer Inc.&lt;AAPL.O&gt; on  Tuesday ...|
|Sci/Tech| Free Record Shop, a Dutch music  retail chain,...|
|Sci/Tec

### Chunk Embeddings

This annotator utilizes `WordEmbeddings` or `BertEmbeddings` to generate chunk embeddings from either `TextMatcher`, `RegexMatcher`, `Chunker`, `NGramGenerator`, or `NerConverter` outputs.

`setPoolingStrategy`: Choose how you would like to aggregate Word Embeddings to Sentence Embeddings: `AVERAGE` or `SUM`

In [None]:
news_df.take(3)

[Row(category='Business', text="Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."),
 Row(category='Sci/Tech', text=' TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket.'),
 Row(category='Sci/Tech', text=' A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.')]

In [None]:
entities = ['parent firm', 'economy', 'amino acids']

with open ('entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

entity_extractor = TextMatcher() \
                      .setInputCols(["document",'token'])\
                      .setOutputCol("entities")\
                      .setEntities("entities.txt")\
                      .setCaseSensitive(False)\
                      .setEntityValue('entities')

nlpPipeline = Pipeline(stages=[documentAssembler,
                               tokenizer,
                               entity_extractor])

result = nlpPipeline.fit(news_df).transform(news_df.limit(10))

In [None]:
result.select('entities.result').take(3)

[Row(result=['parent firm']), Row(result=[]), Row(result=['amino acids'])]

In [None]:
chunk_embeddings = ChunkEmbeddings() \
                      .setInputCols(["entities", "embeddings"]) \
                      .setOutputCol("chunk_embeddings") \
                      .setPoolingStrategy("AVERAGE")

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

nlpPipeline = Pipeline(stages=[documentAssembler,
                               tokenizer,
                               entity_extractor,
                               glove_embeddings,
                               chunk_embeddings])

result = nlpPipeline.fit(news_df).transform(news_df.limit(10))


glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.entities.result,
                                                 result.chunk_embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("entities"),
                          F.expr("cols['1']").alias("chunk_embeddings"))

result_df.show(truncate=100)

+-----------+----------------------------------------------------------------------------------------------------+
|   entities|                                                                                    chunk_embeddings|
+-----------+----------------------------------------------------------------------------------------------------+
|parent firm|[0.45683652, -0.105479494, -0.34525, -0.143924, -0.192452, -0.33616, -0.22334, -0.208185, -0.3673...|
|amino acids|[-0.3861, 0.054408997, -0.287795, -0.33318, 0.375065, -0.185539, -0.330525, -0.214415, -0.73892, ...|
+-----------+----------------------------------------------------------------------------------------------------+

