![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **Chunk2Doc**

This notebook will cover the different parameters and usages of `Chunk2Doc`. This annotator converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result. 

**📖 Learning Objectives:**

1. Understand the usage of the annotator.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [Chunk2Doc](https://nlp.johnsnowlabs.com/docs/en/annotators#chunk2doc)

- Python Docs : [Chunk2Doc](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/chunk2_doc/index.html#sparknlp.base.chunk2_doc.Chunk2Doc)

- Scala Docs : [Chunk2Doc](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/Chunk2Doc)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**

In Spark ML, the machine learning algorithms are grouped in two classes: Estimators and Transformers. An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. A Transformer is an algorithm which can transform one DataFrame into another DataFrame.

Similarily, in Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel. 
The AnnotatorApproach extends the Estimator from Spark ML, and is meant to be trained through fit(). The AnnotatorModel extends the Transformer and is meant to transform data frames through transform().

Each annotator accepts certain types of columns and outputs new columns in another type (we call this AnnotatorType).

In Spark NLP, we have five different transformers that are mainly used for getting the data in or transforming the data from one AnnotatorType to another. `Chunk2Doc` is one of them, it transforms the `CHUNK` type into `DOCUMENT` type.





## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.2.1 spark-nlp==4.2.4

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `DOCUMENT`

## **🔎 Parameters**


- `inputCols`: (Array) Previous annotations columns, if renamed (Default  ['entities']).

- `outputCol`: (String) Output annotation column, can be left default (Default 'chunkConverted'). 

## `Use Chunk2Doc with a Pretrained Pipeline`

Relevant entities are extracted with [Explain Document DL Pipeline](https://nlp.johnsnowlabs.com/2021/03/23/explain_document_dl_en.html), a pretrained pipeline in Spark NLP.

In [None]:
data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

In [None]:
pipeline = PretrainedPipeline("explain_document_dl")

explain_document_dl download started this may take some time.
Approx size to download 169.4 MB
[OK!]


In [None]:
pipelineResult = pipeline.transform(data)
pipelineResult.show(2, 100)

+---+-------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+------------------------------------

 The relevant chunks of text are converted back into DOCUMENT type with Chunk2Doc for further processing.

In [None]:
chunkToDoc = Chunk2Doc() \
    .setInputCols("entities") \
    .setOutputCol("chunkConverted")

In [None]:
result = chunkToDoc.transform(pipelineResult)
result.selectExpr("explode(chunkConverted) as chunkConverted").show(truncate=False)

+------------------------------------------------------------------------------+
|chunkConverted                                                                |
+------------------------------------------------------------------------------+
|{document, 0, 7, New York, {entity -> LOC, sentence -> 0, chunk -> 0}, []}    |
|{document, 13, 22, New Jersey, {entity -> LOC, sentence -> 0, chunk -> 1}, []}|
+------------------------------------------------------------------------------+



## Use `Chunk2Doc with a NER Pipeline`

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

wordEmbeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

nerModel = NerCrfModel.pretrained() \
    .setInputCols(["pos", "token", "document", "word_embeddings"]) \
    .setOutputCol("ner")

nerConverter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("entities") 

chunkToDoc = Chunk2Doc() \
    .setInputCols("entities") \
    .setOutputCol("chunkConverted")

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      posTagger,
      wordEmbeddings,
      nerModel,
      nerConverter,
      chunkToDoc
    ])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[OK!]


In [None]:
data = spark.createDataFrame([["Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.selectExpr("explode(chunkConverted) as result") \
    .select("result.annotatorType", "result.begin", "result.end", "result.result", "result.metadata", "result.embeddings") \
    .show(truncate=False)

+-------------+-----+---+-------------+------------------------------------------+----------+
|annotatorType|begin|end|result       |metadata                                  |embeddings|
+-------------+-----+---+-------------+------------------------------------------+----------+
|document     |15   |27 |Wall Street's|{entity -> ORG, sentence -> 0, chunk -> 0}|[]        |
+-------------+-----+---+-------------+------------------------------------------+----------+



## `Use Chunk2Doc with a NER-DL Pipeline`

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

wordEmbeddings = BertEmbeddings().pretrained('bert_base_cased') \
      .setInputCols(["document",'token'])\
      .setOutputCol("word_embeddings")\
      .setCaseSensitive(False)

nerModel = NerDLModel.pretrained("ner_dl_bert", "en") \
        .setInputCols(["document", "token", "word_embeddings"]) \
        .setOutputCol("ner")

nerConverter = NerConverter() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("entities")

chunkToDoc = Chunk2Doc() \
    .setInputCols("entities") \
    .setOutputCol("chunkConverted")

pipeline = Pipeline().setStages([
      documentAssembler,
      tokenizer,
      wordEmbeddings,
      nerModel,
      nerConverter,
      chunkToDoc
    ])


bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]


In [None]:
data = spark.createDataFrame([["Stocks ended slightly higher on Friday but stayed near lows for the year as oil prices surged past $36;46 a barrel, offsetting a positive outlook from computer maker Dell Inc."]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.selectExpr("explode(chunkConverted) as result") \
    .select("result.annotatorType", "result.begin", "result.end", "result.result", "result.metadata") \
    .show(truncate=False)

+-------------+-----+---+------+------------------------------------------+
|annotatorType|begin|end|result|metadata                                  |
+-------------+-----+---+------+------------------------------------------+
|document     |166  |169|Dell  |{entity -> ORG, sentence -> 0, chunk -> 0}|
+-------------+-----+---+------+------------------------------------------+

