![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/35.06.Doc2Chunk.ipynb)

# **Doc2Chunk**

This notebook will cover the different parameters and usages of `Doc2Chunk`. This annotator converts `DOCUMENT` type annotations into `CHUNK` type. The text to be transformed into chunks must be contained within the input `DOCUMENT`. It may be either `StringType` or `ArrayType[StringType]` (using `setIsArray`). The `Doc2Chunk` annotator is used in conjunction with annotators that require a `CHUNK` type input.

**📖 Learning Objectives:**

1. Understand the usage of the annotator.

2. Become comfortable using the different parameters of the annotator.

3. Become comfortable using the annotator in several examples.


**🔗 Helpful Links:**

- Documentation : [Doc2Chunk](https://nlp.johnsnowlabs.com/docs/en/annotators#doc2chunk)

- Python Docs : [Doc2Chunk](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/doc2_chunk/index.html#sparknlp.base.doc2_chunk.Doc2Chunk)

- Scala Docs : [Doc2Chunk](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/Doc2Chunk)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**


In Spark ML, the machine learning algorithms are grouped in two classes: Estimators and Transformers. An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. A Transformer is an algorithm which can transform one DataFrame into another DataFrame.

Similarily, in Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel. 
The AnnotatorApproach extends the Estimator from Spark ML, and is meant to be trained through fit(). The AnnotatorModel extends the Transformer and is meant to transform data frames through transform().

Each annotator accepts certain types of columns and outputs new columns in another type (we call this AnnotatorType).

In Spark NLP, we have five different transformers that are mainly used for getting the data in or transforming the data from one AnnotatorType to another. `Doc2Chunk` is one of them, it transforms the `DOCUMENT` type into `CHUNK` type.

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `CHUNK`

## **🔎 Parameters**


- `inputCols`: (String) Previous annotations columns.

- `outputCol`: (String) Output annotation column. 

- `chunkCol`: (String) --> Column that contains string, must be part of `DOCUMENT`.

- `failOnMissing`: (BooleanParam) --> Whether to fail the job if a chunk is not found within document, return empty otherwise (Default: false).

- `isArray`: (BooleanParam) --> Whether the chunkCol is an array of strings (Default: false).

- `lowerCase`: (BooleanParam) --> Whether to lower case for matching case (Default: true).

- `startCol`: (String) --> Column that has a reference of where the chunk begins.

- `startColByTokenIndex`: (BooleanParam) --> Whether start column is prepended by whitespace tokens (Default: false).


## `Basic Usage Example`

In [None]:
# Convert data into SparNLP compatible format
documentAssembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")

# Transform document type into chunk type
chunkAssembler = Doc2Chunk() \
        .setInputCols("document") \
        .setOutputCol("chunk") 

# Data sample saved as a Spark dataframe
data = spark.createDataFrame([[
    "advanced natural language processing"
    ]]).toDF("text")

# Basic pipeline model
pipeline = Pipeline() \
        .setStages([
            documentAssembler, 
            chunkAssembler]) \
        .fit(data)

# Obtain and extract the results
result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+--------------------------------------+-------------+
|result                                |annotatorType|
+--------------------------------------+-------------+
|[advanced natural language processing]|[chunk]      |
+--------------------------------------+-------------+



## `Example with Provided Chunk Column`

In [None]:
# Convert data into SparNLP compatible format
documentAssembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")

# Transform document type into chunk type
chunkAssembler = Doc2Chunk() \
        .setInputCols("document") \
        .setChunkCol("target") \
        .setOutputCol("chunk") \
        .setIsArray(True)

# Save sample data as a Spark dataframe
data = spark.createDataFrame([[
    "Spark NLP is an open-source text processing library for advanced natural language processing.",
    ["Spark NLP", "text processing library", "natural language processing"]
]]).toDF("text", "target")

# Basic pipeline model
pipeline = Pipeline() \
        .setStages([
            documentAssembler, 
            chunkAssembler]) \
        .fit(data)

# Obtain and extract the results
result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+



## Example with Extraneous Chunks

In [None]:
# Convert data into SparNLP compatible format
documentAssembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")

# Transform document type into chunk type
chunkAssembler = Doc2Chunk() \
        .setInputCols("document") \
        .setChunkCol("target") \
        .setOutputCol("chunk") \
        .setIsArray(True) \
        .setFailOnMissing(False)

# Save sample data as a Spark dataframe
data = spark.createDataFrame([[
    "Spark NLP is an open-source text processing library for advanced natural language processing.",
    ["python", "Spark NLP", "text processing library", "natural language processing"]
]]).toDF("text", "target")

# Basic pipeline model
pipeline = Pipeline() \
        .setStages([
            documentAssembler, 
            chunkAssembler]) \
        .fit(data)

# Obtain and extract the results
result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+



**Comment**:

The chunk _`python`_ is not found in the sample text. If we set `setFailOnMissing` to default value `False`, the pipeline will ignore the chunk and output the results. If this parameter is set to `True` we get a `Py4JJavaError`.