![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **Doc2ChunkInternal**

This notebook will cover the different parameters and usages of `Doc2ChunkInternal` annotator.

**📖 Learning Objectives:**

1. Understand how to use `Doc2ChunkInternal`.

2. Become comfortable using the different parameters of the annotator.




**🔗 Helpful Links:**

- Documentation : [Doc2ChunkInternal](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#doc2chunkinternal)

- Python Docs : [Doc2ChunkInternal](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/doc2_chunk_internal/index.html#sparknlp_jsl.annotator.doc2_chunk_internal.Doc2ChunkInternal)

- Scala Docs : [Doc2ChunkInternal](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/annotator/Doc2ChunkInternal.html)



## **📜 Background**


`Doc2ChunkInternal` Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.


Converts `DOCUMENT`, `TOKEN` typed annotations into `CHUNK` type with the contents of a chunkCol. Chunk text must be contained within input `DOCUMENT`. May be either StringType or `ArrayType[StringType]` (using `setIsArray`). Useful for annotators that require a `CHUNK` type input.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [7]:
spark

In [8]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.


All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

### `inputCols` and `outputCol`

Define the column names containing the `DOCUMENT` and `TOKEN` annotations needed as input to the `Doc2ChunkInternal ` and the name of the new column containg the identified entities.

Let's define a pipeline to process raw texts into `DOCUMENT` and `TOKEN` annotations:

In [9]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = medical.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])

In [10]:
data = spark.createDataFrame([[
    "Spark NLP is an open-source text processing library for advanced natural language processing.", ["Spark NLP", "text processing library", "natural language processing"],
    ]]).toDF("text", "target")

data.show(truncate=False)

+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------+
|text                                                                                         |target                                                           |
+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------+
|Spark NLP is an open-source text processing library for advanced natural language processing.|[Spark NLP, text processing library, natural language processing]|
+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------+



In [11]:
result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+



In [12]:
result.selectExpr("chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 0, 8, Spark NLP, {sentence -> 0, chunk -> 0}, []}, {chunk, 28, 50, text processing library, {sentence -> 0, chunk -> 1}, []}, {chunk, 65, 91, natural language processing, {sentence -> 0, chunk -> 2}, []}]|
+---------------------------------------------------------------------------------------------------------------------------------------

In [16]:
result_df = result.select(F.explode(F.arrays_zip(result.chunk.result,
                                                 result.chunk.annotatorType,
                                                 result.chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("annotatorType"),
                          F.expr("cols['2']").alias("metadata"))

result_df.show(50, truncate=False)

+---------------------------+-------------+---------------------------+
|chunk                      |annotatorType|metadata                   |
+---------------------------+-------------+---------------------------+
|Spark NLP                  |chunk        |{sentence -> 0, chunk -> 0}|
|text processing library    |chunk        |{sentence -> 0, chunk -> 1}|
|natural language processing|chunk        |{sentence -> 0, chunk -> 2}|
+---------------------------+-------------+---------------------------+



In [17]:
chunkAssembler.extractParamMap()

{Param(parent='Doc2ChunkInternal_b4e2c856bc68', name='isArray', doc='whether the chunkCol is an array of strings'): True,
 Param(parent='Doc2ChunkInternal_b4e2c856bc68', name='inputCols', doc='previous annotations columns, if renamed'): ['document',
  'token'],
 Param(parent='Doc2ChunkInternal_b4e2c856bc68', name='chunkCol', doc='column that contains string. Must be part of DOCUMENT'): 'target',
 Param(parent='Doc2ChunkInternal_b4e2c856bc68', name='outputCol', doc='output annotation column. can be left default.'): 'chunk'}