![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/35.02.Token2Chunk.ipynb)

# **Token2Chunk with SparkNLP**

This notebook will cover the different parameters and usages of `Token2Chunk`.

**📖 Learning Objectives:**

1. Understand how converts token type annotations to chunk type with this annotator.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [Token2Chunk](https://nlp.johnsnowlabs.com/docs/en/annotators#token2chunk)

- Python Docs : [Token2Chunk](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/token2_chunk/index.html#sparknlp.base.token2_chunk.Token2Chunk)

- Scala Docs : [Token2Chunk](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/Token2Chunk.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/open-source-nlp).

## **📜 Background**


Token2Chunk can convert `token` type annotations to `chunk` type. This can be useful if a entities have been already extracted as token and following annotators require chunk types.


We can use the Token2Chunk annotator, for example, before the ChunkMapper annotator from our healthcare library. Because ChunkMapper annotator needs chunk type inputs. Before this annotator, we can convert the tokens to chunks using the Token2Chunk annotator and we get the inputs that ChunkMapper needs.

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2 spark-nlp==4.2.4

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CHUNK`

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

token2chunk = Token2Chunk() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    token2chunk
])

data = spark.createDataFrame([["change object nature sea"]]).toDF("text")
result = pipeline.fit(data).transform(data)


result.selectExpr("token.result as token", "token.annotatorType").show(truncate=False)
result.selectExpr("chunk.result as chunk", "chunk.annotatorType").show(truncate=False)



+-----------------------------+----------------------------+
|token                        |annotatorType               |
+-----------------------------+----------------------------+
|[change, object, nature, sea]|[token, token, token, token]|
+-----------------------------+----------------------------+

+-----------------------------+----------------------------+
|chunk                        |annotatorType               |
+-----------------------------+----------------------------+
|[change, object, nature, sea]|[chunk, chunk, chunk, chunk]|
+-----------------------------+----------------------------+



As you can see above, the tokens of the token type, which is the output of the tokenizer annotator, have been converted to chunk type with Token2chunk annotator. So, for the next annotator that needs a chunk-type input, this conversion need has been resolved.

That's all!! With this you can use power of Spark NLP to convert your tokens to chunks.💪🏻

For additional information, don't hesitate to consult the above references.☘️