![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **Chunk2Token**

This notebook will cover the different parameters and usages of `Chunk2Token` annotator.

**📖 Learning Objectives:**

1. Understand how to use `Chunk2Token`.

2. Become comfortable using the different parameters of the annotator.



**🔗 Helpful Links:**

- Documentation : [Chunk2Token](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunk2token)

- Python Docs : [Chunk2Token](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunk2_token/index.html)

- Scala Docs : [Chunk2Token](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/Chunk2Token.html)

- For extended examples of usage, see the [Spark NLP Workshop repository]().

## **📜 Background**


`Chunk2Token`  a feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

When the input is empty, an empty array is returned.

This annotator is specially convenient when using `NGramGenerator` annotations as inputs to WordEmbeddingsModels.


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m2.7 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIB_SECRET]/spark-nlp-jsl/spark_nlp_jsl-5.3.2-py3-none-any.whl --force-reinstall"
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


In [None]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `TOKEN`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.


All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

### `inputCols` and `outputCol`

Define the column names containing the `DOCUMENT` and `TOKEN` annotations needed as input to the `ContextualParser` and the name of the new column containg the identified entities.

Let's define a pipeline to process raw texts into `DOCUMENT` and `TOKEN` annotations:

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

pipeline = nlp.Pipeline(stages=[
    document,
    sentenceDetector,
    token,
    ngrammer,
    chunk2Token])


In [None]:
text = "The patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week."

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

In [None]:
result.select("ngram_tokens").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ngram_tokens                        

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.ngram_tokens.result,
                                                 result.ngram_tokens.annotatorType,
                                                 result.ngram_tokens.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("annotatorType"),
                          F.expr("cols['2']").alias("metadata"))

result_df.show(50, truncate=False)

+----------------------+-------------+----------------------------+
|chunk                 |annotatorType|metadata                    |
+----------------------+-------------+----------------------------+
|The_patient           |token        |{sentence -> 0, chunk -> 0} |
|patient_is            |token        |{sentence -> 0, chunk -> 1} |
|is_a                  |token        |{sentence -> 0, chunk -> 2} |
|a_41-year-old         |token        |{sentence -> 0, chunk -> 3} |
|41-year-old_Vietnamese|token        |{sentence -> 0, chunk -> 4} |
|Vietnamese_female     |token        |{sentence -> 0, chunk -> 5} |
|female_with           |token        |{sentence -> 0, chunk -> 6} |
|with_a                |token        |{sentence -> 0, chunk -> 7} |
|a_nonproductive       |token        |{sentence -> 0, chunk -> 8} |
|nonproductive_cough   |token        |{sentence -> 0, chunk -> 9} |
|cough_that            |token        |{sentence -> 0, chunk -> 10}|
|that_started          |token        |{sentence 

In [None]:
chunk2Token.extractParamMap()

{Param(parent='Chunk2Token_0ec030e60c92', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='Chunk2Token_0ec030e60c92', name='inputCols', doc='previous annotations columns, if renamed'): ['ngrams'],
 Param(parent='Chunk2Token_0ec030e60c92', name='outputCol', doc='output annotation column. can be left default.'): 'ngram_tokens'}