![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **ChunkConverter**

This notebook will cover the different parameters and usages of `ChunkConverter` annotator.

**📖 Learning Objectives:**

1. Understand how to use `ChunkConverter`.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [ChunkConverter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkconverter)

- Python Docs : [ChunkConverter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunk_converter/index.html#sparknlp_jsl.annotator.chunker.chunk_converter.ChunkConverter)

- Scala Docs : [ChunkConverter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/chunker/ChunkConverter.html)



## **📜 Background**


`ChunkConverter` convert chunks from regexMatcher to chunks with an entity in the metadata. Use the identifier or field as a entity.

This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.


All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

### `inputCols` and `outputCol`

Define the column names containing the `DOCUMENT` and `CHUNK` annotations needed as input to the `ChunkConverter ` and the name of the new column containg the identified entities.

Let's define a pipeline to process raw texts into `DOCUMENT` and `CHUNK` annotations:

In [None]:
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

regex_matcher = nlp.RegexMatcher()\
    .setInputCols("sentence")\
    .setOutputCol("regex")\
    .setExternalRules(path="./regex_rules.txt", delimiter=","  )

chunkConverter = medical.ChunkConverter()\
    .setInputCols("regex")\
    .setOutputCol("chunk")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        regex_matcher,
        regex_matcher,
        chunkConverter,
    ])


In [None]:
text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE:  Excisional biopsy of right cervical lymph node.
ANESTHESIA:  General endotracheal anesthesia.
Specimen:  Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS:  None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS:  Please see anesthesia report.
URINE OUTPUT:  None recorded during the case.
INDICATIONS FOR PROCEDURE:  This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL:  The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)


In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.regex.result,
                                                 result.regex.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex"),
                          F.expr("cols['1']").alias("metadata"))

result_df.show(50, truncate=False)

+--------------------------+----------------------------------------------------------+
|regex                     |metadata                                                  |
+--------------------------+----------------------------------------------------------+
|POSTOPERATIVE DIAGNOSIS:  |{identifier -> SECTION_HEADER, sentence -> 0, chunk -> 0} |
|PROCEDURE:                |{identifier -> SECTION_HEADER, sentence -> 1, chunk -> 0} |
|ANESTHESIA:               |{identifier -> SECTION_HEADER, sentence -> 2, chunk -> 0} |
|EBL:                      |{identifier -> SECTION_HEADER, sentence -> 4, chunk -> 0} |
|COMPLICATIONS:            |{identifier -> SECTION_HEADER, sentence -> 5, chunk -> 0} |
|FINDINGS:                 |{identifier -> SECTION_HEADER, sentence -> 6, chunk -> 0} |
|FLUIDS:                   |{identifier -> SECTION_HEADER, sentence -> 7, chunk -> 0} |
|URINE OUTPUT:             |{identifier -> SECTION_HEADER, sentence -> 8, chunk -> 0} |
|INDICATIONS FOR PROCEDURE:|{ide

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.chunk.result,
                                                 result.chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("metadata"))

result_df.show(50, truncate=False)

+--------------------------+---------------------------------------------------------------------------------------------------------+
|chunk                     |metadata                                                                                                 |
+--------------------------+---------------------------------------------------------------------------------------------------------+
|POSTOPERATIVE DIAGNOSIS:  |{chunk -> 0, identifier -> SECTION_HEADER, ner_source -> chunk, entity -> SECTION_HEADER, sentence -> 0} |
|PROCEDURE:                |{chunk -> 0, identifier -> SECTION_HEADER, ner_source -> chunk, entity -> SECTION_HEADER, sentence -> 1} |
|ANESTHESIA:               |{chunk -> 0, identifier -> SECTION_HEADER, ner_source -> chunk, entity -> SECTION_HEADER, sentence -> 2} |
|EBL:                      |{chunk -> 0, identifier -> SECTION_HEADER, ner_source -> chunk, entity -> SECTION_HEADER, sentence -> 4} |
|COMPLICATIONS:            |{chunk -> 0, identifier -> 

In [None]:
chunkConverter.extractParamMap()

{Param(parent='ChunkConverter_f60ed650117b', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ChunkConverter_f60ed650117b', name='inputCols', doc='previous annotations columns, if renamed'): ['regex'],
 Param(parent='ChunkConverter_f60ed650117b', name='outputCol', doc='output annotation column. can be left default.'): 'chunk'}