![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **AssertionChunkConverter**

This notebook will cover the different parameters and usages of `AssertionChunkConverter`. This annotator allows to train an AssertionDLModel.

**📖 Learning Objectives:**

1. Understand the meaning and use of assertion status.

2. Learn how to create a chunk column with metadata for training assertion status detection models.

3. Customize your assertion model by using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [AssertionChunkConverter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#assertionchunkconverter)

- Python Docs : [AssertionChunkConverter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/assertion/assertion_chunk_converter/)

- Scala Docs : [AssertionChunkConverter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/assertion/AssertionChunkConverter.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb).

## **📜 Background**


The goal of assertion models is to classify chunks of text considering their context. The typical example of assertion status detection is negation identification: in the sentence “the patient has no history of diabetes”, the chunk “diabetes” -extracted by a clinical NER model as a Disease- would be classified as Absent by an assertion model due to the word "no" in its context. A more complex assertion model can include other labels such as Hypothetical, Past, Planned, Possible, Family, etc.


The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Federico Fancellu, Adam Lopez and Bonnie Webber (Neural Networks For Negation Scope Detection.


AssertionChunkConverter creates a CHUNK column with metadata useful for training an Assertion Status Detection model.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`, `NAMED_ENTITY`

- Output: `CHUNK`

## **🔎 AssertionChunkConverter Parameters**


- `chunkBeginCol`: (Str) The column containing the start index of the chunk.

- `chunkEndCol`: (Str) The column containing the end index of the chunk.

- `chunkTextCol`: (Str) The column containing the text chunk.

- `outputTokenBeginCol`: (Str)  The column containing selected token start.

- `outputTokenEndCol`: (Str) The column containing selected token end index.


In some cases, there may be issues while creating the chunk column by using token indices and losing some data while training and testing the assertion status model if there are issues in these token indices. So we developed a new `AssertionChunkConverter` annotator that takes **begin and end indices of the chunks** as input and creates an extended chunk column with metadata that can be used for assertion status detection model training.


NOTE: If your training data comes from NLP Lab, you can use the [Alab Module](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Annotation_Lab/Complete_ALab_Module_SparkNLP_JSL.ipynb) to get the chunk begin and end indices.

Your input data should have the following format:



text  | target | label | char_begin | char_end
-------------------|------------------|------------------|------------------|------------------
She has no history of liver disease or hepatitis. | liver disease | Absent | 22 | 35
He is diabetic.      | diabetic | Present | 6 | 14

In [26]:
data = spark.createDataFrame([
    ["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie)", 57, 64],
    ["After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation ", "PCP", 31, 34]])\
     .toDF("text", "target", "char_begin", "char_end")

data.show()

+--------------------+-------+----------+--------+
|                text| target|char_begin|char_end|
+--------------------+-------+----------+--------+
|An angiography sh...|Minnie)|        57|      64|
|After discussing ...|    PCP|        31|      34|
+--------------------+-------+----------+--------+



In [27]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("tokens")

converter = medical.AssertionChunkConverter() \
    .setInputCols("tokens")\
    .setChunkTextCol("target")\
    .setChunkBeginCol("char_begin")\
    .setChunkEndCol("char_end")\
    .setOutputTokenBeginCol("token_begin")\
    .setOutputTokenEndCol("token_end")\
    .setOutputCol("chunk")

clinical_assertion_pipeline = nlp.Pipeline(stages = [
    document_assembler,
    sentenceDetector,
    tokenizer,
    converter])

results = clinical_assertion_pipeline.fit(data).transform(data)

In [28]:
results.show()

+--------------------+-------+----------+--------+--------------------+--------------------+--------------------+-----------+---------+--------------------+
|                text| target|char_begin|char_end|            document|            sentence|              tokens|token_begin|token_end|               chunk|
+--------------------+-------+----------+--------+--------------------+--------------------+--------------------+-----------+---------+--------------------+
|An angiography sh...|Minnie)|        57|      64|[{document, 0, 11...|[{document, 0, 11...|[{token, 0, 1, An...|         10|       10|[{chunk, 57, 62, ...|
|After discussing ...|    PCP|        31|      34|[{document, 0, 17...|[{document, 0, 17...|[{token, 0, 4, Af...|          5|        5|[{chunk, 31, 33, ...|
+--------------------+-------+----------+--------+--------------------+--------------------+--------------------+-----------+---------+--------------------+



In [31]:
results\
    .selectExpr(
        "target",
        "char_begin",
        "char_end",
        "token_begin",
        "token_end",
        "tokens[token_begin].result",
        "tokens[token_end].result",
        "chunk")\
    .show(truncate=False)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------------------------------------+
|Minnie)|57        |64      |10         |10       |Minnie                    |Minnie                  |[{chunk, 57, 62, Minnie), {sentence -> 0}, []}]|
|PCP    |31        |34      |5          |5        |PCP                       |PCP                     |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]    |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------------------------------------+



In [32]:
results.selectExpr("chunk").show(truncate=False)

+-----------------------------------------------+
|chunk                                          |
+-----------------------------------------------+
|[{chunk, 57, 62, Minnie), {sentence -> 0}, []}]|
|[{chunk, 31, 33, PCP, {sentence -> 0}, []}]    |
+-----------------------------------------------+



The training data should have annotation columns of type DOCUMENT, CHUNK, WORD_EMBEDDINGS, the label column (the assertion status that you want to predict), the start (the start index for the term that has the assertion status), the end column (the end index for the term that has the assertion status).

In order to get such format, you can use **AssertionChunkConverter** in your preprocessing pipeline.