![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/41.Flattener.ipynb)

#   **📜 Flattener**


The **`Flattener`** converts annotation results into a format that easier to use. This annotator produces a DataFrame with flattened and exploded columns containing annotation results, making it easier to interpret and analyze the information.
It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.

## **🎬 Colab Setup**

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [3]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 5.3.0
Spark NLP_JSL Version : 5.3.0


## **🖨️ Input/Output Annotation Types**

- Input: `ANY`

- Output: `NONE`

## **🔎 Parameters**


**Parameters**:

- `inputCols`: Input annotations.
- `cleanAnnotations`: Whether to remove annotation columns, by default `True`.
- `explodeSelectedFields`: Dict of input columns to their corresponding selected fields.
- `flattenExplodedColumns`: Whether to flatten exploded columns(default : `True`).
- `orderByColumn`: Specify the column by which the DataFrame should be ordered..
- `orderDescending`: specifying whether to order the DataFrame in descending order.(default : `True`).
      
  

## MedicalNerModel

In [4]:
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

data = spark.createDataFrame([[text]]).toDF("text")

In [5]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") \
    .setLabelCasing("upper")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [6]:
#explode and flatten all inputCols with all info (when explodeSelectedFields is not set)
flattener = Flattener()\
    .setInputCols("ner_chunk")

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    flattener
])

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)

+----------------------------------+---------------+-------------+------------------------+-----------------------------+-----------------------------+-------------------------+---------------------------+
|ner_chunk_result                  |ner_chunk_begin|ner_chunk_end|ner_chunk_metadata_chunk|ner_chunk_metadata_confidence|ner_chunk_metadata_ner_source|ner_chunk_metadata_entity|ner_chunk_metadata_sentence|
+----------------------------------+---------------+-------------+------------------------+-----------------------------+-----------------------------+-------------------------+---------------------------+
|distress                          |49             |56           |0                       |0.9441                       |ner_chunk                    |SYMPTOM                  |0                          |
|arcus senilis                     |196            |208          |1                       |0.43245                      |ner_chunk                    |DISEASE_SYNDROME_DISORDER

In [7]:
# returns exploded columns for each specified field containing annotation data.
flattener = Flattener()\
    .setInputCols("ner_chunk") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunk",
                                             "begin as begin",
                                             "end as end",
                                             "metadata.entity as entities"]})

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    flattener
])

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)

+----------------------------------+-----+---+-------------------------+
|ner_chunk                         |begin|end|entities                 |
+----------------------------------+-----+---+-------------------------+
|distress                          |49   |56 |SYMPTOM                  |
|arcus senilis                     |196  |208|DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380  |413|SYMPTOM                  |
|adenopathy                        |428  |437|SYMPTOM                  |
|tender                            |514  |519|SYMPTOM                  |
|fullness                          |540  |547|SYMPTOM                  |
|edema                             |665  |669|SYMPTOM                  |
|cyanosis                          |679  |686|VS_FINDING               |
|clubbing                          |692  |699|SYMPTOM                  |
+----------------------------------+-----+---+-------------------------+



In [8]:
#without flattening
flattener = Flattener()\
    .setInputCols("sentence", "token", "ner_chunk") \
    .setFlattenExplodedColumns(False)\
    .setExplodeSelectedFields({"sentence": ["result as sentences"],
                               "token":["result as tokens"],
                               "ner_chunk":["result as ner_chunk"]})\

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    flattener
])

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                           sentences|                                                                                           ner_chunk|                                                                                              tokens|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|[GENERAL: He is an elderly gentleman in no acute distress., He is sitting up in bed 

## AssertionDLModel

In [9]:
# returns exploded columns for each specified field containing annotation data.
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setEntityAssertionCaseSensitive(False)


flattener = Flattener()\
    .setInputCols("ner_chunk", "assertion") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunk",
                                             "begin as begin",
                                             "end as end",
                                             "metadata.entity as entities"],
                               "assertion":["result as assertion",
                                            "metadata.confidence as confidence"]
                               })

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion,
    flattener
])

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)

assertion_jsl_augmented download started this may take some time.
[OK!]
+----------------------------------+-----+---+-------------------------+---------+----------+
|ner_chunk                         |begin|end|entities                 |assertion|confidence|
+----------------------------------+-----+---+-------------------------+---------+----------+
|distress                          |49   |56 |SYMPTOM                  |Absent   |0.9999    |
|arcus senilis                     |196  |208|DISEASE_SYNDROME_DISORDER|Past     |1.0       |
|jugular venous pressure distention|380  |413|SYMPTOM                  |Absent   |1.0       |
|adenopathy                        |428  |437|SYMPTOM                  |Absent   |1.0       |
|tender                            |514  |519|SYMPTOM                  |Absent   |1.0       |
|fullness                          |540  |547|SYMPTOM                  |Possible |1.0       |
|edema                             |665  |669|SYMPTOM                  |Present  |

In [10]:
#order descending
flattener = Flattener()\
    .setInputCols("ner_chunk", "assertion") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunk",
                                             "begin as begin",
                                             "end as end",
                                             "metadata.entity as entities"],
                               "assertion":["result as assertion",
                                            "metadata.confidence as confidence"]
                               })\
    .setOrderByColumn("confidence")\
    .setOrderDescending(True)

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion,
    flattener
])

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)


+----------------------------------+-----+---+-------------------------+---------+----------+
|ner_chunk                         |begin|end|entities                 |assertion|confidence|
+----------------------------------+-----+---+-------------------------+---------+----------+
|arcus senilis                     |196  |208|DISEASE_SYNDROME_DISORDER|Past     |1.0       |
|jugular venous pressure distention|380  |413|SYMPTOM                  |Absent   |1.0       |
|adenopathy                        |428  |437|SYMPTOM                  |Absent   |1.0       |
|tender                            |514  |519|SYMPTOM                  |Absent   |1.0       |
|fullness                          |540  |547|SYMPTOM                  |Possible |1.0       |
|edema                             |665  |669|SYMPTOM                  |Present  |1.0       |
|cyanosis                          |679  |686|VS_FINDING               |Absent   |1.0       |
|clubbing                          |692  |699|SYMPTOM       

## RelationExtractionModel

In [11]:
document_assambler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

pos_ner_tagger = MedicalNerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner_pos")

pos_ner_chunker = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_pos"])\
    .setOutputCol("pos_ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

pos_reModel = RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
    .setOutputCol("pos_relations")\
    .setMaxSyntacticDistance(4)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_posology download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]


In [12]:
flattener = sparknlp_jsl.annotators.Flattener()\
    .setInputCols("pos_relations") \
    .setExplodeSelectedFields({"pos_relations": ["result as relations",
                                                 "metadata.chunk1 as chunk1",
                                                 "metadata.entity1_begin as entity1_begin",
                                                 "metadata.entity1_end as entity1_end",
                                                 "metadata.entity1 as entity1",
                                                 "metadata.chunk2 as chunk2",
                                                 "metadata.entity2_begin as entity2_begin",
                                                 "metadata.entity2_end as entity2_end",
                                                 "metadata.entity2 as entity2"]})

re_pipeline = Pipeline(stages=[
        document_assambler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        pos_tagger,
        pos_ner_tagger,
        pos_ner_chunker,
        dependency_parser,
        pos_reModel,
        flattener
])

text = """The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain.
The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and
cutaneous fragility on the face and the back of the hands.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = re_pipeline.fit(data).transform(data)
result.show(truncate=False)

+--------------+---------+-------------+-----------+-------+----------+-------------+-----------+---------+
|relations     |chunk1   |entity1_begin|entity1_end|entity1|chunk2    |entity2_begin|entity2_end|entity2  |
+--------------+---------+-------------+-----------+-------+----------+-------------+-----------+---------+
|DOSAGE-DRUG   |1 unit   |27           |32         |DOSAGE |naproxen  |37           |44         |DRUG     |
|DRUG-DURATION |naproxen |37           |44         |DRUG   |for 5 days|46           |55         |DURATION |
|DOSAGE-DRUG   |1 unit   |123          |128        |DOSAGE |oxaprozin |133          |141        |DRUG     |
|DRUG-FREQUENCY|oxaprozin|133          |141        |DRUG   |daily     |143          |147        |FREQUENCY|
+--------------+---------+-------------+-----------+-------+----------+-------------+-----------+---------+



## SentenceEntityResolverModel

In [13]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("ner")

ner_converter_icd = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(['PROBLEM'])\
    .setPreservePosition(False)

c2doc = Chunk2Doc()\
    .setInputCols("ner_chunk")\
    .setOutputCol("doc_ner_chunk")

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols("doc_ner_chunk")\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)

icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_augmented_billable_hcc download started this may take some time.
[OK!]


In [14]:
flattener = sparknlp_jsl.annotators.Flattener()\
    .setInputCols( "ner_chunk", "icd10cm_code") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunk",
                                             "metadata.entity as entities"],
                               "icd10cm_code": ["result as icd10cm_code",
                                                 "metadata.all_k_results as all_k_results",
                                                 "metadata.all_k_resolutions as all_k_resolutions",
                                                 "metadata.all_k_aux_labels as all_k_aux_labels"],
                               })

resolver_pipeline = Pipeline(
    stages = [
        document_assembler,
        sentenceDetectorDL,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter_icd,
        c2doc,
        sbert_embedder,
        icd_resolver,
        flattener
  ])


text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation
and subsequent type two diabetes mellitus, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2,
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = resolver_pipeline.fit(data).transform(data)
result.show(truncate=80)

+-------------------------------------+--------+------------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                            ner_chunk|entities|icd10cm_code|                                                                   all_k_results|                                                               all_k_resolutions|                                                                all_k_aux_labels|
+-------------------------------------+--------+------------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|        gestational diabetes mellitus| PROBLEM|       O24.4|     O24.4:::O24.41:::O2