![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.4.PipelineTracer_and_PipelineOutputParser.ipynb)

#   **📜 PipelineTracer and PipelineOutputParser**


## Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
!pip install --upgrade -q pyspark==3.4.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
!pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
!pip install -q spark-nlp-display

In [None]:

import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

spark = sparknlp_jsl.start(secret = SECRET)

spark.sparkContext.setLogLevel("ERROR")

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.4.0
Spark NLP_JSL Version : 5.4.0


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# PipelineTracer



    PipelineTracer is a class that allows to trace the stages of a pipeline and get information about them.
    The `PipelineTracer` class provides functionality for tracing and retrieving information about the various stages of a pipeline.
    It can be used to obtain detailed insights into the entities, assertions, and relationships utilized within the pipeline.
    Compatibility with both `PipelineModel` and `PretrainedPipeline`.
    It can be used with a PipelineModel or a PretrainedPipeline.
    Additionally, it can be used to create a parser dictionary that can be used to create a PipelineOutputParser.


## **🔎 Parameters**

**Parameters**:

- `printPipelineSchema`: Prints the schema of the pipeline.
- `createParserDictionary`: Returns a parser dictionary that can be used to create a PipelineOutputParser
- `getPossibleEntities`: Returns a list of possible entities that the pipeline can include.
- `getPossibleAssertions`: Returns a list of possible assertions that the pipeline can include
- `getPossibleRelations`: Returns a list of possible relations that the pipeline can include.
- `getPipelineStages`: Returns a list of PipelineStage objects that represent the stages of the pipeline.
- `getParserDictDirectly`: Returns a parser dictionary that can be used to create a PipelineOutputParser. This method is used to get the parser dictionary directly without creating a PipelineTracer objec.
- `listAvailableModels`: Returns a list of available models for a given language and source
- `showAvailableModels`: Prints a list of available models for a given language and source.

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser


### showAvailableModels

In [None]:
PipelineTracer.showAvailableModels(language="en", source="clinical/models")

clinical_deidentification
explain_clinical_doc_ade
explain_clinical_doc_biomarker
explain_clinical_doc_generic
explain_clinical_doc_granular
explain_clinical_doc_medication
explain_clinical_doc_oncology
explain_clinical_doc_public_health
explain_clinical_doc_radiology
explain_clinical_doc_risk_factors
explain_clinical_doc_vop
icd10cm_resolver_pipeline
icd10cm_rxnorm_resolver_pipeline
rxnorm_resolver_pipeline
snomed_resolver_pipeline


### listAvailableModels

In [None]:
for model in PipelineTracer.listAvailableModels():
  print(PipelineTracer.getParserDictDirectly(model))

{'document_identifier': 'clinical_deidentification', 'document_text': 'sentence', 'entities': ['ner_chunk'], 'assertions': [], 'resolutions': [], 'relations': [], 'summaries': [], 'deidentifications': [{'original': 'sentence', 'obfuscated': 'obfuscated', 'masked': ''}], 'classifications': []}
{'document_identifier': 'explain_clinical_doc_ade', 'document_text': 'document', 'entities': ['ner_chunks_ade'], 'assertions': ['assertion'], 'resolutions': [], 'relations': ['relations'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'class', 'sentence_column_name': 'sentence'}]}
{'document_identifier': 'explain_clinical_doc_biomarker', 'document_text': 'document', 'entities': ['ner_biomarker_chunk'], 'assertions': [], 'resolutions': [], 'relations': ['re_oncology_biomarker_result_wip'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'prediction', 'sentence_column_name': 'sentence'}]}
{'document_identifie

### createParserDictionary

In [None]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")


explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [None]:
tracer = PipelineTracer(oncology_pipeline)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

### printPipelineSchema

In [None]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_27a75510357d)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetectorDLModel
 |    |-- uid: string (SentenceDetectorDLModel_6bafc4746ea5)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_6e5cf9a1fd71)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |  

### getPossibleEntities

In [None]:
tracer.getPossibleEntities()

['Cycle_Number',
 'Direction',
 'Histological_Type',
 'Biomarker_Result',
 'Site_Other_Body_Part',
 'Hormonal_Therapy',
 'Death_Entity',
 'Targeted_Therapy',
 'Route',
 'Tumor_Finding',
 'Duration',
 'Pathology_Result',
 'Chemotherapy',
 'Date',
 'Radiotherapy',
 'Radiation_Dose',
 'Oncogene',
 'Cancer_Surgery',
 'Tumor_Size',
 'Staging',
 'Pathology_Test',
 'Cancer_Dx',
 'Age',
 'Site_Lung',
 'Site_Breast',
 'Site_Liver',
 'Site_Lymph_Node',
 'Response_To_Treatment',
 'Site_Brain',
 'Immunotherapy',
 'Race_Ethnicity',
 'Metastasis',
 'Smoking_Status',
 'Imaging_Test',
 'Relative_Date',
 'Line_Of_Therapy',
 'Unspecific_Therapy',
 'Site_Bone',
 'Gender',
 'Cycle_Count',
 'Cancer_Score',
 'Adenopathy',
 'Grade',
 'Biomarker',
 'Invasion',
 'Frequency',
 'Performance_Status',
 'Dosage',
 'Cycle_Day',
 'Anatomical_Site',
 'Size_Trend',
 'Posology_Information',
 'Cancer_Therapy',
 'Lymph_Node',
 'Tumor_Description',
 'Lymph_Node_Modifier',
 'Alcohol',
 'BMI',
 'Communicable_Disease',
 'Obes

### getPossibleAssertions

In [None]:
tracer.getPossibleAssertions()

['Past', 'Family', 'Absent', 'Hypothetical', 'Possible', 'Present']

### getPossibleRelations

In [None]:
tracer.getPossibleRelations()

['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']

### getPipelineStages

In [None]:
stages = tracer.getPipelineStages()
for stage in stages:
    print(stage.__dict__())

{'uid': 'DocumentAssembler_27a75510357d', 'name': 'DocumentAssembler', 'index': 0, 'inputCol': StageField(inputCol, text, string), 'outputCol': StageField(outputCol, document, string), 'inputAnnotatorType': StageField(inputAnnotatorType, ----------, none), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'SentenceDetectorDLModel_6bafc4746ea5', 'name': 'SentenceDetectorDLModel', 'index': 1, 'inputCol': StageField(inputCols, [document], array), 'outputCol': StageField(outputCol, sentence, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'REGEX_TOKENIZER_6e5cf9a1fd71', 'name': 'TokenizerModel', 'index': 2, 'inputCol': StageField(inputCols, [sentence], array), 'outputCol': StageField(outputCol, token, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, token,

## with Custom Pipeline




In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["TREATMENT", "PROBLEM"])

clinical_assertion = AssertionDLModel \
    .pretrained("assertion_dl_large", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setIncludeConfidence(True) \
    .setEntityAssertionCaseSensitive(False) \
    .setEntityAssertion({"treAtment": ["present"]}) \
    .setReplaceLabels({"PRESENT": "available", "absent": "none", "Conditional": "Optional"})

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_dl_large download started this may take some time.
[OK!]


In [None]:
tracer = PipelineTracer(model)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['ner_chunk'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
tracer.getPossibleAssertions()

['available',
 'none',
 'hypothetical',
 'possible',
 'Optional',
 'associated_with_someone_else']

In [None]:
tracer.getPossibleEntities()

['TREATMENT', 'PROBLEM']

In [None]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_3089cee38170)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetector
 |    |-- uid: string (SentenceDetector_6eb82dd2c257)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_ea64e54a4d5f)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |    |-- outputCo

# PipelineOutputParser

The output parser module seamlessly integrates with existing systems to return clear, easy-to-read & process prettified results in dictionary format from pretrained pipelines. It is designed to enhance API integration, and user understanding and streamline data analysis workflows.

## clinical_deidentification

In [None]:

from sparknlp.pretrained import PretrainedPipeline
pretrained_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")

text = [
    """Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .""",
    """Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
]

results = pretrained_pipeline.fullAnnotate(text)


clinical_deidentification download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(pretrained_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "clinical_deidentification"})
column_maps

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("clinical_deidentification", "en", "clinical/models")
columns_directly

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'clinical_deidentification',
   'document_text': ['Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .',
    'PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .',
    'Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .'],
   'entities': [{'chunk_id': '78463532',
     'chunk': '2093-01-13',
     'begin': 14,
     'end': 23,
     'ner_label': 'DATE',
     'ner_source': None,
     'ner_confidence': None},
    {'chunk_id': '60a35054',
     'chunk': 'David Hale',
     'begin': 27,
     'end': 36,
     'ner_label': 'DOCTOR',
     'ner_source': 'ner_chunk_enriched',
     'ner_confidence': '0.9895'},
    {'chunk_id': '9d3e7907',
     'chunk': 'Hendrickson Ora',
     'begin': 55,
     'end': 69,
     'ner_label': 'PATIENT',
     'ner_source': 'ner_chunk_enriched',
     'ner_confidence': '0.99300003'},
    {'chunk_id': '81bc095c',
     'chunk': '7194334',
     'begin': 78,
     'e

## icd10cm_resolver_pipeline

In [None]:
from sparknlp.pretrained import PretrainedPipeline

icd10cm_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")

text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""

results = icd10cm_pipeline.fullAnnotate(text)

for stages in icd10cm_pipeline.model.stages:
    try:
        inputCol = stages.getInputCol()
    except:
        inputCol = stages.getInputCols()
    print(inputCol,stages.getOutputCol())

icd10cm_resolver_pipeline download started this may take some time.
Approx size to download 3.3 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token', 'embeddings'] ner
['sentence', 'token', 'ner'] chunk
['chunk'] icd10cm_mapper
['chunk'] icd10cm_mapper
['chunk', 'icd10cm_mapper'] chunks_fail
['chunks_fail'] chunk_doc
['chunk_doc'] sentence_embeddings
['sentence_embeddings'] resolver_code
['resolver_code', 'icd10cm_mapper'] icd10cm


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(icd10cm_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "icd10cm_resolver_pipeline"})
column_maps

{'document_identifier': 'icd10cm_resolver_pipeline',
 'document_text': 'document',
 'entities': ['chunks_fail'],
 'assertions': [],
 'resolutions': [{'vocab': 'icd10cm', 'resolver_column_name': 'icd10cm'}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("icd10cm_resolver_pipeline", "en", "clinical/models")
columns_directly

{'document_identifier': 'icd10cm_resolver_pipeline',
 'document_text': 'document',
 'entities': ['chunk'],
 'assertions': [],
 'resolutions': [{'vocab': 'icd10cm', 'resolver_column_name': 'icd10cm'}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(columns_directly)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [{'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'begin': 39,
     'end': 67,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9424'},
    {'chunk_id': 'd280706c',
     'chunk': 'anisakiasis',
     'begin': 95,
     'end': 105,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9933'},
    {'chunk_id': '9df194a1',
     'chunk': 'fetal and neonatal hemorrhage',
     'begin': 135,
     'end': 163,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.7501'}],
   'assertions': [],
   'resolutions': [{'vocab': 'icd10cm',
     'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'code': 'O2

In [None]:
column_maps = {
    "document_identifier": "icd10cm_resolver_pipeline",
    "document_text": "document",
    "entities": ["chunk"],
    # assertions": [],
    "resolutions": [
        {
            "vocab":"icd10",
            "resolver_column_name": "icd10cm"
        }
    ]
    # "relations": [],
    # "summaries": [],
    # "deidentifications": [],
    # "classifications":[]
}

pipeline_parser = PipelineOutputParser(column_maps)

parsed_result = pipeline_parser.run(results)
parsed_result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [{'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'begin': 39,
     'end': 67,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9424'},
    {'chunk_id': 'd280706c',
     'chunk': 'anisakiasis',
     'begin': 95,
     'end': 105,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.9933'},
    {'chunk_id': '9df194a1',
     'chunk': 'fetal and neonatal hemorrhage',
     'begin': 135,
     'end': 163,
     'ner_label': 'PROBLEM',
     'ner_source': None,
     'ner_confidence': '0.7501'}],
   'assertions': [],
   'resolutions': [{'vocab': 'icd10',
     'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'code': 'O24.

## explain_clinical_doc_biomarker

In [None]:
from sparknlp.pretrained import PretrainedPipeline

biomarker_pipeline = PretrainedPipeline("explain_clinical_doc_biomarker", "en", "clinical/models")

results = biomarker_pipeline.fullAnnotate("""In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.""")


for stages in biomarker_pipeline.model.stages:
    try:
        inputCol = stages.getInputCol()
    except:
        inputCol = stages.getInputCols()
    print(inputCol,stages.getOutputCol())

explain_clinical_doc_biomarker download started this may take some time.
Approx size to download 2 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token'] prediction
['sentence', 'token'] matched_biomarker
['sentence', 'token', 'embeddings'] oncology_ner
['sentence', 'token', 'oncology_ner'] ner_oncology_chunk
['sentence', 'token', 'embeddings'] biomarker_ner
['sentence', 'token', 'biomarker_ner'] ner_biomarker_chunk
['ner_oncology_chunk', 'ner_biomarker_chunk', 'matched_biomarker'] merged_chunk
['sentence', 'token'] pos_tags
['sentence', 'pos_tags', 'token'] dependencies
['embeddings', 'pos_tags', 'merged_chunk', 'dependencies'] re_oncology_biomarker_result_wip


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(biomarker_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_biomarker"})
column_maps

{'document_identifier': 'explain_clinical_doc_biomarker',
 'document_text': 'document',
 'entities': ['merged_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}]}

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("explain_clinical_doc_biomarker", "en", "clinical/models")
columns_directly

{'document_identifier': 'explain_clinical_doc_biomarker',
 'document_text': 'document',
 'entities': ['ner_biomarker_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}]}

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_biomarker',
   'document_text': ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'],
   'entities': [{'chunk_id': 'bc15add6',
     'chunk': 'positive',
     'begin': 84,
     'end': 91,
     'ner_label': 'Biomarker_Result',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9672'},
    {'chunk_id': 'b473fd80',
     'chunk': 'CD9',
     'begin': 97,
     'end': 99,
     'ner_label': 'Biomarker',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.992'},
    {'chunk_id': '0252d08a',
     'chunk': 'CD10',
     'begin': 105,
     'end': 108,
     'ner_label': 'Bio

In [None]:
column_maps = {
    "document_identifier": "explain_clinical_doc_biomarker",
    "document_text": "document",
    "entities": ["merged_chunk"],
    "assertions": [],
    "resolutions": [],
    "relations": ["re_oncology_biomarker_result_wip"],
    "summaries": None,
    "deidentifications": [],
    "classifications":[{
        "classification_column_name": "prediction",
        "sentence_column_name": "sentence",

    }]
}

pipeline_parser = PipelineOutputParser(column_maps)

parsed_result = pipeline_parser.run(results)
parsed_result

{'result': [{'document_identifier': 'explain_clinical_doc_biomarker',
   'document_text': ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'],
   'entities': [{'chunk_id': 'bc15add6',
     'chunk': 'positive',
     'begin': 84,
     'end': 91,
     'ner_label': 'Biomarker_Result',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9672'},
    {'chunk_id': 'b473fd80',
     'chunk': 'CD9',
     'begin': 97,
     'end': 99,
     'ner_label': 'Biomarker',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.992'},
    {'chunk_id': '0252d08a',
     'chunk': 'CD10',
     'begin': 105,
     'end': 108,
     'ner_label': 'Bio

## explain_clinical_doc_oncology

https://nlp.johnsnowlabs.com/2024/05/06/explain_clinical_doc_oncology_en.html

In [None]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

results = oncology_pipeline.fullAnnotate("""The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response""")

for stages in oncology_pipeline.model.stages:
    try:
        inputCol = stages.getInputCol()
    except:
        inputCol = stages.getInputCols()
    print(inputCol,stages.getOutputCol())

explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token', 'embeddings'] ner_oncology
['sentence', 'token', 'ner_oncology'] ner_oncology_chunk
['sentence', 'token', 'embeddings'] ner_oncology_anatomy_general
['sentence', 'token', 'ner_oncology_anatomy_general'] ner_oncology_anatomy_general_chunk
['sentence', 'token', 'embeddings'] ner_oncology_response_to_treatment
['sentence', 'token', 'ner_oncology_response_to_treatment'] ner_oncology_response_to_treatment_chunk
['sentence', 'token', 'embeddings'] ner_oncology_unspecific_posology
['sentence', 'token', 'ner_oncology_unspecific_posology'] ner_oncology_unspecific_posology_chunk
['sentence', 'token', 'embeddings'] ner_oncology_tnm
['sentence', 'token', 'ner_oncology_tnm'] ner_oncology_tnm_chunk
['sentence', 'token', 'embeddings'] ner_jsl
['sentence', 'token', 'ner_jsl'] ner_jsl_ch

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_oncology"})
column_maps

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
print(column_maps)

{'document_identifier': 'explain_clinical_doc_oncology', 'document_text': 'document', 'entities': ['merged_chunk', 'merged_chunk_for_assertion'], 'assertions': ['assertion'], 'resolutions': [], 'relations': ['all_relations'], 'summaries': [], 'deidentifications': [], 'classifications': []}


In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("explain_clinical_doc_oncology", "en", "clinical/models")
columns_directly

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': ['merged_chunk_for_assertion', 'merged_chunk'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [None]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_oncology',
   'document_text': ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response'],
   'entities': [{'chunk_id': '1b71b12a',
     'chunk': 'computed tomography',
     'begin': 24,
     'end': 42,
     'ner_label': 'Imaging_Test',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9575'},
    {'chunk_id': 'ce9ac1a9'

## with Custom Pipeline

In [None]:
def get_pipeline_model():
    documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

    tokenizer = Tokenizer()\
        .setInputCols("sentence")\
        .setOutputCol("token")

    # ade claassifier
    sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade_augmented", "en", "clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("ade_classification")

    word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
        .setInputCols("sentence", "token")\
        .setOutputCol("word_embeddings")

    # to get PROBLEM entitis
    clinical_ner = MedicalNerModel().pretrained("ner_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "token", "word_embeddings"]) \
        .setOutputCol("clinical_ner")

    clinical_ner_chunk = NerConverterInternal()\
        .setInputCols("sentence","token","clinical_ner")\
        .setOutputCol("clinical_ner_chunk")\
        .setWhiteList(["PROBLEM","TEST"])

    # Assertion model trained on i2b2 (sampled from MIMIC) dataset
    assertion_jsl = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
        .setInputCols(["sentence", "clinical_ner_chunk", "word_embeddings"]) \
        .setOutputCol("assertion_jsl")\
        .setEntityAssertionCaseSensitive(False)

    # to get DRUG entities
    posology_ner = MedicalNerModel().pretrained("ner_posology", "en", "clinical/models") \
        .setInputCols(["sentence", "token", "word_embeddings"]) \
        .setOutputCol("posology_ner")

    posology_ner_chunk = NerConverterInternal()\
        .setInputCols("sentence","token","posology_ner")\
        .setOutputCol("posology_ner_chunk")\
        .setWhiteList(["DRUG","DOSAGE","DURATION"])

    # ner deid Idendification
    deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
        .setInputCols(["sentence", "token", "word_embeddings"]) \
        .setOutputCol("deid_ner")

    deid_ner_chunk = NerConverterInternal()\
        .setInputCols(["sentence", "token", "deid_ner"])\
        .setOutputCol("deid_ner_chunk")

    # merge the chunks into a single ner_chunk
    chunk_merger = ChunkMergeApproach()\
        .setInputCols("clinical_ner_chunk","posology_ner_chunk")\
        .setOutputCol("merged_ner_chunk")\
        .setMergeOverlapping(False)

    obfuscation = DeIdentification()\
        .setInputCols(["sentence", "token", "deid_ner_chunk"]) \
        .setOutputCol("deidentified") \
        .setMode("obfuscate")\
        .setObfuscateDate(True)\
        .setObfuscateRefSource("faker") \
        .setMetadataMaskingPolicy("entity_labels")\
        .setOutputAsDocument(True)\

    assertion_vop = AssertionDLModel.pretrained("assertion_vop_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "merged_ner_chunk", "word_embeddings"]) \
        .setOutputCol("assertion_vop")

    pos_tagger = PerceptronModel()\
        .pretrained("pos_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "token"])\
        .setOutputCol("pos_tags")

    dependency_parser = DependencyParserModel()\
        .pretrained("dependency_conllu", "en")\
        .setInputCols(["sentence", "pos_tags", "token"])\
        .setOutputCol("dependencies")

    generic_re = RelationExtractionModel()\
        .pretrained("generic_re")\
        .setInputCols(["word_embeddings", "pos_tags", "posology_ner_chunk", "dependencies"])\
        .setOutputCol("generic_re")\
        .setMaxSyntacticDistance(10)

    # convert chunks to doc to get sentence embeddings of them
    chunk2doc = Chunk2Doc()\
      .setInputCols("merged_ner_chunk")\
      .setOutputCol("doc_final_chunk")


    sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
        .setInputCols(["doc_final_chunk"])\
        .setOutputCol("sbert_embeddings")\
        .setCaseSensitive(False)

    # filter PROBLEM entity embeddings
    router_sentence_icd10 = Router() \
        .setInputCols("sbert_embeddings") \
        .setFilterFieldsElements(["PROBLEM"]) \
        .setOutputCol("problem_embeddings")

    # filter DRUG entity embeddings
    router_sentence_rxnorm = Router() \
        .setInputCols("sbert_embeddings") \
        .setFilterFieldsElements(["DRUG"]) \
        .setOutputCol("drug_embeddings")

    # use problem_embeddings only
    icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
        .setInputCols(["problem_embeddings"]) \
        .setOutputCol("icd10cm_code")\
        .setDistanceFunction("EUCLIDEAN")

    # use drug_embeddings only
    rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
        .setInputCols(["drug_embeddings"]) \
        .setOutputCol("rxnorm_code")\
        .setDistanceFunction("EUCLIDEAN")

    #summurazation
    summarizer = MedicalSummarizer\
        .pretrained("summarizer_clinical_jsl")\
        .setInputCols(['document'])\
        .setOutputCol('summary')\
        .setMaxTextLength(512)\
        .setMaxNewTokens(512)

    pipeline = Pipeline(
        stages=[
            documentAssembler,
            sentenceDetector,
            tokenizer,
            sequenceClassifier,
            word_embeddings,
            clinical_ner,
            clinical_ner_chunk,
            assertion_jsl,
            posology_ner,
            posology_ner_chunk,
            deid_ner,
            deid_ner_chunk,
            chunk_merger,
            obfuscation,
            assertion_vop,
            pos_tagger,
            dependency_parser,
            generic_re,
            chunk2doc,
            sbiobert_embeddings,
            router_sentence_icd10,
            router_sentence_rxnorm,
            icd_resolver,
            rxnorm_resolver,
            summarizer
    ])

    empty_data = spark.createDataFrame([['']]).toDF("text")
    # model = pipeline.fit(empty_data)
    return pipeline.fit(empty_data)

big_pipeline_model =  get_pipeline_model()

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
bert_sequence_classifier_ade_augmented download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_jsl_augmented download started this may take some time.
[OK!]
ner_posology download started this may take some time.
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]
assertion_vop_clinical download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icd10cm_slim_billable_hcc

In [None]:
text = """
Ora Hendrickson, a 28-year-old female with a history of gestational diabetes, now type 2 diabetes, and obesity (BMI 33.5 kg/m²), presented with polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior, she completed a five-day course of amoxicillin for a respiratory infection and had been on dapagliflozin for six months.
On examination, she had dry oral mucosa and a benign abdomen. Key lab findings included serum glucose 111 mg/dL, bicarbonate 18 mmol/L, anion gap 20, triglycerides 508 mg/dL, and HbA1c 10%. Venous pH was 7.27, and serum lipase was normal at 43 U/L. Due to poor oral intake, she was admitted for starvation ketosis.
She also reported a two-week headache and anxiety when walking fast. Her father’s paralysis and workplace bullying were significant stressors, leading to insomnia treated with sleeping pills.
Ora, with insulin-dependent type 2 diabetes, coronary artery disease, and chronic renal insufficiency, was previously admitted for acute paraplegia. She developed pressure wounds on her left foot and sacral area. Transferred for further care, she was on multiple medications, including Fragmin, Xenaderm, Lantus, OxyContin, Avandia, and Neurontin. Pathology revealed tumor cells positive for estrogen and progesterone receptors.
Discharged with Avandia, Coumadin, metformin, and Lisinopril, she was also prescribed aspirin and an Albuterol inhaler for asthma.
"""

ligth_model = LightPipeline(big_pipeline_model)
results = ligth_model.fullAnnotate(text)

In [None]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(big_pipeline_model)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_oncology"})
column_maps

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': ['deid_ner_chunk',
  'posology_ner_chunk',
  'clinical_ner_chunk',
  'merged_ner_chunk'],
 'assertions': ['assertion_jsl', 'assertion_vop'],
 'resolutions': [{'vocab': 'icd10cm_code',
   'resolver_column_name': 'icd10cm_code'},
  {'vocab': 'rxnorm_code', 'resolver_column_name': 'rxnorm_code'}],
 'relations': ['generic_re'],
 'summaries': ['summary'],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'deidentified',
   'masked': ''}],
 'classifications': [{'classification_column_name': 'ade_classification',
   'sentence_column_name': 'sentence'}]}

In [None]:
column_maps = {
    'document_identifier': 'some document identifier',
    'document_text': 'document',
    'entities': ['clinical_ner_chunk','posology_ner_chunk','deid_ner_chunk',],
    'assertions': ['assertion_vop', 'assertion_jsl'],
    'resolutions': [{
            'vocab':"rxnorm",
            'resolver_column_name': 'rxnorm_code'
        },
        {
            'vocab':"icd10",
            'resolver_column_name': 'icd10cm_code'
    }],
    'relations': ['generic_re'],
    'summaries': ['summary'],
    'deidentifications' : [{
        "original": "document",
        "obfuscated": "deidentified",
        "masked": None # None, will check in metadata to masked field
    }],
    'classifications':[{
        "classification_column_name": "ade_classification",
        "sentence_column_name": "sentence",
    }]
}


pipeline_parser = PipelineOutputParser(column_maps,)
result = pipeline_parser.run(results, return_relation_entities=True )

result['result'][0]

{'document_identifier': 'some document identifier',
 'document_text': ['\nOra Hendrickson, a 28-year-old female with a history of gestational diabetes, now type 2 diabetes, and obesity (BMI 33.5 kg/m²), presented with polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior, she completed a five-day course of amoxicillin for a respiratory infection and had been on dapagliflozin for six months.\nOn examination, she had dry oral mucosa and a benign abdomen. Key lab findings included serum glucose 111 mg/dL, bicarbonate 18 mmol/L, anion gap 20, triglycerides 508 mg/dL, and HbA1c 10%. Venous pH was 7.27, and serum lipase was normal at 43 U/L. Due to poor oral intake, she was admitted for starvation ketosis.\nShe also reported a two-week headache and anxiety when walking fast. Her father’s paralysis and workplace bullying were significant stressors, leading to insomnia treated with sleeping pills.\nOra, with insulin-dependent type 2 diabetes, coronary artery disease, and chronic r