![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.4.PipelineTracer_and_PipelineOutputParser.ipynb)

#   **📜 PipelineTracer and PipelineOutputParser**



# Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [2]:
# Installing pyspark and spark-nlp
!pip install --upgrade -q pyspark==3.4.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
!pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
!pip install -q spark-nlp-display

In [3]:

import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

spark = sparknlp_jsl.start(secret = SECRET)

spark.sparkContext.setLogLevel("ERROR")

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.3.2
Spark NLP_JSL Version : 5.3.3


# PipelineTracer



    PipelineTracer is a class that allows to trace the stages of a pipeline and get information about them.
    The `PipelineTracer` class provides functionality for tracing and retrieving information about the various stages of a pipeline.
    It can be used to obtain detailed insights into the entities, assertions, and relationships utilized within the pipeline.
    Compatibility with both `PipelineModel` and `PretrainedPipeline`.
    It can be used with a PipelineModel or a PretrainedPipeline.
    Additionally, it can be used to create a parser dictionary that can be used to create a PipelineOutputParser.


## **🔎 Parameters**

**Parameters**:

- `printPipelineSchema`: Prints the schema of the pipeline.
- `createParserDictionary`: Returns a parser dictionary that can be used to create a PipelineOutputParser
- `getPossibleEntities`: Returns a list of possible entities that the pipeline can include.
- `getPossibleAssertions`: Returns a list of possible assertions that the pipeline can include
- `getPossibleRelations`: Returns a list of possible relations that the pipeline can include.
- `getPipelineStages`: Returns a list of PipelineStage objects that represent the stages of the pipeline.
- `getParserDictDirectly`: Returns a parser dictionary that can be used to create a PipelineOutputParser. This method is used to get the parser dictionary directly without creating a PipelineTracer objec.
- `listAvailableModels`: Returns a list of available models for a given language and source
- `showAvailableModels`: Prints a list of available models for a given language and source.

In [4]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser


### showAvailableModels

In [5]:
PipelineTracer.showAvailableModels(language="en", source="clinical/models")

clinical_deidentification
explain_clinical_doc_ade
explain_clinical_doc_biomarker
explain_clinical_doc_generic
explain_clinical_doc_granular
explain_clinical_doc_medication
explain_clinical_doc_oncology
explain_clinical_doc_public_health
explain_clinical_doc_radiology
explain_clinical_doc_risk_factors
explain_clinical_doc_vop
icd10cm_resolver_pipeline
icd10cm_rxnorm_resolver_pipeline
rxnorm_resolver_pipeline
snomed_resolver_pipeline


### listAvailableModels

In [6]:
for model in PipelineTracer.listAvailableModels():
  print(PipelineTracer.getParserDictDirectly(model))

{'document_identifier': 'clinical_deidentification', 'document_text': 'sentence', 'entities': [{'ner_chunk_column_name': 'ner_chunk', 'assertion_column_name': '', 'resolver_column_name': ''}], 'relations': [], 'summaries': [], 'deidentifications': [{'original': 'sentence', 'obfuscated': 'obfuscated', 'masked': ''}], 'classifications': []}
{'document_identifier': 'explain_clinical_doc_ade', 'document_text': 'document', 'entities': [{'ner_chunk_column_name': 'ner_chunks_ade', 'assertion_column_name': '', 'resolver_column_name': ''}, {'ner_chunk_column_name': 'ner_chunks_ade', 'assertion_column_name': 'assertion', 'resolver_column_name': ''}], 'relations': ['relations'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'class', 'sentence_column_name': 'sentence'}]}
{'document_identifier': 'explain_clinical_doc_biomarker', 'document_text': 'document', 'entities': [{'ner_chunk_column_name': 'ner_biomarker_chunk', 'assertion_column_name': '', 'reso

### createParserDictionary

In [16]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")


explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [17]:
tracer = PipelineTracer(oncology_pipeline)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'merged_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''},
  {'ner_chunk_column_name': 'merged_chunk_for_assertion',
   'assertion_column_name': 'assertion',
   'resolver_column_name': ''}],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

### printPipelineSchema

In [18]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_27a75510357d)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetectorDLModel
 |    |-- uid: string (SentenceDetectorDLModel_6bafc4746ea5)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_6e5cf9a1fd71)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |  

### getPossibleEntities

In [19]:
tracer.getPossibleEntities()

['Cycle_Number',
 'Direction',
 'Histological_Type',
 'Biomarker_Result',
 'Site_Other_Body_Part',
 'Hormonal_Therapy',
 'Death_Entity',
 'Targeted_Therapy',
 'Route',
 'Tumor_Finding',
 'Duration',
 'Pathology_Result',
 'Chemotherapy',
 'Date',
 'Radiotherapy',
 'Radiation_Dose',
 'Oncogene',
 'Cancer_Surgery',
 'Tumor_Size',
 'Staging',
 'Pathology_Test',
 'Cancer_Dx',
 'Age',
 'Site_Lung',
 'Site_Breast',
 'Site_Liver',
 'Site_Lymph_Node',
 'Response_To_Treatment',
 'Site_Brain',
 'Immunotherapy',
 'Race_Ethnicity',
 'Metastasis',
 'Smoking_Status',
 'Imaging_Test',
 'Relative_Date',
 'Line_Of_Therapy',
 'Unspecific_Therapy',
 'Site_Bone',
 'Gender',
 'Cycle_Count',
 'Cancer_Score',
 'Adenopathy',
 'Grade',
 'Biomarker',
 'Invasion',
 'Frequency',
 'Performance_Status',
 'Dosage',
 'Cycle_Day',
 'Anatomical_Site',
 'Size_Trend',
 'Posology_Information',
 'Cancer_Therapy',
 'Lymph_Node',
 'Tumor_Description',
 'Lymph_Node_Modifier',
 'Alcohol',
 'BMI',
 'Communicable_Disease',
 'Obes

### getPossibleAssertions

In [20]:
tracer.getPossibleAssertions()

['Past', 'Family', 'Absent', 'Hypothetical', 'Possible', 'Present']

### getPossibleRelations

In [21]:
tracer.getPossibleRelations()

['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']

### getPipelineStages

In [22]:
stages = tracer.getPipelineStages()
for stage in stages:
    print(stage.__dict__())

{'uid': 'DocumentAssembler_27a75510357d', 'name': 'DocumentAssembler', 'index': 0, 'inputCol': StageField(inputCol, text, string), 'outputCol': StageField(outputCol, document, string), 'inputAnnotatorType': StageField(inputAnnotatorType, ----------, none), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'SentenceDetectorDLModel_6bafc4746ea5', 'name': 'SentenceDetectorDLModel', 'index': 1, 'inputCol': StageField(inputCols, [document], array), 'outputCol': StageField(outputCol, sentence, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'REGEX_TOKENIZER_6e5cf9a1fd71', 'name': 'TokenizerModel', 'index': 2, 'inputCol': StageField(inputCols, [sentence], array), 'outputCol': StageField(outputCol, token, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, token,

## with Custom Pipeline




In [23]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["TREATMENT", "PROBLEM"])

clinical_assertion = AssertionDLModel \
    .pretrained("assertion_dl_large", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setIncludeConfidence(True) \
    .setEntityAssertionCaseSensitive(False) \
    .setEntityAssertion({"treAtment": ["present"]}) \
    .setReplaceLabels({"PRESENT": "available", "absent": "none", "Conditional": "Optional"})

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_dl_large download started this may take some time.
[OK!]


In [24]:
tracer = PipelineTracer(model)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'ner_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''},
  {'ner_chunk_column_name': 'ner_chunk',
   'assertion_column_name': 'assertion',
   'resolver_column_name': ''}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [25]:
tracer.getPossibleAssertions()

['available',
 'none',
 'hypothetical',
 'possible',
 'Optional',
 'associated_with_someone_else']

In [26]:
tracer.getPossibleEntities()

['TREATMENT', 'PROBLEM']

In [27]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_99e0daa03750)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetector
 |    |-- uid: string (SentenceDetector_4f52ac3f7f8e)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_0ef721ffcc32)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |    |-- outputCo

# PipelineOutputParser

The output parser module seamlessly integrates with existing systems to return clear, easy-to-read & process prettified results in dictionary format from pretrained pipelines. It is designed to enhance API integration, and user understanding and streamline data analysis workflows.

## clinical_deidentification

In [28]:

from sparknlp.pretrained import PretrainedPipeline
pretrained_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")

text = [
    '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .''',
    """Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
]

results = pretrained_pipeline.fullAnnotate(text)


clinical_deidentification download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [29]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(pretrained_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "clinical_deidentification"})
column_maps

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': [{'ner_chunk_column_name': 'ner_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''}],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [30]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("clinical_deidentification", "en", "clinical/models")
columns_directly

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': [{'ner_chunk_column_name': 'ner_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''}],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [31]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'clinical_deidentification',
   'document_text': ['Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .',
    'PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .',
    'Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .'],
   'entities': [[{'chunk_id': '0',
      'begin': 14,
      'end': 23,
      'chunk': '2093-01-13',
      'label': 'DATE',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '1',
      'begin': 27,
      'end': 36,
      'chunk': 'David Hale',
      'label': 'DOCTOR',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '2',
      'begin': 55,
      'end': 69,
      'chunk': 'Hendrickson Ora',
      'label': 'PATIENT',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '3',
      'begin': 78,
      'end': 84,
      'chunk': '7194334',
      'label': 'MEDICALRECORD',
      'assertion': None,
      'term

## icd10cm_resolver_pipeline

In [32]:
from sparknlp.pretrained import PretrainedPipeline

icd10cm_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")

text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""

results = icd10cm_pipeline.fullAnnotate(text)

for stages in icd10cm_pipeline.model.stages:
    try:
        inputCol = stages.getInputCol()
    except:
        inputCol = stages.getInputCols()
    print(inputCol,stages.getOutputCol())

icd10cm_resolver_pipeline download started this may take some time.
Approx size to download 3.3 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token', 'embeddings'] ner
['sentence', 'token', 'ner'] chunk
['chunk'] icd10cm_mapper
['chunk'] icd10cm_mapper
['chunk', 'icd10cm_mapper'] chunks_fail
['chunks_fail'] chunk_doc
['chunk_doc'] sentence_embeddings
['sentence_embeddings'] resolver_code
['resolver_code', 'icd10cm_mapper'] icd10cm


In [33]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(icd10cm_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "icd10cm_resolver_pipeline"})
column_maps

{'document_identifier': 'icd10cm_resolver_pipeline',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'chunk',
   'assertion_column_name': '',
   'resolver_column_name': 'icd10cm'}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [34]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("icd10cm_resolver_pipeline", "en", "clinical/models")
columns_directly

{'document_identifier': 'icd10cm_resolver_pipeline',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'chunk',
   'assertion_column_name': '',
   'resolver_column_name': 'icd10cm'}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [35]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [[{'chunk_id': '0',
      'begin': 39,
      'end': 67,
      'chunk': 'gestational diabetes mellitus',
      'label': 'PROBLEM',
      'assertion': None,
      'term_code': 'O24.919'},
     {'chunk_id': '1',
      'begin': 95,
      'end': 105,
      'chunk': 'anisakiasis',
      'label': 'PROBLEM',
      'assertion': None,
      'term_code': 'B81.0'},
     {'chunk_id': '2',
      'begin': 135,
      'end': 163,
      'chunk': 'fetal and neonatal hemorrhage',
      'label': 'PROBLEM',
      'assertion': None,
      'term_code': 'P549'}]],
   'relations': [],
   'summaries': [],
   'deidentifications': [],
   'classifications': []}]}

In [36]:
column_maps = {
    "document_identifier": "icd10cm_resolver_pipeline",
    "document_text": "document",
    "entities": [
        {
            "ner_chunk_column_name": "chunk",
            "assertion_column_name": None,
            "resolver_column_name": "icd10cm"
        },
    ],
    # "relations": [],
    # "summaries": [],
    # "deidentifications": [],
    # "classifications":[]
}

pipeline_parser = PipelineOutputParser(column_maps)

parsed_result = pipeline_parser.run(results)
parsed_result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [[{'chunk_id': '0',
      'begin': 39,
      'end': 67,
      'chunk': 'gestational diabetes mellitus',
      'label': 'PROBLEM',
      'assertion': None,
      'term_code': 'O24.919'},
     {'chunk_id': '1',
      'begin': 95,
      'end': 105,
      'chunk': 'anisakiasis',
      'label': 'PROBLEM',
      'assertion': None,
      'term_code': 'B81.0'},
     {'chunk_id': '2',
      'begin': 135,
      'end': 163,
      'chunk': 'fetal and neonatal hemorrhage',
      'label': 'PROBLEM',
      'assertion': None,
      'term_code': 'P549'}]],
   'relations': [],
   'summaries': [],
   'deidentifications': [],
   'classifications': []}]}

## explain_clinical_doc_biomarker

In [37]:
from sparknlp.pretrained import PretrainedPipeline

biomarker_pipeline = PretrainedPipeline("explain_clinical_doc_biomarker", "en", "clinical/models")

results = biomarker_pipeline.fullAnnotate("""In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.""")


for stages in biomarker_pipeline.model.stages:
    try:
        inputCol = stages.getInputCol()
    except:
        inputCol = stages.getInputCols()
    print(inputCol,stages.getOutputCol())

explain_clinical_doc_biomarker download started this may take some time.
Approx size to download 2 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token'] prediction
['sentence', 'token'] matched_biomarker
['sentence', 'token', 'embeddings'] oncology_ner
['sentence', 'token', 'oncology_ner'] ner_oncology_chunk
['sentence', 'token', 'embeddings'] biomarker_ner
['sentence', 'token', 'biomarker_ner'] ner_biomarker_chunk
['ner_oncology_chunk', 'ner_biomarker_chunk', 'matched_biomarker'] merged_chunk
['sentence', 'token'] pos_tags
['sentence', 'pos_tags', 'token'] dependencies
['embeddings', 'pos_tags', 'merged_chunk', 'dependencies'] re_oncology_biomarker_result_wip


In [38]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(biomarker_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_biomarker"})
column_maps

{'document_identifier': 'explain_clinical_doc_biomarker',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'merged_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''}],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}]}

In [39]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("explain_clinical_doc_biomarker", "en", "clinical/models")
columns_directly

{'document_identifier': 'explain_clinical_doc_biomarker',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'ner_biomarker_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''}],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': '',
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}]}

In [40]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_biomarker',
   'document_text': ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'],
   'entities': [[{'chunk_id': '0',
      'begin': 84,
      'end': 91,
      'chunk': 'positive',
      'label': 'Biomarker_Result',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '1',
      'begin': 97,
      'end': 99,
      'chunk': 'CD9',
      'label': 'Biomarker',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '2',
      'begin': 105,
      'end': 108,
      'chunk': 'CD10',
      'label': 'Biomarker',
      'assertion': None,
      'term_code': None},
     

In [41]:
column_maps = {
    "document_identifier": "explain_clinical_doc_biomarker",
    "document_text": "document",
    "entities": [
        {
            "ner_chunk_column_name": "merged_chunk",
            "assertion_column_name": None,
            "resolver_column_name": None
        }
    ],
    "relations": ["re_oncology_biomarker_result_wip"],
    "summaries": None,
    "deidentifications": [],
    "classifications":[{
        "classification_column_name": "prediction",
        "sentence_column_name": "sentence",

    }]
}

pipeline_parser = PipelineOutputParser(column_maps)

parsed_result = pipeline_parser.run(results)
parsed_result

{'result': [{'document_identifier': 'explain_clinical_doc_biomarker',
   'document_text': ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'],
   'entities': [[{'chunk_id': '0',
      'begin': 84,
      'end': 91,
      'chunk': 'positive',
      'label': 'Biomarker_Result',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '1',
      'begin': 97,
      'end': 99,
      'chunk': 'CD9',
      'label': 'Biomarker',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '2',
      'begin': 105,
      'end': 108,
      'chunk': 'CD10',
      'label': 'Biomarker',
      'assertion': None,
      'term_code': None},
     

## explain_clinical_doc_oncology

https://nlp.johnsnowlabs.com/2024/05/06/explain_clinical_doc_oncology_en.html

In [42]:
from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

results = oncology_pipeline.fullAnnotate("""The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response""")

for stages in oncology_pipeline.model.stages:
    try:
        inputCol = stages.getInputCol()
    except:
        inputCol = stages.getInputCols()
    print(inputCol,stages.getOutputCol())

explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]
text document
['document'] sentence
['sentence'] token
['sentence', 'token'] embeddings
['sentence', 'token', 'embeddings'] ner_oncology
['sentence', 'token', 'ner_oncology'] ner_oncology_chunk
['sentence', 'token', 'embeddings'] ner_oncology_anatomy_general
['sentence', 'token', 'ner_oncology_anatomy_general'] ner_oncology_anatomy_general_chunk
['sentence', 'token', 'embeddings'] ner_oncology_response_to_treatment
['sentence', 'token', 'ner_oncology_response_to_treatment'] ner_oncology_response_to_treatment_chunk
['sentence', 'token', 'embeddings'] ner_oncology_unspecific_posology
['sentence', 'token', 'ner_oncology_unspecific_posology'] ner_oncology_unspecific_posology_chunk
['sentence', 'token', 'embeddings'] ner_oncology_tnm
['sentence', 'token', 'ner_oncology_tnm'] ner_oncology_tnm_chunk
['sentence', 'token', 'embeddings'] ner_jsl
['sentence', 'token', 'ner_jsl'] ner_jsl_ch

In [43]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

pipeline_tracer = PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_oncology"})
column_maps

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'merged_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''},
  {'ner_chunk_column_name': 'merged_chunk_for_assertion',
   'assertion_column_name': 'assertion',
   'resolver_column_name': ''}],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [44]:
print(column_maps)

{'document_identifier': 'explain_clinical_doc_oncology', 'document_text': 'document', 'entities': [{'ner_chunk_column_name': 'merged_chunk', 'assertion_column_name': '', 'resolver_column_name': ''}, {'ner_chunk_column_name': 'merged_chunk_for_assertion', 'assertion_column_name': 'assertion', 'resolver_column_name': ''}], 'relations': ['all_relations'], 'summaries': [], 'deidentifications': [], 'classifications': []}


In [45]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer

columns_directly = PipelineTracer.getParserDictDirectly("explain_clinical_doc_oncology", "en", "clinical/models")
columns_directly

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': [{'ner_chunk_column_name': 'merged_chunk_for_assertion',
   'assertion_column_name': 'assertion',
   'resolver_column_name': ''},
  {'ner_chunk_column_name': 'merged_chunk',
   'assertion_column_name': '',
   'resolver_column_name': ''}],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': []}

In [46]:
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_oncology',
   'document_text': ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response'],
   'entities': [[{'chunk_id': '0',
      'begin': 24,
      'end': 42,
      'chunk': 'computed tomography',
      'label': 'Imaging_Test',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '1',
      'begin': 45,
      'end': 46