![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/07.4.PipelineTracer_and_PipelineOutputParser.ipynb)

#   **📜 PipelineTracer and PipelineOutputParser**



# Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM

nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import pyspark.sql.types as T
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start(spark_conf = {"spark.driver.memory": "50G"})

In [5]:
spark

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# PipelineTracer



    PipelineTracer is a class that allows to trace the stages of a pipeline and get information about them.
    The `PipelineTracer` class provides functionality for tracing and retrieving information about the various stages of a pipeline.
    It can be used to obtain detailed insights into the entities, assertions, and relationships utilized within the pipeline.
    Compatibility with both `PipelineModel` and `PretrainedPipeline`.
    It can be used with a PipelineModel or a PretrainedPipeline.
    Additionally, it can be used to create a parser dictionary that can be used to create a PipelineOutputParser.


## **🔎 Parameters**

**Parameters**:

- `printPipelineSchema`: Prints the schema of the pipeline.
- `createParserDictionary`: Returns a parser dictionary that can be used to create a PipelineOutputParser
- `getPossibleEntities`: Returns a list of possible entities that the pipeline can include.
- `getPossibleAssertions`: Returns a list of possible assertions that the pipeline can include
- `getPossibleRelations`: Returns a list of possible relations that the pipeline can include.
- `getPipelineStages`: Returns a list of PipelineStage objects that represent the stages of the pipeline.
- `getParserDictDirectly`: Returns a parser dictionary that can be used to create a PipelineOutputParser. This method is used to get the parser dictionary directly without creating a PipelineTracer objec.
- `listAvailableModels`: Returns a list of available models for a given language and source
- `showAvailableModels`: Prints a list of available models for a given language and source.

### showAvailableModels

In [6]:
medical.PipelineTracer.showAvailableModels(language="en", source="clinical/models")

clinical_deidentification
explain_clinical_doc_ade
explain_clinical_doc_biomarker
explain_clinical_doc_generic
explain_clinical_doc_granular
explain_clinical_doc_medication
explain_clinical_doc_oncology
explain_clinical_doc_public_health
explain_clinical_doc_radiology
explain_clinical_doc_risk_factors
explain_clinical_doc_vop
icd10cm_resolver_pipeline
icd10cm_rxnorm_resolver_pipeline
rxnorm_resolver_pipeline
snomed_resolver_pipeline


### listAvailableModels

In [8]:
for model in medical.PipelineTracer.listAvailableModels():
  print(medical.PipelineTracer.getParserDictDirectly(model))

{'document_identifier': 'clinical_deidentification', 'document_text': 'sentence', 'entities': ['ner_chunk'], 'assertions': [], 'resolutions': [], 'relations': [], 'summaries': [], 'deidentifications': [{'original': 'sentence', 'obfuscated': 'obfuscated', 'masked': ''}], 'classifications': []}
{'document_identifier': 'explain_clinical_doc_ade', 'document_text': 'document', 'entities': ['ner_chunks_ade'], 'assertions': ['assertion'], 'resolutions': [], 'relations': ['relations'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'class', 'sentence_column_name': 'sentence'}]}
{'document_identifier': 'explain_clinical_doc_biomarker', 'document_text': 'document', 'entities': ['ner_biomarker_chunk'], 'assertions': [], 'resolutions': [], 'relations': ['re_oncology_biomarker_result_wip'], 'summaries': [], 'deidentifications': [], 'classifications': [{'classification_column_name': 'prediction', 'sentence_column_name': 'sentence'}]}
{'document_identifie

### createParserDictionary

In [9]:
oncology_pipeline = nlp.PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")


explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [10]:
tracer = medical.PipelineTracer(oncology_pipeline)

In [11]:
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': []}

### printPipelineSchema

In [12]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_c87f754f30a5)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetectorDLModel
 |    |-- uid: string (SentenceDetectorDLModel_6bafc4746ea5)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_99be4a04da74)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |  

### getPossibleEntities

In [13]:
tracer.getPossibleEntities()

['Cycle_Number',
 'Direction',
 'Histological_Type',
 'Biomarker_Result',
 'Site_Other_Body_Part',
 'Hormonal_Therapy',
 'Death_Entity',
 'Targeted_Therapy',
 'Route',
 'Tumor_Finding',
 'Duration',
 'Pathology_Result',
 'Chemotherapy',
 'Date',
 'Radiotherapy',
 'Radiation_Dose',
 'Oncogene',
 'Cancer_Surgery',
 'Tumor_Size',
 'Staging',
 'Pathology_Test',
 'Cancer_Dx',
 'Age',
 'Site_Lung',
 'Site_Breast',
 'Site_Liver',
 'Site_Lymph_Node',
 'Response_To_Treatment',
 'Site_Brain',
 'Immunotherapy',
 'Race_Ethnicity',
 'Metastasis',
 'Smoking_Status',
 'Imaging_Test',
 'Relative_Date',
 'Line_Of_Therapy',
 'Unspecific_Therapy',
 'Site_Bone',
 'Gender',
 'Cycle_Count',
 'Cancer_Score',
 'Adenopathy',
 'Grade',
 'Biomarker',
 'Invasion',
 'Frequency',
 'Performance_Status',
 'Dosage',
 'Cycle_Day',
 'Anatomical_Site',
 'Size_Trend',
 'Posology_Information',
 'Cancer_Therapy',
 'Lymph_Node',
 'Tumor_Description',
 'Lymph_Node_Modifier',
 'Carcinoma_Type',
 'CNS_Tumor_Type',
 'Melanoma',


### getPossibleAssertions

In [14]:
tracer.getPossibleAssertions()

['Past', 'Absent', 'Family', 'Hypothetical', 'Possible', 'Present']

### getPossibleRelations

In [15]:
tracer.getPossibleRelations()

['is_size_of', 'is_date_of', 'is_location_of', 'is_finding_of']

### getPipelineStages

In [16]:
stages = tracer.getPipelineStages()
for stage in stages:
    print(stage.__dict__())

{'uid': 'DocumentAssembler_c87f754f30a5', 'name': 'DocumentAssembler', 'index': 0, 'inputCol': StageField(inputCol, text, string), 'outputCol': StageField(outputCol, document, string), 'inputAnnotatorType': StageField(inputAnnotatorType, ----------, none), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'SentenceDetectorDLModel_6bafc4746ea5', 'name': 'SentenceDetectorDLModel', 'index': 1, 'inputCol': StageField(inputCols, [document], array), 'outputCol': StageField(outputCol, sentence, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, document, string)}
{'uid': 'REGEX_TOKENIZER_99be4a04da74', 'name': 'TokenizerModel', 'index': 2, 'inputCol': StageField(inputCols, [sentence], array), 'outputCol': StageField(outputCol, token, string), 'inputAnnotatorType': StageField(inputAnnotatorTypes, [document], array), 'outputAnnotatorType': StageField(outputAnnotatorType, token,

## with Custom Pipeline




In [17]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["TREATMENT", "PROBLEM"])

clinical_assertion = medical.AssertionDLModel \
    .pretrained("assertion_dl_large", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setIncludeConfidence(True) \
    .setEntityAssertionCaseSensitive(False) \
    .setEntityAssertion({"treAtment": ["present"]}) \
    .setReplaceLabels({"PRESENT": "available", "absent": "none", "Conditional": "Optional"})

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl_large download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


In [18]:
tracer = medical.PipelineTracer(model)
tracer.createParserDictionary()

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['ner_chunk'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': []}

In [19]:
tracer.getPossibleAssertions()

['available',
 'none',
 'hypothetical',
 'possible',
 'Optional',
 'associated_with_someone_else']

In [20]:
tracer.getPossibleEntities()

['TREATMENT', 'PROBLEM']

In [21]:
tracer.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_8a16b2b30ed2)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetector
 |    |-- uid: string (SentenceDetector_a9920bc9c720)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_e6f40727f1a7)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |    |-- outputCo

# StructuredJsonConverter
This Annotator integrates seamlessly with existing systems to process outputs from pretrained pipelines, delivering structured, easy-to-read results in a dictionary format. Optimized for API integration and user-friendly outputs, it supports streamlined data analysis workflows by converting raw annotations into a prettified, structured JSON format. With configurable schema mappings, it accommodates diverse outputs, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It uses column_maps to define output columns and align them with pipeline requirements. It handles diverse annotation types, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It produces well-structured, easy-to-read results ideal for API consumption and streamlined workflows.

## explain_clinical_doc_oncology

In [22]:
oncology_pipeline = nlp.PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [23]:
text = """The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response"""

data = spark.createDataFrame([text], T.StringType()).toDF("text")
data.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a c...|
+----------------------------------------------------------------------------------------------------+



In [24]:
result_df = oncology_pipeline.transform(data)
result_df.show(truncate = 40)

+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+---------------

In [25]:
pipeline_tracer = medical.PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': []}

**.setOutputAsStr(True)**

In [26]:
#import sparknlp_jsl
output_converter = medical.StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(True)

json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":{"document_identifier":"4d9912d2-5f22-4c14-957d-cdd329ad2bfb","document_text":["The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ova...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [27]:
result_collections = json_output.collect()
eval(result_collections[0].result)

{'result': {'document_identifier': '4d9912d2-5f22-4c14-957d-cdd329ad2bfb',
  'document_text': ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response'],
  'entities': [{'begin': '24',
    'chunk': 'computed tomography',
    'ner_source': 'ner_oncology_chunk',
    'end': '42',
    'ner_label': 'Imaging_Test',
    'chunk_id': '1b71b12a',
    'sentence': '0',
    'ner_confidence': '0.9575'},
   {

**.setOutputAsStr(False)**

In [28]:
output_converter = medical.StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(False)

json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{b067ccac-38bd-439d-b38d-4f4d3cce8cdb, [The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later w...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [29]:
result_collections = json_output.collect()
for record in result_collections:
    for k,v in column_maps.items():
        print(k,record.result[k])

document_identifier b067ccac-38bd-439d-b38d-4f4d3cce8cdb
document_text ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response']
entities [{'ner_label': 'Imaging_Test', 'sentence': '0', 'chunk': 'computed tomography', 'end': '42', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.9575', 'begin': '24', 'chunk_id': '1b71b12a'}, {'ner_label': 'Imaging_Test', 'sentence': '0', 'chunk': 'CT',

**.setParentSource("chunk")**

By using the new .setFormat("chunk") option, users can extract structured chunks instead of base schema results, enabling more precise control over text segmentation.

Additionally, the new sentenceColumn parameter allows retrieval of sentence-level details.

In [30]:
output_converter = medical.StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(True)\
    .setParentSource("chunk")\
    .setSentenceColumn("sentence")

json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                  result|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":[{"chunk_id":"1b71b12a","chunk":"computed tomography","begin":24,"end":42,"sentence_id":0,"sentence":"The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, whic...|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [31]:
result_collections = json_output.collect()
eval(result_collections[0].result)

{'result': [{'chunk_id': '1b71b12a',
   'chunk': 'computed tomography',
   'begin': 24,
   'end': 42,
   'sentence_id': 0,
   'sentence': 'The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass.',
   'ner_label': 'Imaging_Test',
   'ner_source': 'ner_oncology_chunk',
   'ner_confidence': '0.9575',
   'assertion': 'Past',
   'assertion_confidence': '1.0',
   'relations': []},
  {'chunk_id': 'ce9ac1a9',
   'chunk': 'CT',
   'begin': 45,
   'end': 46,
   'sentence_id': 0,
   'sentence': 'The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass.',
   'ner_label': 'Imaging_Test',
   'ner_source': 'ner_oncology_chunk',
   'ner_confidence': '0.9565',
   'assertion': 'Present',
   'assertion_confidence': '0.8937',
   'relations': []},
  {'chunk_id': '3576c965',
   'chunk': 'abdomen',
   'begin': 61,
   'end': 67,
   'sentence_id': 0,
   'sentence': 'The Patient underwent a 

# PipelineOutputParser

The output parser module seamlessly integrates with existing systems to return clear, easy-to-read & process prettified results in dictionary format from pretrained pipelines. It is designed to enhance API integration, and user understanding and streamline data analysis workflows.

## clinical_deidentification

In [32]:
pretrained_pipeline = nlp.PretrainedPipeline("clinical_deidentification", "en", "clinical/models")

text = [
    '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .''',
    """Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
]

results = pretrained_pipeline.fullAnnotate(text)


clinical_deidentification download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [33]:
pipeline_tracer = medical.PipelineTracer(pretrained_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "clinical_deidentification"})
column_maps

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': [],
 'mappers': []}

In [34]:
columns_directly = medical.PipelineTracer.getParserDictDirectly("clinical_deidentification", "en", "clinical/models")
columns_directly

{'document_identifier': 'clinical_deidentification',
 'document_text': 'sentence',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [{'original': 'sentence',
   'obfuscated': 'obfuscated',
   'masked': ''}],
 'classifications': []}

In [35]:
pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'clinical_deidentification',
   'document_id': 0,
   'document_text': ['Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .',
    'PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .',
    'Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .'],
   'entities': [{'chunk_id': '78463532',
     'chunk': '2093-01-13',
     'begin': 14,
     'end': 23,
     'ner_label': 'DATE',
     'ner_source': None,
     'ner_confidence': None},
    {'chunk_id': '60a35054',
     'chunk': 'David Hale',
     'begin': 27,
     'end': 36,
     'ner_label': 'DOCTOR',
     'ner_source': 'ner_chunk_enriched',
     'ner_confidence': '0.9895'},
    {'chunk_id': '9d3e7907',
     'chunk': 'Hendrickson Ora',
     'begin': 55,
     'end': 69,
     'ner_label': 'PATIENT',
     'ner_source': 'ner_chunk_enriched',
     'ner_confidence': '0.99300003'},
    {'chunk_id': '81bc095c',
     'chunk': '7194334',
    

## icd10cm_resolver_pipeline

In [36]:
icd10cm_pipeline = nlp.PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models")

text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""

results = icd10cm_pipeline.fullAnnotate(text)

icd10cm_resolver_pipeline download started this may take some time.
Approx size to download 2.4 GB
[OK!]


In [37]:
pipeline_tracer = medical.PipelineTracer(icd10cm_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "icd10cm_resolver_pipeline"})
column_maps

{'document_identifier': 'icd10cm_resolver_pipeline',
 'document_text': 'document',
 'entities': ['icd10cm_ner_chunk'],
 'assertions': [],
 'resolutions': [{'vocab': 'icd10cm', 'resolver_column_name': 'icd10cm'}],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': ['icd10cm_mapper']}

In [38]:

pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'icd10cm_resolver_pipeline',
   'document_id': 0,
   'document_text': ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage'],
   'entities': [{'chunk_id': '230909b1',
     'chunk': 'gestational diabetes mellitus',
     'begin': 39,
     'end': 67,
     'ner_label': 'PROBLEM',
     'ner_source': 'clinical_ner_chunk',
     'ner_confidence': '0.9424'},
    {'chunk_id': 'd280706c',
     'chunk': 'anisakiasis',
     'begin': 95,
     'end': 105,
     'ner_label': 'PROBLEM',
     'ner_source': 'clinical_ner_chunk',
     'ner_confidence': '0.9933'},
    {'chunk_id': '9df194a1',
     'chunk': 'fetal and neonatal hemorrhage',
     'begin': 135,
     'end': 163,
     'ner_label': 'PROBLEM',
     'ner_source': 'clinical_ner_chunk',
     'ner_confidence': '0.7501'}],
   'assertions': [],
   'resolutions': [{'vocab': 'icd10cm',
     'chunk_id': '23090

## explain_clinical_doc_biomarker

In [39]:
biomarker_pipeline = nlp.PretrainedPipeline("explain_clinical_doc_biomarker", "en", "clinical/models")

results = biomarker_pipeline.fullAnnotate("""In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.""")

explain_clinical_doc_biomarker download started this may take some time.
Approx size to download 2 GB
[OK!]


In [40]:
pipeline_tracer = medical.PipelineTracer(biomarker_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_biomarker"})
column_maps

{'document_identifier': 'explain_clinical_doc_biomarker',
 'document_text': 'document',
 'entities': ['merged_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}],
 'mappers': []}

In [41]:
pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_biomarker',
   'document_id': 0,
   'document_text': ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'],
   'entities': [{'chunk_id': 'bc15add6',
     'chunk': 'positive',
     'begin': 84,
     'end': 91,
     'ner_label': 'Biomarker_Result',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9672'},
    {'chunk_id': 'b473fd80',
     'chunk': 'CD9',
     'begin': 97,
     'end': 99,
     'ner_label': 'Biomarker',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.992'},
    {'chunk_id': '0252d08a',
     'chunk': 'CD10',
     'begin': 105,
     'end': 108,
 

## explain_clinical_doc_oncology

In [42]:
oncology_pipeline = nlp.PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

results = oncology_pipeline.fullAnnotate("""The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response""")


explain_clinical_doc_oncology download started this may take some time.
Approx size to download 1.8 GB
[OK!]


In [43]:
pipeline_tracer = medical.PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "explain_clinical_doc_oncology"})
column_maps

{'document_identifier': 'explain_clinical_doc_oncology',
 'document_text': 'document',
 'entities': ['merged_chunk', 'merged_chunk_for_assertion'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': ['all_relations'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': []}

In [44]:
print(column_maps)

{'document_identifier': 'explain_clinical_doc_oncology', 'document_text': 'document', 'entities': ['merged_chunk', 'merged_chunk_for_assertion'], 'assertions': ['assertion'], 'resolutions': [], 'relations': ['all_relations'], 'summaries': [], 'deidentifications': [], 'classifications': [], 'mappers': []}


In [45]:
pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(results)

result

{'result': [{'document_identifier': 'explain_clinical_doc_oncology',
   'document_id': 0,
   'document_text': ['The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response'],
   'entities': [{'chunk_id': '1b71b12a',
     'chunk': 'computed tomography',
     'begin': 24,
     'end': 42,
     'ner_label': 'Imaging_Test',
     'ner_source': 'ner_oncology_chunk',
     'ner_confidence': '0.9575'},
    {'