![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/StructuredJsonConverter.ipynb)

# **StructuredJsonConverter**

This notebook will cover the different parameters and usages of `StructuredJsonConverter` annotator.

**📖 Learning Objectives:**

1. Understand how to use `StructuredJsonConverter`.

2. Become comfortable using the different parameters of the annotator.

3. Explore how `StructuredJsonConverter` handles different annotation types (e.g., entities, relations, assertions, classifications).

4. Understand how to generate structured JSON outputs optimized for API consumption.

**🔗 Helpful Links:**

- Documentation : [StructuredJsonConverter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#structuredjsonconverter)

- Python Docs : [StructuredJsonConverter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/parser/structured_json_converter/index.html#module-sparknlp_jsl.annotator.parser.structured_json_converter)

- Scala Docs : [StructuredJsonConverter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/parser/StructuredJsonConverter.html)

- For extended examples of usage see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp/).

## **📜 Background**


This annotator integrates seamlessly with existing systems to process outputs from **pretrained pipelines**, delivering structured, easy-to-read results in a dictionary format. Optimized for API integration and user-friendly outputs, it supports streamlined data analysis workflows by converting raw annotations into a prettified, structured JSON format. With configurable schema mappings, it accommodates diverse outputs, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It uses column_maps to define output columns and align them with pipeline requirements. It handles diverse annotation types, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It produces well-structured, easy-to-read results ideal for API consumption and streamlined workflows.

## **🖨️ Input/Output Annotation Types**

- Input Annotator Types: ANY

- Output Annotator Type: JSON

## **🔎 Parameters**


- `converterSchema` : Dictionary defining how to map annotation outputs (entities, assertions, relations, etc.) into structured JSON. Typically generated using `PipelineTracer.createParserDictionary()`.

- `converterSchemaAsStr` : Same as `converterSchema`, but provided as a JSON string. Useful for saving/loading schema configurations.

- `outputCol` : Name of the output column where structured JSON or struct results are stored.

- `outputAsStr` : When true (default), outputs a single JSON string column. When false, outputs a nested struct with detailed arrays/maps for each annotation type.

- `parentSource` : Defines the root structure of the output.  
  - `` (empty): document-level structured JSON.  
  - `chunk`: chunk-based struct with fields such as `chunk_id`, `chunk`, `begin`, `end`, `sentence`, `ner_label`, `relations`, etc.

- `returnRelationEntities` : When true, includes full entity details inside relation objects. When false (default), only relation metadata is included.

- `sentenceColumn` : Name of the sentence column used to include contextual sentence text for each chunk in the structured output.

- `cleanAnnotations` : When true, removes raw Spark NLP annotation columns from the DataFrame, keeping only the clean structured JSON output (default: false).


## 🎯 **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files

print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## Using `StructuredJsonConverter`


Run a `Pretrainedpipeline` on your text and turn its many annotation columns into a single, neat JSON column for easy consumption.

The pretrained pipeline of the following example can:

- extract biomarker entities,

- classify sentences whether they contain biomarker entities or not,

- establish relations between the extracted biomarker and biomarker results from the clinical documents.

`PipelineTracer` inspects the pipeline and produces a schema mapping (columns) describing what is extracted. This mapping is fed to the converter next.

In [None]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.annotator import StructuredJsonConverter
import json

text = "In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry."
df = spark.createDataFrame([[text]]).toDF("text")
bio_pipe = PretrainedPipeline("explain_clinical_doc_biomarker", "en", "clinical/models")
pipeline_tracer = PipelineTracer(bio_pipe)
columns = pipeline_tracer.createParserDictionary()
converter = StructuredJsonConverter()\
           .setOutputCol("json")\
           .setConverterSchema(columns)
base_df = bio_pipe.transform(df)
json_df = converter.transform(base_df)
json_df.select("json").show(truncate=False)

explain_clinical_doc_biomarker download started this may take some time.
Approx size to download 2 GB
[OK!]
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We can see the schema this way

In [None]:
columns

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['merged_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': ['re_oncology_biomarker_result_wip'],
 'summaries': [],
 'deidentifications': [],
 'classifications': [{'classification_column_name': 'prediction',
   'sentence_column_name': 'sentence'}],
 'mappers': []}

## `cleanAnnotations` Parameter

Whether to remove annotation columns from the dataframe, by default False.

If we print the dataframe string information of the previous pipeline can see it has all annotation columns.

In [None]:
json_df

DataFrame[text: string, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, token: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, embeddings: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, prediction: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, matched_biomarker: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, oncology_ner: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, ner_oncology_chunk: array<struct<annotatorTyp

Let's set `cleanAnnotations()` to True to see the difference.

In [None]:
converter = StructuredJsonConverter()\
           .setOutputCol("json")\
           .setConverterSchema(columns)\
           .setCleanAnnotations(True)
base_df = bio_pipe.transform(df)
json_df = converter.transform(base_df)
json_df.select("json").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the dataframe only has the Json column.

In [None]:
json_df

DataFrame[text: string, json: string]

## `returnRelationEntities` Parameter

Whether to return the entities in the relations or not, by default False.

Let's set it to True to see the difference.

In [None]:
converter = StructuredJsonConverter()\
           .setOutputCol("json")\
           .setConverterSchema(columns)\
           .setReturnRelationEntities(True)
base_df = bio_pipe.transform(df)
json_df = converter.transform(base_df)


Let's pretty print and show only the relations from the JSON.

The relations have the entity1 and entity2 information that would not be present with the default parameter.

In [None]:
relations = parsed["json"].get("relations", [])
print(json.dumps(relations, indent=2))

[
  {
    "entity1_begin": "84",
    "relation": "is_finding_of",
    "direction": "both",
    "confidence": "0.99814475",
    "entity1": "Biomarker_Result",
    "chunk2": "CD9",
    "entity2_end": "99",
    "entity1_end": "91",
    "entity2_begin": "97",
    "chunk2_id": "b473fd80",
    "entity2": "Biomarker",
    "chunk1": "positive",
    "chunk1_id": "bc15add6"
  },
  {
    "entity1_begin": "84",
    "relation": "is_finding_of",
    "direction": "both",
    "confidence": "0.9996313",
    "entity1": "Biomarker_Result",
    "chunk2": "CD10",
    "entity2_end": "108",
    "entity1_end": "91",
    "entity2_begin": "105",
    "chunk2_id": "0252d08a",
    "entity2": "Biomarker",
    "chunk1": "positive",
    "chunk1_id": "bc15add6"
  }
]


## `outputAsStr` Parameter

Whether to output the result as a string or as a structured JSON, by default True.

Here, we'll set it to False to see the difference.

Now the content of the JSON column is not really a JSON object, just a different data structure.

In [None]:
converter = StructuredJsonConverter()\
           .setOutputCol("json")\
           .setConverterSchema(columns)\
           .setOutputAsStr(False)
base_df = bio_pipe.transform(df)
json_df = converter.transform(base_df)
json_df.select("json").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## `parentSource` Parameter

Parent source of the output.
Available options: `chunk` and ``.
When set to `chunk`, the output will be designed as a chunk-based struct.

In [None]:
converter = StructuredJsonConverter()\
           .setOutputCol("json")\
           .setConverterSchema(columns)\
           .setParentSource("chunk")
base_df = bio_pipe.transform(df)
json_df = converter.transform(base_df)

In [None]:
rows = json_df.select("json").collect()
for row in rows:
    parsed = json.loads(row["json"])
    print(json.dumps(parsed, indent=2))

{
  "json": [
    {
      "chunk_id": "bc15add6",
      "chunk": "positive",
      "begin": 84,
      "end": 91,
      "sentence_id": 0,
      "ner_label": "Biomarker_Result",
      "ner_source": "ner_oncology_chunk",
      "ner_confidence": "0.9672",
      "relations": [
        {
          "relation": "is_finding_of",
          "direction": "both",
          "confidence": "0.99814475",
          "chunk2": "CD9",
          "chunk2_id": "b473fd80",
          "chunk1": "positive",
          "chunk1_id": "bc15add6"
        },
        {
          "relation": "is_finding_of",
          "direction": "both",
          "confidence": "0.9996313",
          "chunk2": "CD10",
          "chunk2_id": "0252d08a",
          "chunk1": "positive",
          "chunk1_id": "bc15add6"
        }
      ]
    },
    {
      "chunk_id": "b473fd80",
      "chunk": "CD9",
      "begin": 97,
      "end": 99,
      "sentence_id": 0,
      "ner_label": "Biomarker",
      "ner_source": "ner_oncology_chunk",
      "

## `sentenceColumn` Parameter

Sentence column name. The sentence column is used to extract sentence of the chunk.

In [None]:
converter = StructuredJsonConverter()\
           .setOutputCol("json")\
           .setConverterSchema(columns)\
           .setSentenceColumn("sentence")
base_df = bio_pipe.transform(df)
json_df = converter.transform(base_df)

In [None]:
rows = json_df.select("json").collect()
for row in rows:
    parsed = json.loads(row["json"])
    print(json.dumps(parsed, indent=2))

{
  "json": {
    "document_identifier": "c643e4d1-a85b-4d66-a9f6-24b8af8b4414",
    "document_text": [
      "In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry."
    ],
    "entities": [
      {
        "begin": "84",
        "chunk": "positive",
        "ner_source": "ner_oncology_chunk",
        "end": "91",
        "ner_label": "Biomarker_Result",
        "chunk_id": "bc15add6",
        "sentence": "0",
        "ner_confidence": "0.9672"
      },
      {
        "begin": "97",
        "chunk": "CD9",
        "ner_source": "ner_oncology_chunk",
        "end": "99",
        "ner_label": "Biomarker",
        "chunk_id": "b473fd80",
        "sentence": "0",
        "ner_confidence": "0.992"
      },
      {
        "begin": "105",
        "chunk": "CD10",
        "ner_source": "ner_oncology_chunk",
        "end": "108",
        "ner_label": "Biomarker",
        "chunk_id": "0252d08a",
        "sentence": "0",