![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/MetadataAnnotationConverter.ipynb)

#   **📜 MetadataAnnotationConverter**


This notebook introduces the **MetadataAnnotationConverter**  a lightweight annotator that reads values you have already stored in an annotation’s metadata dictionary and promotes them to first class fields such as begin, end, or result. In practical terms, it lets you replace the raw text span or offsets emitted by earlier pipeline stages with cleaner or normalised values you captured in metadata, without writing extra post processing code.

You might use it to:

- Apply corrected character offsets recorded during OCR or PDF processing so that downstream components see the real text span, not the original noisy offsets.
- Surface normalised codes or spellings (e.g. ICD‑10, SNOMED, or dictionary cleaned terms) that you saved in metadata, turning them into the visible result string of each annotation.
- Expose alternative entity representations for instance, a lowercase variant or lemmatised form so that linguistic or rule based steps downstream can consume a consistent format.

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [MetadataAnnotationConverter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#metadataannotationconverter)


## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

In [4]:
spark = nlp.start()

👌 Detected license file /content/6.0.4.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.0.4, 💊Spark-Healthcare==6.0.4, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `ANY`

- Output: `ANY`

## **🔎 Parameters**


**Parameters**:


- `inputType`: Type of the input annotation (e.g., “chunk”, “token”).
- `resultField`: Name of the metadata key to override the result value.
- `beginField`: Name of the metadata key to override the begin offset.
- `endField`: Name of the metadata key to override the end offset.



  

## Create Pipeline

In [5]:
matcher_disease = """influenza
tuberculosis
dengue
difficulty sleep
"""

with open("disease.txt", "w") as f:
    f.write(matcher_disease)

In [6]:
text = """We’re seeing a spike in influenza cases this week, ongoing screenings for latent tuberculosis in new arrivals, reports of difficulty sleeping, and two travel-related dengue infections."""
data = spark.createDataFrame([[text]]).toDF("text")

In [7]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel \
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

text_matcher = medical.TextMatcher()\
    .setInputCols("sentence", "token")\
    .setOutputCol("matched_text")\
    .setEntities("disease.txt")\
    .setEnableLemmatizer(True)\
    .setEnableStemmer(True)\
    .setCleanStopWords(True)\
    .setBuildFromTokens(False)\
    .setReturnChunks("original")

metadata_converter = medical.MetadataAnnotationConverter()\
    .setInputCols("matched_text")\
    .setInputType("chunk")\
    .setBeginField("begin")\
    .setEndField("end")\
    .setResultField("orginal_or_matched")\
    .setOutputCol("new_chunk")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    text_matcher,
    metadata_converter
])

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]


In [8]:
result = pipeline.fit(data).transform(data)

In [9]:
flattener_text_matcher = medical.Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched"
            ]

        }
    )

flattener_text_matcher.transform(result).show(n=30,truncate=False)

+------+-----+---+-------------------+----------------+
|entity|begin|end|result             |matched         |
+------+-----+---+-------------------+----------------+
|entity|24   |32 |influenza          |influenza       |
|entity|81   |92 |tuberculosis       |tuberculosis    |
|entity|166  |171|dengue             |dengue          |
|entity|122  |140|difficulty sleeping|difficulty sleep|
+------+-----+---+-------------------+----------------+

