![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/ContextualEntityRuler.ipynb)

#   **📜 AnnotationConverter**


This notebook introduces a flexible **AnnotationConverter** — a lightweight Python class designed to help you transform annotations within a DataFrame using custom conversion functions. It is especially useful when you need to reformat or reinterpret annotation results from one type to another.

For example, you can use it to:



- Reformat LLM outputs into document-style annotations

- Convert assertion results into chunk annotations

- Adapt rule-based outputs into a consistent, usable format

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [AnnotationConverter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#annotationconverter)


## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

In [4]:
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()
spark

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_9596 (9).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.0.0, 💊Spark-Healthcare==6.0.0, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `ANY`

- Output: `ANY`

## **🔎 Parameters**


**Parameters**:


- `f`: (FunctionParam) User-defined function to transform annotations.

- `inputCol`: Name of the input column containing annotations.

- `outputCol`:  Name of the output column for converted annotations.

- `outputAnnotatorType`: Type of the output annotations (e.g., “token”).



  

### Pipeline

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

medical_llm = medical.AutoGGUFModel.pretrained("jsl_medm_q8_v3", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("completions")\
    .setBatchSize(1)\
    .setNPredict(2048)\
    .setUseChatTemplate(True)\
    .setTemperature(0)\
    .setNGpuLayers(-1) # if you have GPU

def llm_to_documents_function(annotations):
    """Parse JSON from last element of completions and convert into Spark NLP Document format."""
    import json
    import re
    if not isinstance(annotations, list) or len(annotations) == 0:
        return []

    last_entity = annotations[-1]
    if not hasattr(last_entity, 'result'):
        return []

    result = last_entity.result

    def extract_last_json(s):
      """Extract the last JSON object or array from a string."""
      if not isinstance(s, str):
          return None

      json_pattern = re.compile(r'(\{.*?\}|\[.*?\])', re.DOTALL)
      matches = json_pattern.findall(s)
      if not matches:
          return None

      for match in reversed(matches):
          try:
              return json.loads(match)
          except json.JSONDecodeError:
              continue
      return None

    entities = extract_last_json(result)
    if not entities:
        return []

    if isinstance(entities, dict):
        entities = [entities]

    documents = []
    for idx, entity in enumerate(entities[:1]):
        chunk = entity.get("chunk", "")

        documents.append(
            Annotation(
            annotatorType = "document",
            begin = 0 if chunk else -1,
            end = len(chunk) - 1 if chunk else -1,
            result = chunk if chunk else "",
            metadata = {"sentence": 0} if chunk else {},
            embeddings = []
          )
        )

    return documents


llm_to_doc = medical.AnnotationConverter(f=llm_to_documents_function)\
    .setInputCol("completions")\
    .setOutputCol("chunk_docs")\
    .setOutputAnnotatorType("document")

schunk_embeddings = nlp.MPNetEmbeddings.pretrained("all_mpnet_base_v2","en") \
    .setInputCols(["chunk_docs"]) \
    .setOutputCol("mpnet_embeddings")

icd10_resolver = medical.SentenceEntityResolverModel.pretrained("mpnetresolve_icd10_cms_hcc_2024_midyear", "en", "clinical/models") \
    .setInputCols(["mpnet_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("COSINE")


pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        medical_llm,
        llm_to_doc,
        schunk_embeddings,
        icd10_resolver
])

data_ner = spark.createDataFrame([[""]]).toDF("text")
p_model = pipeline.fit(data_ner)


jsl_medm_q8_v3 download started this may take some time.
[OK!]

all_mpnet_base_v2 download started this may take some time.
Approximate size to download 387.8 MB
[OK!]
mpnetresolve_icd10_cms_hcc_2024_midyear download started this may take some time.
[OK!]


In [None]:
prompt = """<|im_start|>system
You are a helpful medical assistant trained by John Snow Labs.<|im_end|>
<|im_start|>user

Patient is reporting severe neck pain and flu.

Respond with the following json:
[
    {
        "chunk": "Actual string from text related to that condition - it should be an exact match",
        "condition": "The name of the condition you think",
        "category": "Confirmed or history or negated or not related"
    }
]

<|im_start|>assistant"""
prompt_df = spark.createDataFrame([[prompt]]).toDF("text")

In [None]:
result_df = p_model.transform(prompt_df)
result_df.select("icd10cm_code").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------