![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/38.0.Annotation_Converter.ipynb)

#   **📜 AnnotationConverter**


This notebook introduces a flexible **AnnotationConverter** — a lightweight Python class designed to help you transform annotations within a DataFrame using custom conversion functions. It is especially useful when you need to reformat or reinterpret annotation results from one type to another.

For example, you can use it to:



- Reformat LLM outputs into document-style annotations

- Convert assertion results into chunk annotations

- Adapt rule-based outputs into a consistent, usable format

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [AnnotationConverter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#annotationconverter)


## **🖨️ Input/Output Annotation Types**

- Input: `ANY`

- Output: `ANY`

## **🔎 Parameters**


**Parameters**:


- `f`: (FunctionParam) User-defined function to transform annotations.

- `inputCol`: Name of the input column containing annotations.

- `outputCol`:  Name of the output column for converted annotations.

- `outputAnnotatorType`: Type of the output annotations (e.g., “token”).



  

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Colab Setup


In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

Installed 1 products:
💊 Spark-Healthcare==6.0.4 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical
# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_6.0.4.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.0.4, 💊Spark-Healthcare==6.0.4, running on ⚡ PySpark==3.4.0


In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
from sparknlp_jsl.pipeline_tracer import PipelineTracer

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Create Custom Annotators

###  Custom SentenceDetector

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

def custom_sentence_detector(annotations):
    new_annotations = []
    # This pattern splits on periods, exclamation marks, and question marks
    # followed by whitespace or the end of the string.
    pattern = r"(?<=[.!?])\s+"

    for annotation in annotations: # annotations here are document annotations
        document_text = annotation.result
        document_begin = annotation.begin
        sentences = re.split(pattern, document_text)
        current_relative_index = 0
        for sentence in sentences:
            if sentence:
                # Find the sentence within the document text starting from the current relative index
                relative_begin = document_text.find(sentence, current_relative_index)
                if relative_begin != -1:
                    relative_end = relative_begin + len(sentence) - 1
                    # Calculate absolute begin and end indices relative to the original document
                    absolute_begin = document_begin + relative_begin
                    absolute_end = document_begin + relative_end

                    new_annotations.append(
                        nlp.Annotation(
                            annotatorType="document", # Sentence annotations are typically 'document' type
                            begin=absolute_begin,
                            end=absolute_end,
                            result=sentence,
                            metadata=annotation.metadata,
                            embeddings=annotation.embeddings,
                        )
                    )
                    # Update the current relative index for finding the next sentence within the document
                    # Need to account for the delimiter that was split on
                    current_relative_index = relative_end + 1 + (document_text[relative_end+1:].find(sentence,0) - relative_begin if relative_end + 1 < len(document_text) else 0)
    return new_annotations


custom_sentence_detector_converter = medical.AnnotationConverter(f=custom_sentence_detector)\
    .setInputCol("document")\
    .setOutputCol("custom_sentence")\
    .setOutputAnnotatorType("document") # Output type is 'document' for sentences


pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        custom_sentence_detector_converter
])




In [None]:
text = "I like SparkNLP. Especially MedicalBertForSequenceClassification and Chunk2Doc."
test_data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(test_data).transform(test_data)
result.selectExpr("explode (custom_sentence) as sentence")\
      .selectExpr("sentence.result", "sentence.begin", "sentence.end", "sentence.metadata as meta")\
      .show(truncate=False)

+--------------------------------------------------------------+-----+---+---------------+
|result                                                        |begin|end|meta           |
+--------------------------------------------------------------+-----+---+---------------+
|I like SparkNLP.                                              |0    |15 |{sentence -> 0}|
|Especially MedicalBertForSequenceClassification and Chunk2Doc.|17   |78 |{sentence -> 0}|
+--------------------------------------------------------------+-----+---+---------------+



###  Custom Tokenizer

In [None]:
import re

text = "I like SparkNLP. Especially MedicalBertForSequenceClassification and Chunk2Doc."

# This pattern finds the parts to split on (whitespace, punctuation, camel case transitions)
pattern_split = r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|\s+|(?=[^\w\s\/\-])|(?<=[^\w\s\/\-])"

parts = re.split(pattern_split, text)

# Recalculate indices based on the original text and the split parts
tokens_with_indices = []
current_index = 0
for part in parts:
    if part:
        begin = text.find(part, current_index)
        if begin != -1:
            end = begin + len(part) - 1
            tokens_with_indices.append({ "token": part, "begin": begin, "end": end })
            current_index = end + 1 # Move current index past the found part
print(tokens_with_indices)

[{'token': 'I', 'begin': 0, 'end': 0}, {'token': 'like', 'begin': 2, 'end': 5}, {'token': 'Spark', 'begin': 7, 'end': 11}, {'token': 'NLP', 'begin': 12, 'end': 14}, {'token': '.', 'begin': 15, 'end': 15}, {'token': 'Especially', 'begin': 17, 'end': 26}, {'token': 'Medical', 'begin': 28, 'end': 34}, {'token': 'Bert', 'begin': 35, 'end': 38}, {'token': 'For', 'begin': 39, 'end': 41}, {'token': 'Sequence', 'begin': 42, 'end': 49}, {'token': 'Classification', 'begin': 50, 'end': 63}, {'token': 'and', 'begin': 65, 'end': 67}, {'token': 'Chunk2Doc', 'begin': 69, 'end': 77}, {'token': '.', 'begin': 78, 'end': 78}]


In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")\

def tokenize_camel_case(annotations):
    new_annotations = []
    pattern = r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|\s+|(?=[^\w\s\/\-])|(?<=[^\w\s\/\-])"

    for annotation in annotations: # annotations here are sentences
        sentence_text = annotation.result
        sentence_begin = annotation.begin
        parts = re.split(pattern, sentence_text)
        current_relative_index = 0
        for part in parts:
            if part:
                # Find the part within the sentence text starting from the current relative index
                relative_begin = sentence_text.find(part, current_relative_index)
                if relative_begin != -1:
                    relative_end = relative_begin + len(part) - 1
                    # Calculate absolute begin and end indices relative to the original document
                    absolute_begin = sentence_begin + relative_begin
                    absolute_end = sentence_begin + relative_end

                    new_annotations.append(
                        nlp.Annotation(
                            annotatorType="token",
                            begin=absolute_begin,
                            end=absolute_end,
                            result=part,
                            metadata=annotation.metadata,
                            embeddings=annotation.embeddings,
                        )
                    )
                    # Update the current relative index for finding the next part within the sentence
                    current_relative_index = relative_end + 1
    return new_annotations


camel_case_tokenizer = medical.AnnotationConverter(f=tokenize_camel_case)\
    .setInputCol("sentence")\
    .setOutputCol("camel_case_token")\
    .setOutputAnnotatorType("token")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        camel_case_tokenizer
])




In [None]:
text = "I like SparkNLP. Especially MedicalBertForSequenceClassification and Chunk2Doc."
test_data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(test_data).transform(test_data)
result.selectExpr("explode (camel_case_token) as cct")\
      .selectExpr("cct.result", "cct.begin", "cct.end", "cct.metadata.sentence as sent_id")\
      .show(truncate=False)

+--------------+-----+---+-------+
|result        |begin|end|sent_id|
+--------------+-----+---+-------+
|I             |0    |0  |0      |
|like          |2    |5  |0      |
|Spark         |7    |11 |0      |
|NLP           |12   |14 |0      |
|.             |15   |15 |0      |
|Especially    |17   |26 |1      |
|Medical       |28   |34 |1      |
|Bert          |35   |38 |1      |
|For           |39   |41 |1      |
|Sequence      |42   |49 |1      |
|Classification|50   |63 |1      |
|and           |65   |67 |1      |
|Chunk2Doc     |69   |77 |1      |
|.             |78   |78 |1      |
+--------------+-----+---+-------+



### Custom LLM

In [None]:
!pip install -q optimum

In [None]:
question_list = [
    ["what is the causes for diabetes?"],
    ["what is the relationship between diabetes and obesity?"],
    ["what is the relationship childhood diabetes and mortality?"]
]
spark_df = spark.createDataFrame(question_list).toDF("text")
spark_df.show(truncate=False)

+----------------------------------------------------------+
|text                                                      |
+----------------------------------------------------------+
|what is the causes for diabetes?                          |
|what is the relationship between diabetes and obesity?    |
|what is the relationship childhood diabetes and mortality?|
+----------------------------------------------------------+



In [None]:
import re
import sparknlp
import sparknlp_jsl

document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

_tokenizer = None
_model = None

def make_custom_llm():
    def custom_llm(annotations):
        global _tokenizer, _model

        if _tokenizer is None or _model is None:
            from transformers import AutoTokenizer, AutoModelForCausalLM
            import torch

            MODEL_NAME = "Qwen/Qwen2.5-3B"
            device = "auto"
            _tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
            _model = AutoModelForCausalLM.from_pretrained(MODEL_NAME,device_map=device)

        def get_prediction(messages):
            input_ids = _tokenizer.apply_chat_template(
                messages,
                add_generation_prompt=True,
                return_tensors="pt"
            ).to(_model.device)


            def get_eos_token_id(tokenizer):
              # Öncelikle direkt tanımlı olanı al
              if tokenizer.eos_token_id is not None:
                  return tokenizer.eos_token_id

              # Alternatif token adlarını dene
              candidates = ["<|eot_id|>", "<|endoftext|>", "</s>"]
              for tok in candidates:
                  tok_id = tokenizer.convert_tokens_to_ids(tok)
                  if isinstance(tok_id, int):
                      return tok_id

              raise ValueError("No valid eos_token_id found.")

            eos_token_id = get_eos_token_id(_tokenizer)

            with torch.no_grad():
                outputs = _model.generate(
                    input_ids,
                    max_new_tokens=1024,
                    eos_token_id=eos_token_id,
                    do_sample=True,
                    temperature=0.1,
                    top_p=0.5,
                )
            response = outputs[0][input_ids.shape[-1]:]
            return _tokenizer.decode(response, skip_special_tokens=True)

        def generate_input_messages(USER_PROMPT):
            SYSTEM_PROMPT = "You are a smart and intelligent medical assistant system."
            messages = [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"{USER_PROMPT}"},
            ]
            return messages

        new_annotations = []
        for annotation in annotations:
            document_text = annotation.result
            messages = generate_input_messages(document_text)
            response = get_prediction(messages)
            result = response.strip()

            new_annotations.append(
                nlp.Annotation(
                    annotatorType="document",
                    begin=0,
                    end=len(result) - 1,
                    result=result,
                    metadata=annotation.metadata,
                    embeddings=annotation.embeddings,
                )
            )
        return new_annotations
    return custom_llm

custom_llm_fn = make_custom_llm()

custom_llm_converter = medical.AnnotationConverter(f=custom_llm_fn)\
    .setInputCol("document")\
    .setOutputCol("generation")\
    .setOutputAnnotatorType("document")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        custom_llm_converter
])




In [None]:
result = pipeline.fit(spark_df).transform(spark_df).cache()

result.selectExpr("text",
                  "generation.result",
                  "generation.begin",
                  "generation.end",
                  "generation.metadata as meta")\
      .show(truncate=150)

+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+-----------------+
|                                                      text|                                                                                                                                                result|begin|  end|             meta|
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+-----+-----+-----------------+
|                          what is the causes for diabetes?|[Diabetes is a chronic condition that affects how your body regulates blood sugar (glucose) levels. There are two main types of diabetes: Type 1 an...|  [0]|[683]|[{sentence -> 0}]|
|    what is the relationship be

In [None]:
result.selectExpr("generation.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                