# Medical Assistant Demo

This notebook is to show abilities of modern generation of JSL transformers that are based on multimodal LLM. These transformers keep compatibility with existing JSL codebase and bring powerful features of LLMs.

## Load sample dataframe

As input we use pdf documents with medical data. Question to data is saved to separate column 'prompt'. All documents will be questioned with the same question.

In [None]:
from pyspark.ml import PipelineModel
import pyspark.sql.functions as f
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_images

pdf_path = "dbfs:/FileStore/medassist_demo/*.pdf"
pdf_df = spark.read.format("binaryFile").load(pdf_path)
pdf_df = pdf_df.withColumn("prompt", f.lit("Extract medical tests with its attributes. Return result as json. If there is not such tests return empty json."))
pdf_df.show()

+--------------------+-------------------+------+--------------------+--------------------+
|                path|   modificationTime|length|             content|              prompt|
+--------------------+-------------------+------+--------------------+--------------------+
|dbfs:/FileStore/m...|2025-09-14 17:18:37|347577|[25 50 44 46 2D 3...|Extract medical t...|
|dbfs:/FileStore/m...|2025-09-14 17:18:38|220900|[25 50 44 46 2D 3...|Extract medical t...|
|dbfs:/FileStore/m...|2025-09-14 17:18:39| 81937|[25 50 44 46 2D 3...|Extract medical t...|
|dbfs:/FileStore/m...|2025-09-14 17:18:34| 57203|[25 50 44 46 2D 3...|Extract medical t...|
+--------------------+-------------------+------+--------------------+--------------------+



## Load transformers

We load 2 tools to process data:

- PdfToImage that converts each pdf into set of images, one page - one image
- VisualMedicalAssistant that does the main logic

In [None]:
import sparkocr.medassist.visual_medical_assistant
from sparkocr.medassist.visual_medical_assistant import VisualMedicalAssistant1

pdf_to_img = PdfToImage() \
.setKeepInput(False)

vma = VisualMedicalAssistant1() \
.setInputCols(["prompt", "image"]) \
.setOutputCol("json") \
.setKeepInput(False)

  from .autonotebook import tqdm as notebook_tqdm


## Processing

Now we run processing step by step.
Please note that here we limited dataframe to speed up generation. For pipeline that will go to prod don't forget to remove limitation.

In [None]:
image_df = pdf_to_img.transform(pdf_df).filter(f.col('pagenum') == 1).limit(2).cache()
image_df = image_df.repartition(2)
image_df.show()

+--------------------+-------------------+------+--------------------+--------------------+-----------+---------+-------+-----------+
|                path|   modificationTime|length|              prompt|               image|total_pages|exception|pagenum|documentnum|
+--------------------+-------------------+------+--------------------+--------------------+-----------+---------+-------+-----------+
|dbfs:/FileStore/m...|2025-09-14 17:18:39| 81937|Extract medical t...|{dbfs:/FileStore/...|          5|         |      1|          0|
|dbfs:/FileStore/m...|2025-09-14 17:18:37|347577|Extract medical t...|{dbfs:/FileStore/...|          5|         |      1|          0|
+--------------------+-------------------+------+--------------------+--------------------+-----------+---------+-------+-----------+



In [None]:
result = vma.transform(image_df).cache()
result.show()

+--------------------+-------------------+------+--------------------+-----------+---------+-------+-----------+--------------------+
|                path|   modificationTime|length|              prompt|total_pages|exception|pagenum|documentnum|                json|
+--------------------+-------------------+------+--------------------+-----------+---------+-------+-----------+--------------------+
|dbfs:/FileStore/m...|2025-09-14 17:18:39| 81937|Extract medical t...|          5|         |      1|          0|{'tests': [{'name...|
|dbfs:/FileStore/m...|2025-09-14 17:18:37|347577|Extract medical t...|          5|         |      1|          0|{'tests': [{'name...|
+--------------------+-------------------+------+--------------------+-----------+---------+-------+-----------+--------------------+



## Result

Finally we get result in the form of json

In [None]:
import json
import pprint

pprint.pprint(json.loads(result.select("json").collect()[0]["json"].replace("'", "\"")))

{'tests': [{'name': 'BILIRUBIN, TOTAL',
            'reference_range': 'ReferenceRange:0.2-1.2mg/dL',
            'value': 0.6},
           {'name': 'ALKALINE PHOSPHATASE',
            'reference_range': 'ReferenceRange:31-125U/L',
            'value': 33},
           {'name': 'AST',
            'reference_range': 'ReferenceRange:10-30U/L',
            'value': 14},
           {'name': 'ALT',
            'reference_range': 'ReferenceRange:6-29U/L',
            'value': 9},
           {'name': 'CHOLESTEROL, TOTAL',
            'reference_range': 'ReferenceRange:<200mg/dL',
            'value': 166},
           {'name': 'HDL CHOLESTEROL',
            'reference_range': 'ReferenceRange:>OR=50mg/dL',
            'value': 62},
           {'name': 'TRIGLYCERIDES',
            'reference_range': 'ReferenceRange:<150mg/dL',
            'value': 46},
           {'name': 'LDL-CHOLESTEROL', 'unit': 'mg/dL(calc)', 'value': 90},
           {'name': 'CHOL/HDLC RATIO',
            'reference_range': 