# P&ID Parsing Example
This notebook is a standalone example of using multimodal models to parse a structured output from process and instrumentation diagrams (P&IDs). We use the OpenAI SDK to call Claude Sonnet to do zero and one shot parsing of P&ID diagrams.

In [0]:
%pip install -U --quiet mlflow openai
%restart_python

In [0]:
from pathlib import Path
from openai import OpenAI
import base64
from PIL import Image
import IPython.display as display

In [0]:
from mlflow.models import ModelConfig
config = ModelConfig(development_config="config.yaml").to_dict()

Prepare Pool of Examples 

In [0]:
example_path = '/Volumes/shm/pid/pdf_images/5a82c87214d47c8af93fb443908548ee_tiled/tile_4.png'
example_image = Image.open(example_path)
display.display(example_image)

In [0]:
test_path = "/Volumes/shm/pid/pdf_images/5a82c87214d47c8af93fb443908548ee_tiled/tile_7.png"
test_image = Image.open(test_path)
display.display(test_image)

## Zero Shot Inference
Our first example tests the zero shot inference, which performs poorly in terms of tag counts.

In [0]:
DATABRICKS_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

client = OpenAI(
  api_key=DATABRICKS_TOKEN,
  base_url="https://adb-984752964297111.11.azuredatabricks.net/serving-endpoints"
)

def zero_shot_parse(image_path: str):
  image_data = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")

  chat_completion = client.chat.completions.create(
    messages=[
      {
        "role": "system",
        "content": config['system_prompt']
      },
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": 
            {"url": f"data:image/webp;base64,{image_data}"}
          }
        ]
      }
    ],
    model=config['fm_endpoint'],
    temperature=config['temperature'],
    top_p=config['top_p']
  )

  parsed_text = chat_completion.choices[0].message.content
  return parsed_text

In [0]:
zero_shot_parse(test_path)

## Few Shot Parsing
Tiles images work much better than full examples due to the complexity of P&IDs. They also let us leverage the few shot ability of multimodal models. Here is an example of a few shot parsing with a single example

In [0]:
import yaml
import json
with open('./example_labels.yaml', 'r') as file:
  example = yaml.safe_load(file)


In [0]:
import pandas as pd
def few_shot_parse(image_path: str):
  image_data = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")

  content = []
  example_text = json.dumps(example)
  example_data = base64.b64encode(
    Path(example_path).read_bytes()
    ).decode("utf-8")
  content.append({
      "type": "image_url", 
      "image_url": {"url": f"data:image/jpeg;base64,{example_data}"},
    })
  content.append({
    "type": "text",
    "text": example_text
  })
    
  content.append({
    "type": "image_url", "image_url": 
    {"url": f"data:image/jpeg;base64,{image_data}"}
    })

  chat_completion = client.chat.completions.create(
    messages=[
      {
        "role": "system",
        "content": config['system_prompt']
      },
      {
        "role": "user",
        "content": content
      }
    ],
    model=config['fm_endpoint'],
    temperature=config['temperature'],
    top_p=config['top_p']
  )

  parsed_text = chat_completion.choices[0].message.content
  return parsed_text

In [0]:
few_shot_output = few_shot_parse(test_path)
few_shot_output

## JSON Parsing
One of the key things we need is to be able to extract the text LLM outputs as structured outputs.

In [0]:
import re
json_str = few_shot_output

fixed_json_str = re.sub(
    r'"(\d+)"-([A-Z\-0-9]+)"',
    r'"\1-\2"',
    json_str
)

In [0]:
import json
parsed_dict = json.loads(fixed_json_str)

In [0]:
parsed_dict