### Vision Detection Agent scratch

⸻

* 전체 과정 
<br> Input → Preparing → VQA → [ Plannning +  Executing ] → Code Generation

* 개선점
1) 단계별 계획: 이전 계획과 tool을 참고해 계획 생성 
2) 사전의 tool을 제공
3) LLM 답변의 일관성 확보를 위한 프롬프트 규약 강화

---

* Input
    * image
    * request

* AgentState 데이터구조 
    * state1 → state2 → state3 →  ... 
    * 생성된 내용을 저장해 - 디버깅 + 재현성 확보 

In [None]:
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional

@dataclass
class AgentState:
    # 고정 입력/환경
    user_request: str
    img_b64: Optional[str] = None
    tool_desc: str = ""
    tool_registry: Dict[str, Any] = field(default_factory=dict)

    # 누적 상태(증거)
    vqa_log: str = ""
    vqa_struct: Dict[str, Any] = field(default_factory=dict)
    observations: List[Dict[str, Any]] = field(default_factory=list)

    # 기록(리플레이/디버깅)
    all_plans: List[Dict[str, Any]] = field(default_factory=list)
    all_execs: List[Dict[str, Any]] = field(default_factory=list)

    # 종료
    final_answer: Optional[str] = None

    # final plan 결과
    code_plan: Optional[List[Dict[str, Any]]] = None

    # coder 
    coder: Optional[Any] = None  # 타입 고정하려면 Optional[CodeCoder]로
    coder_prompt: Optional[str] = None
    generated_code: Optional[str] = None
    generated_code_path: Optional[str] = None
    all_codes: List[Dict[str, Any]] = field(default_factory=list)

In [None]:
import json
import io
import base64
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional

import numpy as np
from PIL import Image
from anthropic import Anthropic

* Preparing
1. image-> base64: LLM이 이미지를 이해하는 형태의 데이터로 변환
2. tool list: vision agent에서 제공하는 tool의 meta data를 LLM의 입력 형태로 변환

In [None]:
# image to Base64
def encode_image_to_base64(
    img_path,
    jpeg_quality: int = 85,
    max_size: int | None = 1024,   # None이면 리사이즈 안 함
):
    """
    - PNG 입력 시 → RGB JPEG로 변환 후 base64 인코딩
    - JPEG 입력 시 → 그대로 (필요하면 리사이즈)
    - max_size: 한 변 최대 길이 (planning/VLM 단계용)
    """
    try:
        img_path = Path(img_path)
        img = Image.open(img_path)

        # PNG / RGBA → RGB
        if img.mode in ("RGBA", "LA", "P"):
            img = img.convert("RGB")

        if max_size is not None:
            img.thumbnail((max_size, max_size)) # 크기 제한

        buf = io.BytesIO()

        # PNG면 JPEG로 변환
        if img_path.suffix.lower() == ".png":
            img.save(
                buf,
                format="JPEG",
                quality=jpeg_quality,
                optimize=True,
            )
        else:
            # jpg / jpeg 등
            img.save(
                buf,
                format="JPEG",
                quality=jpeg_quality,
                optimize=True,
            )

        buf.seek(0)
        return base64.b64encode(buf.read()).decode("utf-8")

    except Exception as e:
        print(f"Error encoding image: {e}")
        return None

In [None]:
# tool registry
tool_registry = {}
for tool in tools:
    tool_registry[tool["name"]] = {
        "func": 실제_함수_참조,
        "metadata": tool  # name, doc, signature 등
    }

In [None]:
def prepare_context(
    user_request: str,
    img_path: Optional[str],
    tools: List[Dict[str, Any]],
) -> AgentState:
    import importlib
    import vision_agent.tools.tools as tools_mod
    
    img_b64 = encode_image_to_base64(img_path) if img_path else None

    # tool_registry를 만들 때 실제 함수도 포함
    tool_registry = {}
    for t in tools:
        tool_name = t["name"]
        # metadata dict 복사
        tool_registry[tool_name] = t.copy()
        
        # 실제 함수 가져오기
        if hasattr(tools_mod, tool_name):
            func = getattr(tools_mod, tool_name)
            if callable(func):
                tool_registry[tool_name]["func"] = func
    
    tool_desc = format_tool_desc(tools, max_tools=80, max_doc_chars=350)

    assert len(tool_registry) > 0
    return AgentState(
        user_request=user_request,
        img_b64=img_b64,
        tool_desc=tool_desc,
        tool_registry=tool_registry,
    )

* VQA
    * multi modality 입력 구성 
    * PROMPT_VQA
    <br> <analysis_log> ... <analysis_log>  / <plan_json> ... <plan_json>
    <br> 형태의 구조화된 출력을 프롬프트에 명시 - LLM의 출력을 강제
    * LLM 응답 파싱

In [1]:
# tag parser

import json
import re
from typing import Any, Dict, Tuple

_TAG_RE = re.compile(
    r"<(?P<tag>analysis_log|plan_json)>(?P<body>[\s\S]*?)</(?P=tag)>",
    re.IGNORECASE,
)

def parse_tagged_output(text: str) -> Tuple[str, Dict[str, Any]]:
    """
    LLM 응답에서 <analysis_log>와 <plan_json>을 추출하고,
    plan_json은 JSON으로 파싱해서 dict로 반환
    """
    text = (text or "").strip()
    matches = {m.group("tag").lower(): m.group("body").strip() for m in _TAG_RE.finditer(text)}

    if "analysis_log" not in matches:
        raise ValueError("응답에 <analysis_log>...</analysis_log> 태그가 없습니다.")
    if "plan_json" not in matches:
        raise ValueError("응답에 <plan_json>...</plan_json> 태그가 없습니다.")

    analysis_log = matches["analysis_log"]
    plan_json_str = matches["plan_json"]

    try:
        plan = json.loads(plan_json_str)
    except json.JSONDecodeError as e:
        raise ValueError(f"<plan_json> 내부 JSON 파싱 실패: {e}\n\nJSON:\n{plan_json_str}")

    return analysis_log, plan

In [None]:
PROMPT_VQA_TEMPLATE = """
You are an expert vision task planner.

You will be given:
- A user request (Korean)
- ONE image (provided to you as an image input)

Your job in this step is ONLY to analyze the user request and propose a concrete, tool-agnostic plan.
Do NOT run code. Do NOT claim results. Do NOT hallucinate object counts.

User request: {user_request}

Output MUST contain EXACTLY TWO TAGS in this order:
1) <analysis_log> ... </analysis_log>  (Korean, human-readable, step-by-step, short)
2) <plan_json> ... </plan_json>        (machine-readable, MUST be valid JSON)

Rules:
- Do not output anything outside the two tags.
- <analysis_log> should be concise: 5–10 lines, each starting with "Step N:".
- <plan_json> must be STRICT JSON (no trailing commas, no comments, no markdown).

<plan_json> JSON schema:
{{
  "language": "ko",
  "intent_summary": string,
  "task_type": "counting",
  "target_definition": {{
    "primary_object": "tomato",
    "required_attributes": ["red"],
    "exclusions": [string],
    "edge_cases": [string]
  }},
  "subtasks": [
    {{
      "name": string,
      "goal": string,
      "suggested_method": string
    }}
  ],
  "tool_requirements": {{
    "needs_localization": boolean,
    "needs_instance_separation": boolean,
    "needs_attribute_reasoning": boolean,
    "preferred_outputs": [string]
  }},
  "verification_checks": [string],
  "questions_if_ambiguous": [string]
}}
"""

In [None]:
from typing import Optional, Union
from anthropic import Anthropic

def model_response_anthropic(
    anthropic_client: Anthropic,
    prompt_text: str,
    model: str = "claude-sonnet-4-20250514",
    temperature: float = 0.1,
    max_tokens: int = 1000,
    parse_tags: bool = True,
    print_log: bool = True,
    img_b64: Optional[str] = None,
    media_type: str = "image/png",
) -> Union[str, Dict[str, Any]]:
    """
    Anthropic(Claude) 전용.
    - prompt_text: prompt1 (태그 출력 규칙 포함)
    - img_b64/media_type

    Returns:
      - parse_tags=False: raw string
      - parse_tags=True: dict {"raw": str, "analysis_log": str, "plan": dict}
    """
    # multi modality 입력 구성
    content = [{"type": "text", "text": prompt_text}]

    if img_b64 is not None:
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": img_b64,
            }
        })

    # LLM 응답 생성
    resp = anthropic_client.messages.create(
        model=model,
        max_tokens=max_tokens,
        temperature=temperature,
        messages=[{"role": "user", "content": content}],
    )

    # Claude 응답은 content blocks로 옴 → text block만 합치기
    raw = "".join(
        blk.text for blk in resp.content
        if getattr(blk, "type", None) == "text"
    ).strip()

    if not parse_tags:
        return raw

    # 태그 파싱
    analysis_log, plan = parse_tagged_output(raw)

    if print_log:
        print("\n[analysis_log]\n" + analysis_log)

    return {"raw": raw, "analysis_log": analysis_log, "plan": plan}
    

* Planning
    * 도구에 대한 설명을 구조화 
    * PROMPT_PLAN
    * 각 단계별 observation을 PROMPT_PLAN에 추가 

In [None]:
def format_tool_desc(
    tools: List[Dict[str, Any]],
    max_tools: int = 50,
    max_doc_chars: int = 300,
) -> str:
    lines = []
    for t in tools[:max_tools]:
        lines.append(
            f"- {t['name']} ({t['type']})\n"
            f"  qualname: {t['qualname']}\n"
            f"  signature: {t['signature']}\n"
            f"  doc: {t['doc'][:max_doc_chars].replace('\\n', ' ')}"
        )
    return "\n".join(lines)

In [None]:
PROMPT_PLAN_TEMPLATE = """
You are a VisionAgent-style planner/controller.

Your job is to decide the NEXT ACTION(s) to take using the available tools, based on the user's request and the accumulated evidence. You do NOT execute tools. You only output tool calls or the final answer.

You will be given:
- A user request (Korean)
- ONE image (already annotated with detection boxes/labels overlaid)
- Tool list with available actions
- VQA log: chronological reasoning, detection notes, bounding-box/label evaluations, and any prior validation outcomes
- VQA structured JSON summary of the detection/analysis results
- Prior tool observations may also appear in the conversation history (as "observation").

Primary evidence:
- Build decisions primarily from [VQA_LOG] and [VQA_STRUCT_JSON].
- Do NOT hallucinate new detections, boxes, or attributes beyond the provided evidence and tool outputs.

User request (Korean):
{user_request}

[VQA_LOG]
{vqa_log}

[VQA_STRUCT_JSON]
{vqa_struct_json}

[TOOLS]
{tool_desc}

[OBSERVATIONS]
{observations}

────────────────────────────────
CORE CONTROL LOOP BEHAVIOR
────────────────────────────────
At each turn, output either:
(A) Tool calls for the NEXT immediate actions (one or more tool calls), OR
(B) A final answer if no more tools are needed.

Do NOT output a full end-to-end plan. Do NOT output steps[1..N].
The executor will run your tool calls in the order you provide, append observations, and call you again with updated context.

────────────────────────────────
DETECTION-SPECIFIC BEHAVIOR
────────────────────────────────
- The image is already annotated. Prefer verification of existing detections.
- If bounding-box coordinates are available in VQA_STRUCT_JSON, use them for cropping and verification.
- If bounding-box coordinates are NOT available, do NOT guess. Request the annotation file
  (COCO JSON / YOLO TXT / model output JSON) or propose a concrete method to obtain coordinates.

Hard rule:
- If VQA_STRUCT_JSON contains bbox coordinates, you MUST call the crop tool first for those boxes.
- You MUST NOT call any VQA tool before attempting crop-based verification when bbox coordinates exist.
- VQA tools are allowed ONLY if bbox coordinates are missing/unavailable OR cropping fails with an error.

────────────────────────────────
OUTPUT FORMAT (STRICT)
────────────────────────────────
Output MUST contain EXACTLY TWO tags in this exact order:
1) <analysis_log> ... </analysis_log>
2) <plan_json> ... </plan_json>

Do NOT output anything outside the two tags.

<analysis_log> rules:
- 3–7 lines only
- Each line must start with "Step N:"
- Only describe the immediate reasoning for the NEXT action(s), not a full multi-step plan.

<plan_json> rules:
- MUST be STRICT JSON (no trailing commas, no comments, no markdown)
- Must match exactly one of the following schemas:

Schema 1: Tool calls
{{
  "language": "ko",
  "mode": "tool_calls",
  "selected_tools": [string],
  "tool_calls": [
    {{
      "id": int,
      "tool": string,
      "parameters": object,
      "expected_result": string
    }}
  ],
  "open_questions": [string]
}}

Schema 2: Final answer
{{
  "language": "ko",
  "mode": "final",
  "final_answer": string,
  "open_questions": [string]
}}

Additional rules:
- tool_calls must be listed in exact execution order; ids must start at 1 and increase strictly by 1 within this turn.
- Each tool call MUST reference a tool name from [TOOLS].
- Keep tool_calls minimal: only what is needed before the next observation.
- If you need missing inputs (e.g., box coordinates), set mode="final" and clearly request them in final_answer, or set open_questions accordingly.
"""


In [None]:
def render_prompt(user_request: str, vqa_log: str, vqa_struct: dict, tool_desc: str) -> str:
    vqa_struct_json = json.dumps(vqa_struct, ensure_ascii=False, indent=2)

    # observations는 너무 길어질 수 있으니, 필요하면 truncate 가능
    if planner_state["observations"]:
        obs_text = json.dumps(planner_state["observations"], ensure_ascii=False, indent=2)
    else:
        obs_text = "(none)"

    return PROMPT_PLAN_TEMPLATE.format(
        user_request=user_request,
        vqa_log=vqa_log,
        vqa_struct_json=vqa_struct_json,
        tool_desc=tool_desc,
        observations=obs_text,
    )

In [None]:
# plan mode checking
def validate_plan(plan: dict, tools_meta: list[dict]) -> None:
    names = {t["name"] for t in tools_meta}
    mode = plan.get("mode")
    if mode not in ("tool_calls", "final"):
        raise ValueError(f"Invalid mode: {mode}")

    if mode == "tool_calls":
        for tc in plan.get("tool_calls", []):
            if tc.get("tool") not in names:
                raise ValueError(f"Unknown tool in plan: {tc.get('tool')}")

In [None]:
def add_observation(tool: str, params: dict, result, ok: bool = True, error: str | None = None):
    planner_state["observations"].append({
        "tool": tool,
        "parameters": params,
        "ok": ok,
        "result": result,
        "error": error,
    })

* Executing
    * plan에서 선택한 도구 실행
    * base64 형태의 이미지 데이터는 도구가 이미지를 다룰 수 있도록 numpy 형태로 변환

In [None]:
def run_tool_call(tool_call: dict, tool_registry: dict):
    """
    Execute a single tool call safely.

    Input:
      - tool_call: {"tool": str, "parameters": dict, ...}
      - tool_registry: {tool_name: callable}

    Output:
      {
        "tool": str,
        "ok": bool,
        "result": Any | None,
        "error": str | None,
      }
    """
    tool_name = tool_call.get("tool")
    params = tool_call.get("parameters", {})

    # 1) tool 존재 여부 확인
    if tool_name not in tool_registry:
        return {
            "tool": tool_name,
            "ok": False,
            "result": None,
            "error": f"Unknown tool: {tool_name}",
        }

    fn = tool_registry[tool_name]

    # 2) 실행 + 예외 처리
    try:
        result = fn(**params)
        return {
            "tool": tool_name,
            "ok": True,
            "result": result,
            "error": None,
        }
    except Exception as e:
        return {
            "tool": tool_name,
            "ok": False,
            "result": None,
            "error": repr(e),
        }


In [None]:
def b64_to_np(img_b64: str) -> np.ndarray:
    data = base64.b64decode(img_b64)
    img = Image.open(io.BytesIO(data)).convert("RGB")
    arr = np.array(img)
    assert isinstance(arr, np.ndarray)
    assert arr.ndim == 3 and arr.shape[2] == 3
    return arr

IMG_NP = b64_to_np(img_b64)  # 한번만 만들어서 재사용

* Agentic Loop
    * 반복적 개선 매커니즘
    * 이전 observation을 참고해 계획 수정
    * 상태 누적

In [None]:
for turn in range(1, max_turns + 1):
    plan = plan_next(state, client, PROMPT_PLAN)
    
    if plan.mode == "tool_calls":
        state = execute_plan(state, plan)  # observations 누적
        
    if plan.mode == "final":
        final_plan = generate_final_plan(state, client, PROMPT_FINAL_PLAN)
        break

In [None]:
state.observations.append(exec_result)
state.all_plans.append(plan_json)
state.all_execs.append(exec_result)

* Code generation

In [None]:
PROMPT_FINAL_PLAN_TEMPLATE = """
You are a code planning expert. Your job is to create a detailed, step-by-step execution plan for generating Python code.

You will be given:
- A user request (Korean)
- VQA analysis results (what was understood from the image and request)
- Tool execution observations (what tools were run and their results)
- Available tools list

Your task:
Create a final execution plan that lists the exact steps needed to write Python code to complete the task.

User request: {user_request}

[VQA ANALYSIS]
{vqa_log}

[VQA STRUCTURED SUMMARY]
{vqa_struct_json}

[TOOL OBSERVATIONS]
{observations}

[AVAILABLE TOOLS]
{tool_desc}

────────────────────────────────
OUTPUT FORMAT (STRICT)
────────────────────────────────
Output MUST contain EXACTLY TWO tags in this exact order:
1) <final_answer> ... </final_answer>
2) <code_plan> ... </code_plan>

<final_answer> rules:
- Brief summary (1-2 sentences) of what the code will accomplish
- Written in Korean

<code_plan> rules:
- MUST be STRICT JSON array
- Each element represents one execution step
- Steps must be in exact execution order
- Format:
[
  {{
    "step": 1,
    "instruction": "Brief instruction describing what to do",
    "code_snippet": "Example code (not full implementation, just example)",
    "explanation": "Why this step is needed (optional)"
  }},
  ...
]

Instruction guidelines:
- Be specific about which functions/tools to use
- Include parameter hints (e.g., "prompt 'tomato'")
- Each step should be a single, clear action
- Steps should build on each other logically

Example instruction format:
- "Load the image using load_image()"
- "Use countgd_object_detection with prompt 'tomato' to detect all tomato instances"
- "Count the number of detections by getting the length of the detection list"
- "Visualize the detections by overlaying bounding boxes using overlay_bounding_boxes()"
- "Save the visualization to a file using save_image()"

Do NOT output anything outside the two tags.
"""

In [None]:
def generate_final_plan(
    state: AgentState,
    anthropic_client: Anthropic,
    prompt_template: str = PROMPT_FINAL_PLAN_TEMPLATE,
    model: str = "claude-sonnet-4-20250514",
    temperature: float = 0.2,
    max_tokens: int = 1500,
) -> Dict[str, Any]:
    """
    최종 코드 생성 계획을 생성
    
    Returns:
        {
            "final_answer": str,
            "code_plan": List[Dict[str, Any]]
        }
    """
    prompt = prompt_template.format(
        user_request=state.user_request,
        vqa_log=state.vqa_log,
        vqa_struct_json=json.dumps(state.vqa_struct, ensure_ascii=False, indent=2),
        observations=json.dumps(state.observations, ensure_ascii=False, indent=2) if state.observations else "(none)",
        tool_desc=state.tool_desc,
    )
    
    out = model_response_anthropic(
        anthropic_client=anthropic_client,
        prompt_text=prompt,
        model=model,
        max_tokens=max_tokens,
        temperature=temperature,
        parse_tags=True,
        print_log=False,
        img_b64=state.img_b64,
        media_type="image/jpeg",
    )
    
    # 태그 파싱
    raw = out["raw"]
    
    # <final_answer> 태그 추출
    final_answer_match = re.search(r"<final_answer>(.*?)</final_answer>", raw, re.DOTALL | re.IGNORECASE)
    final_answer = final_answer_match.group(1).strip() if final_answer_match else ""
    
    # <code_plan> 태그 추출 및 JSON 파싱
    code_plan_match = re.search(r"<code_plan>(.*?)</code_plan>", raw, re.DOTALL | re.IGNORECASE)
    if not code_plan_match:
        raise ValueError("응답에 <code_plan>...</code_plan> 태그가 없습니다.")
    
    code_plan_str = code_plan_match.group(1).strip()
    try:
        code_plan = json.loads(code_plan_str)
    except json.JSONDecodeError as e:
        raise ValueError(f"<code_plan> 내부 JSON 파싱 실패: {e}\n\nJSON:\n{code_plan_str}")
    
    return {
        "final_answer": final_answer,
        "code_plan": code_plan,
        "raw": raw
    }

In [None]:
def build_codegen_prompt(instruction: str, tool_desc: str = "", has_image: bool = False) -> str:
    img_note = (
        "- An image is provided via base64 (img_b64). Use it ONLY if your environment supports it.\n"
        if has_image else
        "- No image is directly embedded. Assume the script will load image_path from disk.\n"
    )

    return f"""
You are a coding assistant.
Write a SINGLE Python file that satisfies the instruction below.

Instruction:
{instruction}

Available tools (reference only):
{tool_desc}

Hard requirements:
- Output ONLY valid Python code (no markdown, no explanations).
- The file must be executable as a script.
- Provide: run(image_path: str) -> dict
- Include a __main__ block that calls run("image.png") by default.
{img_note}
- Include needed imports explicitly.
- Make it robust: basic error handling and clear variable names.
- Do NOT print the whole image or base64. Only print summary results.

Return only the code.
""".strip()


def strip_code_fences(text: str) -> str:
    t = text.strip()
    if t.startswith("```"):
        t = t.split("\n", 1)[-1]
        if t.endswith("```"):
            t = t.rsplit("```", 1)[0]
    return t.strip()

* 한계점
1. 타 LLM을 사용한 validation 과정의 추가 필요
2. LLM judge & reasoning 과정에 scorind 필요 
3. vision agent에서의 도구 외의 새로운 도구를 추가 필요 - 방법 고안