Jing Zhang1,*, Wentao Jiang1,*, Tao Huang1, Zhiwei Wang1, Jianxin Liu2, Jian Chen2, Ping Ye2, Gang Wang3, Zengmao Wang1, Bo Du1, Dacheng Tao4
1 School of Computer Science, Wuhan University, China
2 The Central Hospital of Wuhan, China
3 Taizhou Hospital of Zhejiang Province, China
4 Nanyang Technological University, Singapore
An invoke-and-reason ultrasound agent that coordinates organ-specific detector tools, verifies grounded visual evidence, and converts lesion-level observations into clinically meaningful diagnostic decisions.
Overview | Highlights | Framework | Training Recipe | Main Results | Case Study | Citation
Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning. Specialized detectors are strong at localization but provide limited diagnostic reasoning, while general multimodal large language models (MLLMs) can reason in language but remain weak at fine-grained spatial grounding in specialized medical images.
Echo-α bridges this gap with an agentic multimodal reasoning framework. Given a raw ultrasound image and optional clinical context, Echo-α forms an initial hypothesis, invokes an organ-specific lesion detector through a structured tool interface, reads the returned boxes and labels, and synthesizes a final grounded diagnosis. The model is trained to treat detector outputs as revisable clinical evidence rather than as final predictions.
Echo-α unifies visual perception, detector evidence, and multimodal clinical reasoning in an invoke-and-reason loop.
- Agentic ultrasound interpretation: Echo-α places an MLLM at the center of an invoke-and-reason loop for grounded diagnosis.
- Tool-grounded visual evidence: Organ-specific detectors return rendered visualizations and structured metadata, including coordinates, confidence scores, and lesion labels.
- Nine-task supervised curriculum: The SFT stage covers REC, REG, direct diagnosis, attribute reasoning, grounded analysis, detector correction, tool interpretation, and interaction-loop behavior.
- Sequential RL specialization: The same SFT initialization is optimized into Echo-Ground for lesion anchoring and Echo-Diag for final diagnosis.
- Multi-center evaluation: Renal and breast ultrasound benchmarks are evaluated with in-center validation and cross-center testing.
- Detector-agnostic behavior: The agent improves diagnosis across multiple detector backbones, suggesting that learned tool use is not tied to a single detector.
Echo-α is designed around a detect, verify, and reason workflow.
- Initial multimodal reasoning: The model inspects the ultrasound image and optional clinical context to form a preliminary interpretation.
- Structured detector invocation: The agent calls an organ-specific detector tool through a function-like interface.
- Evidence ingestion: The tool returns a rendered detection image plus structured metadata containing candidate boxes, labels, and confidence scores.
- Grounded synthesis: Echo-α compares tool evidence against global ultrasound appearance and produces a grounded diagnostic decision.
For renal ultrasound, the detector covers six lesion categories: angiomyolipoma, hydronephrosis, renal stone, renal cyst, diffuse renal parenchymal disease, and renal malignant tumor. For breast ultrasound, the detector predicts BI-RADS categories, including BI-RADS 2, 3, 4A, 4B, 4C, and 5.
The core agent is independent of a particular detector implementation. New anatomical domains can be introduced by exposing another detector through the same structured interface.
The SFT stage teaches Echo-α complementary grounding, reasoning, and tool-use skills through a nine-task curriculum:
| Tier | Tasks | Purpose |
|---|---|---|
| Foundational grounding | Referring Expression Comprehension and Referring Expression Generation | Learn lesion-box alignment and lesion description |
| Diagnostic reasoning | Direct diagnosis, attribute explanation, multi-step grounded analysis | Connect sonographic observations with disease categories |
| Tool collaboration | Box refinement, category correction, localization-classification assessment | Learn to revise detector outputs instead of copying them |
| Interaction loop | Tool invocation and feedback interpretation | Learn when and how to use detector evidence |
Training rationales are generated with teacher-forcing context that includes ground-truth annotations and specialized-detector predictions. The result is a shared SFT initialization for both grounding and diagnosis variants.
Echo-α is further optimized with Group Relative Policy Optimization (GRPO). The reward combines:
| Reward | Role |
|---|---|
| Localization reward | Encourages overlap between predicted and ground-truth boxes |
| Classification reward | Rewards correct class prediction when localization is sufficiently accurate |
| Shape reward | Uses DIoU-style alignment to improve compact box quality |
| Tool cost | Penalizes redundant detector calls while preserving strategic tool use |
Different reward weights produce two specialized variants. Echo-Ground emphasizes lesion anchoring and box refinement. Echo-Diag shifts the objective toward diagnosis while retaining localization constraints.
Echo-α is evaluated on renal and breast ultrasound benchmarks under a multi-center protocol. The validation split measures same-center generalization, while the test split measures cross-center robustness.
| Split | Specialized Detector F1@0.5 | Direct MLLM + Tool F1@0.5 | Echo-Ground F1@0.5 |
|---|---|---|---|
| Renal Val | 69.70 | 56.56 | 70.78 |
| Renal Test | 52.63 | 50.11 | 56.73 |
| Breast Val | 46.68 | 36.61 | 50.37 |
| Breast Test | 42.01 | 37.25 | 43.78 |
| Split | Specialized Detector Acc. | Direct MLLM + Tool Acc. | Echo-Diag Acc. |
|---|---|---|---|
| Renal Val | 74.53 | 63.99 | 77.43 |
| Renal Test | 69.13 | 66.99 | 74.90 |
| Breast Val | 51.41 | 37.71 | 48.75 |
| Breast Test | 46.96 | 44.75 | 49.20 |
Replacing the renal detector with YOLO-family, LW-DETR-DINOv3, and RF-DETR-DINOv3 backbones shows that Echo-α consistently improves detector-only diagnosis. The gains are largest for weaker or sparser detectors and remain positive for stronger detectors.
| Detector | Detector Acc. | Echo-α Acc. | Gain |
|---|---|---|---|
| YOLOv8 | 44.16 | 67.11 | +22.95 |
| LWDETR-DINOv3 | 45.91 | 56.08 | +10.17 |
| YOLO26 | 64.56 | 71.01 | +6.45 |
| RFDETR-DINOv3 | 65.91 | 69.70 | +3.79 |
| YOLO12 | 66.85 | 72.35 | +5.50 |
| YOLO11 | 68.99 | 71.14 | +2.15 |
| Ours | 74.53 | 77.43 | +2.90 |
Echo-α uses detector feedback as evidence to be checked against the image, not as a prediction to blindly copy.
In a renal example, the raw MLLM favors a benign angiomyolipoma from a smaller benign-looking region, while Echo-α grounds a larger lesion and predicts malignant renal tumor based on heterogeneous echoes, indistinct margins, and local structural distortion. In a breast example, the raw MLLM reports a benign BI-RADS 3 assessment, while Echo-α predicts BI-RADS 4A after comparing two tool-returned ductal-region candidates with clinical context such as persistent nipple discharge.
Representative renal and breast ultrasound cases showing how grounded tool evidence changes final interpretation.
If you find Echo-α useful for your research, please cite:
@article{echo-alpha,
title={Echo-{\alpha}: Agentic Multimodal Reasoning for Ultrasound Interpretation},
author={Zhang, Jing and Jiang, Wentao and Huang, Tao and Wang, Zhiwei and Liu, Jianxin and Chen, Jian and Ye, Ping and Wang, Gang and Wang, Zengmao and Du, Bo and Tao, Dacheng},
year={2026},
journal={arXiv preprint arXiv:2604.28011},
}
