Multimodal AI for dermatology — aligning skin lesion images with clinical symptom text
to support faster, more reliable diagnosis.
Multimodal AI · RAG · Edge Deployment | 4,010 training images | MSE 0.0025
Dermatology has a data problem. Clinicians rely on visual inspection and symptom history, but manual diagnosis is slow, inconsistent, and constrained by specialist availability. Most AI approaches treat image and text data in isolation — missing the richer signal that emerges when both are aligned.
HealthLens treats this as a business problem with a technical solution: build a production-ready diagnostic support system that reasons over both modalities simultaneously, delivers explainable outputs, and can scale without requiring constant human annotation.
| 4,010 | 0.0025 | 2-in-1 | E2E |
|---|---|---|---|
| Training Images | Alignment MSE | Image + Text Input | Deployed Pipeline |
User Input ──► Image + symptom description
│
▼
Preprocessing ──► Normalization + tokenization
│
▼
ALIGN Encoders ──► Image vector + text vector (shared embedding space)
│
▼
Cosine Similarity ──► Closest clinical match identified
│
▼
RAG Retrieval ──► Clinical description fetched from Qdrant
│
▼
Output ──► Disease label + confidence score + clinical context
| Layer | Choice | Rationale |
|---|---|---|
| Multimodal model | kakaobrain/align-base |
Shared embedding space for image + text |
| Similarity | Cosine similarity | Direction-invariant, fast at inference |
| Explainability | RAG + Qdrant | Clinically sourced descriptions, not black-box |
| Backend | FastAPI | Async, lightweight, production-ready |
| Frontend | Streamlit | Fast iteration for clinician-facing UI |
| Augmentation | Torchvision | Flips, color jitter, cropping on 4,010 images |
What's the operational problem?
Dermatology diagnosis is a bottleneck — time-intensive, specialist-gated, and difficult to audit. An AI-assisted pipeline reduces time-to-decision and creates a reproducible, auditable record.
What does better look like?
Higher precision on unseen samples, outputs a clinician can interrogate, and a system that improves as labeled data grows — without full retraining.
Is this deployable?
Yes. FastAPI + Qdrant (Docker) + Streamlit = a self-contained stack that runs locally or on cloud infrastructure with minimal configuration.
├── models/ # Trained .pth model checkpoints
├── data-info/ # Clinical disease descriptions (JSON)
├── scripts/ # Inference and utility logic
├── notebooks/ # EDA and training notebooks
├── docker-compose.yaml # Qdrant vector DB setup
└── requirements.txt # Dependencies