Bone-R — Bone Fracture Detection (multi-dataset → YOLOv8)

End-to-end pipeline to detect and localize bone fractures on X-rays. The shipped v5 detector is trained on four merged box-annotated sources — FracAtlas (multi-region), GRAZPEDWRI-DX (wrist), HUMERUS (shoulder), and proximal-femur (hip) — built up one region at a time to close blind spots (see JOURNAL.md). MURA was tested as an abnormality-classifier ensemble and rejected (a documented negative result — its "abnormal" label isn't fracture-specific).

⚠️ Research/educational tool only. Outputs are "best guess" estimates and are not a medical diagnosis. Always recommend a qualified orthopedic consult.

Results

Shipped detector: YOLOv8m (fracture_yolov8m_v3), trained on the harmonized FracAtlas + GRAZPEDWRI-DX merge (dataset_v3). The model improved in three deliberate stages — bigger backbone + harmonization, then a data scale-up — and the gains are reported honestly, including where they did and didn't land.

Progression (held-out test split, image-level screening):

Version	Data	mAP@0.5	Sensitivity	Specificity	Max recall
v1-baseline (v8s)	FracAtlas	0.27	—	—	—
v1 (v8m, harmonized)	FracAtlas	0.46	0.561	0.988	0.797
v3 (v8m + GRAZPEDWRI)	+ wrist	0.76	0.812	0.980	0.875
v4 (v8m + HUMERUS)	+ shoulder	0.86	0.927	0.975	0.895
v5 (v8m + proximal-femur)	+ hip	0.79	0.878	0.986	—

v5 per-source sensitivity: hip 0.674 (now evaluable, was 2 cases), humerus 0.993, wrist 0.925, FracAtlas-native 0.671. The overall dip vs v4 is honest — v5's test set is region-stratified and includes 43 hard hip cases (v4's was easier).

v4 PPV 0.973 (TP 357 / FP 10 / TN 394 / FN 28 on 789 test images).

Per-source sensitivity (v4) — gains are region-specific, reported honestly: humerus/shoulder 0.986, wrist 0.961, FracAtlas-native 0.712 (the realistic in-the-wild number; the headline is lifted by the easier wrist/humerus sets).

Backbone study (separate apples-to-apples comparison on dataset_v2, all three trained + tested on identical data):

Backbone	mAP@0.5	mAP@0.5:0.95	Recall
YOLOv8m	0.460	0.198	0.429
Faster R-CNN R50-FPNv2	0.354	0.171	0.288
RetinaNet R50-FPNv2	0.240	0.072	0.188

Honest limitations (full analysis in JOURNAL.md):

Poor out-of-distribution generalization — on an unseen source (pkdarabi) external sensitivity is 0.30 vs 0.67 in-distribution (Entry 009). In-set metrics overstate real-world performance; this is an in-domain demo, not deployable. Run it yourself: python scripts/external_val.py.
Gains are region-specific: v4 detects humerus (0.986) and wrist (0.961) fractures far better than FracAtlas-native ones (0.712) — the headline sensitivity is lifted by the easier added distributions.
Hip is still the open blind spot — only 2 fractured test cases (caught 1). The "hip" Roboflow set turned out to be osteoporosis; a genuine hip-fracture set (Roboflow proximal-femur) is identified for the next data pass.
Recall still cannot be pushed past 0.895; ≥0.95 sensitivity needs more region-diverse boxed data.
Wrist data (GRAZPEDWRI) is pediatric — an age-distribution skew vs adult X-rays.
A MURA abnormality-classifier ensemble was tested and rejected — it traded specificity (0.99 → 0.87) for nothing, since MURA "abnormal" ≠ fracture (a studied negative result, Entry 004).

Layout

scripts/fracatlas_to_yolo.py   COCO JSON  -> YOLO .txt boxes (alongside images)
scripts/mura_to_yolo.py        MURA labels -> classification manifest OR weak boxes
scripts/make_splits.py         build dataset/{images,labels}/{train,val,test}
dataset.yaml                   YOLO dataset config (single class: fracture)
train.py                       Ultralytics YOLOv8 training
evaluate.py                    mAP@0.5, mAP@0.5:0.95 + sensitivity @ high recall
inference.py                   shared load/decode (PNG/JPG/DICOM) + heuristics
visualize.py                   cv2.rectangle overlays saved to an image
app/main.py                    FastAPI /detect endpoint (+ Google Maps consult)

Quickstart

This reproduces the shipped v5 model (FracAtlas + GRAZPEDWRI + HUMERUS + proximal-femur). Each external set is acquired via Roboflow/Kaggle (see analysis/data_sourcing_log.md); paths below assume they're downloaded.

pip install -r requirements.txt

# 1. FracAtlas COCO -> YOLO labels, then harmonize (CLAHE, repair, resize)
python scripts/fracatlas_to_yolo.py
python scripts/preprocess.py --src FracAtlas/images --out FracAtlas_proc

# 2. Ingest each external set to single-class fracture, then harmonize.
#    Class selection is per-dataset (see ingest --help): GRAZPEDWRI by index,
#    HUMERUS all-fracture, hip by name (drops landmarks/normals).
python scripts/ingest_grazpedwri.py --images GRAZ_rf --labels GRAZ_rf \
    --out GRAZ_bone_r --fracture-class 2 --copy
python scripts/preprocess.py --src GRAZ_bone_r --out GRAZ_proc
#    ...repeat for HUMERUS (--all-fracture) and hip; or just use run_v5.py.

# 3. Region-stratified merge -> dataset_v5 (hip/shoulder in every split)
python scripts/make_splits.py --src FracAtlas_proc --src GRAZ_proc \
    --src HUMERUS_proc --src HIP_proc --out dataset_v5 --copy --stratify

# 4. Train YOLOv8m
python train.py --model yolov8m.pt --data dataset_v5.yaml \
    --epochs 150 --imgsz 800 --batch 8 --name fracture_yolov8m_v5

# --- or do steps 2-4 in one command (acquires the hip set too): ---
# python scripts/run_v5.py --roboflow-key KEY --fracture-names \
#   "left-intertrochanteric,right-intertrochanteric,left-neck,right-neck,left-subtrochanteric,right-subtrochanteric"

# 5. Evaluate — mAP + screening sensitivity + per-region + per-source
python evaluate.py --weights runs/detect/fracture_yolov8m_v5/weights/best.pt \
    --data dataset_v5.yaml --data-root dataset_v5 --split test --target-recall 0.90

# 6. Visualize / serve
python visualize.py --weights runs/detect/fracture_yolov8m_v5/weights/best.pt \
    --image xray.png --out prediction.png
export FRACTURE_WEIGHTS=runs/detect/fracture_yolov8m_v5/weights/best.pt
uvicorn app.main:app --port 8000   # /detect?mode=screening for high recall

API example

curl -F file=@xray.png \
  "http://localhost:8000/detect?conf=0.25&lat=37.77&lng=-122.42"

Returns JSON: bounding boxes (xyxy + confidence), heuristic fracture type and severity, a consult recommendation, and (if a Maps key + coordinates are given) nearby orthopedic clinics.

Notes on the two datasets

FracAtlas is single-class and box-annotated → drives true detection. fracatlas_to_yolo.py reads Annotations/COCO JSON/COCO_fracture_masks.json, converts each COCO [x,y,w,h] box to normalized YOLO cx cy w h, writes a populated .txt next to fractured images and an empty .txt next to non-fractured images (valid negative/background samples).
MURA has no boxes — labels are study-level (*_positive / *_negative). It cannot yield true localized boxes. Use the classification manifest to pre-train/co-train a classifier head, or --mode weakbox to emit whole-image weak labels (noisier; use deliberately).

Phase 1 — Dataset preprocessing (harmonization)

Before training, both datasets are harmonized through a shared preprocessing contract applied offline:

# Harmonize FracAtlas (and optionally MURA) into a consistent medium
python scripts\preprocess.py --src FracAtlas\images --out FracAtlas_proc --max-size 1024
# Then build the split from the harmonized images
python scripts\make_splits.py --src FracAtlas_proc --out dataset --copy

What it does:

Repairs truncated JPEGs — recovers ~37 FracAtlas files that YOLO otherwise drops
CLAHE (contrast-limited adaptive histogram equalization) — enhances local contrast on bone edges, sharpens hairline fractures without noise amplification
Grayscale → 3-channel RGB — consistent representation across datasets
DICOM window/level — applies radiographic rescale (slope/intercept) so different machines land on the same intensity scale
Aspect-preserving resize — scales to a max long side (default 1024) without distortion, keeping YOLO labels valid without modification

Run this after the label converters (fracatlas_to_yolo.py / mura_to_yolo.py) but before make_splits.py.

MURA co-training (robustness)

MURA's 40k labeled films train a study-level abnormality classifier that acts as a second opinion on the FracAtlas detector.

python scripts/mura_to_yolo.py --mode classify --out MURA_cls   # build tree
python train_mura_cls.py --data MURA_cls --model yolov8s-cls.pt --epochs 20

# Ensemble: classifier P(abnormal) re-weights detector confidences
python ensemble.py \
  --detector runs/detect/fracture_yolov8s/weights/best.pt \
  --classifier runs/classify/mura_abnormality/weights/best.pt \
  --image xray.png --conf 0.15

fused_conf = det_conf * (alpha + (1-alpha)*p_abnormal) — a detection survives best when both heads agree; a confidently-abnormal film keeps borderline boxes (favoring sensitivity), and a strongly-normal film suppresses weak ones.

CNN detector framework (swappable backbones)

models/ exposes one Detector interface over the strongest detection CNNs so you can benchmark backbones on the same YOLO-format dataset:

key	model	trait
`yolov8`	Ultralytics YOLOv8	fast anchor-free default
`fasterrcnn`	Faster R-CNN ResNet50-FPN v2	two-stage, high precision
`retinanet`	RetinaNet ResNet50-FPN v2	one-stage, focal loss
`fcos`	FCOS ResNet50-FPN	anchor-free one-stage

from models import build_detector
det = build_detector("retinanet", weights="retinanet_fracture.pt", num_classes=1)
detections = det.predict(rgb_image, conf=0.25)   # -> inference.Detection list

# Fine-tune a torchvision backbone on the YOLO-format tree
python train_torchvision.py --arch retinanet --data dataset --epochs 20 --out rn.pt

# Serve any backend through the same API
export FRACTURE_BACKEND=retinanet
export FRACTURE_WEIGHTS=rn.pt
uvicorn app.main:app --port 8000

Every backend returns the shared inference.Detection type, so visualize.py, ensemble.py, and the FastAPI app work unchanged regardless of model.

Model comparison & live overlay

Compare backbones head-to-head on the same test split, and render a "which-is-which" overlay where each model owns a color:

# Numeric comparison -> console table + JSON for the landing page
python benchmark.py --data dataset.yaml \
  --model yolov8=runs/detect/fracture_yolov8s/weights/best.pt \
  --model retinanet=retinanet_fracture.pt \
  --model fasterrcnn=fasterrcnn_fracture.pt \
  --out docs/data/benchmark.json

# Visual overlay -> showcase PNG + JSON the web viewer consumes
python compare_overlay.py --image xray.png \
  --out docs/assets/compare.png --json docs/data/predictions.json \
  --model yolov8=runs/detect/fracture_yolov8s/weights/best.pt \
  --model retinanet=retinanet_fracture.pt

Landing page (GitHub Pages → lsaiko.github.io/Bone-R)

docs/ is a static, dependency-free site with an interactive overlay viewer (toggle models, confidence slider, hover-for-details) and a benchmark table. It ships with synthetic demo data so it renders immediately; benchmark.py and compare_overlay.py overwrite docs/data/*.json with real model output.

Deploy:

git init && git add . && git commit -m "Bone-R: fracture detection pipeline + site"
git remote add origin https://github.com/LSaiko/Bone-R.git
git branch -M main && git push -u origin main

Then in the repo: Settings → Pages → Source: Deploy from a branch → main / /docs. The site goes live at https://lsaiko.github.io/Bone-R/. (Datasets and weights are git-ignored; only code + the docs/ site are pushed.)

Severity / type heuristics

inference.guess_type_and_severity derives a rough fracture type (from box aspect ratio) and severity (from box area + model confidence). These are transparent rule-based hints flagged "best guess", not a learned grading model. For production, train a dedicated multi-class detection or grading head.

Datasets & attribution

FracAtlas — Abedeen et al., FracAtlas: A Dataset for Fracture Classification, Localization and Segmentation of Musculoskeletal Radiographs (Scientific Data, 2023). Licensed CC BY 4.0. The sample radiograph shown on the landing page is a FracAtlas test image, used under that license.
MURA — Rajpurkar et al., MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs (Stanford ML Group). Used for the abnormality classifier experiment under its research data-use terms.

Datasets are not redistributed in this repo (git-ignored); only code, the docs/ site, and a single attributed sample image are committed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bone-R — Bone Fracture Detection (multi-dataset → YOLOv8)

Results

Layout

Quickstart

API example

Notes on the two datasets

Phase 1 — Dataset preprocessing (harmonization)

MURA co-training (robustness)

CNN detector framework (swappable backbones)

Model comparison & live overlay

Landing page (GitHub Pages → lsaiko.github.io/Bone-R)

Severity / type heuristics

Datasets & attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.claude		.claude
.github/workflows		.github/workflows
analysis		analysis
app		app
docs		docs
models		models
scripts		scripts
tests		tests
.gitignore		.gitignore
JOURNAL.md		JOURNAL.md
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
benchmark.py		benchmark.py
compare_overlay.py		compare_overlay.py
dataset_v5.yaml		dataset_v5.yaml
ensemble.py		ensemble.py
eval_ensemble.py		eval_ensemble.py
evaluate.py		evaluate.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py
train_mura_cls.py		train_mura_cls.py
train_torchvision.py		train_torchvision.py
train_typing.py		train_typing.py
visualize.py		visualize.py

Folders and files

Latest commit

History

Repository files navigation

Bone-R — Bone Fracture Detection (multi-dataset → YOLOv8)

Results

Layout

Quickstart

API example

Notes on the two datasets

Phase 1 — Dataset preprocessing (harmonization)

MURA co-training (robustness)

CNN detector framework (swappable backbones)

Model comparison & live overlay

Landing page (GitHub Pages → lsaiko.github.io/Bone-R)

Severity / type heuristics

Datasets & attribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages