ML Doctor

A suite of utilities for autonomously diagnosing environment issues in the context of common ML training/inference workflows.

This repo contains one‑command tools that gather facts, check compatibility, and generate reviewable fix plans—so you don’t have to remember driver/CUDA/torch/ABI matrices or guess which wheel matches your stack.

Tools

1) `ml_doctor.py` — one‑shot environment + compatibility doctor

Run it once; it probes your OS/driver/CUDA/Python/PyTorch and common CUDA extension pitfalls, performs light online lookups when possible, optionally summarizes docs with an LLM, and outputs a fix plan.

python ml_doctor.py

What it captures

OS, kernel, GLIBC; toolchain (gcc/clang/cmake/ninja/make)
NVIDIA driver, GPUs, and nvidia-smi XML dump
CUDA toolkits on disk, nvcc --version, ldconfig visibility for libcuda, libcudart, libcudnn*
Python/venv/conda, pip freeze
PyTorch quick facts + python -m torch.utils.collect_env
Torch extensions cache, and a targeted FlashAttention ABI check (ldd -r on the .so if present)

Artifacts (timestamped folder)

facts.json, report.json, fix_plan.sh, console_summary.txt (+flashattn_check.json if applicable)

The design intentionally avoids flags and follows a “just run it” flow; online lookups and LLM are automatic but only engage if connectivity and keys are present. fileciteturn0file0

2) `ml_fa_wheel.py` — prebuilt FlashAttention wheel finder/installer

Finds the best‑matching prebuilt wheel for your current environment (Python/OS/Torch/CUDA/CXX11 ABI), from the official FlashAttention releases (and an optional community fallback), and optionally installs it with pip --no-deps.

# Dry run (detect + recommend)
python ml_fa_wheel.py

# Choose an FA version
python ml_fa_wheel.py --fa 2.8.3

# Install the recommended wheel into *this* env
python ml_fa_wheel.py --install

Environment variables

ML_FA_VERSION=2.8.3 — choose a specific release (default: latest)
ML_FA_INSTALL=1 — perform the installation after selection (default: dry‑run)
ML_FA_OFFICIAL_ONLY=1 — skip community fallback
GITHUB_TOKEN=... — optional, to raise API rate limits

Artifacts (timestamped folder)

wheel_report.json (detection, candidates, recommendation)
install.sh (reviewable install command) or INSTALL_FALLBACK.txt (if no exact match)
console_summary.txt

The matching logic looks for torch{MAJOR.MINOR}, CUDA markers like cu12/cu128, Python tags like cp311-cp311, platform linux_x86_64, and ABI markers cxx11abiTRUE/FALSE in wheel filenames—mirroring how official wheels are named upstream.

Why it exists
Even when CUDA and Python match, Torch minor and CXX11 ABI must match the wheel build. This tool saves you from trial‑and‑error and points you to either the correct wheel or a safe source‑build path when prebuilt wheels aren’t published for your combo (common on bleeding‑edge Torch/nightlies).
It follows the same “one command, zero required flags” philosophy as ml_doctor.py. fileciteturn0file0

Quick start

# 1) Diagnose your ML environment (read‑only)
python ml_doctor.py

# 2) If training complains "No module named flash_attn", find a matching wheel
python ml_fa_wheel.py
# Optionally install:
python ml_fa_wheel.py --install

If ml_fa_wheel.py can’t find an exact wheel (e.g., you’re on a nightly or a very new Torch minor), it will show the nearest candidates and write INSTALL_FALLBACK.txt with safe, copy‑pasteable source‑build steps (including an example for Blackwell TORCH_CUDA_ARCH_LIST="12.0").

Design principles

Fact‑first: gather the exact runtime facts before suggesting changes.
Deterministic & auditable: JSON reports + reproducible fix scripts instead of opaque magic.
Zero‑risk defaults: read‑only by default; explicit opt‑in for installs.
Tolerant to dev stacks: nightlies and multiple CUDA toolkits are handled with clear warnings and guidance.
LLM optional: if you provide OPENAI_API_KEY, ml_doctor.py can summarize noisy vendor docs to structured hints; otherwise it stays offline and deterministic.

These tools follow the long‑running, JSON‑strict workflow patterns we’ve successfully used in our other pipelines (e.g., multi‑pass token curation). fileciteturn0file2

Troubleshooting & tips

Multiple CUDA toolkits: ml_doctor.py flags them; be consistent about which you build against.
LD_LIBRARY_PATH: helpful, but can mask bundled libs—temporarily unset it when debugging imports.
Blackwell arch: use TORCH_CUDA_ARCH_LIST="12.0" when compiling CUDA extensions on Blackwell GPUs.
Pip cache: after switching Torch/CUDA lines, purge pip and ~/.cache/torch_extensions before rebuilding.
Nightlies: prefer source builds for CUDA extensions, or pin to a stable torch minor with known wheels.

Contributing

PRs welcome. Please keep new tools:

one‑command by default,
read‑only by default, and
emitting JSON + a reviewable plan.

License

MIT (see LICENSE).

Appendix: Example probe output

The doctor captures a robust snapshot for post‑mortems (OS, driver, CUDA 12.8, PyTorch nightly + cu128, etc.), which is the baseline for recommendations and fix plans. fileciteturn0file1

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
ml_doctor.py		ml_doctor.py
ml_fa_wheel.py		ml_fa_wheel.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML Doctor

Tools

1) `ml_doctor.py` — one‑shot environment + compatibility doctor

2) `ml_fa_wheel.py` — prebuilt FlashAttention wheel finder/installer

Quick start

Design principles

Troubleshooting & tips

Contributing

License

Appendix: Example probe output

About

Uh oh!

Releases

Packages

Languages

NSFW-API/ML-Doctor

Folders and files

Latest commit

History

Repository files navigation

ML Doctor

Tools

1) ml_doctor.py — one‑shot environment + compatibility doctor

2) ml_fa_wheel.py — prebuilt FlashAttention wheel finder/installer

Quick start

Design principles

Troubleshooting & tips

Contributing

License

Appendix: Example probe output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1) `ml_doctor.py` — one‑shot environment + compatibility doctor

2) `ml_fa_wheel.py` — prebuilt FlashAttention wheel finder/installer

Packages