A suite of utilities for autonomously diagnosing environment issues in the context of common ML training/inference workflows.
This repo contains one‑command tools that gather facts, check compatibility, and generate reviewable fix plans—so you don’t have to remember driver/CUDA/torch/ABI matrices or guess which wheel matches your stack.
Run it once; it probes your OS/driver/CUDA/Python/PyTorch and common CUDA extension pitfalls, performs light online lookups when possible, optionally summarizes docs with an LLM, and outputs a fix plan.
python ml_doctor.pyWhat it captures
- OS, kernel, GLIBC; toolchain (gcc/clang/cmake/ninja/make)
- NVIDIA driver, GPUs, and
nvidia-smiXML dump - CUDA toolkits on disk,
nvcc --version,ldconfigvisibility forlibcuda,libcudart,libcudnn* - Python/venv/conda,
pip freeze - PyTorch quick facts +
python -m torch.utils.collect_env - Torch extensions cache, and a targeted FlashAttention ABI check (
ldd -ron the.soif present)
Artifacts (timestamped folder)
facts.json,report.json,fix_plan.sh,console_summary.txt(+flashattn_check.jsonif applicable)
The design intentionally avoids flags and follows a “just run it” flow; online lookups and LLM are automatic but only engage if connectivity and keys are present. fileciteturn0file0
Finds the best‑matching prebuilt wheel for your current environment (Python/OS/Torch/CUDA/CXX11 ABI), from the official FlashAttention releases (and an optional community fallback), and optionally installs it with pip --no-deps.
# Dry run (detect + recommend)
python ml_fa_wheel.py
# Choose an FA version
python ml_fa_wheel.py --fa 2.8.3
# Install the recommended wheel into *this* env
python ml_fa_wheel.py --installEnvironment variables
ML_FA_VERSION=2.8.3— choose a specific release (default:latest)ML_FA_INSTALL=1— perform the installation after selection (default: dry‑run)ML_FA_OFFICIAL_ONLY=1— skip community fallbackGITHUB_TOKEN=...— optional, to raise API rate limits
Artifacts (timestamped folder)
wheel_report.json(detection, candidates, recommendation)install.sh(reviewable install command) orINSTALL_FALLBACK.txt(if no exact match)console_summary.txt
The matching logic looks for
torch{MAJOR.MINOR}, CUDA markers likecu12/cu128, Python tags likecp311-cp311, platformlinux_x86_64, and ABI markerscxx11abiTRUE/FALSEin wheel filenames—mirroring how official wheels are named upstream.
Why it exists
Even when CUDA and Python match, Torch minor and CXX11 ABI must match the wheel build. This tool saves you from trial‑and‑error and points you to either the correct wheel or a safe source‑build path when prebuilt wheels aren’t published for your combo (common on bleeding‑edge Torch/nightlies).
It follows the same “one command, zero required flags” philosophy as ml_doctor.py. fileciteturn0file0
# 1) Diagnose your ML environment (read‑only)
python ml_doctor.py
# 2) If training complains "No module named flash_attn", find a matching wheel
python ml_fa_wheel.py
# Optionally install:
python ml_fa_wheel.py --installIf ml_fa_wheel.py can’t find an exact wheel (e.g., you’re on a nightly or a very new Torch minor), it will show the nearest candidates and write INSTALL_FALLBACK.txt with safe, copy‑pasteable source‑build steps (including an example for Blackwell TORCH_CUDA_ARCH_LIST="12.0").
- Fact‑first: gather the exact runtime facts before suggesting changes.
- Deterministic & auditable: JSON reports + reproducible fix scripts instead of opaque magic.
- Zero‑risk defaults: read‑only by default; explicit opt‑in for installs.
- Tolerant to dev stacks: nightlies and multiple CUDA toolkits are handled with clear warnings and guidance.
- LLM optional: if you provide
OPENAI_API_KEY,ml_doctor.pycan summarize noisy vendor docs to structured hints; otherwise it stays offline and deterministic.
These tools follow the long‑running, JSON‑strict workflow patterns we’ve successfully used in our other pipelines (e.g., multi‑pass token curation). fileciteturn0file2
- Multiple CUDA toolkits:
ml_doctor.pyflags them; be consistent about which you build against. LD_LIBRARY_PATH: helpful, but can mask bundled libs—temporarily unset it when debugging imports.- Blackwell arch: use
TORCH_CUDA_ARCH_LIST="12.0"when compiling CUDA extensions on Blackwell GPUs. - Pip cache: after switching Torch/CUDA lines, purge
pipand~/.cache/torch_extensionsbefore rebuilding. - Nightlies: prefer source builds for CUDA extensions, or pin to a stable torch minor with known wheels.
PRs welcome. Please keep new tools:
- one‑command by default,
- read‑only by default, and
- emitting JSON + a reviewable plan.
MIT (see LICENSE).
The doctor captures a robust snapshot for post‑mortems (OS, driver, CUDA 12.8, PyTorch nightly + cu128, etc.), which is the baseline for recommendations and fix plans. fileciteturn0file1