Skip to content

A suite of utilities for autonomously diagnosing environment issues in the context of common ML training/inference workflows.

Notifications You must be signed in to change notification settings

NSFW-API/ML-Doctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

ML Doctor

A suite of utilities for autonomously diagnosing environment issues in the context of common ML training/inference workflows.

This repo contains one‑command tools that gather facts, check compatibility, and generate reviewable fix plans—so you don’t have to remember driver/CUDA/torch/ABI matrices or guess which wheel matches your stack.


Tools

1) ml_doctor.py — one‑shot environment + compatibility doctor

Run it once; it probes your OS/driver/CUDA/Python/PyTorch and common CUDA extension pitfalls, performs light online lookups when possible, optionally summarizes docs with an LLM, and outputs a fix plan.

python ml_doctor.py

What it captures

  • OS, kernel, GLIBC; toolchain (gcc/clang/cmake/ninja/make)
  • NVIDIA driver, GPUs, and nvidia-smi XML dump
  • CUDA toolkits on disk, nvcc --version, ldconfig visibility for libcuda, libcudart, libcudnn*
  • Python/venv/conda, pip freeze
  • PyTorch quick facts + python -m torch.utils.collect_env
  • Torch extensions cache, and a targeted FlashAttention ABI check (ldd -r on the .so if present)

Artifacts (timestamped folder)

  • facts.json, report.json, fix_plan.sh, console_summary.txt (+flashattn_check.json if applicable)

The design intentionally avoids flags and follows a “just run it” flow; online lookups and LLM are automatic but only engage if connectivity and keys are present. fileciteturn0file0


2) ml_fa_wheel.py — prebuilt FlashAttention wheel finder/installer

Finds the best‑matching prebuilt wheel for your current environment (Python/OS/Torch/CUDA/CXX11 ABI), from the official FlashAttention releases (and an optional community fallback), and optionally installs it with pip --no-deps.

# Dry run (detect + recommend)
python ml_fa_wheel.py

# Choose an FA version
python ml_fa_wheel.py --fa 2.8.3

# Install the recommended wheel into *this* env
python ml_fa_wheel.py --install

Environment variables

  • ML_FA_VERSION=2.8.3 — choose a specific release (default: latest)
  • ML_FA_INSTALL=1 — perform the installation after selection (default: dry‑run)
  • ML_FA_OFFICIAL_ONLY=1 — skip community fallback
  • GITHUB_TOKEN=... — optional, to raise API rate limits

Artifacts (timestamped folder)

  • wheel_report.json (detection, candidates, recommendation)
  • install.sh (reviewable install command) or INSTALL_FALLBACK.txt (if no exact match)
  • console_summary.txt

The matching logic looks for torch{MAJOR.MINOR}, CUDA markers like cu12/cu128, Python tags like cp311-cp311, platform linux_x86_64, and ABI markers cxx11abiTRUE/FALSE in wheel filenames—mirroring how official wheels are named upstream.

Why it exists
Even when CUDA and Python match, Torch minor and CXX11 ABI must match the wheel build. This tool saves you from trial‑and‑error and points you to either the correct wheel or a safe source‑build path when prebuilt wheels aren’t published for your combo (common on bleeding‑edge Torch/nightlies).
It follows the same “one command, zero required flags” philosophy as ml_doctor.py. fileciteturn0file0


Quick start

# 1) Diagnose your ML environment (read‑only)
python ml_doctor.py

# 2) If training complains "No module named flash_attn", find a matching wheel
python ml_fa_wheel.py
# Optionally install:
python ml_fa_wheel.py --install

If ml_fa_wheel.py can’t find an exact wheel (e.g., you’re on a nightly or a very new Torch minor), it will show the nearest candidates and write INSTALL_FALLBACK.txt with safe, copy‑pasteable source‑build steps (including an example for Blackwell TORCH_CUDA_ARCH_LIST="12.0").


Design principles

  • Fact‑first: gather the exact runtime facts before suggesting changes.
  • Deterministic & auditable: JSON reports + reproducible fix scripts instead of opaque magic.
  • Zero‑risk defaults: read‑only by default; explicit opt‑in for installs.
  • Tolerant to dev stacks: nightlies and multiple CUDA toolkits are handled with clear warnings and guidance.
  • LLM optional: if you provide OPENAI_API_KEY, ml_doctor.py can summarize noisy vendor docs to structured hints; otherwise it stays offline and deterministic.

These tools follow the long‑running, JSON‑strict workflow patterns we’ve successfully used in our other pipelines (e.g., multi‑pass token curation). fileciteturn0file2


Troubleshooting & tips

  • Multiple CUDA toolkits: ml_doctor.py flags them; be consistent about which you build against.
  • LD_LIBRARY_PATH: helpful, but can mask bundled libs—temporarily unset it when debugging imports.
  • Blackwell arch: use TORCH_CUDA_ARCH_LIST="12.0" when compiling CUDA extensions on Blackwell GPUs.
  • Pip cache: after switching Torch/CUDA lines, purge pip and ~/.cache/torch_extensions before rebuilding.
  • Nightlies: prefer source builds for CUDA extensions, or pin to a stable torch minor with known wheels.

Contributing

PRs welcome. Please keep new tools:

  • one‑command by default,
  • read‑only by default, and
  • emitting JSON + a reviewable plan.

License

MIT (see LICENSE).


Appendix: Example probe output

The doctor captures a robust snapshot for post‑mortems (OS, driver, CUDA 12.8, PyTorch nightly + cu128, etc.), which is the baseline for recommendations and fix plans. fileciteturn0file1

About

A suite of utilities for autonomously diagnosing environment issues in the context of common ML training/inference workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages