GitHub - DaoyuanLi2816/can-i-finetune-this: Estimate whether a Hugging Face model fits and fine-tunes on your local GPU.

Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.

You have one consumer-grade NVIDIA GPU. You want to fine-tune an open-weight LLM with LoRA or QLoRA, but you do not want to download 14 GB of weights just to discover that your 12 GB / 16 GB / 24 GB card OOMs on step 1.

canifinetune answers, before you spend the disk and the time:

Can I fine-tune this model?
About how much VRAM will it use?
What batch size / sequence length / LoRA rank / quantization should I use?
If I can't, how should I downsize?
Is there local benchmark evidence for that answer?
Can I get a ready-to-run Hugging Face + PEFT + TRL training script for that config?

It is a single Python package with a CLI:

canifinetune doctor
canifinetune estimate --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --gpu-vram-gb 16 --seq-len 2048 --micro-batch-size 1 --lora-rank 16
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16
canifinetune bench    --model sshleifer/tiny-gpt2 --method lora --steps 3
canifinetune calibrate --benchmarks benchmarks/results
canifinetune recipe   --model Qwen/Qwen2.5-1.5B-Instruct --method qlora --output recipes/qwen2.5-1.5b-qlora-4080
canifinetune report   --benchmarks benchmarks/results --out report.md
canifinetune compare  --benchmarks benchmarks/results --out compare.md

What canifinetune estimate actually prints:

+-------- Qwen/Qwen2.5-1.5B-Instruct  (qlora) --------+
| feasible: YES    ratio = 0.20    confidence = medium |
+------------------------------------------------------+
       Memory breakdown (GB)
+---------------------------------+
| Component             |   Value |
|-----------------------+---------|
| static model          |   0.737 |
| quantization overhead |   0.018 |
| trainable params      |  4.4 MB |
| gradients             |   0.008 |
| optimizer states      |   0.010 |
| activations           |   0.328 |
| CUDA / fragmentation  |   1.280 |
| safety margin         |   0.800 |
| total                 |   3.163 |
+---------------------------------+

Static estimate says 3.16 GB; on a real RTX 4080 the same config measures 7.10 GB (heavy bitsandbytes unpacking buffers at seq_len=2048). canifinetune bench and canifinetune calibrate close that gap on your machine — that is the point of the project.

Install

canifinetune runs in two layers:

Layer	Install	What you get
Core (estimate / recommend / recipe / report)	`pip install canifinetune`	All CLI commands. No PyTorch required.
Training (bench / real fine-tuning)	`pip install canifinetune[train]`	Adds `torch`, `transformers`, `peft`, `bitsandbytes`, `trl`, `datasets`.
Reporting extras	`pip install canifinetune[report]`	Pandas/tabulate for prettier tables.
Development	`pip install canifinetune[dev]`	pytest, ruff, mypy.

If you use uv:

uv venv
uv pip install -e ".[dev,report]"
# Add training deps when you want to run benchmarks:
uv pip install -e ".[dev,train,report]"

PyTorch should generally be installed with the CUDA wheel that matches your driver, e.g.

uv pip install torch --index-url https://download.pytorch.org/whl/cu121

See docs/troubleshooting.md for Windows / WSL / bitsandbytes specifics.

Quickstart

# 1. See what your machine looks like
canifinetune doctor

# 2. Ask if a model fits on your card
canifinetune estimate \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --gpu-vram-gb 16 \
  --seq-len 2048 \
  --micro-batch-size 1 \
  --lora-rank 16

# 3. Have it search for a feasible config
canifinetune recommend --model Qwen/Qwen2.5-1.5B-Instruct --gpu-vram-gb 16

# 4. Run a tiny real benchmark (downloads sshleifer/tiny-gpt2, ~5 MB)
canifinetune bench --model sshleifer/tiny-gpt2 --method lora --steps 3

# 5. Generate a ready-to-run training recipe
canifinetune recipe \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --method qlora \
  --seq-len 2048 \
  --output recipes/qwen2.5-1.5b-qlora-4080

What's different from `accelerate estimate-memory`?

accelerate estimate-memory tells you how much memory loading a model takes. That is not enough to know whether you can train it.

This project tries to answer the harder question. It models:

Model weights, in fp32 / fp16 / bf16 / int8 / NF4 + double-quant
LoRA / QLoRA trainable parameter count for typical target_modules
Gradients only for trainable parameters
AdamW vs 8-bit / paged AdamW optimizer states
Activations as a function of seq_len, batch_size, hidden_size, num_layers, with and without gradient checkpointing
A fragmentation / CUDA / buffer safety margin
A feasibility decision against your actual GPU
Concrete degradation suggestions when not feasible

Estimates are always marked with an assumptions block and a confidence level, because activation memory in particular is hard to predict statically. Run canifinetune bench and canifinetune calibrate to ground them in real measurements on your machine.

RTX 4080 baselines

docs/rtx4080_baselines.md contains real measurements collected on a single RTX 4080 (16 GB). These are not synthetic. If a configuration was not run, the table says "not run", not a guessed number.

Highlights (more in the doc):

model	method	seq_len	measured peak	tok/sec
`Qwen/Qwen2.5-0.5B-Instruct`	qlora	1024	3.30 GB	1995
`Qwen/Qwen2.5-1.5B-Instruct`	qlora	1024	4.36 GB	1352
`Qwen/Qwen2.5-1.5B-Instruct`	qlora	2048	7.10 GB	1470
`Qwen/Qwen2.5-3B-Instruct`	qlora	1024	5.54 GB	1158
`sshleifer/tiny-gpt2` (smoke)	lora	128	0.12 GB	1735

Repository layout

src/canifinetune/        # package code (estimator, bench, recipes, reports, cli)
benchmarks/              # configs/, results/ (JSON), calibration/
docs/                    # design, memory model, troubleshooting
examples/                # end-to-end recipe folders
tests/                   # pytest tests (CPU-only, no large downloads)
scripts/                 # helper scripts for collecting baselines
.github/workflows/       # CI (ruff + pytest on CPU)

Roadmap

The current scope is "single consumer GPU, single node, LoRA / QLoRA, causal LM, Hugging Face stack". Possible directions, none committed:

DeepSpeed ZeRO and FSDP estimation for multi-GPU setups
Heuristics for sequence-classification / encoder-decoder training
Throughput modeling (tokens / sec), not just feasibility
Auto-tuning of gradient_accumulation_steps for a target effective batch size
A web UI on top of the CLI

Contributions welcome.

Star History

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Quickstart

What's different from `accelerate estimate-memory`?

RTX 4080 baselines

Repository layout

Roadmap

Star History

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
scripts		scripts
src/canifinetune		src/canifinetune
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Install

Quickstart

What's different from accelerate estimate-memory?

RTX 4080 baselines

Repository layout

Roadmap

Star History

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What's different from `accelerate estimate-memory`?

Packages