Skip to content

Use lazy tensor #1427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

justinchuby
Copy link

@justinchuby justinchuby commented Apr 27, 2025

  • Use LazyTensor to avoid having the store to disk.
  • Load model in native dtype with torch_dtype="auto" to avoid the cast overhead when loading the model.
  • The end result is 6.7 GB of peak memory usage (75% reduction) w/o needing temp disk space & 28% reduction in export time
  • Memory needed for a model is (size of the weights + max(size of individual weight) * 3)

image

With a bigger model (google/gemma-3-12b-it) (model size 22.7 GB):

Time: 0:14:38 -> 0:01:57 (87% reduction) Memory: 190.5 -> 35.3 (81% reduction)

(left: current main, right: with IR and lazy tensor)

image

image

Compare with gguf conversion tool

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant