Support running inference with KV cache on Kithara's MaxText models. #10

wenxindongwork · 2025-03-17T18:55:45Z

This PR integrates MaxEngine with Kithara's MaxTextModel to improve inference performance. Previously, model inference (i.e. MaxTextModel.generate()) is supported via a naive autoregressive for-loop without KV Cache, which results in slow tokens per second for longer sequences. Now, inference for MaxTextModel is backed by a KV Cache system.

Fast inference is required for supporting online and offline evaluation, as well as on-policy RLHF.

It should also be mentioned that Kithara's KerasHubModel.generate() is already backed by KV Cache.

Key Changes

Added JetStream as a submodule
Patched MaxText's MaxEngine to intake an parameterized model
Redesigned the model generation API to support a wider range of input formats:
- String inputs (single or batched)
- Token inputs as integer lists or numpy arrays
Added configurable max_prefill_predict_length parameter. This parameter determines the max prefill length.
Updated documentation
Support batched inference with progress tracking

Testing

Added unit tests verifying compatibility with MaxText models

kithara/model/kerashub/keras_hub_model.py

tests/model/kerashub/test_inference.py

kithara/__init__.py

tests/model/kerashub/test_inference.py

lienchen0526

I left some python readability stuff here. We can discuss which can be a good practice from both of our sides

wenxindongwork · 2025-03-20T16:18:33Z

thank you Jerry for the thorough code review! And thanks for catching the typo :)

wenxindongwork added 6 commits March 11, 2025 19:59

WIP

eb96265

WIP

5ccf178

WIP

afeeaab

WIP

fbc626b

Merge branch 'main' into training_time_inference

2a3576f

WIP

d2e4f64

wenxindongwork requested review from lienchen0526, richardsliu and yixinshi March 17, 2025 20:28

lienchen0526 reviewed Mar 18, 2025

View reviewed changes

kithara/model/kerashub/keras_hub_model.py Outdated Show resolved Hide resolved

lienchen0526 reviewed Mar 18, 2025

View reviewed changes

tests/model/kerashub/test_inference.py Outdated Show resolved Hide resolved

wenxindongwork added 2 commits March 18, 2025 14:55

Update pyproject.toml

3289eb4

fix typo and add Jetstream dependency

f749722

lienchen0526 reviewed Mar 20, 2025

View reviewed changes

kithara/__init__.py Outdated Show resolved Hide resolved

lienchen0526 reviewed Mar 20, 2025

View reviewed changes

tests/model/kerashub/test_inference.py Outdated Show resolved Hide resolved

lienchen0526 reviewed Mar 20, 2025

View reviewed changes

wenxindongwork added 2 commits March 20, 2025 16:18

Update models.rst

ed4fd43

WIP

8500e96

wenxindongwork merged commit 36686dc into main Mar 20, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support running inference with KV cache on Kithara's MaxText models. #10

Support running inference with KV cache on Kithara's MaxText models. #10

Uh oh!

wenxindongwork commented Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lienchen0526 left a comment

Uh oh!

wenxindongwork commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support running inference with KV cache on Kithara's MaxText models. #10

Support running inference with KV cache on Kithara's MaxText models. #10

Uh oh!

Conversation

wenxindongwork commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lienchen0526 left a comment

Choose a reason for hiding this comment

Uh oh!

wenxindongwork commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenxindongwork commented Mar 17, 2025 •

edited

Loading