Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Adds support for parallel inference and batching #498

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

TikZSZ
Copy link

@TikZSZ TikZSZ commented Jul 2, 2024

Parallel Inference Support for berkeley-function-call-leaderboard

This PR adds support for running berkeley-function-call-leaderboard inference in parallel, reducing running time by 4x or more depending on --batch-size.

Changes

Modifies berkeley-function-call-leaderboard/model_handler/handler.py

  • Modified write function to make it async using aiofiles
  • Added sort_results function to sort the results based on idx after each individual test_cate is over
  • sort_results function returns the indices after sorting, supporting resuming functionality

Modifies berkeley-function-call-leaderboard/openfunctions_evaluation.py

  • Added --batch-size arg, defaults to 1 -> controls number of parallel requests
  • Refactored processing and result-writing logic to fetch_and_process function
  • Added make_async function to wrap sync functions as async (used for handler.inference)
  • Added nested progress bar for tracking iterations
  • Refactored core logic for processing under main function
  • Implemented proper resume support, replacing num_existing_lines

Resume Support

Improved resuming functionality in async code:

  • Addresses potential issues where some test cases could complete earlier than others, leading to inconsistent resume.
  • Filters already saved test cases instead of using a simple line count
  • For saved test cases, adds None as a placeholder, which becomes the conditional for skipping test cases
  • This approach ensures consistent resuming even if execution is interrupted mid-test
  • This is really important for models that are expensive to run, where re-running the whole test again is undesirable
  • A test log screenshot is also uploaded bottom of this PR to confirm that it works as intended

Note: This PR automatically wraps inference calls as async to minimize code changes, but the calls are still synchronous and will block the event loop so we use loop.run_in_executor to run the calls in parallel on multiple threads instead it uses min(32, os.cpu_count() + 4) threads by default. If handlers are made async in future they would continue to work like normal async code

Testing

Tested on a custom OpenAI-compatible model running on vllm:

  • Completed simple test in 40 seconds
  • Hardware: RTX4090
  • Model: LLAMA 8B BF16
  • Batch size: 15-20

Benchmark Results

Benchmark

Debug Logs for new Resume System

Debug-resume

@TikZSZ TikZSZ changed the title Adds support for parallel inference and batching [BFCL] Adds support for parallel inference and batching Jul 2, 2024
@ShishirPatil
Copy link
Owner

Thanks for contributing to the Berkeley Function Calling Leaderboard @TikZSZ ! Appreciate the PR and Welcome! We are currently reviewing and testing this PR

@TikZSZ
Copy link
Author

TikZSZ commented Jul 14, 2024

@ShishirPatil I've added proper resume support and updated the PR description with more details about the changes, Could you guys please take a look when you have a chance? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants