-
Notifications
You must be signed in to change notification settings - Fork 7
cli runner for benchmarks #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli runner for benchmarks #200
Conversation
Add configurable dataset support with kunji and sneha options, update cost estimation for GPT models, and make CSV column names configurable per dataset.
|
On gpt-4o-mini and Sneha goldens, mean latency is 10.731s |
- Rename DataclassConfig to DatasetConfig for clarity - Update API key comment with setup instructions - Remove unused HARYANA_ASSISTANT_ID constant
- Introduced a new responses API route for handling OpenAI responses. - Updated main API router to include the new responses route. - Enhanced diagnostics handling in threads and CLI benchmark commands for better token tracking. - Added instruction files for the responses dataset to guide the AI assistant's behavior.
- Introduced a new GitHub Actions workflow for benchmarking the RAG system. - Configured to run benchmarks on multiple services and datasets. - Includes steps for starting the server, executing benchmarks, uploading results, and cleaning up resources. - Outputs benchmark results in a structured format for easy review.
- Added environment variables for API keys and credentials in the GitHub Actions benchmark workflow. - Updated the CLI benchmark command to use the LOCAL_CREDENTIALS_API_KEY from environment variables instead of a hardcoded value. - Included a step to create local credentials before running benchmarks.
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Here is benchmark on ~100-item datasets:
https://github.com/agency-fund/ai-platform/actions/runs/15397785818 Responses API reduced mean duration by 5.58s (37%) for Kunji and 4.70s (44%) for Sneha datasets. Responses API (Assistant API didn't) exposes chunks used to augment response so we can extend this code to run model evaluation (e.g. cosine similarity, RAGAS, etc). |
| class ResponsesAPIRequest(BaseModel): | ||
| project_id: int | ||
|
|
||
| model: str | ||
| instructions: str | ||
| vector_store_ids: list[str] | ||
| max_num_results: Optional[int] = 20 | ||
| temperature: Optional[float] = 0.1 | ||
| response_id: Optional[str] = None | ||
|
|
||
| question: str | ||
|
|
||
|
|
||
| class Diagnostics(BaseModel): | ||
| input_tokens: int | ||
| output_tokens: int | ||
| total_tokens: int | ||
|
|
||
| model: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move models to backend/app/models folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sees are for parsing requests not for backing database models. can introduce to a /schema/ folder for these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| class _APIResponse(BaseModel): | ||
| status: str | ||
|
|
||
| response_id: str | ||
| message: str | ||
| chunks: list[FileResultChunk] | ||
|
|
||
| diagnostics: Optional[Diagnostics] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have APIResponse model in backend/app/utils.py see if we can use same
|
|
||
|
|
||
| @router.post("/responses/sync", response_model=ResponsesAPIResponse) | ||
| async def responses_sync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean synchronous or asynchronous in the description as I'm assuming this is running asynchronous so fasten up the completion by running parallel
| class AssistantDatasetConfig: | ||
| assistant_id: str | ||
| filename: str | ||
| query_column: str | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move models to backend/app/models
| - name: Create local credentials | ||
| run: | | ||
| curl -X POST "http://localhost:8000/api/v1/credentials/" \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "X-API-KEY: ${{ env.LOCAL_CREDENTIALS_API_KEY }}" \ | ||
| -d '{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need to run seeder before adding credentials for organization_id 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seeder is run in prestart in docker compose up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool
| - name: Upload benchmark results | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: bench-${{ matrix.service }}-${{ matrix.dataset }}-${{ matrix.count }}.csv | ||
| path: bench-${{ matrix.service }}-${{ matrix.dataset }}-${{ matrix.count }}.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are we uploading the results of the benchmark from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the github actions ui see link:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
avirajsingh7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm

Summary
Target issue is #198
Explain the motivation for making this change. What existing problem does the pull request solve?
In moving off the soon to be deprecated OpenAI Assistants API to the Responses API we would like to be able to quantify the performance improvement.
The following video is a walk through of how to run the benchmark.
assistants-benchmark.mp4
Here are the results, mainly a mean duration of 14.289s for calls to the
/threads/syncendpoint which calls the OpenAI Assistants API.Note that running these benchmarks burn credits so did not run more than 100 test queries in the >3k queries from the test dataset. Running the full 3,000-query benchmark would take hours and cost over $140 in credits. Durations in the 100-item run varied from 12.27s to 24.23s so don't think its worth it to run the full benchmark. If it sounds good @dlobo, @AkhileshNegi we can move to adding a set of endpoints for Responses API and re-running benchmark to measure latency on same 100-item set.