Skip to content

Conversation

@EdmundKorley
Copy link
Collaborator

Summary

Target issue is #198

Explain the motivation for making this change. What existing problem does the pull request solve?

In moving off the soon to be deprecated OpenAI Assistants API to the Responses API we would like to be able to quantify the performance improvement.

The following video is a walk through of how to run the benchmark.

assistants-benchmark.mp4

Here are the results, mainly a mean duration of 14.289s for calls to the /threads/sync endpoint which calls the OpenAI Assistants API.

uv run ai-cli bench assistants --count 100 --workers 1

Total queries: 100
Total deduped queries: 97
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [23:05<00:00, 14.29s/it]
Results saved to bench_results_20250529143857.csv

Mean duration: 14.289s over 97 runs.
Total duration: 1386.001s
Total tokens used: prompt=1941583, completion=23220
Estimated cost for 97 runs on gpt-4o: $4.115366

Note that running these benchmarks burn credits so did not run more than 100 test queries in the >3k queries from the test dataset. Running the full 3,000-query benchmark would take hours and cost over $140 in credits. Durations in the 100-item run varied from 12.27s to 24.23s so don't think its worth it to run the full benchmark. If it sounds good @dlobo, @AkhileshNegi we can move to adding a set of endpoints for Responses API and re-running benchmark to measure latency on same 100-item set.

@EdmundKorley
Copy link
Collaborator Author

Add configurable dataset support with kunji and sneha options, update
cost estimation for GPT models, and make CSV column names configurable
per dataset.
@EdmundKorley
Copy link
Collaborator Author

EdmundKorley commented May 29, 2025

On gpt-4o-mini and Sneha goldens, mean latency is 10.731s

CleanShot 2025-05-29 at 16 12 07@2x

bench_results_20250529161204.csv

- Rename DataclassConfig to DatasetConfig for clarity
- Update API key comment with setup instructions
- Remove unused HARYANA_ASSISTANT_ID constant
- Introduced a new responses API route for handling OpenAI responses.
- Updated main API router to include the new responses route.
- Enhanced diagnostics handling in threads and CLI benchmark commands for better token tracking.
- Added instruction files for the responses dataset to guide the AI assistant's behavior.
- Introduced a new GitHub Actions workflow for benchmarking the RAG system.
- Configured to run benchmarks on multiple services and datasets.
- Includes steps for starting the server, executing benchmarks, uploading results, and cleaning up resources.
- Outputs benchmark results in a structured format for easy review.
@EdmundKorley EdmundKorley changed the title cli runner for kunji benchmark cli runner for benchmarks Jun 2, 2025
@codecov
Copy link

codecov bot commented Jun 2, 2025

Codecov Report

Attention: Patch coverage is 19.90291% with 165 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/cli/bench/commands.py 0.00% 137 Missing ⚠️
backend/app/api/routes/responses.py 68.96% 18 Missing ⚠️
backend/app/cli/main.py 0.00% 8 Missing ⚠️
backend/app/api/routes/threads.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@EdmundKorley EdmundKorley self-assigned this Jun 2, 2025
@EdmundKorley EdmundKorley marked this pull request as ready for review June 2, 2025 16:41
@EdmundKorley
Copy link
Collaborator Author

EdmundKorley commented Jun 2, 2025

Here is benchmark on ~100-item datasets:

benchmark runs mean duration (s) est. cost (USD)
assistants kunji 97 14.929 5.043410
responses kunji 97 9.349 5.474907
assistants sneha 82 10.794 0.168092
responses sneha 82 6.099 0.221158

https://github.com/agency-fund/ai-platform/actions/runs/15397785818

Responses API reduced mean duration by 5.58s (37%) for Kunji and 4.70s (44%) for Sneha datasets.

Responses API (Assistant API didn't) exposes chunks used to augment response so we can extend this code to run model evaluation (e.g. cosine similarity, RAGAS, etc).

Comment on lines +24 to +42
class ResponsesAPIRequest(BaseModel):
project_id: int

model: str
instructions: str
vector_store_ids: list[str]
max_num_results: Optional[int] = 20
temperature: Optional[float] = 0.1
response_id: Optional[str] = None

question: str


class Diagnostics(BaseModel):
input_tokens: int
output_tokens: int
total_tokens: int

model: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move models to backend/app/models folder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sees are for parsing requests not for backing database models. can introduce to a /schema/ folder for these

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment on lines +50 to +57
class _APIResponse(BaseModel):
status: str

response_id: str
message: str
chunks: list[FileResultChunk]

diagnostics: Optional[Diagnostics] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have APIResponse model in backend/app/utils.py see if we can use same



@router.post("/responses/sync", response_model=ResponsesAPIResponse)
async def responses_sync(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean synchronous or asynchronous in the description as I'm assuming this is running asynchronous so fasten up the completion by running parallel

Comment on lines +32 to +36
class AssistantDatasetConfig:
assistant_id: str
filename: str
query_column: str

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move models to backend/app/models

Comment on lines +47 to +52
- name: Create local credentials
run: |
curl -X POST "http://localhost:8000/api/v1/credentials/" \
-H "Content-Type: application/json" \
-H "X-API-KEY: ${{ env.LOCAL_CREDENTIALS_API_KEY }}" \
-d '{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to run seeder before adding credentials for organization_id 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeder is run in prestart in docker compose up

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool

Comment on lines +86 to +90
- name: Upload benchmark results
uses: actions/upload-artifact@v4
with:
name: bench-${{ matrix.service }}-${{ matrix.dataset }}-${{ matrix.count }}.csv
path: bench-${{ matrix.service }}-${{ matrix.dataset }}-${{ matrix.count }}.csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are we uploading the results of the benchmark from here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the github actions ui see link:

#200 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Collaborator

@avirajsingh7 avirajsingh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@AkhileshNegi AkhileshNegi merged commit 2e1adcb into ProjectTech4DevAI:main Jun 5, 2025
1 of 2 checks passed
@AkhileshNegi AkhileshNegi linked an issue Jun 9, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build benchmark of OpenAI assistants API latency

3 participants