cli runner for benchmarks #200

EdmundKorley · 2025-05-29T12:28:29Z

Summary

Target issue is #198

Explain the motivation for making this change. What existing problem does the pull request solve?

In moving off the soon to be deprecated OpenAI Assistants API to the Responses API we would like to be able to quantify the performance improvement.

The following video is a walk through of how to run the benchmark.

assistants-benchmark.mp4

Here are the results, mainly a mean duration of 14.289s for calls to the /threads/sync endpoint which calls the OpenAI Assistants API.

uv run ai-cli bench assistants --count 100 --workers 1

Total queries: 100
Total deduped queries: 97
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [23:05<00:00, 14.29s/it]
Results saved to bench_results_20250529143857.csv

Mean duration: 14.289s over 97 runs.
Total duration: 1386.001s
Total tokens used: prompt=1941583, completion=23220
Estimated cost for 97 runs on gpt-4o: $4.115366

Note that running these benchmarks burn credits so did not run more than 100 test queries in the >3k queries from the test dataset. Running the full 3,000-query benchmark would take hours and cost over $140 in credits. Durations in the 100-item run varied from 12.27s to 24.23s so don't think its worth it to run the full benchmark. If it sounds good @dlobo, @AkhileshNegi we can move to adding a set of endpoints for Responses API and re-running benchmark to measure latency on same 100-item set.

EdmundKorley · 2025-05-29T12:40:42Z

bench_results_20250529135630.csv

Add configurable dataset support with kunji and sneha options, update cost estimation for GPT models, and make CSV column names configurable per dataset.

EdmundKorley · 2025-05-29T13:14:18Z

On gpt-4o-mini and Sneha goldens, mean latency is 10.731s

bench_results_20250529161204.csv

- Rename DataclassConfig to DatasetConfig for clarity - Update API key comment with setup instructions - Remove unused HARYANA_ASSISTANT_ID constant

- Introduced a new responses API route for handling OpenAI responses. - Updated main API router to include the new responses route. - Enhanced diagnostics handling in threads and CLI benchmark commands for better token tracking. - Added instruction files for the responses dataset to guide the AI assistant's behavior.

- Introduced a new GitHub Actions workflow for benchmarking the RAG system. - Configured to run benchmarks on multiple services and datasets. - Includes steps for starting the server, executing benchmarks, uploading results, and cleaning up resources. - Outputs benchmark results in a structured format for easy review.

- Added environment variables for API keys and credentials in the GitHub Actions benchmark workflow. - Updated the CLI benchmark command to use the LOCAL_CREDENTIALS_API_KEY from environment variables instead of a hardcoded value. - Included a step to create local credentials before running benchmarks.

codecov · 2025-06-02T16:14:14Z

Codecov Report

Attention: Patch coverage is 19.90291% with 165 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/cli/bench/commands.py	0.00%	137 Missing ⚠️
backend/app/api/routes/responses.py	68.96%	18 Missing ⚠️
backend/app/cli/main.py	0.00%	8 Missing ⚠️
backend/app/api/routes/threads.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

EdmundKorley · 2025-06-02T17:09:50Z

Here is benchmark on ~100-item datasets:

benchmark	runs	mean duration (s)	est. cost (USD)
assistants kunji	97	14.929	5.043410
responses kunji	97	9.349	5.474907
assistants sneha	82	10.794	0.168092
responses sneha	82	6.099	0.221158

https://github.com/agency-fund/ai-platform/actions/runs/15397785818

Responses API reduced mean duration by 5.58s (37%) for Kunji and 4.70s (44%) for Sneha datasets.

Responses API (Assistant API didn't) exposes chunks used to augment response so we can extend this code to run model evaluation (e.g. cosine similarity, RAGAS, etc).

AkhileshNegi · 2025-06-03T09:55:52Z

backend/app/api/routes/responses.py

+class ResponsesAPIRequest(BaseModel):
+    project_id: int
+
+    model: str
+    instructions: str
+    vector_store_ids: list[str]
+    max_num_results: Optional[int] = 20
+    temperature: Optional[float] = 0.1
+    response_id: Optional[str] = None
+
+    question: str
+
+
+class Diagnostics(BaseModel):
+    input_tokens: int
+    output_tokens: int
+    total_tokens: int
+
+    model: str


move models to backend/app/models folder

sees are for parsing requests not for backing database models. can introduce to a /schema/ folder for these

AkhileshNegi · 2025-06-03T09:56:26Z

backend/app/api/routes/responses.py

+class _APIResponse(BaseModel):
+    status: str
+
+    response_id: str
+    message: str
+    chunks: list[FileResultChunk]
+
+    diagnostics: Optional[Diagnostics] = None


we already have APIResponse model in backend/app/utils.py see if we can use same

AkhileshNegi · 2025-06-03T09:58:18Z

backend/app/api/routes/responses.py

+
+
+@router.post("/responses/sync", response_model=ResponsesAPIResponse)
+async def responses_sync(


do you mean synchronous or asynchronous in the description as I'm assuming this is running asynchronous so fasten up the completion by running parallel

AkhileshNegi · 2025-06-03T10:00:59Z

backend/app/cli/bench/commands.py

+class AssistantDatasetConfig:
+    assistant_id: str
+    filename: str
+    query_column: str
+


should we move models to backend/app/models

AkhileshNegi · 2025-06-04T04:43:36Z

.github/workflows/benchmark.yml

+      - name: Create local credentials
+        run: |
+          curl -X POST "http://localhost:8000/api/v1/credentials/" \
+            -H "Content-Type: application/json" \
+            -H "X-API-KEY: ${{ env.LOCAL_CREDENTIALS_API_KEY }}" \
+            -d '{


do we also need to run seeder before adding credentials for organization_id 1

seeder is run in prestart in docker compose up

AkhileshNegi · 2025-06-04T04:45:31Z

.github/workflows/benchmark.yml

+      - name: Upload benchmark results
+        uses: actions/upload-artifact@v4
+        with:
+          name: bench-${{ matrix.service }}-${{ matrix.dataset }}-${{ matrix.count }}.csv
+          path: bench-${{ matrix.service }}-${{ matrix.dataset }}-${{ matrix.count }}.csv


where are we uploading the results of the benchmark from here?

in the github actions ui see link:

#200 (comment)

avirajsingh7

lgtm

cli runner for kunji benchmark

b8fea7c

Refactor bench command to support multiple datasets

1b4c0da

Add configurable dataset support with kunji and sneha options, update cost estimation for GPT models, and make CSV column names configurable per dataset.

EdmundKorley added 3 commits May 29, 2025 16:21

Update bench CLI configuration and naming

4dfd55b

- Rename DataclassConfig to DatasetConfig for clarity - Update API key comment with setup instructions - Remove unused HARYANA_ASSISTANT_ID constant

EdmundKorley changed the title ~~cli runner for kunji benchmark~~ cli runner for benchmarks Jun 2, 2025

EdmundKorley added 22 commits June 2, 2025 14:12

workflow with server setup and data seeding

d9e11c1

rm extra

9c7ee5e

prestart.sh takes care of this

0aea62d

verbose logs

1f80184

try to get prestart failure logs

b987720

prestart fixes

87674ad

failure logs as separate step

ffd8a92

env vars not making way to docker

e76b2a0

cp

7501e61

debug

d3c11a3

set environment

88b0270

debug

3fb44ab

debug credentials creation

e53e9a1

add sleep

f93343a

inline project key

7fa48ee

debug

f641892

one shot

651fa48

backend logs on failure

5357fb4

more debug

0bd2aaf

debug

fa3ff46

mo debug

9bb6922

EdmundKorley added 8 commits June 2, 2025 17:47

timeout minutes

8f34120

debug

ef0b578

debug

2694151

debug

cbfc8c4

🤦

38193e1

patch artifact upload

c58ff68

copy artifact from docker to runner

bbfe1ef

fix docker file cp & linter

f2543db

EdmundKorley added 6 commits June 2, 2025 19:15

add back sleep

1a831bf

up count and patching costing

9c73202

add bench results to job summary

d54de56

add mean duration to step summary header

f569263

specific order to make sense to compare

15b44b6

up iterations to 100

7e98278

EdmundKorley requested review from AkhileshNegi and avirajsingh7 June 2, 2025 16:40

EdmundKorley self-assigned this Jun 2, 2025

EdmundKorley marked this pull request as ready for review June 2, 2025 16:41

EdmundKorley requested review from nishika26 and vijay-T4D June 3, 2025 04:37

AkhileshNegi approved these changes Jun 3, 2025

View reviewed changes

AkhileshNegi requested changes Jun 4, 2025

View reviewed changes

avirajsingh7 approved these changes Jun 4, 2025

View reviewed changes

AkhileshNegi merged commit 2e1adcb into ProjectTech4DevAI:main Jun 5, 2025
1 of 2 checks passed

AkhileshNegi linked an issue Jun 9, 2025 that may be closed by this pull request

Build benchmark of OpenAI assistants API latency #198

Closed



		@router.post("/responses/sync", response_model=ResponsesAPIResponse)
		async def responses_sync(

cli runner for benchmarks #200

cli runner for benchmarks #200

Uh oh!

Conversation

EdmundKorley commented May 29, 2025

Summary

Uh oh!

EdmundKorley commented May 29, 2025

Uh oh!

EdmundKorley commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

EdmundKorley commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EdmundKorley commented May 29, 2025 •

edited

Loading

codecov bot commented Jun 2, 2025 •

edited

Loading

EdmundKorley commented Jun 2, 2025 •

edited

Loading