Skip to content

Conversation

@Bslabe123
Copy link
Collaborator

@Bslabe123 Bslabe123 commented Mar 27, 2025

Sleep time is now configurable via the SLEEP_TIME env var

@Bslabe123 Bslabe123 requested a review from achandrasekar March 27, 2025 20:52
@achandrasekar
Copy link
Collaborator

Looks good Brendan! Is there a small example run with multiple request rates to confirm?

@Bslabe123
Copy link
Collaborator Author

Bslabe123 commented Mar 27, 2025

Relevant logs

+ python3 benchmark_serving.py --save-json-results --host=vllm-inference-server --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Llama-2-7b-hf --backend=vllm --max-input-length=256 --max-output-length=256 --file-prefix=benchmark --models=meta-llama/Llama-2-7b-hf --pm-namespace= --pm-job= --scrape-server-metrics --save-aggregated-result --request-rate=600 --num-prompts=1002
Finding Mean vllm:cpu_cache_usage_perc with the following query: avg_over_time(vllm:cpu_cache_usage_perc{job='',namespace=''}[158s])
Got response from metrics server: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Cloud Monitoring PromQL Error: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Finding Mean vllm:cpu_cache_usage_perc with the following query: avg_over_time(vllm:cpu_cache_usage_perc{job='',namespace=''}[158s])
Got response from metrics server: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Cloud Monitoring PromQL Error: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
+ cat latency-profile-2025-03-27_21-31-41.txt
Namespace(backend='vllm', sax_model='', file_prefix='benchmark', endpoint='generate', host='vllm-inference-server', port=8000, dataset='ShareGPT_V3_unfiltered_cleaned_split.json', models='meta-llama/Llama-2-7b-hf', traffic_split=None, stream_request=False, request_timeout=10800.0, tokenizer='meta-llama/Llama-2-7b-hf', best_of=1, use_beam_search=False, num_prompts=1002, max_input_length=256, max_output_length=256, top_k=32000, request_rate=600.0, seed=1743111104, trust_remote_code=False, machine_cost=None, use_dummy_text=False, save_json_results=True, output_bucket=None, output_bucket_filepath=None, save_aggregated_result=True, additional_metadata_metrics_to_save=None, scrape_server_metrics=True, pm_namespace='', pm_job='')
Models to benchmark: ['meta-llama/Llama-2-7b-hf']
No traffic split specified. Defaulting to uniform traffic split.
Starting Prometheus Server on port 9090
send all requests
====Result for Model: weighted====
Errors: {'ClientConnectorError': 0, 'TimeoutError': 0, 'ContentTypeError': 0, 'ClientOSError': 0, 'ServerDisconnectedError': 0, 'unknown_error': 0}
Total time: 158.22 s
Successful/total requests: 1002/1002
Requests/min: 379.99
Output_tokens/min: 41124.43
Input_tokens/min: 25982.92
Tokens/min: 67107.35
Average seconds/token (includes waiting time on server): 0.61
Average milliseconds/request (includes waiting time on server): 75679.43
Average milliseconds/output_token (includes waiting time on server): 1453.33
Average input length: 68.38
Average output length: 108.23
====Result for Model: meta-llama/Llama-2-7b-hf====
Errors: {'ClientConnectorError': 0, 'TimeoutError': 0, 'ContentTypeError': 0, 'ClientOSError': 0, 'ServerDisconnectedError': 0, 'unknown_error': 0}
Total time: 158.22 s
Successful/total requests: 1002/1002
Requests/min: 379.99
Output_tokens/min: 41124.43
Input_tokens/min: 25982.92
Tokens/min: 67107.35
Average seconds/token (includes waiting time on server): 0.61
Average milliseconds/request (includes waiting time on server): 75679.43
Average milliseconds/output_token (includes waiting time on server): 1453.33
Average input length: 68.38
Average output length: 108.23
+ echo 'Sleeping for 10 seconds...'
Sleeping for 10 seconds...
+ sleep 10
+ for request_rate in $(echo $REQUEST_RATES | tr ',' ' ')
+ echo 'Benchmarking request rate: 650'
Benchmarking request rate: 650
++ date +%Y-%m-%d_%H-%M-%S
+ timestamp=2025-03-27_21-35-02
+ output_file=latency-profile-2025-03-27_21-35-02.txt
+ '[' 650 == 0 ']'
++ awk 'BEGIN {print int(650 * 1.67)}'
+ num_prompts=1085
+ echo 'TOTAL prompts: 1085'
TOTAL prompts: 1085
+ PYTHON_OPTS=("${BASE_PYTHON_OPTS[@]}" "--request-rate=$request_rate" "--num-prompts=$num_prompts")
+ python3 benchmark_serving.py --save-json-results --host=vllm-inference-server --port=8000 --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer=meta-llama/Llama-2-7b-hf --backend=vllm --max-input-length=256 --max-output-length=256 --file-prefix=benchmark --models=meta-llama/Llama-2-7b-hf --pm-namespace= --pm-job= --scrape-server-metrics --save-aggregated-result --request-rate=650 --num-prompts=1085

@achandrasekar achandrasekar merged commit 1330e4d into main Mar 27, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants