Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ go to the deps/JetStream folder (downloaded during `install_everything.sh`)
cd deps/JetStream
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json
python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs
python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs --warm-up=True
```
Please look at `deps/JetStream/benchmarks/README.md` for more information.

Expand Down
64 changes: 64 additions & 0 deletions benchmarks/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Benchmark results of various models


## Llama 3 - 8B

Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
2024-04-24 | TPU v5e-8 | bfloat16 | 128 | 2048 | 1024 | 1024 | 8249
2024-04-24 | TPU v5e-8 | int8 | 256 | 2048 | 1024 | 1024 | 10873


## Gemma - 7B

Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
2024-05-10 | TPU v5e-8 | bfloat16 | 96 | 2048 | 1024 | 1024 | 3236
2024-05-10 | TPU v5e-8 | int8 | 128 | 2048 | 1024 | 1024 | 4695

## Llama 2 - 7B

Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
2024-03-28 | TPU v5e-8 | bfloat16 | 96 | 2048 | 1024 | 1024 | 3663
2024-03-28 | TPU v5e-8 | int8 | 96 | 2048 | 1024 | 1024 | 4783

## Llama 2 - 13B

Date | Device | dtype | batch size | cache length |max input length |max output length| throughput (token/s)
----| ------- | ------ |---------- | -------------|-----------------|------------------|----------------------
2024-03-28 | TPU v5e-8 | bfloat16 | 48 | 2048 | 1024 | 1024 | 2056
2024-03-28 | TPU v5e-8 | int8 | 96 | 2048 | 1024 | 1024 | 3458
2024-03-28 | TPU v5e-8 | bfloat16 | 80 | 1280 | 1024 | 1024 | 2911
2024-03-28 | TPU v5e-8 | int8 | 96 | 1280 | 1024 | 1024 | 3938

**NOTE:** When cache length is less than the sum of max input length + max output length
we employ *Rolling window attention*.


# Instructions to reproduce:

Please refer [README.md](README.md) for instructions in how to get the model weights.

**NOTE** Different weights can produce different benchmark results (due to generating)
different sentence length. For llama, we used the `-chat` versions of the weight.
For Gemma we used the `-it` (instruction finetuned) version of the weights.

## Run the server
NOTE: the `--platform=tpu=8` need to specify number of tpu devices (which is 4 for v4-8 and 8 for v5light-8`)

```bash
python run_server.py --param_size=7b --batch_size= 128 --max_cache_length=2048 --quantize_weights=$quantize --quantize_kv_cache=$quantize --checkpoint_path=$output_ckpt_dir --tokenizer_path=$tokenizer_path --platform=tpu=8 --model=$model_name
```
Now you can fire gRPC to it

# Run benchmark
go to the deps/JetStream folder (downloaded during `install_everything.sh`)

```bash
cd deps/JetStream
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json
python benchmarks/benchmark_serving.py --tokenizer $tokenizer_path --num-prompts 2000 --dataset-path $dataset_path --dataset sharegpt --save-request-outputs --warm-up=True
```
Please look at `deps/JetStream/benchmarks/README.md` for more information.