changed basic hf server to support quantization and streaming #2293

yk · 2023-04-02T11:34:25Z

I now also put the multi-worker-image PR in here because it builds heavily upon it

…able

Notable config settings that are required to get GPTNeoX-20B to behave nicely: - bf16: when training in fp16, gradients often over/underflow, leading to bad convergence. (bf16 could also potentially improve the loss of other models, we should try) - More warmup steps: To fill the optimizer buffers, a gentle warmup is required. Right now, we do this by just using more warmup steps, but in the future a linear LR warmup schedule could be used to achieve the same effect. - Gradient checkpointing is faster than gradient accumulation. This could also translate to other models. - Stage 3 is required to fit the 20B model in bf16. For some reason, fp16 was possible to fit with stage 2 but bf16 wasn't. Other learnings: - Residual dropout quickly degrades performance of the pre-trained model. Even `p=0.1` leads to an initial loss of > 5. - Flash attention does not go too well with the 20B model. There are slight numerical differences between flash attention and the GPTNeoX attention implementation that accumulate layer-by-layer, ultimately leading to vastly different results. - Run in bf16 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9 - Run in bf16 with flash attention: https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8 - Run in fp32 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.

Added chat.json Updated common.json Updated dashboard.json Added error.json Updated index.json Updated tasks.json Updated labelling.json Updated message.json Update leaderboard.json All missing message labels added. --------- Co-authored-by: AbdBarho <ka70911@gmail.com>

- add `CodeAlpaca` class to load [sahil2801/CodeAlpaca-20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k) - add `GPT4all` class to load [Nebulous/gpt4all_pruned](https://huggingface.co/datasets/Nebulous/gpt4all_pruned)

This adds configs for the 13B and 6.7B Cerebras models but with the necessary code changes made it should be easy enough to add configs for the smaller models too if desired. The tokenizer seems to be the GPT-2 fast tokenizer from HuggingFace so the special tokens have been configured for that.

Co-authored-by: mishka <gartsocial@gmail.com>

Add `rng_seed` command line/config parameter. Command line arg will overwrite yaml-config value.

Introduces counters for work queues that allows us to track the positions of enqueued work requests without having to iterate through the queues

- removed setting the eos id property because this property is read-only at the lama tokenizer (and it is not necessary to explicitly set it for pythia models)

Hey, following my pull request for adding the "Team" button to the call-to-action section, I made a small update to the responsiveness. Before | Now :-------------------------:|:-------------------------: ![image](https://user-images.githubusercontent.com/25230234/229289501-3cf93539-18ff-4b45-9c91-299e677b4ace.png) | ![image](https://user-images.githubusercontent.com/25230234/229289660-67ab501a-e02e-432a-a6d0-c1287a1c0d72.png)

…uing (#2279)

New files have been added to the dataset which caused entries to appear multiple times if loaded naively. Now one dataset file is explicitly specified to avoid this. Ollie helped analyzing the dataset. --------- Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>

olliestanley

Looks good. I wonder if we should refactor the worker directory structure a bit to have a subdirectory for basic HF server functionality, the code (and the difference between basic HF server vs text-generation-inference server) is possibly a bit confusing for new developers at this point.

Also seems like something weird has happened with the diffs from main.

inference/worker/basic_hf_server.py

olliestanley · 2023-04-02T12:11:07Z

inference/worker/settings.py

@@ -17,5 +17,8 @@ class Settings(pydantic.BaseSettings):
    perform_oom_test: bool = False
    oom_test_max_length: int | None = None

+    # for hf basic server
+    quantize: bool = False


Should we have a more descriptive setting name here so people don't expect it to have an effect when not using the basic server?

this needs to be called quantize because the hf-inference server also expects it to be called like this

model/model_training/trainer_rm.py

inference/worker/__main__.py

olliestanley

Reapproving with new changes (apologies for the early review yesterday, somehow didn't notice it was a draft!)

yk and others added 20 commits March 31, 2023 10:52

Starting multiple workers in inference image when multiple GPUs avail…

6afd4fa

…able

added some echoes

a0360fb

added some echoes

8bd03fc

more entrypoint stuff

1f2ebdc

master port

bf1b77d

adding sleep

7419753

more configs

ef4a4e0

Use LLaMA impl of Huggingface Transformers (#2263)

9cb9075

typo in parsing openai/summarize_from_feedback (#2268)

3c1335e

Co-authored-by: mishka <gartsocial@gmail.com>

Add rng_seed parameter to trainers (#2254)

7e05077

Add `rng_seed` command line/config parameter. Command line arg will overwrite yaml-config value.

Computing message queue positions (#2235)

1b72c07

Introduces counters for work queues that allows us to track the positions of enqueued work requests without having to iterate through the queues

Remove assigning eos token id (llama compatibility) (#2280)

e88efb7

- removed setting the eos id property because this property is read-only at the lama tokenizer (and it is not necessary to explicitly set it for pythia models)

Added max size to work queue and an error response if full when enque…

696889c

…uing (#2279)

changed basic hf server to support quantization and streaming

e6ad876

yk force-pushed the hf-worker-server-bnb branch from 6fa066d to e6ad876 Compare April 2, 2023 12:07

olliestanley approved these changes Apr 2, 2023

View reviewed changes

yk added 8 commits April 2, 2023 14:20

updated main script

9924bab

ctrl c trap

dcf7d43

replacing llama config

ad12aa7

sleep param

90297b2

removed dtypes

f7ff758

loading in thread

6e4e0b8

removed signal

2b7e048

removed double start

417d8ff

yk added 17 commits April 2, 2023 18:27

vocab size fix

86ea260

vocab size fix

8139699

vocab size fix

b27b6c4

vocab size fix

072d1d7

vocab size fix

1b9d0c3

vocab size fix

9caf954

vocab size fix

e733abd

decode hack

ed38c87

more fixes

17e43bd

added back token hack

6b5788e

removed logging

9701b0d

feedback

2ad7820

decode fix

f517214

warmup change

000f12c

torch threads

9047652

model loading fix

82506ea

delaying tokens by 1

d27e712

yk marked this pull request as ready for review April 3, 2023 07:43

yk requested review from andreaskoepf, AbdBarho, theblackcat102, sanagno, dvruette and notmd as code owners April 3, 2023 07:43

Merge branch 'main' into hf-worker-server-bnb

54e9994

dvruette approved these changes Apr 3, 2023

View reviewed changes

model/model_training/trainer_rm.py Outdated Show resolved Hide resolved

inference/worker/__main__.py Outdated Show resolved Hide resolved

feedback

3195604

olliestanley approved these changes Apr 3, 2023

View reviewed changes

yk merged commit 8a97cd4 into main Apr 3, 2023

yk deleted the hf-worker-server-bnb branch April 3, 2023 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changed basic hf server to support quantization and streaming #2293

changed basic hf server to support quantization and streaming #2293

yk commented Apr 2, 2023 •

edited

Loading

olliestanley left a comment

olliestanley Apr 2, 2023

yk Apr 2, 2023

olliestanley left a comment

changed basic hf server to support quantization and streaming #2293

changed basic hf server to support quantization and streaming #2293

Conversation

yk commented Apr 2, 2023 • edited Loading

olliestanley left a comment

Choose a reason for hiding this comment

olliestanley Apr 2, 2023

Choose a reason for hiding this comment

yk Apr 2, 2023

Choose a reason for hiding this comment

olliestanley left a comment

Choose a reason for hiding this comment

yk commented Apr 2, 2023 •

edited

Loading