-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed basic hf server to support quantization and streaming #2293
Conversation
Notable config settings that are required to get GPTNeoX-20B to behave nicely: - bf16: when training in fp16, gradients often over/underflow, leading to bad convergence. (bf16 could also potentially improve the loss of other models, we should try) - More warmup steps: To fill the optimizer buffers, a gentle warmup is required. Right now, we do this by just using more warmup steps, but in the future a linear LR warmup schedule could be used to achieve the same effect. - Gradient checkpointing is faster than gradient accumulation. This could also translate to other models. - Stage 3 is required to fit the 20B model in bf16. For some reason, fp16 was possible to fit with stage 2 but bf16 wasn't. Other learnings: - Residual dropout quickly degrades performance of the pre-trained model. Even `p=0.1` leads to an initial loss of > 5. - Flash attention does not go too well with the 20B model. There are slight numerical differences between flash attention and the GPTNeoX attention implementation that accumulate layer-by-layer, ultimately leading to vastly different results. - Run in bf16 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9 - Run in bf16 with flash attention: https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8 - Run in fp32 with regular implementation: https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp EDIT: Updated fp32 comparison run. It actually behaves nicely just like bf16, so the most likely explanation for flash attention not working is accumulating errors.
Added chat.json Updated common.json Updated dashboard.json Added error.json Updated index.json Updated tasks.json Updated labelling.json Updated message.json Update leaderboard.json All missing message labels added. --------- Co-authored-by: AbdBarho <ka70911@gmail.com>
- add `CodeAlpaca` class to load [sahil2801/CodeAlpaca-20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k) - add `GPT4all` class to load [Nebulous/gpt4all_pruned](https://huggingface.co/datasets/Nebulous/gpt4all_pruned)
This adds configs for the 13B and 6.7B Cerebras models but with the necessary code changes made it should be easy enough to add configs for the smaller models too if desired. The tokenizer seems to be the GPT-2 fast tokenizer from HuggingFace so the special tokens have been configured for that.
Co-authored-by: mishka <gartsocial@gmail.com>
Add `rng_seed` command line/config parameter. Command line arg will overwrite yaml-config value.
Introduces counters for work queues that allows us to track the positions of enqueued work requests without having to iterate through the queues
- removed setting the eos id property because this property is read-only at the lama tokenizer (and it is not necessary to explicitly set it for pythia models)
Hey, following my pull request for adding the "Team" button to the call-to-action section, I made a small update to the responsiveness. Before | Now :-------------------------:|:-------------------------: ![image](https://user-images.githubusercontent.com/25230234/229289501-3cf93539-18ff-4b45-9c91-299e677b4ace.png) | ![image](https://user-images.githubusercontent.com/25230234/229289660-67ab501a-e02e-432a-a6d0-c1287a1c0d72.png)
New files have been added to the dataset which caused entries to appear multiple times if loaded naively. Now one dataset file is explicitly specified to avoid this. Ollie helped analyzing the dataset. --------- Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I wonder if we should refactor the worker directory structure a bit to have a subdirectory for basic HF server functionality, the code (and the difference between basic HF server vs text-generation-inference server) is possibly a bit confusing for new developers at this point.
Also seems like something weird has happened with the diffs from main.
@@ -17,5 +17,8 @@ class Settings(pydantic.BaseSettings): | |||
perform_oom_test: bool = False | |||
oom_test_max_length: int | None = None | |||
|
|||
# for hf basic server | |||
quantize: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a more descriptive setting name here so people don't expect it to have an effect when not using the basic server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be called quantize because the hf-inference server also expects it to be called like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reapproving with new changes (apologies for the early review yesterday, somehow didn't notice it was a draft!)
I now also put the multi-worker-image PR in here because it builds heavily upon it