Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed basic hf server to support quantization and streaming #2293

Merged
merged 70 commits into from
Apr 3, 2023

Conversation

yk
Copy link
Collaborator

@yk yk commented Apr 2, 2023

I now also put the multi-worker-image PR in here because it builds heavily upon it

yk and others added 20 commits March 31, 2023 10:52
Notable config settings that are required to get GPTNeoX-20B to behave
nicely:
- bf16: when training in fp16, gradients often over/underflow, leading
to bad convergence. (bf16 could also potentially improve the loss of
other models, we should try)
- More warmup steps: To fill the optimizer buffers, a gentle warmup is
required. Right now, we do this by just using more warmup steps, but in
the future a linear LR warmup schedule could be used to achieve the same
effect.
- Gradient checkpointing is faster than gradient accumulation. This
could also translate to other models.
- Stage 3 is required to fit the 20B model in bf16. For some reason,
fp16 was possible to fit with stage 2 but bf16 wasn't.

Other learnings:
- Residual dropout quickly degrades performance of the pre-trained
model. Even `p=0.1` leads to an initial loss of > 5.
- Flash attention does not go too well with the 20B model. There are
slight numerical differences between flash attention and the GPTNeoX
attention implementation that accumulate layer-by-layer, ultimately
leading to vastly different results.
- Run in bf16 with regular implementation:
https://wandb.ai/open-assistant/supervised-finetuning/runs/x2pczqa9
- Run in bf16 with flash attention:
https://wandb.ai/open-assistant/supervised-finetuning/runs/cvv3edm8
- Run in fp32 with regular implementation:
https://wandb.ai/open-assistant/supervised-finetuning/runs/shrzz3xp

EDIT: Updated fp32 comparison run. It actually behaves nicely just like
bf16, so the most likely explanation for flash attention not working is
accumulating errors.
Added chat.json
Updated common.json
Updated dashboard.json
Added error.json
Updated index.json
Updated tasks.json
Updated labelling.json
Updated message.json
Update leaderboard.json

All missing message labels added.

---------

Co-authored-by: AbdBarho <ka70911@gmail.com>
- add `CodeAlpaca` class to load
[sahil2801/CodeAlpaca-20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k)
- add `GPT4all` class to load
[Nebulous/gpt4all_pruned](https://huggingface.co/datasets/Nebulous/gpt4all_pruned)
This adds configs for the 13B and 6.7B Cerebras models but with the
necessary code changes made it should be easy enough to add configs for
the smaller models too if desired. The tokenizer seems to be the GPT-2
fast tokenizer from HuggingFace so the special tokens have been
configured for that.
Co-authored-by: mishka <gartsocial@gmail.com>
Add `rng_seed` command line/config parameter. Command line arg will
overwrite yaml-config value.
Introduces counters for work queues that allows us to track the
positions of enqueued work requests without having to iterate through
the queues
- removed setting the eos id property because this property is read-only
at the lama tokenizer (and it is not necessary to explicitly set it for
pythia models)
Hey, following my pull request for adding the "Team" button to the
call-to-action section, I made a small update to the responsiveness.
Before             |  Now
:-------------------------:|:-------------------------:

![image](https://user-images.githubusercontent.com/25230234/229289501-3cf93539-18ff-4b45-9c91-299e677b4ace.png)
|
![image](https://user-images.githubusercontent.com/25230234/229289660-67ab501a-e02e-432a-a6d0-c1287a1c0d72.png)
New files have been added to the dataset which caused entries to appear
multiple times if loaded naively. Now one dataset file is explicitly
specified to avoid this. Ollie helped analyzing the dataset.
---------

Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>
Copy link
Collaborator

@olliestanley olliestanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I wonder if we should refactor the worker directory structure a bit to have a subdirectory for basic HF server functionality, the code (and the difference between basic HF server vs text-generation-inference server) is possibly a bit confusing for new developers at this point.

Also seems like something weird has happened with the diffs from main.

inference/worker/basic_hf_server.py Outdated Show resolved Hide resolved
@@ -17,5 +17,8 @@ class Settings(pydantic.BaseSettings):
perform_oom_test: bool = False
oom_test_max_length: int | None = None

# for hf basic server
quantize: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a more descriptive setting name here so people don't expect it to have an effect when not using the basic server?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be called quantize because the hf-inference server also expects it to be called like this

@yk yk marked this pull request as ready for review April 3, 2023 07:43
model/model_training/trainer_rm.py Outdated Show resolved Hide resolved
inference/worker/__main__.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@olliestanley olliestanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving with new changes (apologies for the early review yesterday, somehow didn't notice it was a draft!)

@yk yk merged commit 8a97cd4 into main Apr 3, 2023
@yk yk deleted the hf-worker-server-bnb branch April 3, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants