Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 32dim #125

Merged
merged 11 commits into from
Jun 3, 2023
Merged

Support 32dim #125

merged 11 commits into from
Jun 3, 2023

Conversation

qwopqwop200
Copy link
Collaborator

Changed the code to support triton when divisible by 32. This enables the use of triton in falcon and GPT2-XL.

Copy link
Collaborator

@PanQiWei PanQiWei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This update looks really neat yet also strong for it make it possible to support more models. 🔥 🔥

@PanQiWei
Copy link
Collaborator

PanQiWei commented Jun 2, 2023

@qwopqwop200 Hi, this is just a question: is it possible to implement a dynamic strategy to config triton warmup based on model type and the model's attributes?

@qwopqwop200
Copy link
Collaborator Author

@qwopqwop200 Hi, this is just a question: is it possible to implement a dynamic strategy to config triton warmup based on model type and the model's attributes?

Can you give me an example? I didn't understand

@PanQiWei
Copy link
Collaborator

PanQiWei commented Jun 2, 2023

For example: if one use GPT2-large, use one group configs to warmup triton; and when use GPT2-XL, use another group configs to warmup triton.

And if this can be implemented, I think maybe we can also predefine a set of config groups based on GPU types and architectures?

@qwopqwop200
Copy link
Collaborator Author

I think it's probably possible.

@PanQiWei
Copy link
Collaborator

PanQiWei commented Jun 2, 2023

Maybe a dynamic coding strategy need to be implemented with a wrapper or comtext_manager. Anyway, I will look into it sometime when I have enough time.

Thanks again for this pr! ❤️

@TheBloke
Copy link
Contributor

TheBloke commented Jun 2, 2023

Awesome work! I'd love to try this but am currently getting an error when using a basic test script, I think related to the HF hub download stuff:

Script:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantized_model_dir = "/workspace/models/TheBloke_falcon-40b-instruct-GPTQ"

use_triton = True

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Errors:

[pytorch2] ubuntu@h100:/workspace/misc $ python ./simple_falcon.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /workspace/venv/pytorch2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 120
CUDA SETUP: Loading binary /workspace/venv/pytorch2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so...
triton is not installed, reset use_triton to False
Traceback (most recent call last):
  File "/workspace/misc/./simple_falcon.py", line 11, in <module>
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 82, in from_quantized
    return quant_func(
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 543, in from_quantized
    quantize_config = BaseQuantizeConfig.from_pretrained(save_dir)
  File "/workspace/venv/pytorch2/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 57, in from_pretrained
    return cls(**json.load(f))
TypeError: BaseQuantizeConfig.__init__() got an unexpected keyword argument 'model_name_or_path'

I saw that the PR is relative to peft and I've not tried the peft code at all yet so am not familiar with any changes that makes

@PanQiWei
Copy link
Collaborator

PanQiWei commented Jun 2, 2023

This seems because some changes in the main branch not been merged into peft branch

@PanQiWei PanQiWei merged commit 023bb1c into peft_integration Jun 3, 2023
@PanQiWei PanQiWei deleted the support-32dim branch June 8, 2023 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants