-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 32dim #125
Support 32dim #125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! This update looks really neat yet also strong for it make it possible to support more models. 🔥 🔥
@qwopqwop200 Hi, this is just a question: is it possible to implement a dynamic strategy to config triton warmup based on model type and the model's attributes? |
Can you give me an example? I didn't understand |
For example: if one use GPT2-large, use one group configs to warmup triton; and when use GPT2-XL, use another group configs to warmup triton. And if this can be implemented, I think maybe we can also predefine a set of config groups based on GPU types and architectures? |
I think it's probably possible. |
Maybe a dynamic coding strategy need to be implemented with a wrapper or comtext_manager. Anyway, I will look into it sometime when I have enough time. Thanks again for this pr! ❤️ |
Awesome work! I'd love to try this but am currently getting an error when using a basic test script, I think related to the HF hub download stuff: Script: from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantized_model_dir = "/workspace/models/TheBloke_falcon-40b-instruct-GPTQ"
use_triton = True
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)
prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''
print("*** Pipeline:")
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(pipe(prompt_template)[0]['generated_text'])
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0])) Errors:
I saw that the PR is relative to |
This seems because some changes in the main branch not been merged into peft branch |
Changed the code to support triton when divisible by 32. This enables the use of triton in falcon and GPT2-XL.