Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falcon support #111

Merged
merged 7 commits into from
May 27, 2023
Merged

Falcon support #111

merged 7 commits into from
May 27, 2023

Conversation

qwopqwop200
Copy link
Collaborator

Add falcon
Added dtype. This is added because falcon currently does not support float16.
Also, the input dimension of 7b is not divisible by 256, so triton is not supported. This is a problem to be addressed later.

@qwopqwop200
Copy link
Collaborator Author

qwopqwop200 commented May 26, 2023

quant code

import os
import numpy as np
import random
import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from datasets import load_dataset
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


pretrained_model_dir = "tiiuae/falcon-7b"
quantized_model_dir = "falcon-7b-4bit-128g"

# os.makedirs(quantized_model_dir, exist_ok=True)
def get_wikitext2(nsamples, seed, seqlen, tokenizer):
    # set seed
    random.seed(seed)
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    
    # load dataset and preprocess 
    traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
    testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    trainenc = tokenizer("\n\n".join(traindata['text']), return_tensors='pt')
    testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')
    
    traindataset = []
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        traindataset.append({'input_ids':inp,'attention_mask': attention_mask})
    return traindataset, testenc

def main():
    from transformers import AutoTokenizer
    try:
        tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=False)
    except:
        tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
    
    # load un-quantized model, the model will always be force loaded into cpu
    quantize_config = BaseQuantizeConfig(
        bits=4,  # quantize model to 4-bit
        group_size=64,  # it is recommended to set the value to 128
        desc_act=False,  # desc_act and groupsize only works on triton
    )
    
    # get model maximum sequence length
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True, torch_dtype=torch.float32)
    model_config = model.config.to_dict()
    seq_len_keys = ["max_position_embeddings", "seq_length", "n_positions"]
    if any([k in model_config for k in seq_len_keys]):
        for key in seq_len_keys:
            if key in model_config:
                model.seqlen = model_config[key]
                break
    else:
        model.seqlen = 2048
     
    # load train dataset for quantize
    traindataset, testenc = get_wikitext2(128, 0, model.seqlen, tokenizer)

    # quantize model, the examples should be list of dict whose keys contains "input_ids" and "attention_mask"
    # with value under torch.LongTensor type.
    model.quantize(traindataset, use_triton=False)

    # save quantized model
    model.save_quantized(quantized_model_dir)

    # save quantized model using safetensors
    model.save_quantized(quantized_model_dir, use_safetensors=True)

    # load quantized model, currently only support cpu or single gpu
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False, torch_dtype=torch.float32, trust_remote_code=True)
    
    token = tokenizer("test is", return_tensors="pt").to("cuda:0")
    del token['token_type_ids']
    print(tokenizer.decode(model.generate(**token).cpu().tolist()[0]))
    
import logging

logging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S")

main()

@PanQiWei
Copy link
Collaborator

PanQiWei commented May 26, 2023

Thank you very much for such fast implement to support falcon! 🥳 Here is my question, does this pr only focus on falcon-7b, or is falcon-40b will also be considered in this pr?

@qwopqwop200
Copy link
Collaborator Author

I think both will work. But currently only tested on 7b.

@PanQiWei
Copy link
Collaborator

I see the model_type in config.json of 40b and 7b models are different, in 40b it's RefinedWeb while in 7b it's RefinedWebModel, so maybe both need to be added in auto-gptq's relevant code in order to support both two models.

@qwopqwop200
Copy link
Collaborator Author

This is the current draft.

@PanQiWei PanQiWei linked an issue May 27, 2023 that may be closed by this pull request
@PanQiWei PanQiWei marked this pull request as draft May 27, 2023 00:12
@PanQiWei
Copy link
Collaborator

This is the current draft.

Hi, I just convert this pr to draft mode based this information, you can convert to ready for review mode once everything is done.

@qwopqwop200
Copy link
Collaborator Author

Confirmed that 7B works. Although 40B results in OOM. But it seems to work.

@qwopqwop200 qwopqwop200 marked this pull request as ready for review May 27, 2023 05:23
@TheBloke
Copy link
Contributor

Amazing @qwopqwop200 thank you! I am trying it now

Copy link
Collaborator

@PanQiWei PanQiWei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge, thank you very much!

@PanQiWei PanQiWei merged commit 0a40581 into main May 27, 2023
@qwopqwop200 qwopqwop200 deleted the falcon branch May 27, 2023 09:04
@yhyu13
Copy link

yhyu13 commented May 27, 2023

@qwopqwop200 #111 (comment)

The falcon datatype is torch.bfloat16, why did you use float32 to load the model?
Also, shouldn't we use the refined web dataset that flacon is trained against for quantization, it is available in datasets https://huggingface.co/datasets/tiiuae/falcon-refinedweb

It's a huge dataset (2.8T), but we probably only need the testing part

@qwopqwop200
Copy link
Collaborator Author

@qwopqwop200 #111 (comment)

The falcon datatype is torch.bfloat16, why did you use float32 to load the model? Also, shouldn't we use the refined web dataset that flacon is trained against for quantization, it is available in datasets https://huggingface.co/datasets/tiiuae/falcon-refinedweb

It's a huge dataset (2.8T), but we probably only need the testing part

This is just code to make sure it works.

@TheBloke
Copy link
Contributor

I can confirm that the model loads fine with torch.bfloat16 as well.

I have made the 7B model no problem, and it works well : https://huggingface.co/TheBloke/falcon-7b-instruct-GPTQ

I am having a bit of trouble making 40B. As @qwopqwop200 found, it uses a lot of VRAM - it peaks at around 32GB, so a 24GB card is no good. So I made it on an A6000 with 48GB.

It took over 2.5 hours to quantise all the layers.. then this happened!!!

2023-05-27 12:10:03 INFO [auto_gptq.quantization.gptq] avg loss: 841.5869140625
2023-05-27 12:10:25 INFO [auto_gptq.modeling._utils] Packing model...
[1]    2334 killed     python3 quant_falcon_40b.py

out of RAM! Argh! :)

I had 167GB RAM so that was a hell of a lot of RAM it needed.

I am trying again now on a server with L40 40GB GPU and 250GB RAM. I hope it will be enough!

I am going to try low_cpu_mem_usage:

 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, low_cpu_mem_usage=True, trust_remote_code=True, torch_dtype=torch.float32)

to see if this helps?

What do you think @qwopqwop200 @PanQiWei ? Will that reduce RAM requirements for quantizing do you think? Any other ideas?

@yhyu13
Copy link

yhyu13 commented May 27, 2023

How do you properly set up the device map for hugging face from_pretrained method? Which set maximum vram for each device and maximum CPU ram

@lloorree
Copy link

lloorree commented May 27, 2023

Any other ideas?

If there’s a large enough disk for it on the machine you can make a big swapfile for it. I got it running on 64GB of RAM that way (then ran into the VRAM needs you mentioned).

Also, this was using a modified version of GPTQ-for-LLaMa so it might not be relevant, but since Falcon’s model file uses their own version of nn.Linear, if you add their Linear class to the list of ones to quantize it packs down a bit smaller. That could be unstable, though.

@TheBloke
Copy link
Contributor

Any other ideas?

If there’s a large enough disk for it on the machine you can make a big swapfile for it. I got it running on 64GB of RAM that way (then ran into the VRAM needs you mentioned).

Hmm yeah I guess I should have tried that. I'm running in a Docker and am not 100% sure I'm able to add swap. But it'd be worth a try.

Oh well, I'll know soon enough. It's on layer 56 of 60 of quantizing.. fingers' crossed 250GB RAM will be enough to pack it! I can see it's using 130GB during the quantising phase so I just hope it doesn't need 2x that to pack...

@TheBloke
Copy link
Contributor

OK I tried it and I guess it can't be done in a docker :(

root@58f81ddbb48d:~/AutoGPTQ# swapon /workspace/swapfile
swapon: /workspace/swapfile: swapon failed: Operation not permitted

Just crossing my fingers it's not going to die in about 2 mins..

@TheBloke
Copy link
Contributor

Doing better!

2023-05-27 13:57:45 INFO [auto_gptq.modeling._utils] Packing model...
2023-05-27 13:57:54 INFO [auto_gptq.modeling._utils] transformer.h.0.self_attention.dense
2023-05-27 13:58:03 INFO [auto_gptq.modeling._utils] transformer.h.0.self_attention.query_key_value
2023-05-27 13:58:13 INFO [auto_gptq.modeling._utils] transformer.h.0.mlp.dense_4h_to_h

And only using 144GB RAM so maybe low_cpu_mem_usage=True did help

@TheBloke
Copy link
Contributor

TheBloke commented May 27, 2023

Hehe this is going to take forever. It took 17 minutes to pack the first 6 layers. So looks like it'll take around 3 hours to do the whole thing - much longer than it took to quantise!

@PanQiWei @qwopqwop200 One feature I would really love to see is GPU acceleration for packing. I know it might be difficult though. I looked at the code once and saw it references uint32 which isn't available in torch yet?

@yhyu13
Copy link

yhyu13 commented May 27, 2023

@TheBloke Did you try 128 group size, the quant succeeded, but evaluation failed due to layer size mismatch. While changing back to 64 group size, everything goes fine (and matches your quant result, for 7B model)

@TheBloke
Copy link
Contributor

Oh interesting. No I did not. I tried 64 first and it worked fine so I left it at that.

I am using group_size = -1 (no group size) for 40B though, based on past experience with Llama models to reduce VRAM usage as much as possible for models of 30B or greater

@yhyu13
Copy link

yhyu13 commented May 27, 2023

@qwopqwop200 Maybe a stupid question. How did you managed to pull off the large model GPTQ quantization like 65B w/o OOM in your GPTQ-for-LLAMA repo? AutoGPTQ seems always load more weight onto cuda:0 only during quantization (cuda branch), and ends up failing with OOM at some point in my dual 24G VRAM setup @PanQiWei

Also found that

device_map = "balanced"
max_memory = {0: "20GIB", 1: "20GIB", "cpu": "200GIB"}

Can be passed into AutoGPTQ from_pretain_method, torch actually reserve the correct amount of vram, and truly offloading model to CPU (shown by output log), but are ignored during quantization. The vram on GPU 0 goes up until OOM even with these size settings

@TheBloke
Copy link
Contributor

TheBloke commented May 27, 2023

Update: The 40B model worked and is uploaded at https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

Even with group_size -1, it requires a bit more than 24GB VRAM which is a shame.

But the main problem is it is REALLY slow. So is the 7B model.

Example:
7B model = 2 tokens/s, compared to 30+ tokens/s for a 7B model (CPU bottleneckled)
40B model = 0.54 tokens/s ! And that's on an L40 48GB GPU.

Is there any possibility of improving this performance? I suppose it's mostly because of the custom code provided by RefinedWeb. But is there any chance there are optimisations that could be made in the AutoGPTQ code related to RefinedWeb?

@avaer
Copy link

avaer commented May 28, 2023

Update: The 40B model worked and is uploaded at https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

😮
Any possibility of quantizing the base 40B model (non instruct)?

@yhyu13
Copy link

yhyu13 commented May 30, 2023

#111 (comment)

4 bit qlora might be the end https://www.reddit.com/r/LocalLLaMA/comments/13uvbxe/testing_the_new_bnb_4bit_or_qlora_vs_gptq_cuda/. It is supposed to be even faster than gptq in the near future while be able to do both inferenceing and fine tuning

@ltm920716
Copy link

ltm920716 commented Jul 6, 2023

Hi @TheBloke @qwopqwop200

I get error when I run quant code for 'falcon-7b-instruct' model:

torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 16967 is not positive-definite).

have you meet the same error? I can not find a solution....

code bellow:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
import torch

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = './falcon-7b-instruct'
quantized_model_dir = './falco-7b-quant-test'

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)

examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

def get_wikitext2(nsamples, seed, seqlen, model):
    from datasets import load_dataset
    traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
    testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')

    from transformers import AutoTokenizer
    try:
        tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
    except:
        tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
    trainenc = tokenizer("\n\n".join(traindata['text']), return_tensors='pt')
    testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')

    import random
    import numpy as np
    random.seed(seed)
    np.random.seed(0)
    torch.random.manual_seed(0)
    
    traindataset = []
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        traindataset.append({'input_ids':inp,'attention_mask': attention_mask})
    return traindataset, testenc

quantize_config = BaseQuantizeConfig(
    bits=4,  
    group_size=128,  
    desc_act=False
)

model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config, trust_remote_code=True, torch_dtype=torch.float32)
model.seqlen = 2048

traindataset,testenc = get_wikitext2(128, 0, model.seqlen, pretrained_model_dir)

model.quantize(examples, use_triton=True)
model.save_quantized(quantized_model_dir)

@TheBloke
Copy link
Contributor

TheBloke commented Jul 6, 2023

@ltm920716 Yes I have had this problem in the past.

Two solutions I have found:

  1. Use a bigger dataset, more samples
  2. Use a higher damp percent, like 0.1 instead of 0.01

I see you are using 128 samples of wikitext2, so I'm surprised you have this error. But it can likely be solved either by using 256 samples, or by setting damp percent to 0.1, or both.

Specify damp percent with:

quantize_config = BaseQuantizeConfig(
    bits=4,  
    group_size=128,  
    desc_act=False,
    damp_percent=0.1
)

@TheBloke
Copy link
Contributor

TheBloke commented Jul 6, 2023

By the way I already quantised Falcon 7B Instruct with AutoGPTQ + Wikitext2, here: https://huggingface.co/TheBloke/falcon-7b-instruct-GPTQ

So you could just use that!

@ltm920716
Copy link

hello @TheBloke
thanks!it dose work by setting damp_percent=0.1!

and maybe I need to dive deep into studying the gptq paper to get the reason for this error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New LLM format: Falcon 40B and 7B - "RWForCausalLM"
7 participants