<a href="https://colab.research.google.com/github/Signed-B/build-your-own-llm/blob/main/Copy_of_bnb_4bit_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [1]:
!pip install -q -U bitsandbytes
!pip install -q transformers==4.32.0
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━

First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "eluzhnica/mpt-7b-8k-peft-compatible"
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    # bnb_4bit_use_double_quant=True,
    # bnb_4bit_quant_type="nf4",
    # bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0},
                                             torch_dtype=torch.bfloat16,
                                             trust_remote_code=True
                                             )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

configuration_mpt.py:   0%|          | 0.00/9.20k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- configuration_mpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_mpt.py:   0%|          | 0.00/19.9k [00:00<?, ?B/s]

adapt_tokenizer.py:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- adapt_tokenizer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


meta_init_context.py:   0%|          | 0.00/3.64k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- meta_init_context.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


param_init_fns.py:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

norm.py:   0%|          | 0.00/2.63k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- norm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- param_init_fns.py
- norm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


hf_prefixlm_converter.py:   0%|          | 0.00/27.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- hf_prefixlm_converter.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


custom_embedding.py:   0%|          | 0.00/305 [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- custom_embedding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


blocks.py:   0%|          | 0.00/2.55k [00:00<?, ?B/s]

attention.py:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

flash_attn_triton.py:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- flash_attn_triton.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- attention.py
- flash_attn_triton.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compatible:
- blocks.py
- attention.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/eluzhnica/mpt-7b-8k-peft-compat

pytorch_model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/91.0 [00:00<?, ?B/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [3]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [4]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [5]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4194304 || all params: 6653480960 || trainable%: 0.0630392425441013


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [6]:
from datasets import load_dataset

data = load_dataset("vicgalle/alpaca-gpt4", split="train[:]")
data = data.train_test_split(test_size=0.2, seed=42)
data = data.map(lambda samples: tokenizer(samples["text"]), batched=True)

Downloading readme:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/48.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

Map:   0%|          | 0/41601 [00:00<?, ? examples/s]

Map:   0%|          | 0/10401 [00:00<?, ? examples/s]

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [9]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    eval_dataset=data["test"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=30,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        evaluation_strategy="steps",
        # num_train_epochs=3
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Step,Training Loss,Validation Loss
30,1.2275,1.066718
60,1.0491,1.040586
90,1.042,1.032366


TrainOutput(global_step=100, training_loss=1.1011932945251466, metrics={'train_runtime': 9940.636, 'train_samples_per_second': 0.161, 'train_steps_per_second': 0.01, 'total_flos': 2.143479446588621e+16, 'train_loss': 1.1011932945251466, 'epoch': 0.04})

In [10]:
from transformers import TextStreamer

inputs = tokenizer(["What is the meaning of life?"], return_tensors="pt").to("cuda:0")

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

model.generate(**inputs, streamer=streamer, max_new_tokens=100)



 What is the purpose of life? These are questions that have been asked for centuries, and they continue to be asked today. The meaning 

KeyboardInterrupt: 

In [11]:
def stream(question, context=None):
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    B_INST, I_INST, E_INST = "### Instruction:\n", "### Input:\n", "### Response:\n"

    prompt = f"{system_prompt}{B_INST}{question.strip()}\n\n{I_INST}{context.strip()}\n\n{E_INST}" \
             if context else f"{system_prompt}{B_INST}{question.strip()}\n\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

In [12]:
stream("Which of these colors is an apple?", "Red, Blue, Black")

Red is an apple.

Red is one of the colors that is often associated with apples. It can be seen in the skin of the fruit, as well as in the flesh and the seeds. Red apples include varieties such as Red Delicious, Red Rome, and Red Fuji.

Blue is not a color that is often associated with apples. However, some varieties of blue apples do exist, such as the Blue Pearmain and the Blue Gala. These apples have a bluish-purple skin and a blue-green flesh.

Black is not a color that is often associated with apples. However, some varieties of black apples do exist, such as the Black Twig and the Black Oxford. These apples have a dark, almost black skin and a deep red flesh.

In conclusion, red is the color that is most commonly associated with apples. Other colors such as blue and black may also be seen in some varieties of apples, but red is the most common color for apples.

Apple is a fruit that is commonly associated with red color. Other colors such as blue and black may also be seen in so

In [13]:
stream("Why am I so bad at everything?")



It's important to remember that everyone has their own strengths and weaknesses. There is no one-size-fits-all answer to this question. However, there are some factors that may contribute to your perceived lack of success in various areas of your life.

One possible explanation is that you may be setting unrealistic expectations for yourself. It's important to be realistic about your abilities and set attainable goals. If you're not meeting your own expectations, it may be because you're setting the bar too high.

Another possible explanation is that you may be putting too much pressure on yourself. It's important to take things at your own pace and not let the expectations of others get in the way of your success.

It's also possible that you may be comparing yourself to others and feeling like you're not measuring up. It's important to remember that everyone has their own strengths and weaknesses, and there is no one-size-fits-all answer to this question.

Overall, it's important to 