<a href="https://colab.research.google.com/github/Nid989/LLM-Overview/blob/main/LLM_Quantization_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -qU python-dotenv
!pip install -qU bitsandbytes
!pip install -qU transformers
!pip install -qU peft
!pip install -qU accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m56.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import dotenv
import torch
from torch import cuda, bfloat16
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    AutoTokenizer,
    pipeline
)

# load environment variable files w/ saved authorization tokens
_ = dotenv.load_dotenv("./.env.txt")
# set Huggingface authorization token
hf_auth = os.environ.get("HF_AUTH") or "HF_AUTH"
# set cuda device
device = f"cuda:{cuda.current_device()}" if torch.cuda.is_available() else "cpu"

#### `LLaMA 2`

`LoRA`

In [1]:
%%capture
# NOTE: requires colab-pro to be executed
model_id = "meta-llama/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

config = LoraConfig(
    r=8, # rank of the update matrices; Lower rank results in smaller update matrices with fewer trainable parameters
    lora_alpha=32, # LoRA scaling factor
    target_modules=["self_attn.q_proj", "self_attn.k_proj",
                    "self_attn.v_proj", "self_attn.o_proj"], # modules to apply the LoRA update matrices; specific to each model
    lora_dropout=0.05,
    bias="none", # specifies if the bias parameters should be trained
    task_type=TaskType.CAUSAL_LM
)

# transition original model to have LoRA layers
model = get_peft_model(model, config)

`Q-LoRA`

In [3]:
%%capture
model_id = "meta-llama/Llama-2-13b-chat-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", # 4bit quantization; Options[NF4 (normalized float 4 (default)), pure FP4]
    bnb_4bit_use_double_quant=True, # second quantization, applied after the first quantization, to save an additional 0.4 bits per parameters
    bnb_4bit_compute_dtype=bfloat16 # compute type; while 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit; Option[float16, bfloat15, float32, ...]
)

model_config = AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map="auto",
    use_auth_token=hf_auth
)
model.eval() # inference/evaluation mode, no parameter optimization

In [4]:
%%capture
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

In [5]:
# transformer text-generation pipeline
llama2_qlora_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task="text-generation",
    temperature=0.0001,
    max_new_tokens=512,
    repetition_penalty=1.1
)

In [6]:
res = llama2_qlora_pipeline("Write a short story about time travel.")
print(res[0]["generated_text"])

Write a short story about time travel.

Time Traveler's Dilemma

As soon as the time machine was completed, Emily couldn't wait to try it out. She had spent years building it, pouring over theories and diagrams, testing and retesting every component. Finally, she was ready to see if it would work.

She climbed inside and set the dials for a date 20 years into the future. The machine whirred to life, and before she knew it, she was standing in the middle of a bustling city street.

At first, everything seemed familiar. The buildings were taller and more modern than they were now, but the people and the energy of the city were the same. But then, something caught her eye. A group of people walking down the street were wearing clothes that she had never seen before. They were sleek and shimmering, like nothing she had ever imagined.

Emily felt a pang of excitement. She had always been fascinated by fashion, and the idea of seeing what the future held in store for style was too tempting t