# **Vanilla QLoRA Implementation**

Initially we were confused by the term Vanilla QLoRA. This is because our implementation of QLoRA has a lora portion that implements fine tuning. We were not certain how to implement the lora portion without performing finetuning. After discussing this with Professor Gandhi, we found that only quantizing the model is acceptable for the Vanilla QLoRA requirement. In order to produce this model, we followed tutorials from datacamp with some slight variations.

This is an access token that I created to access the various hugging face models we have attempted to use.

In [1]:
# Kayleigh's hugging face access token to use to access hugging face models
access_token = "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"

## **Installations and Imports**

In [2]:
%%capture
%pip install accelerate peft bitsandbytes transformers trl

In [11]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

## **Quantization**

## **Defines Quantization Parameters**

Documentation containing information about the parameters: https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/quantization#transformers.BitsAndBytesConfig

In [12]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    # Enables the 4 bit quantization
    # 4-bit quantization by replacing the Linear layers with FP4/NF4 layers
    load_in_4bit=True,
    # Sets the quantization data type in the bnb.nn.Linear4Bit layers
    bnb_4bit_quant_type="nf4",
    # Sets the computational type
    bnb_4bit_compute_dtype=compute_dtype,
    # Does not quantize the first quantization again.
    bnb_4bit_use_double_quant=False,
)

### **Load Model**

Below we define the model we intend to quantize called SmolLM2-135M. It was produced by Hugging Face and can be accessed here https://huggingface.co/HuggingFaceTB/SmolLM-135M.

In [13]:
base_model = "HuggingFaceTB/SmolLM2-135M"

This loads the SmolLM2-135M with 4 bit quantization.

In [14]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0},
    token = access_token
)
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Loads the tokenizer for the model SmolLM2-135M.

In [15]:
tokenizer = AutoTokenizer.from_pretrained(base_model, token=access_token, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

This was the portion of our original QLoRA implementation that utilized the LoRA technique.

In [18]:
# peft_params = LoraConfig(
#     lora_alpha=16,
#     lora_dropout=0.1,
#     r=64,
#     bias="none",
#     task_type="CAUSAL_LM",
# )

## **Quantized Model vs Base Model**

Outputs the quantized model.

In [16]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=576, out_features=576, bias=False)
          (k_proj): Linear4bit(in_features=576, out_features=192, bias=False)
          (v_proj): Linear4bit(in_features=576, out_features=192, bias=False)
          (o_proj): Linear4bit(in_features=576, out_features=576, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear4bit(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear4bit(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
  

Saves the quantized model.

In [None]:
model.save_pretrained("quantized_model" )

Loads the original SmolLM2-135 model

In [None]:
original_model = AutoModelForCausalLM.from_pretrained(base_model, token = access_token)

Saves the original model without quantization

In [None]:
original_model.save_pretrained("non_quantized_model" )

### **Model Size Comparison**

This size comparison code was generated with the assistance of ChatGPT in order to check that the vanilla quantization was successful. We can see below that the size of the base model, SmolLM2 - 135 parameter, is 4.6 times larger than the quantized model.

In [None]:
import os

def get_directory_size(directory):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(directory):
      # print(dirpath)
      # print(dirnames)
      # print(filenames)
      for f in filenames:
          fp = os.path.join(dirpath, f)
          total_size += os.path.getsize(fp)
    return total_size

non_quantized_size = get_directory_size("non_quantized_model")
quantized_size = get_directory_size("quantized_model")

print(f"Non-quantized model size: {non_quantized_size / 1e6:.2f} MB")
print(f"Quantized model size: {quantized_size / 1e6:.2f} MB")

Non-quantized model size: 538.09 MB
Quantized model size: 116.55 MB


### **Response Testing**

This prompts the quantized model for a response to "Who is Leonardo Da Vinci?"

In [None]:
logging.set_verbosity(logging.CRITICAL)

prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt, num_return_sequences=1)
print(result[0]['generated_text'])

Who is Leonardo Da Vinci?

Leonardo da Vinci was a famous Italian Renaissance artist, inventor, and scientist. He was born in 1452 in Vinci, a small town in the province of Vinci in the province of Vinci in the Duchy of Milan. He was the son of a wealthy merchant and a noblewoman. He was a very talented and gifted man. He was a very good student and a very good writer. He was a very good artist and a very good scientist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was a very good scientist and a very good artist. He was


This prompts the original model for a response to "Who is Leonardo Da Vinci?"

In [None]:
prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=original_model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt, num_return_sequences=1)
print(result[0]['generated_text'])

Who is Leonardo Da Vinci?

Leonardo da Vinci was born in 1452 in the town of Vinci, Italy. He was the son of a wealthy merchant. His father was a wealthy merchant who was also a member of the Medici family. His father was a member of the family of the Medici family. The Medici family was a family of bankers and merchants. The family was very wealthy and had a lot of power. The family was very influential in the Italian Renaissance.

Leonardo da Vinci was born in a time when the Renaissance was in full swing. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance


In [None]:
result

[{'generated_text': 'Who is Leonardo Da Vinci?\n\nLeonardo da Vinci was born in 1452 in the town of Vinci, Italy. He was the son of a wealthy merchant. His father was a wealthy merchant who was also a member of the Medici family. His father was a member of the family of the Medici family. The Medici family was a family of bankers and merchants. The family was very wealthy and had a lot of power. The family was very influential in the Italian Renaissance.\n\nLeonardo da Vinci was born in a time when the Renaissance was in full swing. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance was a time of great change in art, science, and literature. The Renaissance'}]

## **References**
https://www.datacamp.com/tutorial/fine-tuning-llama-2

https://huggingface.co/docs/hub/security-tokens

https://huggingface.co/HuggingFaceTB/SmolLM-135M.
https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/quantization#transformers.BitsAndBytesConfig
