## We will fine tune using state of the art method qlora. QLoRA (Quantized Low-Rank Adapters) is a fine-tuning method designed to make the process of fine-tuning large language models more efficient. It combines two techniques: quantization and low-rank adapters (LoRA).

Quantization: Reduces the numerical precision of model weights, making the model more compact and faster to execute. This involves mapping high-precision floating-point values to lower-precision values, such as 4-bit integers1.

Low-Rank Adapters (LoRA): Fine-tunes large pre-trained models by updating only a small number of trainable parameters, reducing computational complexity and memory usage.

Efficiency: Allows fine-tuning of massive models with billions of parameters on relatively small GPUs.

Cost-Effective: Reduces the need for expensive, high-memory computing resources.

Performance: Maintains high performance while being more memory-efficient.

QLoRA democratizes fine-tuning by making it accessible to those with limited resources, enabling state-of-the-art results even with smaller models

##### Install libraries, only need to do this once per instance

In [None]:
!pip install -U bitsandbytes
!pip install -U git+https://github.com/huggingface/transformers.git
!pip install -U git+https://github.com/huggingface/peft.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install -U datasets scipy ipywidgets wandb

In [3]:
pip show accelerate

Name: accelerate
Version: 1.2.0.dev0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /home/calee/.local/lib/python3.11/site-packages
Requires: huggingface_hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: peft
Note: you may need to restart the kernel to use updated packages.


##### NOTE:
1. go to https://huggingface.co/, create or login. At the top right icon, click settings -> access tokens -> create new token (click all permissions). Copy and paste it to the login line.  (No login line)

2. if you get this error: `OSError: You are trying to access a gated repo.`, you need to go to https://huggingface.co/mistralai/Mistral-7B-v0.1 and accept usage terms 

In [4]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


### gem/viggo is video lingo training data

we see that it contains 5k training samples
Note we do not need to login to HF to use this data

In [5]:
from datasets import load_dataset

train_dataset = load_dataset('gem/viggo', split='train')
eval_dataset = load_dataset('gem/viggo', split='validation')
test_dataset = load_dataset('gem/viggo', split='test')

print(train_dataset)
print(eval_dataset)
print(test_dataset)

Dataset({
    features: ['gem_id', 'meaning_representation', 'target', 'references'],
    num_rows: 5103
})
Dataset({
    features: ['gem_id', 'meaning_representation', 'target', 'references'],
    num_rows: 714
})
Dataset({
    features: ['gem_id', 'meaning_representation', 'target', 'references'],
    num_rows: 1083
})


##### But we still need to login to use the mistral LLM

In [6]:
!pip install huggingface_hub
from huggingface_hub import login

Defaulting to user installation because normal site-packages is not writeable


In [7]:
login(token="") # your token

##### Load LLM with bnb for quantization

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    model_max_length=512,
    padding_side="left",
    add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token
     

In [10]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

In [11]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
{data_point["target"]}


### Meaning representation:
{data_point["meaning_representation"]}
"""
    return tokenize(full_prompt)

In [12]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)
print(tokenized_train_dataset[4]['input_ids'])
print(len(tokenized_train_dataset[4]['input_ids']))

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 12628, 264, 2718, 12271, 5122, 272, 14164, 5746, 9283, 302, 272, 2787, 12271, 390, 264, 2692, 908, 395, 9623, 304, 6836, 3069, 28

##### above we see the tokens representing the sentence

In [13]:
print("Target Sentence: " + test_dataset[1]['target'])
print("Meaning Representation: " + test_dataset[1]['meaning_representation'] + "\n")

Target Sentence: Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?
Meaning Representation: verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])



### Test the model before fine tuning

In [14]:
# from peft import PeftModel
# offload_dir = "./offload_dir"

# model = PeftModel.from_pretrained(model, base_model_id, offload_folder=offload_dir)

eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?


### Meaning representation:
"""

# import torch

# Define the device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the same device
model = model.to(device)

# Tokenize the input
model_input = tokenizer(eval_prompt, return_tensors="pt").to(device)  # Move input to the same device

# Set the model to evaluation mode and generate output without gradient tracking
model.eval()
with torch.no_grad():
    generated_ids = model.generate(**model_input, max_new_tokens=256, pad_token_id=2)
    
    # Decode the generated ids and print the result
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))


# model_input = tokenizer(eval_prompt, return_tensors="pt")
# model.eval()
# with torch.no_grad():
#     print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))

Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?


### Meaning representation:
для_all_games_which_don't_have_multiplayer_is_your_opinion_true_for_playstation's_little_big_adventure


### Meaning representa

##### above we see the LLM trying to understand the meaning, but the output is alittle weird

In [15]:
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )


##### below defines the LoRA config then prints the number of trainable parameters (ones that requires gradient)

r=8: This parameter controls the rank of the LoRA, determining the number of components used to approximate the original weight matrices.

lora_alpha=16: This is a scaling factor for the LoRA updates.

target_modules: A list of specific model components to apply LoRA to, such as "q_proj", "k_proj", and so on. 

- In the self-attention mechanism, the input is projected into three different spaces: queries (q_proj), keys (k_proj), and values (v_proj). Here's a brief explanation of these projections:
    - Queries (q_proj): These are the transformed representations of the input that are used to match against keys.

    - Keys (k_proj): These represent the input data and are matched against the queries to determine relevance.

    - Values (v_proj): These carry the actual information and are combined based on the attention scores calculated from the queries and keys.
    
- o_proj (Output Projection):

    This layer projects the output of the attention mechanism into the same dimension as the input. It's responsible for transforming the attention output before it is passed to the next layer in the network.

- gate_proj (Gating Projection):

    This layer is used in gated mechanisms, such as in Gated Recurrent Units (GRUs) or certain transformer variants. It controls the flow of information through the network by gating signals, often determining which parts of the input should be emphasized or suppressed.

- up_proj (Up Projection) and down_proj (Down Projection):

    These terms are typically used in models that involve dimensionality changes.

    up_proj: This layer increases the dimensionality of the data. It's often used in layers where more expressive power is needed, like when transitioning from a compressed representation to a higher-dimensional space.

    down_proj: This layer decreases the dimensionality of the data. It's useful for reducing the computational load by compressing representations, often after processing or combining features.

- lm_head (Language Model Head):

    This is the final layer in a language model used to produce the output. It converts the hidden states of the model into logits, which represent the probabilities of the next token in sequence modeling tasks.

    These projections allow the model to weigh different parts of the input differently, enabling it to focus on the most relevant information for a particular task.

bias="none": This indicates whether to use bias terms in the LoRA layers.

lora_dropout=0.05: This specifies the dropout rate applied to the LoRA layers.

task_type="CAUSAL_LM": This defines the task type, in this case, causal language modeling.


In [16]:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
# Apply the accelerator. You can comment this out to remove the accelerator.
# model = accelerator.prepare_model(model)

trainable params: 21260288 || all params: 3773331456 || trainable%: 0.5634354746703705


In [17]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_p

In [18]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True
     

### Perform fine tuning training

model: The model to be trained.

train_dataset and eval_dataset: The datasets for training and evaluation.

args: The training arguments, including:

output_dir: Directory to save the model.

warmup_steps: Number of warm-up steps.

per_device_train_batch_size: Batch size per device.

gradient_accumulation_steps: Number of steps to accumulate gradients before updating.

max_steps: Maximum number of training steps.

learning_rate: Learning rate for optimization.

logging_steps: Frequency of logging updates.

bf16: Whether to use bfloat16 precision.

optim: Optimizer to use (paged_adamw_8bit or adamw_hf).

logging_dir: Directory for logs.

save_strategy and save_steps: Strategy and frequency of saving checkpoints.

evaluation_strategy and eval_steps: Strategy and frequency of evaluation.

do_eval: Whether to perform evaluation.

report_to: Reporting mechanism (e.g., "wandb" for Weights & Biases).

data_collator: The data collator for language modeling, specifying not to use masked language modeling (mlm=False).

In [None]:
import transformers
from datetime import datetime


project = "viggo-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name


tokenizer.pad_token = tokenizer.eos_token


trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5, # Want about 10x smaller than the Mistral learning rate
        logging_steps=50,
        bf16=False,
        optim="paged_adamw_8bit", 
#         optim="adamw_hf",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        report_to="none",           # Use "wandb" if you want to use wandb
#         run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}",          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
50,0.7422,0.271674


  return fn(*args, **kwargs)


In [19]:

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Test the model after fine tuning

In [20]:

from peft import PeftModel
# ft_model = PeftModel.from_pretrained(base_model, "mistral-viggo-finetune/checkpoint-1000")
ft_model = PeftModel.from_pretrained(base_model, "mistral-viggo-finetune/checkpoint-1000-full")

ft_model.eval()
with torch.no_grad():
    print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100, pad_token_id=2)[0], skip_special_tokens=True))


Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?


### Meaning representation:
 Given a game without multiplayer, do you feel indifferent about it?


### Target sentence:
I'm curious, why do you think that 

##### After fine tuning, we see the LLM understands the meaning of the sentence better