# Notebook
This notebook was tested on Google Colab.
Data Preparation, fine-tuning, and Inference will be done with Llama 3.2 11B, which has vision capability. Code base refers to other public sources which were devised to have flextibility to be used for other models also as long as you adjust parameters and components according to the models.

Test was done with A100 GPU

[Reference1](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-multimodal-llms-with-trl.ipynb), [Reference2](https://github.com/huggingface/huggingface-llama-recipes/blob/main/fine_tune/Llama-Vision%20FT.ipynb)

## Install Packages

In [None]:
%pip install tensorboard pillow

# Install Hugging Face libraries
%pip install  --upgrade \
  "transformers==4.45.1" \
  "datasets==3.0.1" \
  "accelerate==0.34.2" \
  "evaluate==0.4.3" \
  "bitsandbytes==0.44.0" \
  "trl==0.11.1" \
  "peft==0.13.0" \
  "qwen-vl-utils"

# According to colab version, you'd need to install `torchvision` again
# %pip install torchvision

# Model
 you need to get access first. Visit the page in HuggingFace and see if you have an authority for the model.
 For Llama 3.2 11B and 90B, [visit here](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)

 Due to limited GPU Availability and Great performance light-weighted model can extract, We will use quantized model. Also with small number of datapoint, full fine-tuning can lead us to overfitting. So I'm going to use LoRA, which tries to approximate what would've been done by FFT using rather small number of parameters. 

In [1]:
from huggingface_hub import login

login()  # USE YOUR TOKEN HERE

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig

# Hugging Face model id
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

# Configuration for quantizing model into 4bit normal float (nf4)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,  # Additional quantization applied to quantization constant, but I don't want it now for better performance
)

# Load model and processor
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    # attn_implementation="flash_attention_2", # not supported for training
)
processor = AutoProcessor.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

## Dataset
We'd like to see dataset has three items: Query, Image, and Answer. 
 - Query, if exists in an appropriate format, can be converted to the format model was trained for
 - Image is PIL object
 - Answer is a set of ground truths for which difference from model prediction will be calculated as a loss (Cross Entropy)

In [22]:
from datasets import load_dataset

# Load dataset from the hub
dataset_id = "philschmid/amazon-product-descriptions-vlm"
dataset = load_dataset("philschmid/amazon-product-descriptions-vlm", split="train[:150]")

We want to combine information from various columns of dataset above into a chatting template. 
Even though there can be slight variance, chat templates are used to consist of `role` and `content`. Sometimes `role` can be system, user, and assistant, but we won't include system here. Also you'd notice that **image is not provided here**. Images will be provided later

In [23]:
# Add a column where query for Llama will be located
def add_text_feature(sample):

    # note the image is not provided in the prompt its included as part of the "processor"
    prompt= """You are an expert product description writer for Amazon. Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image.
    Only return description. The description should be SEO optimized and for a better mobile search experience.

    ##PRODUCT NAME##: {product_name}
    ##CATEGORY##: {category}"""

    text = [{"role": "user",
            "content": [
                {"type": "image",},
                {"type": "text", "text": prompt.format(product_name=sample["Product Name"], category=sample["Category"]),}
                ]},
            {"role": "assistant",
            "content": [
                {"type": "text", "text": sample["description"]}]}
            ]
    sample['text'] = text
    return sample

In [24]:
# Add a column which will contain every information as `query`
dataset = dataset.map(add_text_feature)

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

In [25]:
# Split the dataset into train and test sets
split_dataset = dataset.train_test_split(test_size=0.2)

# Access the train and test splits
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

Trainer will refer to the `collate_fn` to process our dataset. It basically does 
 - make sure query, image, and answer follow appropriate format
 - convert those to vectors
 - specify components to be ignored during loss calculation

In [27]:
def collate_fn(examples):

    texts = [processor.apply_chat_template(example["text"], tokenize=False) for example in examples]
    images = [example["image"] for example in examples]

    # Tokenize the texts and process the images
    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100  #
    # Ignore the image token index in the loss computation (model specific)
    image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
    labels[labels == image_token_id] = -100
    batch["labels"] = labels

    return batch

## Trainer!

The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. QLoRA. We only need to create our `LoraConfig` and provide it to the trainer. Our `LoraConfig` parameters are defined based on the [qlora paper](https://arxiv.org/pdf/2305.14314.pdf) and sebastian's [blog post](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms).

In [28]:
from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=8,
        bias="none",
        target_modules=["q_proj", "v_proj"],
        task_type="CAUSAL_LM",
)

This notebook is of an illustrative purpose so I will run it for only 1 epoch.
How to calculate validation error if so? We can change the frequency by which validation loss will be calculated by stating `eval_strategy="steps"`. The cell below indicates how to check the number of steps per training epoch. For our scenario, batch size and gradient accumulation step will affect. 

In [29]:
grad_acc_step = 8
train_batch_size = 2
num_train_steps = ( len(train_dataset) // grad_acc_step ) // train_batch_size
print(f"Training Step per Epoch: {num_train_steps}")

Training Step per Epoch: 7


In [30]:
from trl import (
    ModelConfig,
    SFTConfig,
    SFTTrainer
)
args = SFTConfig(
    num_train_epochs=1,
    max_seq_length=1024,  # Adjust according to the length of training text
    eval_strategy="steps",
    eval_steps=3,
    logging_strategy="steps",
    logging_steps=3,
    output_dir="llama-32-11B-ft",
    per_device_eval_batch_size=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,  # conduct parameter update after stacking this number of forward pass
    gradient_checkpointing=True,    # Strategically save/forget some results (activations)
    gradient_checkpointing_kwargs={'use_reentrant': False},
    remove_unused_columns=False,
    dataset_kwargs={'skip_prepare_dataset': True},
    bf16=True,
    do_eval=True,
    report_to="none",
    dataloader_pin_memory=False,

    # log_level='info',
    # prediction_loss_only=False,
    label_names=["labels"],  # To show validation error
)

In [31]:
trainer = SFTTrainer(
    model=model,
    args=args,
    peft_config=peft_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collate_fn,
    tokenizer=processor.tokenizer,
)

In [32]:
# We limited num training dataset to 120 for illustrative purpose
# You might see better convergence with more data and epochs.
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Step,Training Loss,Validation Loss
3,3.1828,2.962725
6,3.1027,2.943371


TrainOutput(global_step=7, training_loss=3.1274712766919817, metrics={'train_runtime': 413.3045, 'train_samples_per_second': 0.29, 'train_steps_per_second': 0.017, 'total_flos': 1178027987631864.0, 'train_loss': 3.1274712766919817, 'epoch': 0.9333333333333333})

We will just save LoRA Adapter, which was trained with less than 1% of entire parameters. If you want to combine base model and LoRA adapter to make a standalone model, search for `unload_and_merge()`

In [33]:
trainer.save_model(args.output_dir)  # save LoRA adapter

In [39]:
# Clear memory

del model
del trainer
torch.cuda.empty_cache()

## Inference with trained model

In [40]:
import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig

# Hugging Face model id
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

# Load model and tokenizer
# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
#     )
model = MllamaForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)



Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]



### Wrap the base model with adapter

In [43]:
from peft import PeftModel
adapter_path = '/content/my-awesome-llama/checkpoint-7'
peft_model = PeftModel.from_pretrained(model, adapter_path)

In [63]:
def infer_fn(example):

    # Change plain text into dictionary to be processed by processor
    prompt= """You are an expert product description writer for Amazon. Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image.
    Only return description. The description should be SEO optimized and for a better mobile search experience.

    ##PRODUCT NAME##: {product_name}
    ##CATEGORY##: {category}"""

    text = [{"role": "user",
            "content": [
                {"type": "image",},
                {"type": "text", "text": prompt.format(product_name=sample["Product Name"], category=sample["Category"]),}
                ]}
            ]

    texts = [processor.apply_chat_template(text, add_generation_prompt=True)]
    images = [example["image"]]

    # Tokenize the texts and process the images
    inputs = processor(text=texts, images=images, return_tensors="pt", padding=True, add_special_tokens=False).to(peft_model.device)

    # Make an inference
    with torch.no_grad():
            output = peft_model.generate(**inputs, max_new_tokens=300)

    return processor.decode(output[0][inputs["input_ids"].shape[-1]:])  # exclude input query

In [64]:
output = infer_fn(eval_dataset[0])

In [65]:
output

'**Unleash Your Inner Historian with Springbok Puzzles\' Americana 500-Piece Jigsaw Puzzle**\n\nImmerse yourself in the rich tapestry of American history with our Springbok Puzzles - Americana 500 Piece Jigsaw Puzzle. Crafted in the USA, this large 18" x 23.5" puzzle is a testament to the art of traditional jigsaw puzzle-making.\n\n**Unique Cut Interlocking Pieces for a Challenging yet Rewarding Experience**\n\nEach piece of this puzzle is meticulously cut to fit together seamlessly, providing a satisfying challenge for puzzle enthusiasts of all levels. The unique cut interlocking pieces ensure a smooth and enjoyable assembly experience.\n\n**A Timeless Tribute to American Heritage**\n\nThis puzzle is more than just a game; it\'s a tribute to the enduring spirit of America. With its vintage aesthetic and intricate design, it\'s an ideal gift for history buffs, collectors, and anyone who appreciates the beauty of Americana.\n\n**Experience the Joy of Assembly and the Pride of Completion