### Fine-Tuning Llama 3.2 Vision
**fine-tune a multimodal model by Meta AI on the Amazon product dataset using the Unsloth framework**

### 1.Install and upgrade UnSloth library for optimized model training

- Use `%%capture` to suppress installation output in Jupyter/Colab environments.
- Install the `unsloth` package from PyPI for initial setup.
- Uninstall the existing `unsloth` package to ensure a clean installation.
- Upgrade to the latest version of `unsloth` directly from the GitHub repository.

In [None]:
%%capture
# The `%%capture` magic in Jupyter/Colab captures output, suppressing it from being displayed.
# Install the `unsloth` package from PyPI
!pip install unsloth
# Uninstall `unsloth` to ensure a clean installation, then install the latest version from GitHub
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### 2. Load the model


Load the `unsloth/Llama-3.2-11B-Vision-Instruct` model using FastVisionModel.

Enable 4-bit quantization to reduce memory usage.
Utilize UnSloth's gradient checkpointing for efficient training and inference.


In [None]:
from unsloth import FastVisionModel
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Llama-3.2-11B-Vision-Instruct",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

### 3. Add LoRA fine-tuning configuration for Llama 3.2 Vision model

- Apply LoRA (Low-Rank Adaptation) to fine-tune the vision and language components of the model.
- Enable fine-tuning for vision layers, language layers, attention modules, and MLP modules.
- Configure LoRA parameters such as rank (`r`), alpha (`lora_alpha`), and dropout (`lora_dropout`).


In [None]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True,
    finetune_language_layers   = True,
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,
    r                          = 16,
    lora_alpha                 = 16,
    lora_dropout               = 0,
    bias                       = "none",
    random_state               = 3407,
    use_rslora                 = False,
    loftq_config               = None,
)

### 4.  Load a subset of the Amazon Product Descriptions dataset for Vision-Language tasks

- Load the `philschmid/amazon-product-descriptions-vlm` dataset using the Hugging Face `datasets` library.
- Select a subset of the training data (first 500 samples) for faster experimentation and prototyping.


In [None]:
from datasets import load_dataset

dataset = load_dataset("philschmid/amazon-product-descriptions-vlm",
                       split="train[0:500]")

dataset

In [None]:
dataset[45]["description"]

### 5.  Convert Amazon product descriptions dataset into conversation format for Vision-Language Models

- Define a system instruction for generating product descriptions based on images.
- Implement a `convert_to_conversation` function to transform dataset samples into a conversation-like structure suitable for Vision-Language Models (VLMs).
- Apply the transformation to the dataset to create a new `converted_dataset`.


In [None]:
instruction = """
You are an expert Amazon worker who is good at writing product descriptions.
Write the product description accurately by looking at the image.
"""

def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]},
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": sample["description"]}],
        },
    ]
    return {"messages": conversation}

# Apply the conversion to each sample in the dataset
converted_dataset = [convert_to_conversation(sample) for sample in dataset]

### 6. Enable inference on Llama 3.2 Vision model for generating product descriptions

- Prepare the model for inference using `FastVisionModel.for_inference`.
- Generate a product description by processing an image and instruction using the Vision-Language Model.
- Use a streaming approach to display generated text in real-time.


In [None]:
FastVisionModel.for_inference(model)  # Enable for inference!

image = dataset[45]["image"]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": instruction},
        ],
    }
]
input_text = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)


### 7. Set up SFTTrainer for fine-tuning Llama 3.2 Vision model on product descriptions

- Enable the model for training using `FastVisionModel.for_training`.
- Configure the `SFTTrainer` from the `trl` library for supervised fine-tuning (SFT) with UnSloth optimizations.
- Use `UnslothVisionDataCollator` to handle multi-modal data efficiently during training.


In [None]:
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model)  # Enable for training!

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),  # Must use!
    train_dataset=converted_dataset,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=30,
        learning_rate=2e-4,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        logging_steps=5,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # For Weights and Biases
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        dataset_num_proc=4,
        max_seq_length=2048,
    ),
)

### 8. Training the model
Start the training process by running the trainer.train() code.

In [None]:
trainer_stats = trainer.train()

### 10. Saving the model and tokenizer

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

### 9. Model inference after fine-tuning

 Enable inference for Llama 3.2 Vision model to generate product descriptions

- Prepare the model for inference using `FastVisionModel.for_inference`.
- Generate a product description by processing an image and instruction with the Vision-Language Model.
- Stream generated text in real-time using `TextStreamer`.


In [None]:
from unsloth import FastVisionModel
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    model="lora_model",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

FastVisionModel.for_inference(model)  # Enable for inference!

image = dataset[40]["image"]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": instruction},
        ],
    }
]
input_text = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)

### Then use push_to_hub to save the model on the Hugging Face hub  GGUF format.

In [None]:
# Push the trained model to the Hugging Face Model Hub using the GGUF format
model.push_to_hub_gguf(
    "SURESHBEEKHANI/llama_3.2_vision_amazon_product_description",  # Specify the model repository path on Hugging Face Hub. Replace "hf" with your Hugging Face username.
    tokenizer,  # Pass the tokenizer associated with the model to ensure compatibility on the hub
    quantization_method=["q4_k_m", "q8_0", "q5_k_m"],  # Specify the quantization methods to apply for optimized model storage (e.g., q4_k_m, q8_0, q5_k_m)
    token="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",  # Provide the Hugging Face token for authentication. Obtain a token at https://huggingface.co/settings/tokens
)