<a href="https://colab.research.google.com/github/Himkeshtak/VLM-OpenCV-Course/blob/main/medium_VLM_explaination_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install -U trl transformers datasets peft accelerate bitsandbytes

Collecting trl
  Downloading trl-0.27.0-py3-none-any.whl.metadata (11 kB)
Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-23.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Downloading trl-0.27.0-py3-none-any.whl (532 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m532.5/532.5 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-4.5.0-py3-none-any.whl (515 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-23.0.0-cp312-cp312-manylinu

2. Initialise the Model, Tokeniser, and Processor

We load the pre-trained LLaVA model and its associated components. The processor is a handy tool that bundles the image processor and tokeniser together, making it easy to prepare both images and text for the model.

In [2]:
from transformers import AutoTokenizer, AutoProcessor , LlavaForConditionalGeneration
import torch

model_id = "llava-hf/llava-1.5-7b-hf"

#the processor handles both the image and the text preprocessing
processor = AutoProcessor.from_pretrained(model_id)

#Load the model with 16-bit precision for efficiecy
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype = torch.float16,
    device_map = "auto" #automatically uses the availabel GPUs
)



processor_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/674 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]



3. Create a Data Collator

A data collator is a function that takes a list of samples from our dataset and formats them into a batch that the model can process. This class handles applying the chat template to text and packaging the images and labels correctly for training.

In [3]:
class LLavaDataCollator:
  def __init__(self, processor):
    self.processor = processor
  def __call__(self, examples):
    texts = []
    images = []
    for example in examples:
      messages = example["messages"]
      text = self.processor.tokenizer.apply_chat_template(
          messages, tokenize=False, add_generation_prompt = False
      )
      texts.append(text)
      images.append(example["images"][0])
    batch = self.processor(texts, images, return_tensors="pt", padding=True)
    labels = batch["input_ids"].clone()
    if self.processor.tokenizer.pad_token_id is not None:
      labels[labels == self.processor.tokenizer.pad_token_id] = -100
    batch["labels"] = labels
    return batch
data_collator = LLavaDataCollator(processor)

4. Load the Dataset and Initialise the Trainer

We load our instruction dataset and configure the SFTTrainer, which is the core component from the TRL library that orchestrates the fine-tuning process.

In [5]:
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

#Load the dataset
raw_datasets = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft")
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

#Configure training arguemnets
training_args = SFTConfig(
    output_dir= "llava-1.5-7b-hf-ft-mix",
    report_to="tensorboard",
    learning_rate=1.41e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    logging_steps=5,
    num_train_epochs=5,
    #Add other neccessary arguemnts...
)
#Initialize the trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.tokenizer, #Pass the tokenizer part of the processor
    data_collator = data_collator,
    #A dunmmy field is needed for thetrainer , but our collator handles the real work
)


README.md:   0%|          | 0.00/868 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/20 [00:00<?, ?files/s]

data/train-00000-of-00020.parquet:   0%|          | 0.00/539M [00:00<?, ?B/s]

data/train-00001-of-00020.parquet:   0%|          | 0.00/547M [00:00<?, ?B/s]

data/train-00002-of-00020.parquet:   0%|          | 0.00/540M [00:00<?, ?B/s]

data/train-00003-of-00020.parquet:   0%|          | 0.00/542M [00:00<?, ?B/s]

data/train-00004-of-00020.parquet:   0%|          | 0.00/541M [00:00<?, ?B/s]

data/train-00005-of-00020.parquet:   0%|          | 0.00/541M [00:00<?, ?B/s]

data/train-00006-of-00020.parquet:   0%|          | 0.00/539M [00:00<?, ?B/s]

data/train-00007-of-00020.parquet:   0%|          | 0.00/540M [00:00<?, ?B/s]

data/train-00008-of-00020.parquet:   0%|          | 0.00/540M [00:00<?, ?B/s]

data/train-00009-of-00020.parquet:   0%|          | 0.00/537M [00:00<?, ?B/s]

data/train-00010-of-00020.parquet:   0%|          | 0.00/537M [00:00<?, ?B/s]

data/train-00011-of-00020.parquet:   0%|          | 0.00/544M [00:00<?, ?B/s]

data/train-00012-of-00020.parquet:   0%|          | 0.00/549M [00:00<?, ?B/s]

data/train-00013-of-00020.parquet:   0%|          | 0.00/543M [00:00<?, ?B/s]

data/train-00014-of-00020.parquet:   0%|          | 0.00/543M [00:00<?, ?B/s]

data/train-00015-of-00020.parquet:   0%|          | 0.00/547M [00:00<?, ?B/s]

data/train-00016-of-00020.parquet:   0%|          | 0.00/541M [00:00<?, ?B/s]

data/train-00017-of-00020.parquet:   0%|          | 0.00/541M [00:00<?, ?B/s]

data/train-00018-of-00020.parquet:   0%|          | 0.00/547M [00:00<?, ?B/s]

data/train-00019-of-00020.parquet:   0%|          | 0.00/540M [00:00<?, ?B/s]

data/test-00000-of-00002.parquet:   0%|          | 0.00/285M [00:00<?, ?B/s]

data/test-00001-of-00002.parquet:   0%|          | 0.00/284M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/259155 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/13640 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/23 [00:00<?, ?it/s]

ValueError: Your setup doesn't support bf16/gpu.

5. Start Training

With everything configured, a single line of code kicks off the fine-tuning process.

In [None]:
trainer.train()