# Fine-tune Vision Language Model llava


### Insights and Code References Attribution

This work incorporates insights and code snippets courtesy of Hugging Face, published on Hugging Face's blog. We extend our gratitude to the original authors and Hugging Face for sharing their valuable resources with the community.

Here are few resources used for reference.

https://huggingface.co/blog/vlms

Also, Thanks twitter user mrm8488 for sharing a sample notebook

https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/fine_tune_VLM_LlaVa.ipynb

https://x.com/mrm8488/status/1778823704585101807

### Environment 

Google Colab A100 GPU

A100-SXM4-40GB  



## vision-language models

    Vision-language models are AI systems designed to understand and generate language based on visual inputs. They integrate visual data, such as images or videos, with textual data to perform tasks like image captioning, visual question answering, and image-text retrieval. These models leverage techniques from both computer vision and natural language processing, often using large neural networks to learn from vast amounts of paired visual and textual data. By combining visual and linguistic information, they can generate more contextually relevant and accurate descriptions and answers.

## How to Train a Vision-Language Model

#### Understanding the Basics:

Vision-Language Model: It's a type of AI that understands and works with both pictures and text together.

Goal: To make a model that can take an image and a piece of text and understand how they relate to each other.

#### Main Components:

Image Encoder: This part of the model looks at pictures and translates them into a form the computer can understand.

Embedding Projector: This part helps align or match the image data with text data. It's like a translator that makes sure both the image and text speak the same "language" to the computer.

Text Decoder: This part takes the combined image and text data and generates a response or description.

#### Training the Model:

Unifying Representations: The trick is to combine the image and text data in a way the model can use.

Different Approaches: Different models use slightly different methods to do this.

## llava Model

The LLaVA model, short for "Large Language and Vision Assistant," is a type of AI model designed to understand and generate responses that incorporate both visual and textual information. Here's a simplified breakdown of what it is and how it works:

#### What is LLaVA?
LLaVA is a vision-language model that combines the strengths of image processing and natural language processing. It is specifically trained to handle tasks where both image and text inputs are involved, such as answering questions about an image, generating captions, or providing detailed descriptions of visual content.

#### How Does LLaVA Work?

#### Components:

CLIP Image Encoder: This part of the model is responsible for analyzing images. CLIP (Contrastive Language–Image Pre-training) is a model developed by OpenAI that can understand images and text together. The image encoder processes the image and converts it into a format that the model can understand.

Multimodal Projector: This component helps align the image data with the text data. It acts as a bridge between the visual information from the image encoder and the textual information from the text decoder.

Vicuna Text Decoder: This is the part of the model that generates text based on the combined image and text data. Vicuna is a language model that can produce coherent and contextually relevant text.

#### Training Process:

Dataset: The model is trained on a dataset of images and their corresponding captions. Additionally, questions related to these images and captions are generated using another AI model, GPT-4.

Initial Training: During the first phase, the image encoder and text decoder are kept fixed (not updated), and only the multimodal projector is trained. The model learns to match the image features with the text features by comparing its generated outputs with the ground truth captions.

Fine-Tuning: In the second phase, the image encoder remains fixed, but the text decoder is allowed to learn along with the multimodal projector. This helps the model generate better textual responses based on the visual input.


Applications of LLaVA

LLaVA can be used in various applications, such as:

Image Captioning: Generating descriptive captions for images.

Visual Question Answering: Answering questions based on the content of an image.

Multimodal Search: Searching for images based on textual descriptions or vice versa.

Assistive Technology: Helping visually impaired individuals by describing images or scenes.

In summary, the LLaVA model is a sophisticated AI system that excels in tasks requiring a deep understanding of both visual and textual information, making it highly useful in many practical applications.


## Understanding of Basic Concepts

### peft - Parameter Efficient Fine Tuning
Parameter Efficient Fine-Tuning (PEFT) refers to a set of techniques in machine learning aimed at updating large pre-trained models in a way that requires altering only a small fraction of the model's parameters. This approach is particularly useful when working with very large models, such as those used in natural language processing (NLP) and computer vision, where full model retraining can be prohibitively expensive in terms of computational resources and time.


### LoRA

In the context of parameter-efficient fine-tuning for machine learning models, LoRA and QLoRa refer to specific techniques designed to update large pre-trained models more efficiently.

LoRA (Low-Rank Adaptation):

Low-Rank Adaptation is a technique used to fine-tune large pre-trained models in a parameter-efficient manner. Instead of updating all the parameters of a model, LoRA focuses on adapting only a small set of parameters, typically by introducing and training low-rank matrices that modify the weights of specific layers (like the attention or feed-forward layers in a Transformer model). This approach allows for significant customization of the model to new tasks with minimal changes to the overall parameter set, preserving the original strengths of the pre-trained model while adapting it to specific needs.

Imagine you have a huge, complex book (representing a large pre-trained model) that you want to customize slightly without rewriting the whole thing. LoRA allows you to do just that by adding a few notes or annotations (small changes) to some pages, making the book more relevant to your specific needs without altering its original content significantly. In technical terms, LoRA introduces small, trainable adjustments to certain parts of the model while keeping the vast majority of the original parameters fixed. This makes the fine-tuning process much more efficient, as you're only tweaking a tiny fraction of the model.

![lora](lora.jpeg)

The image above is designed to visually explain the concept of Low-Rank Adaptation (LoRA) within Parameter Efficient Fine-Tuning (PEFT), with a focus on resembling a neural network. It illustrates how specific LoRA modules act as targeted enhancements to the neural network, allowing for precise and efficient adjustments to adapt the model for new tasks.


### QLoRa (Quantized Low-Rank Adaptation):

Quantization, in the context of machine learning, involves reducing the precision of the numerical representations used in a model. For example, reducing floating-point numbers from 32-bit to 8-bit integers. This process helps decrease the model's memory footprint and can speed up inference, making the model more efficient, especially on hardware with limited computational resources or specific accelerators designed to work with lower-precision arithmetic.

Building on the previous analogy, if LoRA is like adding handwritten notes to the book, QLoRa is akin to using a shorthand or a more condensed notation for those notes. QLoRa further compresses the changes you make (the annotations or notes) by quantizing them, which means representing these adjustments with fewer bits. This not only makes the fine-tuning process more efficient but also reduces the memory and computational resources needed to apply these customizations.

![qlora](qlora.jpeg)

In very simple terms, both LoRA and QLoRa are about making smart, minimal updates to a large, pre-existing "knowledge base" (the pre-trained model) to adapt it for specific tasks. LoRA does this by adding small modifications, and QLoRa goes a step further by compressing these modifications to be even more resource-efficient.


### Install Packages

In [None]:
!pip install -U "transformers>=4.39.0"
!pip install peft bitsandbytes
!pip install -U "trl>=0.8.3"

Collecting peft
  Downloading peft-0.11.1-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from peft)
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.13.0->peft)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12

In [None]:
import torch
from transformers import AutoTokenizer, AutoProcessor, TrainingArguments, LlavaForConditionalGeneration, BitsAndBytesConfig
from trl import SFTTrainer
from peft import LoraConfig

### Load the model and quantization_config

In [None]:
aibot_model_id = "llava-hf/llava-1.5-7b-hf"

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

In [None]:
model = LlavaForConditionalGeneration.from_pretrained(aibot_model_id,
                                                      quantization_config=quantization_config,
                                                      torch_dtype=torch.float16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/70.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

### Define  Chat template and tokenizer and processor

In [None]:
LLAVA_CHAT_TEMPLATE = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. {% for message in messages %}{% if message['role'] == 'user' %}USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] %}{% if item['type'] == 'text' %}{{ item['text'] }}{% elif item['type'] == 'image' %}<image>{% endif %}{% endfor %}{% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}"""

In [None]:
tokenizer = AutoTokenizer.from_pretrained(aibot_model_id)
tokenizer.chat_template = LLAVA_CHAT_TEMPLATE
processor = AutoProcessor.from_pretrained(aibot_model_id)
processor.tokenizer = tokenizer

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Run Inference Sample 1


In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\nExplain the image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)



In [None]:
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
Explain the image? [/INST]

The image is a graph showing the performance of a player in a game. The graph is divided into several sections, each representing a different aspect of the player's performance. The sections include MME, MMB, MMB-Bench, MMB-Bench-Con, MMB-Bench-Con-V2, MMB-Bench-Con-V2-Vizwiz, MMB-Bench-Con-V2-Vizwiz-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2-V2, MMB-Bench-Con-V2-Vizwiz

### Run Inference Sample 2

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
#prompt = "[INST] <image>\nExplain the image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
What is shown in this image? [/INST]


### Run Inference Sample 3

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
#prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\nExplain the image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
Explain the image? [/INST]

The image features a toy camel with a pair of glasses on its face. The camel is wearing a pair of red glasses, giving it a unique and quirky appearance. The toy is sitting on a table, and its glasses are lit up, adding to its playful and whimsical charm.


### Create a DataCollator to combine text and image pairs.

In [None]:
class LLavaDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            messages = example["messages"]
            text = self.processor.tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=False
            )
            texts.append(text)
            images.append(example["images"][0])

        batch = self.processor(texts, images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        if self.processor.tokenizer.pad_token_id is not None:
            labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels

        return batch

data_collator = LLavaDataCollator(processor)

### Load the Dataset

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft")
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

Downloading readme:   0%|          | 0.00/868 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/20 [00:00<?, ?files/s]

Downloading data:   0%|          | 0.00/285M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/284M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/259155 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/13640 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/23 [00:00<?, ?it/s]

In [None]:
train_dataset[0]

{'messages': [{'content': [{'index': None,
     'text': 'Who wrote this book?\n',
     'type': 'text'},
    {'index': 0, 'text': None, 'type': 'image'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Donna Eden', 'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What is the title of this book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'The Energies of Love: Using Energy Medicine to Keep Your Relationship Thriving',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What type of book is this?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'Health, Fitness & Dieting',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'Is this a fitness book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Yes', 'type': 'text'}],
   'role': 'assist

In [None]:
train_dataset[10]

{'messages': [{'content': [{'index': None,
     'text': 'How many travelers from Asia were in the U.S. in 2020?\n',
     'type': 'text'},
    {'index': 0, 'text': None, 'type': 'image'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': '2.16', 'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'Can you extract the full data and reformat it as a markdown table?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': "Sure! Here's the extracted results written in markdown\n| Characteristic   |   Number of visitors in millions |\n|:-----------------|---------------------------------:|\n| 2020*            |                             2.16 |\n| 2019             |                            12.25 |\n| 2018             |                            11.87 |\n| 2017             |                            12.14 |\n| 2016             |                            11.54 |\n| 2015             |                         

In [None]:
train_dataset[100]

{'messages': [{'content': [{'index': None,
     'text': 'Who is the author of this book?\n',
     'type': 'text'},
    {'index': 0, 'text': None, 'type': 'image'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Bill Spetrino', 'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What is the title of this book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'The Great American Dividend Machine: How an Outsider Became the Undisputed Champ of Wall Street',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What is the genre of this book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Business & Money', 'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'Is this a financial book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Yes', 'type': 'text'}],

In [None]:
len(train_dataset)

259155

In [None]:
len(eval_dataset)

13640

In [None]:
type(train_dataset)

### For faster training Create a new dataset with the first 5000 elements

In [None]:
#from datasets import Dataset

# Assuming `original_dataset` is your existing dataset
#original_dataset = Dataset(...)

# Create a new dataset with the first 100 elements
new_train_dataset = train_dataset.select(range(5000))
new_eval_dataset = eval_dataset.select(range(500))

In [None]:
new_train_dataset[0]

{'messages': [{'content': [{'index': None,
     'text': 'Who wrote this book?\n',
     'type': 'text'},
    {'index': 0, 'text': None, 'type': 'image'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Donna Eden', 'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What is the title of this book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'The Energies of Love: Using Energy Medicine to Keep Your Relationship Thriving',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What type of book is this?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'Health, Fitness & Dieting',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'Is this a fitness book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None, 'text': 'Yes', 'type': 'text'}],
   'role': 'assist

In [None]:
#eval_dataset = eval_dataset[:50]

In [None]:
len(new_train_dataset)

5000

In [None]:
len(new_eval_dataset)

500

# Most Important Customization 

# Add the text AI Bot Hello. to all the asistant responses

# After Fine Tuning, we will check if the model has learn this pattern

In [None]:
# def add_ai_bot_hello(example):
#     for message in example['messages']:
#         for content in message['content']:
#             if content['type'] == 'text' and content['text'] is not None:
#                 content['text'] = f"AI Bot Hello. {content['text']} AI Bot Hello."
#     return example

def add_ai_bot_hello(example):
    for message in example['messages']:
        if message['role'] == 'assistant':
            for content in message['content']:
                if content['type'] == 'text' and content['text'] is not None:
                    content['text'] = f"AI Bot Hello. {content['text']} AI Bot Hello."
    return example

# Apply the transformation to each element in the dataset
new_train_dataset = new_train_dataset.map(add_ai_bot_hello)
new_eval_dataset = new_eval_dataset.map(add_ai_bot_hello)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
new_train_dataset[0]

{'messages': [{'content': [{'index': None,
     'text': 'Who wrote this book?\n',
     'type': 'text'},
    {'index': 0, 'text': None, 'type': 'image'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'AI Bot Hello. Donna Eden AI Bot Hello.',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What is the title of this book?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'AI Bot Hello. The Energies of Love: Using Energy Medicine to Keep Your Relationship Thriving AI Bot Hello.',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'What type of book is this?',
     'type': 'text'}],
   'role': 'user'},
  {'content': [{'index': None,
     'text': 'AI Bot Hello. Health, Fitness & Dieting AI Bot Hello.',
     'type': 'text'}],
   'role': 'assistant'},
  {'content': [{'index': None,
     'text': 'Is this a fitness book?',
     'type': 'text'}],
   'role

### Set the Arguments for Training

In [None]:
training_args = TrainingArguments(
    output_dir="llava-1.5-7b-hf-ft-mix-vsft",
    report_to="tensorboard",
    learning_rate=1.4e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    logging_steps=5,
    num_train_epochs=1,
    #push_to_hub=True,
    gradient_checkpointing=True,
    remove_unused_columns=False,
    fp16=True,
    bf16=False
)

### Do LoRA configuration

In [None]:
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"]
)

In [None]:
from huggingface_hub import notebook_login

# Authenticate with Hugging Face using notebook login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Create the SFTTrainer object



In [None]:
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=new_train_dataset,
    eval_dataset=new_eval_dataset,
    peft_config=lora_config,
    dataset_text_field="text",  # need a dummy field
    tokenizer=tokenizer,
    data_collator=data_collator,
    dataset_kwargs={"skip_prepare_dataset": True},
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


### Run the Fine Tuning training!

In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
5,2.7053
10,3.0919
15,2.7643
20,2.7494
25,2.8433
30,2.5881
35,2.3134
40,2.6313
45,2.3569
50,2.1915




TrainOutput(global_step=625, training_loss=1.14315698928833, metrics={'train_runtime': 2160.7061, 'train_samples_per_second': 2.314, 'train_steps_per_second': 0.289, 'total_flos': 1.026869516945326e+17, 'train_loss': 1.14315698928833, 'epoch': 1.0})

In [None]:
model

LlavaForConditionalGeneration(
  (vision_tower): CLIPVisionModel(
    (vision_model): CLIPVisionTransformer(
      (embeddings): CLIPVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
        (position_embedding): Embedding(577, 1024)
      )
      (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (encoder): CLIPEncoder(
        (layers): ModuleList(
          (0-23): 24 x CLIPEncoderLayer(
            (self_attn): CLIPAttention(
              (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=1024, out_features=1024, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1024, out_features=64, bias=False)
                )
                (lora_B)

In [None]:
# tokenizer = AutoTokenizer.from_pretrained(aibot_model_id)
# tokenizer.chat_template = LLAVA_CHAT_TEMPLATE
# processor = AutoProcessor.from_pretrained(aibot_model_id)
# processor.tokenizer = tokenizer

## Run Inference on Fine Tuned Model sample 1

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
#prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\nwho wrote the book? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
who wrote the book? [/INST] ASSISTANT: [/INST] AI Bot Hello.


## Run Inference on Fine Tuned Model sample 2

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
#prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\n Who wrote this book in the attached image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=10000)
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
 Who wrote this book in the attached image? [/INST] AI Bot Hello. [INST] AI Bot Hello.


## Run Inference on Fine Tuned Model sample 3

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
#prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\nWhat type of image is this? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
What type of image is this? [/INST] ASS: AI Bot Hello. AI Bot Hello.


## Run Inference on Fine Tuned Model sample 4

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
#prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\nIs this a fitness image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=10000)
print(processor.decode(output[0], skip_special_tokens=True))

[INST]  
Is this a fitness image? [/INST] AI: No


## Run Inference on Fine Tuned Model sample 5

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
#prompt = "[INST] <image>\n What is in this image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0], skip_special_tokens=False))

<s> [INST] <image> 
What is shown in this image? [/INST] AI Bot Hello. 

What is shown in this image? [/INST] AI Bot Hello.</s>


## Run Inference on Fine Tuned Model sample 6

In [None]:
from PIL import Image
import requests
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_logo.png?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is the title shown in this image? [/INST]"
#prompt = "[INST] <image>\n What is in this image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=1000)
print(processor.decode(output[0], skip_special_tokens=False))

<s> [INST] <image> 
What is the title shown in this image? [/INST]  AI Bot Assistant 1.0 AI Bot Assistant: 2017:Q2:SS:US AI Bot Assistant: 2017:Q2:SS:US</s>
