### Instruction
* This notebook is run on **Colab GPU (T4)**. Running Llava on laptop/CPU is limited by 16G RAM, not to mention CPU is slow.
* **16G GPU RAM** on Colab is just about enough.
* **Llava**: without reasoning ability (?) classification label prediction is very much biased. Though further prompt engineering is not attemped given GPT4 does a far better job.

In [None]:
# NOTE: Download takes maxi 5 min. Model size: 14G.

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf", use_fast=True)
print("Model loaded.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.1k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Model loaded.


### Template: load image from url

In [None]:
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
            {"type": "text", "text": "Classify what's in the image. Only return 1 label from this list: [sign, car, road marking]."},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, torch.float16)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)

["USER:  \nClassify what's in the image. Only return 1 label from this list: [sign, car, road marking]. ASSISTANT: Sign"]

### Template: use local image!

In [None]:
import base64

def image_to_base64_data_uri(file_path):
    with open(file_path, "rb") as img_file:
        base64_data = base64.b64encode(img_file.read()).decode('utf-8')
        return f"data:image/png;base64,{base64_data}"

# Replace 'file_path.png' with the actual path to your PNG file
file_path = '/content/200.jpeg'
# '/content/9123.jpeg' - Clear
# '/content/6488.jpeg' - Clear
# '/content/57465.jpeg' - crystals
# '/content/200.jpeg' - Other
data_uri = image_to_base64_data_uri(file_path)

prompt = """
Look at the image and choose the most appropriate single label from this list: [clear, crystals, precipitate, other].
Only respond with one of the labels from the list.
"""


conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": data_uri}},
            {"type": "text", "text": prompt},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, torch.float16)

# Generate
generate_ids = model.generate(
    **inputs,
    max_new_tokens=30,
    temperature=0.7,         # Add randomness
    do_sample=True,          # Enable sampling
    top_k=10                 # Top-k sampling to limit to top tokens
)

processor.batch_decode(generate_ids, skip_special_tokens=True)

['USER: \nLook at the image and choose the most appropriate single label from this list: [clear, crystals, precipitate, other].\nOnly respond with one of the labels from the list.\n ASSISTANT: Clear']