### LLaVA Demo

Demo images can be found inside `images` folder. There are four images, their labels are {img1: alert, img2:drowsy, img3: alert, img4: drowsy}. Follow upcoming cells to run the demo for all three versions used.

In [2]:
import torch
from transformers import BitsAndBytesConfig, pipeline
from PIL import Image

In [3]:
# Initialize test images
images = [Image.open("./images/img1.jpg"), Image.open("./images/img2.jpg"), 
          Image.open("./images/img3.jpg"), Image.open("./images/img4.jpg")]

# Set quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

In order to run your desired variant, simply uncomment one of the model_id's in the cell below. This is the only thing you have to change, the rest of the code remains the same. The available options to use are:
- **LLaVA-7B**: `model_id = "llava-hf/llava-1.5-7b-hf"`
- **LLaVA-13B**: `model_id = "llava-hf/llava-1.5-13b-hf"`
- **BakLLaVA**: `model_id = "llava-hf/bakLlava-v1-hf"`

For this demo, we used LLaVA-13B.

In [5]:
# Set the model ID (uncomment your desired variant)

# model_id = "llava-hf/llava-1.5-7b-hf" # (1)
model_id = "llava-hf/llava-1.5-13b-hf" # (2)
# model_id = "llava-hf/bakLlava-v1-hf" # (3)

# Leverage the image-to-text pipeline from transformers
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

config.json: 100%|██████████| 1.10k/1.10k [00:00<00:00, 8.49MB/s]
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Downloading shards: 100%|██████████| 6/6 [00:01<00:00,  4.10it/s]
Loading checkpoint shards: 100%|██████████| 6/6 [04:05<00:00, 40.87s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


For the prompt, you need to write it in the format specified in the next cell (we used our fine tuned prompt in this demo). The output of the LLM is the text coming after "ASSISTANT: ".

In [11]:
# Write the prompt, should be in the format --> USER: <image>\n<prompt>\nASSISTANT:
# Note that for this prompt, an answer of 'yes' means alert and 'no' means drowsy.
prompt = "USER: <image>\nCarefully examining the driver's current state, is this driver fully alert and very engaged in safe driving practices? Answer only with 'yes' or 'no'.\nASSISTANT:"

# Get and display predictions (for this demo, all predictions are correct)
i = 1
for img in images:
    output = pipe(img, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
    print(f"[img{i}]\n")
    print(output[0]["generated_text"])
    print("\n----------------------------------------------")
    i += 1

[img1]

USER:  
Carefully examining the driver's current state, is this driver fully alert and very engaged in safe driving practices? Answer only with 'yes' or 'no'.
ASSISTANT: Yes

----------------------------------------------
[img2]

USER:  
Carefully examining the driver's current state, is this driver fully alert and very engaged in safe driving practices? Answer only with 'yes' or 'no'.
ASSISTANT: No

----------------------------------------------
[img3]

USER:  
Carefully examining the driver's current state, is this driver fully alert and very engaged in safe driving practices? Answer only with 'yes' or 'no'.
ASSISTANT: Yes

----------------------------------------------
[img4]

USER:  
Carefully examining the driver's current state, is this driver fully alert and very engaged in safe driving practices? Answer only with 'yes' or 'no'.
ASSISTANT: No

----------------------------------------------
