In [1]:
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xl",
    torch_dtype=torch.float16
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define QnA function
def ask_image_question(image_path, question):
    image = Image.open(image_path).convert('RGB')
    inputs = processor(image, question=question, return_tensors="pt").to(device)
    out = model.generate(**inputs)
    return processor.decode(out[0], skip_special_tokens=True)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
from llama_cpp import Llama
import os

openhermes_path = r"C:\GGUF\TheBloke\OpenHermes-2.5-Mistral-7B-GGUF\openhermes-2.5-mistral-7b.Q4_K_M.gguf"

OpenHermes = Llama(
    model_path=openhermes_path,
    n_gpu_layers=20,
    n_ctx=2048,
    n_batch=256,
    n_threads=6,
    use_mlock=True,
    verbose=True
)

In [11]:
def blip2_to_openhermes(image_path, question):
    print(f"❓ QnA: {question}")
    
    visual_answer = ask_image_question(image_path, question)
    print("📸 BLIP-2 Answer:", visual_answer)

    hermes_prompt = (
        f"The image was analyzed and the answer to the question "
        f"'{question}' is: '{visual_answer}'. Can you provide a deeper interpretation?"
    )
    full_prompt = f"<|user|>\n{hermes_prompt}\n<|assistant|>\n"

    response = OpenHermes(full_prompt, max_tokens=300, stop=["<|user|>"])
    hermes_text = response["choices"][0]["text"]

    print("🧠 OpenHermes Response:", hermes_text)


In [12]:
image_path = r"C:\Users\DaysPC\Pictures\Screenshots\PFP.jpg"
question = "Describe this image"
blip2_to_openhermes(image_path, question)

❓ QnA: Describe this image
📸 BLIP-2 Answer: a man is sitting in a chair with a laptop computer


Llama.generate: 43 prefix-match hit, remaining 9 prompt tokens to eval
llama_perf_context_print:        load time =    7002.10 ms
llama_perf_context_print: prompt eval time =    2736.97 ms /     9 tokens (  304.11 ms per token,     3.29 tokens per second)
llama_perf_context_print:        eval time =   91928.63 ms /   110 runs   (  835.71 ms per token,     1.20 tokens per second)
llama_perf_context_print:       total time =   96191.11 ms /   119 tokens


🧠 OpenHermes Response: The image portrays a modern scene of a man using technology to perform various tasks or connect with others. The man's posture and the arrangement of the laptop on his lap suggest that he may be working or communicating with someone remotely. The surroundings are not visible in the image, but the man's attire and the laptop indicate that this could be happening in an office or a home setting. Overall, the image depicts a common scenario of a person using technology as a tool for productivity and interaction in the digital age.
