Huggingface - [Image-to-Text](https://huggingface.co/tasks/image-to-text)

Run Llava model on a Google Colab!

https://huggingface.co/llava-hf/llava-1.5-7b-hf

Llava is a multi-modal image-text to text model that can be seen as an "open source version of GPT4". It yields to very nice results as we will see in this Google Colab demo.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)

The architecutre is a pure decoder-based text model that takes concatenated vision hidden states with text hidden states.

We will leverage QLoRA quantization method and use `pipeline` to run our model.

In [None]:
!pip install -q -U transformers==4.37.2
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
# !pip install -q openai
!pip install -q gTTS
!ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 10 -q:a 9 -acodec libmp3lame Temp.mp3

## Preparing the quantization config to load the model in 4bit precision

In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

We will leverage the `image-to-text` pipeline from transformers !

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from huggingface_hub import whoami

whoami()
# you should see something like {'type': 'user',  'id': '...',  'name': 'Wauplin', ...}

In [None]:
from transformers import pipeline

model_id = "llava-hf/llava-1.5-7b-hf"

pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

In [None]:
import whisper
import gradio as gr
import time
import warnings
# import json
# import openai
import os
from gtts import gTTS

In order to load the model in 4-bit precision, we need to pass a `quantization_config` to our model. Let's do that in the cells below

## Load an image

Let's use the image that has been used for Llava demo

And ask the model to describe that image!

In [None]:
for i in range(1, 11):
    # num = (str(i).zfill(2))
    # print(f"https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter05/images/LOAIW_ch05_image_{str(i).zfill(2)}.jpg")
    !wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter05/images/LOAIW_ch05_image_{str(i).zfill(2)}.jpg

In [None]:
# import requests
from PIL import Image

# image_url = "https://llava-vl.github.io/static/images/view.jpg"
# image_path = "/content/Whisper-FT-ES-001.jpg"
# image_path = "/content/Blaky.jpg"
image_path = "/content/LOAIW_ch05_image_03.jpg"
image = Image.open((image_path))
image

## Load the model using `pipeline`

In [None]:
# NLTK helps to split the transcription sentence by sentence
# and shows it in a neat manner one below another. You will see it in the output below.

import nltk
nltk.download('punkt')
from nltk import sent_tokenize

It is important to prompt the model wth a specific format, which is:
```bash
USER: <image>\n<prompt>\nASSISTANT:
```

In [None]:
import locale
print(locale.getlocale())  # Before running the pipeline
# Run the pipeline
print(locale.getlocale())  # After running the pipeline

In [None]:
max_new_tokens = 200

prompt_instructions = """
Describe the image using as much detail as possible, is it a painting, a photograph, what colors are predominant, what is the image about?
"""

prompt = "USER: <image>\n" + prompt_instructions + "\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
# outputs
# print(outputs[0]["generated_text"])
for sent in sent_tokenize(outputs[0]["generated_text"]):
    print(sent)

In [None]:
import locale
print(locale.getlocale())  # Before running the pipeline
# Run the pipeline
print(locale.getlocale())  # After running the pipeline

In [None]:
warnings.filterwarnings("ignore")

Using a GPU is the preferred way to use Whisper. If you are using a local machine, you can check if you have a GPU available. The first line results `False`, if Cuda compatible Nvidia GPU is not available and `True` if it is available. The second line of code sets the model to preference GPU whenever it is available.

In [None]:
# https://lablab.ai/t/whisper-tutorial
import warnings
from gtts import gTTS
import numpy as np
import torch
torch.cuda.is_available()
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using torch {torch.__version__} ({DEVICE})")

Please keep in mind, that there are multiple different models available. You can find all of them [here](https://github.com/openai/whisper/blob/main/model-card.md). Each one of them has tradeoffs between accuracy and speed (compute needed). We will use the 'small' model for this tutorial.

In [None]:
#Now we can load the Whipser model. The model is loaded with the following command:
import whisper
model = whisper.load_model("medium", device=DEVICE)
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

In [None]:
import requests
import re
from PIL import Image


input_text = 'What color is the flag in the image?'
input_image = '/content/LOAIW_ch05_image_10.jpg'

# load the image
image = Image.open(input_image)

# prompt_instructions = """
# Describe the image using as much detail as possible, is it a painting, a photograph, what colors are predominant, what is the image about?
# """

# print(input_text)
prompt_instructions = """
Act as an expert in imagery descriptive analysis, using as much detail as possible from the image, respond to the following prompt:
""" + input_text
prompt = "USER: <image>\n" + prompt_instructions + "\nASSISTANT:"

# print(prompt)

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

match = re.search(r'ASSISTANT:\s*(.*)', outputs[0]["generated_text"])

if match:
    # Extract the text after "ASSISTANT:"
    extracted_text = match.group(1)
    print(extracted_text)
else:
    print("No match found.")

for sent in sent_tokenize(outputs[0]["generated_text"]):
    print(sent)

In [None]:
import datetime
import os
# import time

In [None]:
## Logger file
tstamp = datetime.datetime.now()
tstamp = str(tstamp).replace(' ','_')
logfile = f'{tstamp}_log.txt'
def writehistory(text):
    with open(logfile, 'a', encoding='utf-8') as f:
        f.write(text)
        f.write('\n')
    f.close()

In [None]:
import re
import requests
from PIL import Image

def img2txt(input_text, input_image):

    # load the image
    image = Image.open(input_image)

    writehistory(f"Input text: {input_text} - Type: {type(input_text)} - Dir: {dir(input_text)}")
    if type(input_text) == tuple:
        prompt_instructions = """
        Describe the image using as much detail as possible, is it a painting, a photograph, what colors are predominant, what is the image about?
        """
    else:
        prompt_instructions = """
        Act as an expert in imagery descriptive analysis, using as much detail as possible from the image, respond to the following prompt:
        """ + input_text

    writehistory(f"prompt_instructions: {prompt_instructions}")
    prompt = "USER: <image>\n" + prompt_instructions + "\nASSISTANT:"

    outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

    # Properly extract the response text
    if outputs is not None and len(outputs[0]["generated_text"]) > 0:
        match = re.search(r'ASSISTANT:\s*(.*)', outputs[0]["generated_text"])
        if match:
            # Extract the text after "ASSISTANT:"
            reply = match.group(1)
        else:
            reply = "No response found."
    else:
        reply = "No response generated."

    return reply

In [None]:
def transcribe(audio):

    # Check if the audio input is None or empty
    if audio is None or audio == '':
        return ('','',None)  # Return empty strings and None audio file

    # language = 'en'

    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    _, probs = model.detect_language(mel)

    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)
    result_text = result.text

    return result_text

In [None]:
def text_to_speech(text, file_path):
    language = 'en'

    audioobj = gTTS(text = text,
                    lang = language,
                    slow = False)

    audioobj.save(file_path)

    return file_path

In [None]:
import gradio as gr
import base64
import os

# A function to handle audio and image inputs
def process_inputs(audio_path, image_path):
    # Process the audio file (assuming this is handled by a function called 'transcribe')
    speech_to_text_output = transcribe(audio_path)

    # Handle the image input
    if image_path:
        chatgpt_output = img2txt(speech_to_text_output, image_path)
    else:
        chatgpt_output = "No image provided."

    # Assuming 'transcribe' also returns the path to a processed audio file
    processed_audio_path = text_to_speech(chatgpt_output, "Temp3.mp3")  # Replace with actual path if different

    return speech_to_text_output, chatgpt_output, processed_audio_path

# Create the interface
iface = gr.Interface(
    fn=process_inputs,
    inputs=[
        gr.Audio(sources=["microphone"], type="filepath"),
        gr.Image(type="filepath")
    ],
    outputs=[
        gr.Textbox(label="Speech to Text"),
        gr.Textbox(label="ChatGPT Output"),
        gr.Audio("Temp.mp3")
    ],
    title="Learn OpenAI Whisper: Image processing with Whisper and Llava",
    description="Upload an image and interact via voice input and audio response."
)

# Launch the interface
iface.launch(debug=True)