<a href="https://colab.research.google.com/github/IS-Saja/VQA-with-Audio/blob/main/Visual_Question_%26_Answering_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visual Question & Answering

###**Explanation**

#####This code uses the Gradio library and Hugging Face Transformers to create an interface for Visual Question Answering (VQA) with audio output. It allows users to upload an image, ask a question about it, and the model generates an answer which is then converted into an audio file and played back to the user.

**Installing Required Libraries:**

* `transformers`: Provides tools for working with various pre-trained models from Hugging Face.
* `gradio`: A library for creating simple and interactive web interfaces for machine learning models.
* `gTTS`: A library for converting text to speech, which is used to generate audio responses.

In [1]:
!pip install transformers
!pip install gradio
!pip install gtts

Collecting gradio
  Downloading gradio-4.44.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from g

**Importing Libraries:**
* `gradio as gr :`Imported to create the user interface.
* `BlipForQuestionAnswering, AutoProcessor from the transformers library:` These are imported to load the pre-trained model and processor for Visual Question Answering.
* `Image from the PIL library:` Used for image handling.
* `gTTS from the gtts library:` Used to convert text to speech and generate an audio file.
Loading the Model and Processor:


The model `BlipForQuestionAnswering` and `AutoProcessor` are loaded from the pre-trained model `"Salesforce/blip-vqa-base"` using the from_pretrained() method.

The `BlipForQuestionAnswering` model is specifically designed to handle questions related to images and generate relevant answers.

The `AutoProcessor` is responsible for preprocessing the image and question inputs for the model.



In [2]:
from transformers import BlipForQuestionAnswering, AutoProcessor  # For the pre-trained VQA model and processor
from PIL import Image  # For image handling
import gradio as gr  # For creating the interface
from gtts import gTTS  # For converting text to speech
import os  # For file handling

# Load the model and processor from Hugging Face
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



**Defining the answer_question Function:**

This function takes two inputs:

* **image:** The image input provided by the user.
* **question:** The question input provided by the user.

Inside the function:

* If the image is provided as a file path (string), it is opened using `Image.open()`.
* The image and question are processed using the `processor` to create a format that the model can understand. The `return_tensors="pt"` parameter converts the inputs into PyTorch tensors.
* The model generates an answer based on the processed inputs using the `generate()` method.
* The generated response is decoded using `processor.decode()` to convert it into a human-readable string.
* The decoded text is then converted into an audio file using `gTTS`.
* The audio file is saved and its path is returned as the function output.

In [3]:
# Define the function that handles the image and question input, and returns an audio response
def answer_question_with_audio(image, question):
    # If the input is a file path, open the image
    if isinstance(image, str):
        image = Image.open(image)

    # Process the image and question using the processor to get inputs for the model
    inputs = processor(image, question, return_tensors="pt")

    # Generate the model's response to the question
    out = model.generate(**inputs)

    # Decode the model's output to get a human-readable answer
    answer_text = processor.decode(out[0], skip_special_tokens=True)

    # Convert the text answer to audio using gTTS
    tts = gTTS(text=answer_text, lang='en')

    # Save the audio file
    audio_path = "answer.mp3"
    tts.save(audio_path)

    # Return the path to the audio file
    return audio_path

**Creating the Gradio Interface:**

The Gradio interface is created using the `gr.Interface()` function:

* `fn=answer_question_with_audio:` The function to be called when the user interacts with the interface.
* `inputs:` Specifies the input components for the interface:
 * `gr.Image(type="pil"):` An image upload component, which accepts images in PIL format.
 * `gr.Textbox(label="Question"):` A textbox for the user to input their question.
* `outputs:` Specifies the output component:
 * `gr.Audio(label="Answer (Audio)"):` An audio player where the generated answer will be played.



In [4]:
# Create a Gradio interface with image and text inputs, and an audio output
interface = gr.Interface(
    fn=answer_question_with_audio,  # Function to call when the interface is used
    inputs=[gr.Image(type="pil"), gr.Textbox(label="Question")],  # Inputs: Image and Textbox
    outputs=gr.Audio(label="Answer (Audio)"),  # Output: Audio response
    title="Visual Question Answering with Audio",  # Title of the interface
    description="Upload an image and ask a question. The answer will be provided as an audio response."  # Description
)

# Launch the Gradio interface with public sharing enabled
interface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e161155509b19ffd8b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


