Approach 2: using Google GenAI library and HuggingFace Transformers pipeline 
System Overview
Input Image → Image Captioning for object detection → Generated text → Text-to-Speech → Audio Output

Step-by-Step Process
Step 1: Image Captioning using Google GenAI SDK
Objective: Generate a detailed and accessibility-focused caption describing the image.
Instead of detecting isolated objects, the model:
●	Understands the entire scene
●	Identifies relationships between objects
●	Describes actions and context
●	Produces natural, human-like language
Recommended Models: gemini-3-flash-preview 
While using the given gemini model, provide the following system prompt:
"You are a helpful AI Assistant. Given an image perform object detection and provide a text output which contains the information about the labels detected and their counts."
Step 2: Text Processing (Optional Enhancement)
Objective: Prepare the generated caption for speech synthesis.
Possible enhancements:
●	Remove unnecessary symbols
●	Control length (brief/detailed mode)
●	Adjust tone (formal/informal)
●	Add introductory phrase (e.g., "Here is what I see in the image...")

Step 3: Text-to-Speech (TTS)

Objective: Convert the generated descriptive text into natural speech audio.
Recommended Models:
●	suno/bark-small
●	microsoft/speecht5_tts
●	facebook/fastspeech2-en-ljspeech
System Architecture Overview

Input Image
↓
Object Detection using Vision Model (Google GenAI)
↓
Generated Text
↓
Text-to-Speech Model
↓
Audio Output



In [54]:
! pip install google-genai

Collecting google-genai
  Using cached google_genai-1.63.0-py3-none-any.whl.metadata (53 kB)
Collecting google-auth<3.0.0,>=2.47.0 (from google-auth[requests]<3.0.0,>=2.47.0->google-genai)
  Using cached google_auth-2.48.0-py3-none-any.whl.metadata (6.2 kB)
Collecting pydantic<3.0.0,>=2.9.0 (from google-genai)
  Using cached pydantic-2.12.5-py3-none-any.whl.metadata (90 kB)
Collecting tenacity<9.2.0,>=8.2.3 (from google-genai)
  Using cached tenacity-9.1.4-py3-none-any.whl.metadata (1.2 kB)
Collecting websockets<15.1.0,>=13.0.0 (from google-genai)
  Using cached websockets-15.0.1-cp312-cp312-win_amd64.whl.metadata (7.0 kB)
Collecting distro<2,>=1.7.0 (from google-genai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting sniffio (from google-genai)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting pyasn1-modules>=0.2.1 (from google-auth<3.0.0,>=2.47.0->google-auth[requests]<3.0.0,>=2.47.0->google-genai)
  Using cached pyasn1_modules-0.4.2-


[notice] A new release of pip is available: 24.0 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [55]:
client = genai.Client(api_key="erase")

NameError: name 'genai' is not defined

In [56]:
from google import genai

In [57]:
client = genai.Client(api_key="erase")

In [59]:
from google.genai import types

In [61]:
with open('images/traffic.jpeg', 'rb') as f:
  image_bytes = f.read()

context = [
    types.Part.from_bytes(
        data=image_bytes,
        mime_type='image/jpeg',
    ),
]

In [63]:
custom_config = types.GenerateContentConfig(
    system_instruction="You are a helpful AI Assistant. Given an image perform object detection and provide a text output which contains the information about the labels detected and their counts.",
    temperature=1,
    top_p=0.8,
)

In [64]:
context = [
    types.Part.from_bytes(
        data=image_bytes,
        mime_type='image/jpeg',
    )
]


response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=context,
)

print(response.text)

```json
[
  {"box_2d": [101, 169, 298, 352], "label": "an orange bus"},
  {"box_2d": [178, 263, 407, 442], "label": "a white truck"},
  {"box_2d": [76, 372, 426, 843], "label": "a white and blue bus"},
  {"box_2d": [435, 599, 785, 858], "label": "a maroon scooter"},
  {"box_2d": [252, 437, 853, 558], "label": "an elderly man"},
  {"box_2d": [331, 137, 705, 287], "label": "a black motorcycle"},
  {"box_2d": [48, 51, 291, 150], "label": "a push button station"}
]
```


In [65]:
context = [
    types.Part.from_bytes(
        data=image_bytes,
        mime_type='image/jpeg',
    )
]

In [68]:

custom_config = types.GenerateContentConfig(
    system_instruction=(
        "You are a helpful AI Assistant. "
        "Given an image, perform object detection and provide a text output "
        "which contains the information about the labels detected and their counts."
        "Understands the entire scene,"
        "Identifies relationships between objects,"
        "Describes actions and context"
        "Produces natural, human-like language"
    ),
)

In [69]:
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=context,
    config=custom_config,  
)

print(response.text)

In this street scene, a busy road runs alongside a body of water under a pale sky. In the foreground, a man in a light-checkered shirt and tan pants walks across the asphalt, while a person on a maroon scooter rides past him on a zebra-striped crosswalk. To the left, a blue-and-white motorcycle is parked near a blue metal pole that features a white "PUSH THE BUTTON" traffic signal box.

In the middle ground, the traffic includes a small white cargo truck following a large blue and white bus labeled "EXPRESS." Further back, an orange bus is also visible in the lane. The road is bordered on the right by a black metal fence, beyond which lies a wide expanse of water reflecting the soft light of what appears to be late afternoon. Lush white-flowered bushes are visible in the lower foreground, framing the bottom of the scene.

The image contains the following objects and their counts:
- **buses**: 2
- **truck**: 1
- **motorcycles**: 2
- **people**: 3
- **traffic signal pole**: 1
- **crosswa

In [70]:
print(response.text)

In this street scene, a busy road runs alongside a body of water under a pale sky. In the foreground, a man in a light-checkered shirt and tan pants walks across the asphalt, while a person on a maroon scooter rides past him on a zebra-striped crosswalk. To the left, a blue-and-white motorcycle is parked near a blue metal pole that features a white "PUSH THE BUTTON" traffic signal box.

In the middle ground, the traffic includes a small white cargo truck following a large blue and white bus labeled "EXPRESS." Further back, an orange bus is also visible in the lane. The road is bordered on the right by a black metal fence, beyond which lies a wide expanse of water reflecting the soft light of what appears to be late afternoon. Lush white-flowered bushes are visible in the lower foreground, framing the bottom of the scene.

The image contains the following objects and their counts:
- **buses**: 2
- **truck**: 1
- **motorcycles**: 2
- **people**: 3
- **traffic signal pole**: 1
- **crosswa

In [72]:
textGrooomSystemIns="""
You are a helpful AI assistant.
Perform object detection on the given image.

Output Rules:
- Use plain text only.
- Do NOT include emojis, markdown, special symbols, or bullet characters.
- Do not use *, #, -, or decorative formatting.
- Keep the output clean and readable.
"""

In [73]:
mode = "brief"         
tone = "formal"        
intro_phrase = "Here is what I see in the image:"

In [74]:
length_instruction = {
    "brief": "Provide a concise summary",
    "detailed": "Provide a detailed description including object counts and spatial details."
}

In [75]:
tone_instruction = {
    "formal": "Use professional and formal language.",
    "informal": "Use casual and conversational language."
}

In [76]:
system_instruction = f"""
You are a text refinement assistant.

Your task:
- Convert the provided text into a clean summary.

Formatting Rules:
- Plain text only.
- No emojis.
- No markdown.
- Remove unnecessary symbols or bullet characters.

{length_instruction[mode]}
{tone_instruction[tone]}

Begin the response with:
"{intro_phrase}"
"""

In [77]:
custom_config = types.GenerateContentConfig(
    system_instruction=system_instruction,
)

In [78]:
print(response.text)

In this street scene, a busy road runs alongside a body of water under a pale sky. In the foreground, a man in a light-checkered shirt and tan pants walks across the asphalt, while a person on a maroon scooter rides past him on a zebra-striped crosswalk. To the left, a blue-and-white motorcycle is parked near a blue metal pole that features a white "PUSH THE BUTTON" traffic signal box.

In the middle ground, the traffic includes a small white cargo truck following a large blue and white bus labeled "EXPRESS." Further back, an orange bus is also visible in the lane. The road is bordered on the right by a black metal fence, beyond which lies a wide expanse of water reflecting the soft light of what appears to be late afternoon. Lush white-flowered bushes are visible in the lower foreground, framing the bottom of the scene.

The image contains the following objects and their counts:
- **buses**: 2
- **truck**: 1
- **motorcycles**: 2
- **people**: 3
- **traffic signal pole**: 1
- **crosswa

In [80]:
refresponse = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=response.text,   
    config=custom_config,
)

In [81]:
print(refresponse.text)

Here is what I see in the image:

The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In the foreground, a pedestrian walks across the asphalt while a person on a scooter traverses a zebra-striped crosswalk. A traffic signal pole stands near a parked motorcycle. The middle ground features several vehicles, including a white cargo truck and two buses, one of which is labeled Express. The road is separated from the water by a black metal fence, and flowering bushes frame the bottom of the view. The composition includes three individuals, two motorcycles, two buses, a truck, and a traffic signal.


In [82]:
tts = pipeline(
    task="text-to-speech",
    model="suno/bark-small",
)

Loading weights:   0%|          | 0/542 [00:00<?, ?it/s]



In [89]:
refSpeech = tts(refresponse.text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Both `max_new_tokens` (=768) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both

In [90]:

audio = refSpeech["audio"]
sampling_rate = refSpeech["sampling_rate"]

In [91]:
print(sampling_rate)

24000


In [92]:
from IPython.display import Audio as IPythonAudio

In [93]:
IPythonAudio(
    audio,
    rate=ref["sampling_rate"]
)

In [88]:
print(refresponse.text)


Here is what I see in the image:

The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In the foreground, a pedestrian walks across the asphalt while a person on a scooter traverses a zebra-striped crosswalk. A traffic signal pole stands near a parked motorcycle. The middle ground features several vehicles, including a white cargo truck and two buses, one of which is labeled Express. The road is separated from the water by a black metal fence, and flowering bushes frame the bottom of the view. The composition includes three individuals, two motorcycles, two buses, a truck, and a traffic signal.


In [107]:
def text_to_speech_long(text):
    chunks = split_text(text, max_words=25)
    
    audio_segments = []
    sampling_rate = None
    
    for chunk in chunks:
        print(f"Generating audio for: {chunk}")
        
        speech = tts(chunk)
        
        audio_segments.append(speech["audio"])
        sampling_rate = speech["sampling_rate"]
    
    # Merge all audio arrays
    final_audio = np.concatenate(audio_segments)
    
    return final_audio, sampling_rate

In [103]:
def split_text(text, max_words=25):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), max_words):
        chunk = " ".join(words[i:i+max_words])
        chunks.append(chunk)
    
    return chunks

In [97]:
text_to_speech_long(refresponse.text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Both `max_new_tokens` (=768) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating audio for: Here is what I see in the image: The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: the foreground, a pedestrian walks across the asphalt while a person on a scooter traverses a zebra-striped crosswalk. A traffic signal pole stands near a


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: parked motorcycle. The middle ground features several vehicles, including a white cargo truck and two buses, one of which is labeled Express. The road is


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: separated from the water by a black metal fence, and flowering bushes frame the bottom of the view. The composition includes three individuals, two motorcycles,


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: two buses, a truck, and a traffic signal.


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

NameError: name 'np' is not defined

In [98]:
import numpy as np


In [99]:
from IPython.display import Audio

In [100]:
text_to_speech_long(refresponse.text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Both `max_new_tokens` (=768) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating audio for: Here is what I see in the image: The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: the foreground, a pedestrian walks across the asphalt while a person on a scooter traverses a zebra-striped crosswalk. A traffic signal pole stands near a


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: parked motorcycle. The middle ground features several vehicles, including a white cargo truck and two buses, one of which is labeled Express. The road is


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: separated from the water by a black metal fence, and flowering bushes frame the bottom of the view. The composition includes three individuals, two motorcycles,


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: two buses, a truck, and a traffic signal.


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

(array([ 0.07862938,  0.06154559,  0.05787709, ..., -0.00439063,
        -0.00472519, -0.00481351], shape=(1182400,), dtype=float32),
 24000)

In [106]:
from IPython.display import Audio

audio_data, sr = text_to_speech_long(refresponse.text)
Audio(audio_data, rate=sr)

Generating audio for: Here is what I see in the image: The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In


ValueError: The following `model_kwargs` are not used by the model: ['voice_preset'] (note: typos in the generate arguments will also show up in this list)

In [108]:
from IPython.display import Audio

audio_data, sr = text_to_speech_long(refresponse.text)
Audio(audio_data, rate=sr)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
Both `max_new_tokens` (=768) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating audio for: Here is what I see in the image: The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: the foreground, a pedestrian walks across the asphalt while a person on a scooter traverses a zebra-striped crosswalk. A traffic signal pole stands near a


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: parked motorcycle. The middle ground features several vehicles, including a white cargo truck and two buses, one of which is labeled Express. The road is


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: separated from the water by a black metal fence, and flowering bushes frame the bottom of the view. The composition includes three individuals, two motorcycles,


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

Generating audio for: two buses, a truck, and a traffic signal.


Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=60) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `ma

In [109]:
print(refresponse.text)

Here is what I see in the image:

The scene depicts a busy roadway adjacent to a body of water during the late afternoon. In the foreground, a pedestrian walks across the asphalt while a person on a scooter traverses a zebra-striped crosswalk. A traffic signal pole stands near a parked motorcycle. The middle ground features several vehicles, including a white cargo truck and two buses, one of which is labeled Express. The road is separated from the water by a black metal fence, and flowering bushes frame the bottom of the view. The composition includes three individuals, two motorcycles, two buses, a truck, and a traffic signal.
