## EchoVision: An Intelligent Image Narration System for Accessibility Support

## Problem Statement

 
In real-world applications such as assistive technology for visually impaired individuals, smart surveillance systems, and automated content generation tools, systems must understand visual data, convert it into meaningful text, and generate natural speech output.
This project builds a multimodal AI system integrating: 
Computer Vision (Object Detection) 
Natural Language Generation (Text Conversion) 
Speech Synthesis (Text-to-Speech) 
In this task you are supposed to **implement both the Approaches** mentioned below: 
**`<u>Approach 1 (Using Only HF):</u>`**` Using Hugging Face Transformers pipelines for object detection and Speech Synthesis.` 
**<u>Approach 2 (Using Gemini and HF):</u>** Using Google GenAI library for image captioning and HuggingFace Transformers pipeline for Speech Synthesis. 
**`<u>Approach 1: Using Hugging Face Transformers pipelines</u>`**

## System Workflow

 
Input Image → Object Detection → Label Extraction & Counting → Text Generation → Text-to-Speech → Audio Output

## Step-by-Step Implementation

### Step 1: Object Detection from Image
 
Objective: Detect objects present in an image and extract labels and confidence scores. 
Recommended Models: 
facebook/detr-resnet-50 (Accurate Transformer-based detector) 
hustvl/yolos-tiny (Lightweight detector) 
google/owlvit-base-patch32 (Open-vocabulary detection)[
### Step 2: Extract Labels and Convert to Text
 
Objective: Count occurrences of detected objects and convert them into meaningful natural language text. You have to write python logic for this step.
### Step 3: Text-to-Speech (TTS)
 
Objective: Convert the generated descriptive text into natural speech audio. 
Recommended Models: 
suno/bark-small 
microsoft/speecht5_tts 
facebook/fastspeech2-en-ljspeech

## System Architecture Overview

 
Input Image↓Object Detection Model (DETR/YOLOS)↓Label Extraction & Counting↓Text Generation Module↓Text-to-Speech Model↓Audio Output 
**<u>Approach 2: using Google GenAI library and HuggingFace Transformers pipeline </u>**

## System Overview

 
Input Image → Image Captioning for object detection → Generated text → Text-to-Speech → Audio Output 
**Step-by-Step Process**

## Step 1: Image Captioning using Google GenAI SDK

 
Objective: Generate a detailed and accessibility-focused caption describing the image. 
Instead of detecting isolated objects, the model: 
Understands the entire scene 
Identifies relationships between objects 
Describes actions and context 
Produces natural, human-like language 
**Recommended Models:** gemini-3-flash-preview 
While using the given gemini model, provide the following system prompt:
"You are a helpful AI Assistant. Given an image perform object detection and provide a text output which contains the information about the labels detected and their counts."

## Step 2: Text Processing (Optional Enhancement)

### **<span style="color:#000000">Object</span><span style="color:#000000">ive: </span><span style="color:#000000">Prepare the generated caption for speech synthesis.</span>**
 
Possible enhancements: 
Remove unnecessary symbols 
Control length (brief/detailed mode) 
Adjust tone (formal/informal) 
Add introductory phrase (e.g., "Here is what I see in the image...")
### Step 3: Text-to-Speech (TTS)
 
Objective: Convert the generated descriptive text into natural speech audio. 
Recommended Models: 
suno/bark-small 
microsoft/speecht5_tts 
facebook/fastspeech2-en-ljspeech

## System Architecture Overview

 
Input Image↓Object Detection using Vision Model (Google GenAI)↓Generated Text↓Text-to-Speech Model↓Audio Output

## Example Use Case Scenario

 
Scenario: Assistive AI tool for visually impaired users.
Sample Image: A park scene containing 3 persons, 1 bicycle, and 2 dogs.
Generated Text: The image contains 3 persons, 1 bicycle, and 2 dogs. 
Output: Audio narration of the generated sentence.

## Learning Outcomes

 
Understanding transformer-based object detection 
Working with Hugging Face pipelines 
Multimodal AI system integration 
Natural language generation techniques 
Text-to-speech synthesis 
End-to-end AI system development

## Conclusion

 
This project demonstrates the integration of Vision, Language, and Speech using Transformer-based pipelines. It provides hands-on experience in designing and implementing real-world multimodal AI systems.

## Submission Link

 
[**<u><span style="color:#1155CC">Click here</span></u>**](https://forms.gle/NxQUoZS5FeS34kRm7) to submit your work.