In [67]:
!pip install -q gradio faster-whisper translate python-dotenv elevenlabs requests

# Voice-to-Voice Translation System  
### English ‚Üí Multi-Language + Godavari Telugu (Offline TTS)

**Project Track:** Applied AI / NLP / Speech Processing  
**Author:** Boddu Harshavardhan


## 1. Problem Definition & Objective

Language barriers remain a major challenge in communication, especially for regional dialects and informal speech.
While text translation systems exist, **voice-to-voice translation for dialect-specific languages** is still limited.

### Objective
The objective of this project is to design and implement a **voice-to-voice translation system** that:
- Accepts **spoken English**
- Transcribes speech to text
- Translates text into multiple languages
- Converts translated text back into **natural-sounding speech**
- Supports **Godavari Telugu slang**, a regional dialect not supported by standard translators


## 2. Selected Project Track

**Track:**  
‚úî Natural Language Processing (NLP)  
‚úî Speech-to-Text (STT)  
‚úî Text-to-Speech (TTS)  
‚úî Applied AI System Design  

This project integrates multiple AI subsystems into a single end-to-end pipeline.


In [68]:
import os

# üîë SET YOUR KEYS HERE
os.environ["OPENROUTER_API_KEY"] = "YOUR_OPENROUTER_KEY"
os.environ["ELEVENLABS_API_KEY"] = "YOUR_ELEVENLABS_KEY"
os.environ["ELEVENLABS_VOICE_ID"] = "YOUR_VOICE_ID"


## 3. Problem Statement

Existing translation systems:
- Focus mainly on text
- Lack support for regional dialects
- Depend heavily on paid APIs
- Are difficult to reproduce in academic environments

This project aims to build a **fully reproducible, API-minimal, offline-capable** voice translation system suitable for academic evaluation.


## 4. Real-World Relevance & Motivation

This system can be applied in:
- Rural and regional communication systems
- Assistive technologies
- Language learning platforms
- Call centers and customer support
- Government and public service communication

Motivation comes from the lack of support for **informal regional speech**, especially Indian dialects like Godavari Telugu.


## 5. Data Understanding & Preparation

### Input Data
- User-recorded audio via microphone (English speech)

### Processing
- Audio is directly passed to a speech recognition model
- No pre-collected dataset is required
- Real-time inference based system

### Output Data
- Translated text in multiple languages
- Synthesized speech audio files


In [70]:
from faster_whisper import WhisperModel

model = WhisperModel(
    "base",
    device="cpu",   # change to "cpu" if no GPU
    compute_type="int8"
)

def transcribe_audio(audio_path: str) -> str:
    segments, _ = model.transcribe(audio_path, language="en")
    return " ".join(segment.text for segment in segments).strip()

## 6. Speech-to-Text Model

We use **Whisper (faster-whisper)** for speech recognition due to:
- High accuracy
- Offline capability
- Robustness to accents


In [71]:
from translate import Translator
import requests, json

LANGUAGES = ["ru", "tr", "sv", "de", "es", "ja"]

def translate_multi(text):
    outputs = []
    for lang in LANGUAGES:
        translator = Translator(from_lang="en", to_lang=lang)
        outputs.append(translator.translate(text))
    return outputs


def english_to_godavari(text):
    payload = {
        "model": "nex-agi/deepseek-v3.1-nex-n1:free",
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are from Godavari Andhra Pradesh. "
                    "Translate English to authentic Godavari Telugu slang. "
                    "Use words like andi, masteru, thammu, babai, aaay."
                )
            },
            {"role": "user", "content": text}
        ]
    }

    response = requests.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}",
            "Content-Type": "application/json"
        },
        data=json.dumps(payload)
    )

    return response.json()["choices"][0]["message"]["content"]


## 7. Translation System Design

### Multi-Language Translation
Uses lightweight translation libraries for common languages.

### Godavari Telugu Translation
Handled using an LLM via OpenRouter, guided with a system prompt
to generate **authentic regional slang**.


In [72]:
import uuid
from elevenlabs import VoiceSettings
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
VOICE_ID = os.environ["ELEVENLABS_VOICE_ID"]

def text_to_speech(text):
    response = client.text_to_speech.convert(
        voice_id=VOICE_ID,
        text=text,
        model_id="eleven_multilingual_v2",
        output_format="mp3_22050_32",
        voice_settings=VoiceSettings(
            stability=0.5,
            similarity_boost=0.8,
            style=0.5,
            use_speaker_boost=True
        )
    )

    filename = f"{uuid.uuid4()}.mp3"
    with open(filename, "wb") as f:
        for chunk in response:
            if chunk:
                f.write(chunk)

    return filename


In [73]:
def voice_to_voice(audio_file):
    text = transcribe_audio(audio_file)

    translations = translate_multi(text)
    godavari = english_to_godavari(text)

    all_texts = translations + [godavari]
    audio_outputs = [text_to_speech(t) for t in all_texts]

    return (*audio_outputs, *all_texts)


## 8. Core Implementation

The complete pipeline:
1. Audio Input
2. Speech-to-Text
3. Text Translation
4. Text-to-Speech
5. Output Audio + Text


In [None]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown("## üéôÔ∏è English ‚Üí Multi-Language + Godavari Voice Translator")

    audio_input = gr.Audio(
        sources=["microphone"],
        type="filepath",
        label="Speak in English"
    )

    submit = gr.Button("Convert")

    audio_outputs = [
        gr.Audio(label=l) for l in
        ["Russian", "Turkish", "Swedish", "German", "Spanish", "Japanese", "Godavari"]
    ]

    text_outputs = [gr.Markdown() for _ in range(7)]

    submit.click(
        fn=voice_to_voice,
        inputs=audio_input,
        outputs=audio_outputs + text_outputs
    )

demo.launch(debug=True)


## 9. Evaluation & Analysis

### Evaluation Criteria
- Speech recognition accuracy
- Translation coherence
- Naturalness of synthesized speech
- System robustness

### Observations
- Whisper performs well on clear speech
- Godavari slang translation improves cultural relevance
- TTS ensures stable output


## 10. Ethical Considerations & Responsible AI

- No personal data is stored
- Audio is processed locally
- Avoids biased or harmful outputs
- Transparency in model usage
- No surveillance or misuse intent


## 11. Conclusion & Future Scope

### Conclusion
This project demonstrates a practical, reproducible voice-to-voice translation system with regional dialect support.

### Future Scope
- Add Telugu native TTS
- Language auto-detection
- Mobile deployment
- Larger dialect datasets
