# **📖 Speech-to-Text Translation API Using Saaras Model**  

## **🔗 Overview**  

This notebook provides a step-by-step guide on how to use the **STT-Translate API** for translating audio files into text using **Saaras**, this API automatically detects the input language, transcribes the speech, and translates the text to English.

It includes instructions for installation, setting up the API key, uploading audio files, and translating audio using the API.


## **1. Installation**
Before you begin, ensure you have the necessary Python libraries installed. Run the following commands to install the required packages:


In [None]:
!pip install sarvamai

In [None]:
from sarvamai import SarvamAI

## **2. Authentication**

To use the API, you need an API subscription key. Follow these steps to set up your API key:

1. **Obtain your API key**: If you don’t have an API key, sign up on the [Sarvam AI Dashboard](https://dashboard.sarvam.ai/) to get one.
2. **Replace the placeholder key**: In the code below, replace "YOUR_SARVAM_AI_API_KEY" with your actual API key.

In [None]:
SARVAM_API_KEY = "YOUR_SARVAM_AI_API_KEY"

### **2.1 Initialize the Client**

Create a Sarvam client instance using your API key. This client will be used to interact with the Saaras API.

In [None]:
client = SarvamAI(api_subscription_key=SARVAM_API_KEY)

## **3. Uploading Audio Files**

To translate audio, you need to provide a `.wav` or `.mp3` file.

#### ✅ Supported Environments:
- Google Colab
- Jupyter Notebook (VS Code, JupyterLab, etc.)

#### 📝 Instructions:
- Ensure your audio file is in `.wav` **or** `.mp3` format.
- Run the cell below. The uploader will automatically adjust based on your environment:
  - **In Google Colab**: You'll be prompted to upload a `.wav` or `.mp3` file via a file picker.
  - **In Jupyter Notebook**: You'll be prompted to enter the full file path of the `.wav` or `.mp3` file stored locally on your machine.
- Once provided, the file will be available for use in the next step.


In [None]:
import sys
import os

def get_audio_file():
    supported_formats = ['.wav', '.mp3']

    if 'google.colab' in sys.modules:
        # Running in Google Colab: use upload widget
        from google.colab import files
        uploaded = files.upload()
        audio_file_path = list(uploaded.keys())[0]
        ext = os.path.splitext(audio_file_path)[1].lower()
        if ext not in supported_formats:
            print(f"Unsupported file format '{ext}'. Please upload a WAV or MP3 file.")
            return None
        print(f"File '{audio_file_path}' uploaded successfully in Colab!")
        return audio_file_path
    else:
        # Running in Jupyter Notebook: input file path
        audio_file_path = input("Enter the path to your MP3 or WAV file: ").strip()
        ext = os.path.splitext(audio_file_path)[1].lower()
        if not os.path.exists(audio_file_path):
            print(f"File not found at: {audio_file_path}")
            return None
        if ext not in supported_formats:
            print(f"Unsupported file format '{ext}'. Please provide a WAV or MP3 file.")
            return None
        print(f"File '{audio_file_path}' found successfully in Jupyter!")
        return audio_file_path


In [None]:
#Enter the file path and enter/return.
audio_file_path = get_audio_file()

## **4. Saaras-v2.5 Usage for STT Translate**

The Saaras-v2 model can be used for converting speech to text across diverse, production-grade scenarios.
It supports basic transcription, code-mixed Indian speech, automatic language detection, and domain-specific prompting — all optimized for real-world applications like telephony, multi-speaker audio, and more.

### **4.1 Basic Usage**

Basic transcription with specified language code.  
Perfect for single-language content with clear audio quality.

In [None]:
if audio_file_path:
    with open(audio_file_path, "rb") as audio_file:
        response = client.speech_to_text.translate(
            file=audio_file,
            model="saaras:v2.5"
        )
    print("✅ Transcription Response:")
    print(response)
else:
    print("🚫 No audio file found. Transcription aborted.")


### **4.2 Code-Mixed Speech**

Handles mixed-language content with automatic detection of language switches within sentences.  
Ideal for natural Indian conversations that mix multiple languages.

In [None]:
if audio_file_path:
    with open(audio_file_path, "rb") as audio_file:
        response = client.speech_to_text.translate(
            file=audio_file,
            model="saaras:v2.5"
        )
    print(response)
else:
    print("No valid audio file found.")

### **4.3 Automatic Language Detection**

Let Saaras automatically detect the language being spoken.  
Useful when the input language is unknown or for handling multi-language content.

In [None]:
if audio_file_path:
    with open(audio_file_path, "rb") as audio_file:
        response = client.speech_to_text.translate(
            file=audio_file,
            model="saaras:v2.5",
        )
    print(response)
else:
    print("No valid audio file found.")

### **4.4 Domain Prompting**

Enhance transcription accuracy with domain-specific prompts and preserve important terms.  
Perfect for specialized contexts like medical, legal, or technical content.

In [None]:
if audio_file_path:
    with open(audio_file_path, "rb") as audio_file:
        response = client.speech_to_text.translate(
            file=audio_file,
            model="saaras:v2.5",
            prompt="Medical consultation"
        )
    print(response)
else:
    print("No valid audio file found.")

## **5. Handling Long Audio Files**

If your audio file exceeds the 30-second limit supported by the **real-time transcription API**, you must split it into smaller chunks for accurate and successful transcription. 
These smaller segments are then transcribed individually using the **real-time API**, and the results are stitched back together to form the final transcript.

👉 For large audio files, switch to the **Batch API** designed for longer durations.  
[🔗 Try the Batch API here](https://github.com/sarvamai/sarvam-ai-cookbook/tree/main/notebooks/stt-translate/stt-translate-batch-api) 

---

### 📝 When to Use
- Audio length > 30 seconds
- **Real-time API** returns timeout or error due to size
- You want to **batch process** long audio files for better accuracy and reliability


### ⚙️ How It Works
1. The full `.mp3` or `.wav` file is first **split into smaller chunks** (e.g., 29 seconds each)
2. Each chunk is then transcribed **individually** using the **real-time API**
3. The individual results are finally **combined** to form one seamless transcript

> ⚠️ For short audio files (<30 seconds), you can skip this step and directly proceed with transcription using the real-time API.

The functions below help with:
- Prevents real-time API timeouts
- Splitting large `.wav`or `.mp3` files into smaller chunks
- Transcribing each chunk using the Saaras:v2.5
- Collating results into a single transcript


### **5.1 Define the split_audio Function**

This function splits a long `.mp3` or `.wav` audio file into smaller chunks (default: 29 seconds) using **FFmpeg**. 
It ensures each segment remains within the real-time API's 30-second limit and stores them in the specified output directory.

In [None]:
import os
import subprocess

def split_audio_ffmpeg(audio_path, chunk_duration=29, output_dir="chunks"):
    os.makedirs(output_dir, exist_ok=True)
    ext = os.path.splitext(audio_path)[1].lower()
    base_name = os.path.splitext(os.path.basename(audio_path))[0]
    output_pattern = os.path.join(output_dir, f"{base_name}_%03d{ext}")

    codec = "pcm_s16le" if ext == ".wav" else "libmp3lame"

    command = [
        "ffmpeg",
        "-i", audio_path,
        "-f", "segment",
        "-segment_time", str(chunk_duration),
        "-c:a", codec,
        output_pattern
    ]

    print("Running command:", " ".join(command))

    result = subprocess.run(command, capture_output=True, text=True)
    print("Return code:", result.returncode)
    print("STDOUT:\n", result.stdout)
    print("STDERR:\n", result.stderr)

    output_files = sorted([
        os.path.join(output_dir, f) for f in os.listdir(output_dir)
        if f.endswith(ext)
    ])

    print("Chunks generated:", output_files)
    return output_files


### **5.2 Define the `translate_audio_chunks` Function**

This function takes the list of chunked audio file paths and uses the **Saaras real-time API** to translate each one individually.
It collects all partial transcriptions and combines them into a single, complete transcript.


In [None]:
def translate_audio_chunks(chunk_paths, client, model="saaras:v2.5"):
    """
    Transcribes each audio chunk using the Sarvam client.

    Args:
        chunk_paths (list): List of file paths to audio chunks.
        client: Authenticated Sarvam client.
        model (str): Model version.
        
    Returns:
        str: Full combined transcription.
    """
    full_transcript = []

    for idx, chunk_path in enumerate(chunk_paths):
        print(f"\n🔄 Translating chunk {idx + 1}/{len(chunk_paths)} → {chunk_path}")
        with open(chunk_path, "rb") as audio_file:
            try:
                response = client.speech_to_text.translate(
                    file=audio_file,
                    model=model
                )
                print("✅ Chunk Response:", response)
                full_transcript.append(str(response))
            except Exception as e:
                print(f"❌ Error with chunk {chunk_path}: {e}")

    return " ".join(full_transcript).strip()


### **5.3 Putting It All Together**

Call the `split_audio_ffmpeg()` function first to break the audio into chunks, and then pass those chunks to `translate_audio_chunks()` for transcription. 
This two-step process ensures large audio files are handled smoothly using the real-time API.


In [None]:
# 1. Split the audio
chunks = split_audio_ffmpeg(audio_file_path)

# 2. Translate each chunk and collate
if chunks:
    final_transcript = translate_audio_chunks(chunks, client)
    print("\n📝 Final Combined Transcript:\n")
    print(final_transcript)
else:
    print("🚫 No audio chunks generated. Transcription aborted.")
 

## **6. Error Handling**  

You may encounter these errors while using the API:  

- **403 Forbidden** (`invalid_api_key_error`)  
  - Cause: Invalid API key.  
  - Solution: Use a valid API key from the [Sarvam AI Dashboard](https://dashboard.sarvam.ai/).  

- **429 Too Many Requests** (`insufficient_quota_error`)  
  - Cause: Exceeded API quota.  
  - Solution: Check your usage, upgrade if needed, or implement exponential backoff when retrying.  

- **500 Internal Server Error** (`internal_server_error`)  
  - Cause: Issue on our servers.  
  - Solution: Try again later. If persistent, contact support.  

- **400 Bad Request** (`invalid_request_error`)  
  - Cause: Incorrect request formatting.  
  - Solution: Verify your request structure, and parameters.  

- **422 Unprocessable Entity Request** (`unprocessable_entity_error`)  
  - Cause: Unable to detect the language of the input text.
  - Solution: Explicitly pass the source_language_code parameter with a supported language.


## **7. Additional Resources**

For more details, refer to the our official documentation and we are always there to support and help you on our Discord Server:

- **Documentation**: [docs.sarvam.ai](https://docs.sarvam.ai)  
- **Community**: [Join the Discord Community](https://discord.gg/hTuVuPNF)


## **8. Final Notes**

- Keep your API key secure.
- Use clear audio for best results.
- Explore advanced features like diarization and translation.

**Keep Building!** 🚀