# Audio Processing and Transcription using Sarvam AI API and VAD

This notebook demonstrates how to process an audio file, detect speech segments using Voice Activity Detection (VAD), and transcribe those segments using the Sarvam AI Speech-to-Text API. The results are then saved in an SRT file format.

## Prerequisites

Before running this notebook, ensure you have the following installed:

- Python 3.7 or higher
- Required Python packages: `torch`, `numpy`, `librosa`, `soundfile`, `tqdm`, `pydub`, `requests`



In [8]:
!pip install torch numpy librosa soundfile tqdm pydub requests




### **Set Up the API Endpoint and Payload**

To use the Saaras API, you need an API subscription key. Follow these steps to set up your API key:

1. **Obtain your API key**: If you don’t have an API key, sign up on the [Sarvam AI Dashboard](https://dashboard.sarvam.ai/) to get one.
2. **Replace the placeholder key**: In the code below, replace "YOUR_SARVAM_AI_API_KEY" with your actual API key.

In [3]:
SARVAM_AI_API="YOUR_SARVAM_AI_API_KEY"

## **Configuration**

Set up the configuration parameters for the audio processing and VAD.

In [12]:
sample_rate = 16000  # Set the sample rate for loading audio
vad_threshold = 0.5  # Threshold for VAD
combine_duration = 8  # Maximum duration for combined segments
combine_gap = 1  # Maximum gap between segments to combine

### Load VAD Model

We use the Silero VAD model from the `torch.hub` to detect speech segments in the audio.


In [13]:
@torch.no_grad()
def get_vad_probs(model, audio, sample_rate=16000):
    audio = torch.as_tensor(audio, dtype=torch.float32)
    window_size_samples = 512 if sample_rate == 16000 else 256

    model.reset_states()
    audio_length_samples = len(audio)

    speech_probs = []
    for current_start_sample in range(0, audio_length_samples, window_size_samples):
        chunk = audio[current_start_sample: current_start_sample + window_size_samples]
        if len(chunk) < window_size_samples:
            chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
        speech_prob = model(chunk, sample_rate).item()
        speech_probs.append(speech_prob)

    return speech_probs


### Extract Utterances

This function extracts the start and end times of speech segments based on the VAD probabilities.


In [14]:
def get_utterances(vad_probs, threshold=0.5, frame_duration=0.032):
    """Extracts utterances (start and end times) based on VAD probabilities."""
    utterances = []
    in_utterance = False
    utterance_start = 0

    for i, prob in enumerate(vad_probs):
        if prob > threshold and not in_utterance:
            in_utterance = True
            utterance_start = i * frame_duration
        elif prob <= threshold and in_utterance:
            in_utterance = False
            utterance_end = i * frame_duration
            if utterance_end - utterance_start > 0:
                utterances.append((utterance_start, utterance_end))

    if in_utterance:
        utterances.append((utterance_start, len(vad_probs) * frame_duration))

    return utterances


### Merge Segments

This function merges segments that are close to each other and within the specified duration limit.


In [15]:
def merge_segments(segments, max_duration=8, max_gap=1):
    """Combines segments with pauses shorter than `max_gap` seconds, with total duration limit."""
    merged_segments = []
    if not segments:
        return merged_segments  # Return empty if no segments are found

    current_start, current_end = segments[0]

    for start, end in segments[1:]:
        combined_duration = (end - current_start)

        if (start - current_end <= max_gap) and (combined_duration <= max_duration):
            current_end = end
        else:
            merged_segments.append((current_start, current_end))
            current_start, current_end = start, end

    merged_segments.append((current_start, current_end))
    return merged_segments


### Process Audio

This function processes the audio file to detect speech segments using the VAD model.


In [16]:
def process_audio(audio_file):
    vad_model, _ = torch.hub.load(
        repo_or_dir='snakers4/silero-vad',
        model='silero_vad',
        force_reload=False,
        onnx=False
    )
    vad_model.eval()

    audio, _ = librosa.load(audio_file, sr=sample_rate)
    speech_probs = get_vad_probs(vad_model, audio, sample_rate)
    utterances = get_utterances(speech_probs, threshold=vad_threshold)

    if not utterances:
        print(f"No VAD regions detected for {audio_file}.")
        return
    merged_segments = merge_segments(utterances, max_duration=combine_duration, max_gap=combine_gap)

    if merged_segments:
        return merged_segments
    else:
        return


## Transcription using Sarvam AI API

### Transcribe Audio Segment

This function sends an audio segment to the Sarvam AI API for transcription.


In [17]:
def transcribe_audio_segment(start_time_sec, end_time_sec):
    # Convert seconds to milliseconds for pydub
    start_time_ms = start_time_sec * 1000
    end_time_ms = end_time_sec * 1000

    # Extract the audio segment
    segment = audio[start_time_ms:end_time_ms]

    # Export segment to an in-memory BytesIO object
    audio_buffer = io.BytesIO()
    segment.export(audio_buffer, format="wav")
    audio_buffer.seek(0)  # Reset buffer position to the beginning

    files = {
        'file': ('audiofile.wav', audio_buffer, 'audio/wav')
    }

    response = requests.post(api_url, headers=headers, files=files, data=data)

    if response.status_code == 200 or response.status_code == 201:
        return response.json()
    else:
        print(f"Error for segment {start_time_sec}-{end_time_sec}: {response.status_code} - {response.text}")
        return None


### Write SRT File

This function writes the transcription results into an SRT file.


In [18]:
def write_srt_file(results, output_file_path):
    """
    Writes the transcription results into an SRT file.

    Args:
        results (list): List of dictionaries containing 'start_time', 'end_time', and 'transcript'.
        output_file_path (str): Path to save the SRT file.
    """
    def format_timestamp(seconds):
        """Converts seconds to SRT timestamp format: hh:mm:ss,ms"""
        milliseconds = int((seconds % 1) * 1000)
        seconds = int(seconds)
        minutes = seconds // 60
        hours = minutes // 60
        seconds = seconds % 60
        minutes = minutes % 60
        return f"{hours:02}:{minutes:02}:{seconds:02},{milliseconds:03}"

    with open(output_file_path, "w", encoding="utf-8") as srt_file:
        for i, result in enumerate(results, start=1):
            start_timestamp = format_timestamp(result["start_time"])
            end_timestamp = format_timestamp(result["end_time"])
            transcript = result["transcript"]

            # Write the SRT entry
            srt_file.write(f"{i}\\n")
            srt_file.write(f"{start_timestamp} --> {end_timestamp}\\n")
            srt_file.write(f"{transcript}\\n\\n")


## Main Execution

### Set Up API and Audio File

Set up the Sarvam AI API URL, headers, and data. Also, specify the path to the audio file.


In [4]:
api_url = "https://api.sarvam.ai/speech-to-text-translate"
headers = {
    "api-subscription-key" :SARVAM_AI_API
}
data = {
    "model": "saaras:v2",
}
audio_file_path = "stevve.wav"


### Process Audio and Transcribe

Process the audio file to detect speech segments and transcribe each segment using the Sarvam AI API.


In [23]:
timestamps = process_audio(audio_file_path)
audio = AudioSegment.from_file(audio_file_path)

results = []

for start, end in timestamps:
    transcription = transcribe_audio_segment(start, end)
    if transcription is not None:
        results.append({
            "start_time": start,
            "end_time": end,
            "transcript": transcription["transcript"]
        })


Using cache found in /root/.cache/torch/hub/snakers4_silero-vad_master


### Save Results to SRT File

Finally, save the transcription results to an SRT file.


In [24]:
output_srt_path = "subtitles.srt"
write_srt_file(results, output_srt_path)

print(f"SRT file has been saved to {output_srt_path}")


SRT file has been saved to subtitles.srt


## Conclusion

This notebook demonstrates how to process an audio file, detect speech segments using VAD, transcribe those segments using the Sarvam AI API, and save the results in an SRT file format. You can modify the configuration parameters and API settings to suit your specific needs.



### **Additional Resources**

For more details, refer to the official **Saaras API documentation** and join the community for support:

- **Documentation**: [docs.sarvam.ai](https://docs.sarvam.ai/)
- **Community**: [Join the Discord Community](https://discord.gg/hTuVuPNF)

### **Notes:**

**File Format:** Ensure the file is in .wav format and has a sample rate of 16kHz.

**API Key:** Double-check that the SARVAM_API_KEY is correctly set.

**Error Handling:** If transcription fails, the error message and response content will be displayed for debugging.

**Keep Building!** 🚀

