<a href="https://colab.research.google.com/github/Carnage203/Video-Subtitle-Generator-Agent/blob/main/agentic_pipeline_Soham_Mandal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Technical Documentation: An Agentic Pipeline for High-Accuracy Subtitle Generation

#Objective
This document outlines an agentic pipeline developed to generate accurate, speaker-separated subtitles for video clips. The pipeline demonstrates a multi-agent workflow where specialized AI models collaborate to perform complex media processing tasks, culminating in a high-quality, verified output.


## Pipeline Stages:

The pipeline is structured into the following stages:

1.  **Initialize Libraries and Download Dependencies**: This initial stage ensures that all necessary Python libraries (`yt-dlp`, `ffmpeg-python`, `openai-whisper`, `pyannote.audio`, `google-generativeai`, `sarvamai`, `pydub`) are installed in the environment. This is a foundational step to ensure the subsequent stages can execute without dependency errors.

2.  **Download Video and Extract Audio**: This stage handles the input video. It can accept a YouTube URL or a local video file. Using `yt-dlp`, the video is downloaded, and then `pydub` is used to extract the audio stream and convert it into a standardized WAV format (`assignment_audio.wav`). This ensures a consistent audio input for the downstream tasks.

3.  **Transcribe Audio with Whisper**: The extracted audio is processed by the Whisper model (`openai-whisper`) to generate a complete transcription. This stage provides the textual content of the spoken dialogue along with word-level timestamps, which are crucial for aligning with speaker information later.

4.  **Perform Speaker Diarization with Pyannote**: In parallel with transcription, the `pyannote.audio` library is used to perform speaker diarization. This process identifies *who* is speaking and *when*. It segments the audio into speaker turns, labeling each segment with a speaker identifier (e.g., SPEAKER_00, SPEAKER_01). This output is essential for separating dialogue by speaker in the subtitles.

5.  **Generate SRT from Transcription and Diarization**: This is a critical integration stage. A custom function (`generate_srt_from_results`) takes the word-level timestamps and text from the Whisper transcription and the speaker turns from the Pyannote diarization. It aligns these two pieces of information to create a preliminary SRT file (`srt_output`) where each subtitle entry includes the text spoken, the corresponding timestamps, and the identified speaker.

6.  **Correct SRT with Gemini** : To achieve state-of-the-art accuracy, the draft SRT, the initial diarization timeline, and the original audio are passed to the Refinement Agent, powered by a large multimodal model (Gemini). This agent performs a holistic review, correcting any residual errors in timing, speaker assignment, or transcription, and generates the final, polished SRT file (`output.srt`).

#Quality Check

7.  **Burn Subtitles into Video**: Using FFmpeg, the corrected SRT file is embedded directly into the original video file. This creates a new video file (`final_video_with_subtitles.mp4`) with the subtitles permanently displayed, providing a visual output for review. (Human in loop)

8.  **Quality Check the Subtitles with Gemini**: As a final quality assurance step, Gemini is used again to provide a confidence score and qualitative feedback on the final corrected SRT by comparing it against the original audio. This generates a report (`quality_report.csv`) detailing the accuracy of the transcription, speaker labels, and timestamps for each subtitle chunk.

9.  **Extract Initial Audio for Sarika (Optional)**: This stage demonstrates the potential to integrate other ASR services. A short segment of the audio is extracted to be sent to the Sarika model.

10. **Transcribe with Sarika (Optional)**: Using the SarvamAI client and the Saarika model, the extracted audio segment is transcribed. This showcases how different transcription services can be incorporated into the pipeline.


## Justification for Not Using a Specific Agentic Framework:

While several excellent agentic frameworks exist (e.g., LangChain, LlamaIndex), this notebook intentionally utilizes simple Python functions and sequential execution within a Colab environment. This deliberate choice was made based on the following considerations, particularly relevant for demonstrating core engineering and problem-solving skills in an internship context:

*   **Demonstrated Understanding of Fundamentals:** Building the pipeline from individual components showcases a deeper understanding of the underlying technologies (ASR, diarization, LLM integration) and how they interact. It highlights the ability to integrate disparate tools to solve a complex problem, rather than relying on a framework's pre-built abstractions.
*   **Flexibility and Customization:** The modular nature of this approach allows for fine-grained control over each step. This is crucial for iterating on the pipeline, experimenting with different models or parameters, and addressing specific edge cases that might be difficult to handle within a more opinionated framework. For an internship, this demonstrates adaptability and a problem-solving mindset.
*   **Efficiency and Reduced Overhead:** For a task of this scope and within the Colab environment, introducing a full-fledged agentic framework could add unnecessary complexity and overhead in terms of setup, learning curve, and potential debugging within the framework itself. A simpler, direct implementation is more efficient for demonstrating the core functionality.
*   **Clear Data Flow and Debugging:** The sequential execution and explicit function calls make the data flow between stages transparent. This simplifies debugging and understanding where issues might arise in the pipeline, a valuable skill in any development role.
*   **Focus on Core Logic:** By avoiding framework-specific boilerplate, the code directly focuses on the core logic of integrating the AI models and processing the data. This highlights the problem-solving approach and the implementation details of the subtitling pipeline itself.
*   **Adaptability to Different Environments:** While demonstrated in Colab, the core logic and the use of standard libraries make this pipeline relatively portable and adaptable to different deployment environments, showcasing an understanding of practical software development considerations beyond a specific framework.

##Conclusion and Future Work
The implemented pipeline successfully meets the assignment's objectives, demonstrating an effective multi-agent workflow that produces high-quality, diarized subtitles.

##Potential improvements and future work include:

Alternative Architectures: Implementing a "diarize-then-transcribe" pipeline (where audio is first split into speaker chunks and then transcribed) as an alternative for comparison.

Automated Error Flagging: Enhancing the QA Agent to automatically flag subtitles with low confidence scores for targeted human review.

Model Benchmarking: Integrating other models (like the optional Sarika stage) more formally to benchmark their performance on key metrics like Word Error Rate (WER) against the baseline.

# Assigned Task:


Downloading dependencies

In [None]:
!pip install yt-dlp
!pip install ffmpeg-python
!pip install openai-whisper
!pip install pyannote.audio
!pip install google-generativeai



In [None]:
from google.colab import userdata
google_api_key = userdata.get('GOOGLE_API_KEY')
hf_token = userdata.get('HF_TOKEN')
sarvamai_key = userdata.get('SARVAMAI')
print("API keys loaded successfully.")

API keys loaded successfully.


The overall approach is to create a flexible and robust data preparation pipeline. User provide the video, either from an online source like YouTube or a local file on the user's machine, and convert it into a standardized, high-quality WAV audio file. This audio file (assignment_audio.wav) serves as the essential input for the subsequent, more complex stages of the project, such as speaker diarization and transcription.

Can handle both YouTube link and Video upload



In [None]:
import os
import yt_dlp
from pydub import AudioSegment
from google.colab import files

def download_video_and_extract_audio(video_url, output_video_path, output_audio_path):
    ydl_opts = {
        'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best',
        'outtmpl': output_video_path,
        'quiet': False,
    }
    try:
        print(f"📥 Starting video download for: {video_url}")
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            ydl.download([video_url])
        if not os.path.exists(output_video_path):
             raise FileNotFoundError(f"Video download failed, file not found at: {output_video_path}")
        print(f"✅ Video successfully downloaded to: {output_video_path}")
        print(f"\n🎧 Extracting audio from '{output_video_path}'...")
        video_segment = AudioSegment.from_file(output_video_path, format="mp4")
        video_segment.export(output_audio_path, format="wav")
        print(f"✅ Audio successfully extracted to: {output_audio_path}")
        return output_video_path, output_audio_path
    except Exception as e:
        print(f"❌ An error occurred: {e}")
        return None

def process_local_video(video_path, output_audio_path="extracted_audio.wav"):
    try:
        if not os.path.exists(video_path):
            raise FileNotFoundError(f"File not found at: {video_path}")
        print(f"\n🎧 Extracting audio from '{video_path}'...")
        video_segment = AudioSegment.from_file(video_path)
        video_segment.export(output_audio_path, format="wav")
        print(f"✅ Audio successfully extracted to: {output_audio_path}")
        return video_path, output_audio_path
    except Exception as e:
        print(f"❌ An error occurred: {e}")
        return None

if __name__ == '__main__':
    user_input = input("Enter a YouTube URL, a local file path, or leave blank to upload: ")
    VIDEO_OUTPUT_PATH = "assignment_video.mp4"
    AUDIO_OUTPUT_PATH = "assignment_audio.wav"
    result = None

    if user_input.strip().startswith('http'):
        result = download_video_and_extract_audio(user_input, VIDEO_OUTPUT_PATH, AUDIO_OUTPUT_PATH)
    elif user_input.strip() and os.path.exists(user_input.strip()):
        result = process_local_video(user_input, AUDIO_OUTPUT_PATH)
    elif not user_input.strip():
        print("Please select a video file to upload.")
        uploaded = files.upload()
        if uploaded:
            uploaded_filename = next(iter(uploaded))
            print(f"✅ File '{uploaded_filename}' uploaded successfully.")
            result = process_local_video(uploaded_filename, AUDIO_OUTPUT_PATH)
        else:
            print("❌ No file was uploaded.")
    else:
        print(f"❌ Input '{user_input}' is not a valid URL or an existing file path.")

    if result:
        video_file, audio_file = result
        video_exists = os.path.exists(video_file)
        audio_exists = os.path.exists(audio_file)
        print("\n--- Verification ---")
        print(f"Video file '{video_file}' exists: {video_exists}")
        print(f"Audio file '{audio_file}' exists: {audio_exists}")
        if video_exists and audio_exists:
            print("\n✅ Both files are ready for the next step.")
        else:
            print("\n❌ Verification failed. One or both files were not created.")
    else:
        print("\n❌ Process failed.")

#https://www.youtube.com/watch?v=zYJKq17GpEc -> Reference video given

Enter a YouTube URL, a local file path, or leave blank to upload: https://www.youtube.com/watch?v=zYJKq17GpEc
📥 Starting video download for: https://www.youtube.com/watch?v=zYJKq17GpEc
[youtube] Extracting URL: https://www.youtube.com/watch?v=zYJKq17GpEc
[youtube] zYJKq17GpEc: Downloading webpage
[youtube] zYJKq17GpEc: Downloading tv client config
[youtube] zYJKq17GpEc: Downloading tv player API JSON
[youtube] zYJKq17GpEc: Downloading ios player API JSON
[youtube] zYJKq17GpEc: Downloading m3u8 information
[info] Testing format 614
[info] zYJKq17GpEc: Downloading 1 format(s): 614+328
[download] assignment_video.mp4 has already been downloaded
✅ Video successfully downloaded to: assignment_video.mp4

🎧 Extracting audio from 'assignment_video.mp4'...
✅ Audio successfully extracted to: assignment_audio.wav

--- Verification ---
Video file 'assignment_video.mp4' exists: True
Audio file 'assignment_audio.wav' exists: True

✅ Both files are ready for the next step.


Using Whisper for the entire transcription from the extracted audio

In [None]:
import whisper
import pprint
model = whisper.load_model("large")
whisper_result = model.transcribe(
    "/content/assignment_audio.wav",
    word_timestamps=True
)

pprint.pprint(whisper_result)



Using Pyannote for the speaker diarization from the extracted audio

In [None]:
import torch
from pyannote.audio import Pipeline


hf_token = hf_token

print("Loading diarization pipeline...")
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token=hf_token)
print("✅ Pipeline loaded.")

if torch.cuda.is_available():
    print("Moving pipeline to GPU...")
    pipeline.to(torch.device("cuda"))
    print("✅ Pipeline on GPU.")

# ---


audio_path = "/content/assignment_audio.wav"
print(f"Running diarization on {audio_path}...")
diarization_result = pipeline(audio_path)
print("✅ Diarization complete.")


print("\n--- Diarization Result ---")
print(diarization_result)


print("\n--- Speaker Turns ---")
for turn, _, speaker in diarization_result.itertracks(yield_label=True):
    print(f"Start: {turn.start:.2f}s | End: {turn.end:.2f}s | Speaker: {speaker}")

Loading diarization pipeline...


DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover


✅ Pipeline loaded.
Running diarization on /content/assignment_audio.wav...


  std = sequences.std(dim=-1, correction=1)


✅ Diarization complete.

--- Diarization Result ---
[ 00:00:01.026 -->  00:00:01.954] A SPEAKER_00
[ 00:00:02.933 -->  00:00:03.507] B SPEAKER_00
[ 00:00:05.228 -->  00:00:06.780] C SPEAKER_00
[ 00:00:07.523 -->  00:00:16.045] D SPEAKER_00
[ 00:00:08.654 -->  00:00:09.210] E SPEAKER_01
[ 00:00:17.108 -->  00:00:18.779] F SPEAKER_00
[ 00:00:19.099 -->  00:00:35.097] G SPEAKER_00
[ 00:00:20.264 -->  00:00:21.057] H SPEAKER_01
[ 00:00:21.107 -->  00:00:22.255] I SPEAKER_01
[ 00:00:23.132 -->  00:00:25.360] J SPEAKER_01
[ 00:00:26.862 -->  00:00:29.562] K SPEAKER_01
[ 00:00:30.659 -->  00:00:31.148] L SPEAKER_01
[ 00:00:35.873 -->  00:00:36.953] M SPEAKER_00
[ 00:00:37.392 -->  00:01:00.240] N SPEAKER_00
[ 00:00:42.134 -->  00:00:43.872] O SPEAKER_01
[ 00:00:45.458 -->  00:00:47.888] P SPEAKER_01
[ 00:00:55.769 -->  00:00:57.017] Q SPEAKER_01
[ 00:01:00.797 -->  00:01:02.130] R SPEAKER_00
[ 00:01:02.367 -->  00:01:07.564] S SPEAKER_00
[ 00:01:08.627 -->  00:01:10.534] T SPEAKER_00
[ 00:01:

Code Explanation:
This  pipeline designed to merge the outputs of two separate agents: a transcription model (Whisper) and a speaker diarization model (pyannote.audio). Its goal is to produce a single, accurate, speaker-separated subtitle file in the standard SRT format.

The process is broken down into two main functions:

generate_srt_from_results (The Core Logic):

Speaker Assignment: The function iterates through every single word provided by the transcription model. For each word, it determines which speaker was talking at that moment by checking the diarization data.

Grouping into Dialogue: It then groups consecutive words spoken by the same person into a single dialogue segment. A new segment is created whenever the speaker changes.

Formatting: Finally, it takes these dialogue segments and formats them into the final SRT string, using the helper function to create the required timestamps.

format_timestamp_srt (A Utility):

This is a helper function that takes a time in seconds (e.g., 35.482) and converts it into the strict HH:MM:SS,ms format (e.g., 00:00:35,482) that the SRT standard requires.

In [None]:
import datetime

def format_timestamp_srt(seconds):
    """Converts seconds into the SRT timestamp format (HH:MM:SS,ms)."""
    td = datetime.timedelta(seconds=seconds)
    minutes, seconds = divmod(td.seconds, 60)
    hours, minutes = divmod(minutes, 60)
    milliseconds = td.microseconds // 1000
    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"


def generate_srt_from_results(whisper_result, diarization_result):
    """
    Aligns whisper transcription with pyannote diarization, handling both
    overlapping segments and filling VAD gaps for complete sentences.
    """
    all_words = []
    for segment in whisper_result.get('segments', []):
        all_words.extend(segment.get('words', []))

    if not all_words:
        return ""

    speaker_mapping = {label: f"Speaker {i+1}" for i, label in enumerate(diarization_result.labels())}

    word_speakers = []
    for word in all_words:
        word_center_time = word['start'] + (word['end'] - word['start']) / 2

        overlapping_speakers = []
        for turn, _, speaker_label in diarization_result.itertracks(yield_label=True):
            if turn.start <= word_center_time < turn.end:
                duration = turn.end - turn.start
                overlapping_speakers.append({
                    'speaker': speaker_mapping.get(speaker_label),
                    'duration': duration
                })

        found_speaker = None
        if overlapping_speakers:

            shortest_turn = min(overlapping_speakers, key=lambda x: x['duration'])
            found_speaker = shortest_turn['speaker']
        elif word_speakers and word['start'] - word_speakers[-1]['end'] < 0.2:
            found_speaker = word_speakers[-1]['speaker']

        word_speakers.append({
            'word': word['word'],
            'start': word['start'],
            'end': word['end'],
            'speaker': found_speaker if found_speaker else "Unknown Speaker"
        })

    dialogue_segments = []
    if not word_speakers:
        return ""

    current_segment = {
        'speaker': word_speakers[0]['speaker'],
        'text': word_speakers[0]['word'],
        'start': word_speakers[0]['start'],
        'end': word_speakers[0]['end']
    }

    for i in range(1, len(word_speakers)):
        word_info = word_speakers[i]
        if word_info['speaker'] == current_segment['speaker']:
            current_segment['text'] += word_info['word']
            current_segment['end'] = word_info['end']
        else:
            dialogue_segments.append(current_segment)
            current_segment = {
                'speaker': word_info['speaker'],
                'text': word_info['word'],
                'start': word_info['start'],
                'end': word_info['end']
            }
    dialogue_segments.append(current_segment)

    srt_content = ""
    idx = 1
    for segment in dialogue_segments:
        if segment['speaker'] == "Unknown Speaker":

            continue

        start_time_str = format_timestamp_srt(segment['start'])
        end_time_str = format_timestamp_srt(segment['end'])
        speaker_label = segment['speaker']
        text = segment['text'].strip()

        srt_content += f"{idx}\n"
        srt_content += f"{start_time_str} --> {end_time_str}\n"
        srt_content += f"[{speaker_label}] {text}\n\n"
        idx += 1

    return srt_content


srt_output = generate_srt_from_results(whisper_result, diarization_result)

print("--- Definitive SRT Content ---")
print(srt_output)

--- Definitive SRT Content ---
1
00:00:01,100 --> 00:00:01,740
[Speaker 1] I ask you a question?

2
00:00:02,860 --> 00:00:03,320
[Speaker 1] course you can.

3
00:00:05,300 --> 00:00:06,460
[Speaker 1] you get me out of this place?

4
00:00:07,520 --> 00:00:08,700
[Speaker 1] afraid I'm of no use to you there.

5
00:00:09,020 --> 00:00:09,380
[Speaker 2] Why

6
00:00:09,380 --> 00:00:15,860
[Speaker 1] couldn't they put me in a proper prison? A better facility wasn't available. Or put me in a madhouse. It's a secure training centre. Training?

7
00:00:17,220 --> 00:00:18,360
[Speaker 1] are they training us for?

8
00:00:19,220 --> 00:00:20,280
[Speaker 1] other boys just scream all

9
00:00:20,280 --> 00:00:22,340
[Speaker 2] the time. I thought this had been explained to you, Jamie. And they rock backwards and forwards

10
00:00:22,340 --> 00:00:23,060
[Speaker 1] and they shower at current

11
00:00:23,060 --> 00:00:25,320
[Speaker 2] agencies. Young offenders institutions have got

To further correct the SRT Gemini is used. Original audio is provided as the ground truth along with the diarization data and SRT data as flawed data. Gemini takes the SRT and diarization data as reference to frame the final "output.srt"

In [None]:
import google.generativeai as genai
import textwrap
import os

def correct_srt_with_gemini(api_key: str, audio_path: str, diarization_text: str, flawed_srt_text: str) -> str:
    """
    Uses Gemini to correct a flawed SRT file using audio and a diarization timeline.

    Args:
        audio_path: The file path to the audio to be analyzed.
        diarization_text: A string containing the speaker diarization timeline to be used as a reference.
        flawed_srt_text: A string containing the  SRT transcript to be used as a reference.

    Returns:
        The corrected SRT content as a string.
    """
    try:

        genai.configure(api_key=api_key)


        print("Uploading audio file to Gemini...")
        audio_file = genai.upload_file(path=audio_path)
        print(f"✅ Audio file '{audio_path}' uploaded successfully.")


        prompt = textwrap.dedent("""
        <prompt>
    <system_instructions>
        You are an expert AI subtitler and audio analyst.
        Your primary task is to create a perfectly accurate, speaker-separated SRT subtitle file directly from the provided audio.

        1.  **Analyze Audio Ground Truth:** The audio file is the single source of truth. Listen to it carefully to determine the exact words spoken, the timestamps, and the speakers.

        2.  **Perform Your Own Speaker Diarization:** Identify the number of speakers in the audio yourself. Label them consistently as `[Speaker 1]`, `[Speaker 2]`, etc., based on the order of their first appearance.

        3.  **Use Reference Data as Weak Hints:** You are also provided with two pieces of low-confidence reference data: a faulty diarization timeline and a flawed draft SRT. Use them only as weak hints if you are uncertain.

        4.  **Generate a Perfect SRT File:** Your final output must be a single, complete SRT file.
            -   The text, speaker labels, and timestamps must be derived directly and accurately from your analysis of the audio.
            -   **Crucially, each distinct spoken utterance, even if it overlaps with another, must be its own numbered block with its own precise start and end times.**
            -   Do not include any other text or explanations. Ensure the dialogue flows naturally and sentences are complete.

        5.  **Required Output Format:**
            ```srt
            1
            00:00:01,000 --> 00:00:04,000
            [Speaker 1] Hello, welcome to the show.

            2
            00:00:04,100 --> 00:00:06,000
            [Speaker 2] Thank you, it's great to be here.
            ```
    </system_instructions>

    <reference_data>
        <faulty_diarization_timeline>
            {diarization_data}
        </faulty_diarization_timeline>

        <faulty_draft_srt>
            {srt_data}
        </faulty_draft_srt>
    </reference_data>
</prompt>""").format(diarization_data=diarization_text, srt_data=flawed_srt_text)


        print("Sending request to Gemini for correction. This may take a moment...")
        model = genai.GenerativeModel('models/gemini-2.5-flash')
        response = model.generate_content([prompt, audio_file])
        print("✅ Received response from Gemini.")

        return response.text

    except Exception as e:
        return f"An error occurred: {e}"



if __name__ == '__main__':

    GEMINI_API_KEY = google_api_key
    AUDIO_FILE_PATH = "assignment_audio.wav"

    DIARIZATION_DATA = diarization_result
    FLAWED_SRT_DATA = srt_output

    if not os.path.exists(AUDIO_FILE_PATH):
        print(f"Error: Audio file not found at '{AUDIO_FILE_PATH}'")
        print("Please ensure the audio file is downloaded and the path is correct.")
    elif GEMINI_API_KEY == "YOUR_GEMINI_API_KEY":
        print("Error: Please replace 'YOUR_GEMINI_API_KEY' with your actual API key.")
    else:
        corrected_srt = correct_srt_with_gemini(
            api_key=GEMINI_API_KEY,
            audio_path=AUDIO_FILE_PATH,
            diarization_text=DIARIZATION_DATA,
            flawed_srt_text=FLAWED_SRT_DATA
        )

        print("\n--- FINAL CORRECTED SRT (from Gemini) ---")
        print(corrected_srt)

        with open("output.srt", "w", encoding="utf-8") as f:
            f.write(corrected_srt)
            print("\n✅ Final corrected file saved as 'output.srt'")



Uploading audio file to Gemini...
✅ Audio file 'assignment_audio.wav' uploaded successfully.
Sending request to Gemini for correction. This may take a moment...
✅ Received response from Gemini.

--- FINAL CORRECTED SRT (from Gemini) ---
1
00:00:00,996 --> 00:00:02,176
[Speaker 1] Can I ask you a question?

2
00:00:02,896 --> 00:00:03,666
[Speaker 2] Course you can.

3
00:00:05,256 --> 00:00:06,686
[Speaker 1] Can you get me out of this place?

4
00:00:07,456 --> 00:00:08,896
[Speaker 2] I'm afraid I'm of no use to you there.

5
00:00:09,056 --> 00:00:11,106
[Speaker 1] Why can't they put me in a proper prison?

6
00:00:11,266 --> 00:00:12,746
[Speaker 2] A better facility wasn't available.

7
00:00:13,116 --> 00:00:13,996
[Speaker 1] Or put me in a madhouse.

8
00:00:14,246 --> 00:00:15,396
[Speaker 2] It's a secure training centre.

9
00:00:15,706 --> 00:00:16,136
[Speaker 1] Training?

10
00:00:17,216 --> 00:00:18,526
[Speaker 1] What are they training us for?

11
00:00:19,266 --> 00

##QC Agent

Provided both the visual approach (Human in loop)
and
Agentic approach

This is for the visual testing of the "FINAL PRODUCT" just run and watch the entire video with subtitles. "final_video_with_subtitles.mp4"

In [None]:
def burn_subtitles_with_audio(video_path: str, srt_path: str, output_path="final_video_with_subtitles.mp4"):
    """
    Burns the SRT subtitles into the video while copying the original audio stream.

    Args:
        video_path (str): Path to the input video file.
        srt_path (str): Path to the SRT subtitle file.
        output_path (str): Path for the final output video.
    """
    print("\n--- Merging Subtitles and Audio into Video ---")


    if not os.path.exists(video_path):
        print(f"❌ Error: Input video not found at '{video_path}'")
        return
    if not os.path.exists(srt_path):
        print(f"❌ Error: Subtitle file not found at '{srt_path}'")
        return


    command = f"""
    ffmpeg -i "{video_path}" -vf "subtitles='{srt_path}':force_style='Fontsize=20,PrimaryColour=&H00FFFFFF&,BorderStyle=3'" -c:a copy "{output_path}" -y
    """

    print("🔥 Burning subtitles into video...")

    os.system(command)


    if os.path.exists(output_path):
        print(f"✅ Final video created: '{output_path}'")
    else:
        print(f"❌ Failed to create final video.")

video_output_path = "/content/assignment_video.mp4"
final_srt_file = "/content/output.srt"


burn_subtitles_with_audio(video_output_path, final_srt_file)


--- Merging Subtitles and Audio into Video ---
🔥 Burning subtitles into video...
✅ Final video created: 'final_video_with_subtitles.mp4'


Used as Quality Check Agent, Gemini is given the "original audio" as ground truth and the "output.srt" as the test. with the given task:
1.  Verify the accuracy of the transcribed text.
2.  Verify the accuracy of the assigned speaker label.
3.  Verify the accuracy of the start and end timestamps.

To provide both confidence score with feedback.

which is saved as "quality_report.csv"

In [None]:
from google import genai
import pandas as pd

client = genai.Client(api_key=google_api_key)

myfile = client.files.upload(file="/content/assignment_audio.wav")
print("Original Audio uploaded !")

prompt = f"""<prompt>
    <system_instructions>
        You are a meticulous Quality Assurance (QA) Analyst for Speaker Diarization. Your task is to analyze each individual subtitle chunk in the provided SRT file against the video ground truth (provided audio).
        There are multiple speakers speaking in the audio thus, ensure the SRT file justifies the speaker diarization timeline.
    </system_instructions>

    <input_data>
        <full_subtitle_file_to_evaluate>
            {corrected_srt}
        </full_subtitle_file_to_evaluate>
    </input_data>

    <task>
        For **every numbered chunk** in the subtitle file, perform the following analysis by listening to the audio at the corresponding timestamp:
        1.  Verify the accuracy of the transcribed text.
        2.  Verify the accuracy of the assigned speaker label.
        3.  Verify the accuracy of the start and end timestamps.
    </task>

    <output_format>
        Your response MUST be a raw CSV string and nothing else. Do not include any other text, explanations, or markdown formatting.

        The CSV string must have the following header row:
        `index,confidence_score,feedback`

        Each subsequent row must correspond to a subtitle chunk from the input file. Ensure that any feedback containing commas is enclosed in double quotes.

        Example:
        ```csv
        index,confidence_score,feedback
        1,10,"Perfect match in text, speaker, and timing."
        2,7,"The text is accurate, but the subtitle starts slightly late."
        3,4,"The speaker label is incorrect; Speaker 2's voice is heard."
        ```
    </output_format>
</prompt>"""

response = client.models.generate_content(
    model="gemini-2.5-flash", contents=[prompt, myfile]
)

print("\n--- Quality Check Report (CSV) ---")

csv_response_text = response.text.strip().replace("```csv", "").replace("```", "")
file_path = "quality_report.csv"
with open(file_path, "w", encoding="utf-8") as f:
    f.write(csv_response_text.strip())

print(f"✅ Report successfully saved to '{file_path}'")

print("\n--- Quality Check Report ---")
report_df = pd.read_csv(file_path)


print(report_df.to_string())


Original Audio uploaded !

--- Quality Check Report (CSV) ---
✅ Report successfully saved to 'quality_report.csv'

--- Quality Check Report ---
    index  confidence_score                                                                                                                                                  feedback
0       1                10                                                                                                               Perfect match in text, speaker, and timing.
1       2                10                                                                                                               Perfect match in text, speaker, and timing.
2       3                10                                                                                                               Perfect match in text, speaker, and timing.
3       4                10                                                                                                     

#Using Sarika

Planned to use Sarika for generate another confidence score by comparing between diarization of Sarika and My agentic output.
But Couldn't do it since i was not able to find any documentation to perform diarization with Saarika-v2.5 (as mentioned in the provided docs)

In [None]:
def extract_initial_audio(video_path: str, duration_seconds: int, output_audio_path: str) -> str | None:
    """
    Extracts a specific duration of audio from the beginning of a video file.

    Args:
        video_path (str): The path to the input video file.
        duration_seconds (int): The duration to extract in seconds.
        output_audio_path (str): The path to save the extracted WAV audio file.

    Returns:
        The path to the output audio file, or None if an error occurred.
    """
    try:
        if not os.path.exists(video_path):
            raise FileNotFoundError(f"Input video not found at: {video_path}")

        print(f"🎧 Loading video: '{video_path}'")
        full_audio = AudioSegment.from_file(video_path)

        # Slice the audio to 25 sec (limit is 30 without batch)
        duration_ms = duration_seconds * 1000
        initial_audio_segment = full_audio[:duration_ms]


        initial_audio_segment.export(output_audio_path, format="wav")
        print(f"✅ Successfully extracted {duration_seconds}s of audio to '{output_audio_path}'")
        return output_audio_path

    except Exception as e:
        print(f"❌ An error occurred: {e}")
        return None

extracted_file = extract_initial_audio(
        video_path="/content/assignment_video.mp4",
        duration_seconds=25,
        output_audio_path="/content/sarika_audio.wav"
    )

🎧 Loading video: '/content/assignment_video.mp4'
✅ Successfully extracted 25s of audio to '/content/sarika_audio.wav'


In [None]:
!pip install sarvamai

Collecting sarvamai
  Downloading sarvamai-0.1.15-py3-none-any.whl.metadata (26 kB)
Downloading sarvamai-0.1.15-py3-none-any.whl (163 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.7/163.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sarvamai
Successfully installed sarvamai-0.1.15


In [None]:
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key= sarvamai_key
)

response = client.speech_to_text.transcribe(
    file=open("/content/sarika_audio.wav", "rb"),
    model="saarika:v2.5",
    language_code="en-IN",
    #enable_diarization=True  #couldn't found the parameter to enable diarization
    #diarized_transcript=True
)

print(response.transcript)

Can I ask you a question? Of course you can.  Can you can you get me out of this place? I'm afraid I'm of no use to you there. Why why couldn't they put me in a proper prison? A better facility wasn't available. Or put me in a madhouse. It's a secure training center. Training? Well what are they training us for? The other boys just scream all the time. And they walk practicing forward and they shall walk coming in. Young offenders institutions are always a problem.
