# Video Transcription Pipeline

## Overview
This Jupyter notebook implements a batch processing pipeline for transcribing YouTube videos using OpenAI's Whisper model. It processes videos sequentially, saves transcripts, and tracks progress throughout the transcription process.

### Key Features
- Batch video transcription using Whisper-turbo model
- Progress tracking and display
- Error handling for corrupted files
- CSV output with video IDs and transcripts
- Resumable processing capability

### Prerequisites


In [None]:
!pip install openai-whisper pandas tqdm --break-system-packages
!pip install ffmpeg-python --break-system-packages

In [None]:
import os
import pandas as pd
import whisper
import subprocess
from IPython.display import clear_output



### Configuration
- Input: MP4 video files in `Non_Transcribed_Videos` folder
- Output: Transcriptions saved to `Transcription.csv`
- Model: Whisper-turbo for optimal speed/accuracy balance

### Process Flow
1. Load Whisper model and initialize settings
2. Process videos sequentially:
   - Extract audio and transcribe
   - Save results to CSV
   - Update progress display
3. Handle errors and corrupted files
4. Track completion percentage

### Usage
The pipeline expects videos in the `Non_Transcribed_Videos` folder and processes them sequentially, displaying progress as a percentage of total videos completed.

In [None]:
input_folder = 'Non_Transcribed_Videos'
output_folder = 'Transcription.csv'
print("Loading Whisper-turbo...")
model = whisper.load_model("turbo")

Loading Whisper-turbo...


100%|█████████████████████████████████████| 1.51G/1.51G [00:52<00:00, 30.9MiB/s]


In [15]:
N = 10956 
mp4_files = os.listdir(input_folder)

# delete transcribed videos
if os.path.exists(output_folder):
        df = pd.read_csv(output_folder)

In [16]:
def get_progress():
    clear_output(wait=True)
    print(f"{(N-len(mp4_files))/N * 100:.2f}%")

In [20]:
def process_video(video):
    video_path = os.path.join(input_folder, video)

    try:
        # Transcribe audio directly from MP4
        result = model.transcribe(video_path)
        transcript = result["text"]

        # Save result to CSV
        video_id = os.path.splitext(video)[0]
        df = pd.DataFrame([{"Video ID": video_id, "transcript": transcript}])
        df.to_csv(output_folder, mode='a', header=not os.path.exists(output_folder), index=False)

    except RuntimeError as e:
        if "moov atom not found" in str(e) or "Invalid data found when processing input" in str(e):
            print(f"Skipping {video} due to file corruption or invalid format.")
        else:
            raise  # Re-raise other errors


In [None]:
for video in mp4_files:
    print(f"\n Processing video: {os.path.splitext(video)[0]} ...")
    process_video(video)
    get_progress()

93.56%

 Processing video: 0Sum_pDOiyI ...
