<a href="https://colab.research.google.com/github/A-Monaghan/whisper-audio/blob/main/Whisper_Audi0_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Whisper AI: Advanced Audio Transcription, Translation, and Analysis
This Google Colaboratory notebook provides a comprehensive environment for leveraging OpenAI's Whisper model for various audio processing tasks. It goes beyond basic transcription to include features like language detection, translation, speaker diarization, and text summarization.

Features:

Easy Setup: Install all necessary libraries with a single cell.

Flexible Audio Input: Upload files or use files from Google Drive.

Whisper Model Selection: Choose from various Whisper models (tiny, base, small, medium, large).

Transcription: Convert audio to text with automatic language detection.

Translation: Translate transcribed text to English.

Speaker Diarization: Identify and label different speakers in the audio (requires Hugging Face token).

Text Summarization: Generate concise summaries of the transcribed content.

Structured Output: Save results in various formats (TXT, SRT, VTT, JSON).

1. Setup and Installation
This section will install all the necessary libraries, including whisper, ffmpeg (for audio processing), pyannote.audio (for speaker diarization), and transformers (for summarization).

In [None]:
# @title Install Dependencies
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q ffmpeg-python
!pip install -q transformers
!pip install -q accelerate
!pip install -q sentencepiece # Required for some tokenizer models
!pip install -q pydub # For audio manipulation

# Install pyannote.audio for speaker diarization (requires specific versions)
# Note: pyannote.audio requires a Hugging Face token for models.
!pip install -q pyannote.audio==3.1.1
!pip install -q torchaudio==2.0.2

# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"PyTorch version: {torch.__version__}")
print(f"Torch Audio version: {torchaudio.__version__}")

import whisper
import os
from pydub import AudioSegment
from transformers import pipeline
import json
import re
from IPython.display import Audio, display
import numpy as np
import subprocess

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

print("All dependencies installed and imported successfully!")

2. Google Drive Integration (Optional)
Mount your Google Drive to easily access audio files and save output files.

In [None]:
# @title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
print("Google Drive mounted!")


3. Audio Input Handling
You can either upload an audio file directly to Colab or specify a path to an audio file already in your Google Drive. This section also includes a utility to convert audio to a suitable format if needed.

In [None]:
# @title Upload Audio File or Specify Path
# @markdown Choose your audio input method:
audio_input_method = "Upload File"  # @param ["Upload File", "Google Drive Path"]

# @markdown If "Upload File" is selected, use the upload button that appears after running this cell.
# @markdown If "Google Drive Path" is selected, provide the full path to your audio file (e.g., `/content/drive/MyDrive/my_audio.mp3`).
google_drive_audio_path = "/content/drive/MyDrive/my_audio.mp3" # @param {type:"string"}

audio_file_path = None

if audio_input_method == "Upload File":
    from google.colab import files
    uploaded = files.upload()
    for filename in uploaded.keys():
        audio_file_path = filename
        print(f"Uploaded file: {audio_file_path}")
        break # Assuming only one file is uploaded
else:
    if os.path.exists(google_drive_audio_path):
        audio_file_path = google_drive_audio_path
        print(f"Using Google Drive file: {audio_file_path}")
    else:
        print(f"Error: File not found at {google_drive_audio_path}. Please check the path.")

# Function to convert audio to WAV (Whisper prefers WAV)
def convert_audio_to_wav(input_path, output_path="temp_audio.wav"):
    try:
        audio = AudioSegment.from_file(input_path)
        audio.export(output_path, format="wav")
        print(f"Converted '{input_path}' to '{output_path}'")
        return output_path
    except Exception as e:
        print(f"Error converting audio: {e}")
        return None

if audio_file_path:
    # Convert to WAV if not already WAV
    if not audio_file_path.lower().endswith(".wav"):
        audio_file_path = convert_audio_to_wav(audio_file_path)
        if not audio_file_path:
            print("Audio conversion failed. Please ensure the input file is valid.")
    else:
        print(f"Audio file '{audio_file_path}' is already in WAV format.")

    # Display audio for verification
    if audio_file_path and os.path.exists(audio_file_path):
        display(Audio(audio_file_path))
    else:
        print("No valid audio file to display.")

4. Whisper Model Loading
Choose the Whisper model size. Larger models provide better accuracy but require more computational resources and time.



In [None]:
# @title Select Whisper Model
whisper_model_size = "base" # @param ["tiny", "base", "small", "medium", "large", "large-v2"]

print(f"Loading Whisper model: {whisper_model_size}...")
device = "cuda" if torch.cuda.is_available() else "cpu"
whisper_model = whisper.load_model(whisper_model_size, device=device)
print(f"Whisper model '{whisper_model_size}' loaded on {device} device.")


5. Core Functionality
This section contains the main functions for transcription, translation, speaker diarization, and summarization.

5.1. Audio Transcription
Transcribe the audio file using the loaded Whisper model. It automatically detects the language.

In [None]:
# @title Transcribe Audio
# @markdown Run this cell to transcribe the uploaded/selected audio file.
# @markdown You can choose to save the output in different formats.
output_formats = ["txt", "srt", "vtt", "json"] # @param ["txt", "srt", "vtt", "json"] {allow-input: true}
output_filename_prefix = "transcription" # @param {type:"string"}

def transcribe_audio(audio_path, model, output_formats=["txt"]):
    if not audio_path or not os.path.exists(audio_path):
        print("Error: Audio file not found for transcription.")
        return None, None

    print(f"Starting transcription for {audio_path}...")
    result = model.transcribe(audio_path, verbose=True) # verbose=True shows progress
    text = result["text"]
    language = result["language"]
    segments = result["segments"]

    print(f"\nTranscription complete. Detected language: {language}")
    print("\n--- Full Transcription ---")
    print(text)
    print("--------------------------")

    base_output_name = os.path.join(os.path.dirname(audio_path) if os.path.dirname(audio_path) else ".", output_filename_prefix)

    # Save outputs in specified formats
    for fmt in output_formats:
        output_path = f"{base_output_name}.{fmt}"
        with open(output_path, "w", encoding="utf-8") as f:
            if fmt == "txt":
                f.write(text)
            elif fmt == "srt":
                for i, segment in enumerate(segments):
                    start_time = str(0) + str(int(segment['start'] // 3600)).zfill(2) + ':' + str(int((segment['start'] % 3600) // 60)).zfill(2) + ':' + str(int(segment['start'] % 60)).zfill(2) + ',' + str(int((segment['start'] * 1000) % 1000)).zfill(3)
                    end_time = str(0) + str(int(segment['end'] // 3600)).zfill(2) + ':' + str(int((segment['end'] % 3600) // 60)).zfill(2) + ':' + str(int(segment['end'] % 60)).zfill(2) + ',' + str(int((segment['end'] * 1000) % 1000)).zfill(3)
                    f.write(f"{i + 1}\n")
                    f.write(f"{start_time} --> {end_time}\n")
                    f.write(f"{segment['text'].strip()}\n\n")
            elif fmt == "vtt":
                f.write("WEBVTT\n\n")
                for segment in segments:
                    start_time = str(int(segment['start'] // 3600)).zfill(2) + ':' + str(int((segment['start'] % 3600) // 60)).zfill(2) + ':' + str(int(segment['start'] % 60)).zfill(2) + '.' + str(int((segment['start'] * 1000) % 1000)).zfill(3)
                    end_time = str(int(segment['end'] // 3600)).zfill(2) + ':' + str(int((segment['end'] % 3600) // 60)).zfill(2) + ':' + str(int(segment['end'] % 60)).zfill(2) + '.' + str(int((segment['end'] * 1000) % 1000)).zfill(3)
                    f.write(f"{start_time} --> {end_time}\n")
                    f.write(f"{segment['text'].strip()}\n\n")
            elif fmt == "json":
                json.dump(result, f, indent=4)
        print(f"Transcription saved to {output_path}")

    return text, language, segments

# Perform transcription
transcribed_text, detected_language, audio_segments = transcribe_audio(audio_file_path, whisper_model, output_formats)


5.2. Text Translation
Translate the transcribed text to English. Whisper's transcribe function can also directly translate to English if task="translate" is specified.

In [None]:
# @title Translate Transcription to English
# @markdown Run this cell to translate the previously transcribed text to English.
# @markdown This uses Whisper's built-in translation capability.

def translate_text_with_whisper(audio_path, model):
    if not audio_path or not os.path.exists(audio_path):
        print("Error: Audio file not found for translation.")
        return None

    print(f"Starting translation for {audio_path}...")
    # Use task="translate" to translate to English
    translation_result = model.transcribe(audio_path, task="translate", verbose=True)
    translated_text = translation_result["text"]

    print("\n--- Translated Text (English) ---")
    print(translated_text)
    print("---------------------------------")

    # Save translated text
    translation_output_path = os.path.join(os.path.dirname(audio_file_path) if os.path.dirname(audio_file_path) else ".", f"{output_filename_prefix}_translated.txt")
    with open(translation_output_path, "w", encoding="utf-8") as f:
        f.write(translated_text)
    print(f"Translated text saved to {translation_output_path}")

    return translated_text

if audio_file_path and transcribed_text:
    translated_content = translate_text_with_whisper(audio_file_path, whisper_model)
else:
    print("No audio file or transcription available for translation. Please run previous cells.")


5.3. Speaker Diarization
Identify different speakers in the audio and associate them with their respective transcribed segments. This requires a Hugging Face token.

Important: You need to accept the user conditions for the pyannote/speaker-diarization model on Hugging Face and provide your Hugging Face token.

Go to https://huggingface.co/pyannote/speaker-diarization and accept the user conditions.

Go to https://huggingface.co/settings/tokens and generate a new token with 'read' access.

Paste your token in the cell below.

In [None]:
# @title Perform Speaker Diarization
# @markdown **Important:** Enter your Hugging Face token below.
# @markdown Get your token from: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
hugging_face_token = "" # @param {type:"string"}

if not hugging_face_token:
    print("Please provide your Hugging Face token to perform speaker diarization.")
else:
    os.environ["HF_TOKEN"] = hugging_face_token
    try:
        from pyannote.audio import Pipeline
        diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1", use_auth_token=hugging_face_token)
        diarization_pipeline.to(torch.device(device))
        print("Speaker diarization pipeline loaded.")

        def diarize_audio(audio_path, segments):
            if not audio_path or not os.path.exists(audio_path):
                print("Error: Audio file not found for diarization.")
                return []

            print(f"Starting speaker diarization for {audio_path}...")
            # Diarize the audio file
            diarization_result = diarization_pipeline(audio_path)

            diarized_segments = []
            for segment in segments:
                start_time = segment['start']
                end_time = segment['end']
                text = segment['text']
                speaker = "UNKNOWN" # Default speaker

                # Find the speaker for this segment
                for turn, _, speaker_label in diarization_result.itertracks(yield_labelling=True):
                    # Check for overlap between transcription segment and diarization turn
                    if max(start_time, turn.start) < min(end_time, turn.end):
                        speaker = speaker_label
                        break # Found a speaker for this segment

                diarized_segments.append({
                    "start": start_time,
                    "end": end_time,
                    "text": text,
                    "speaker": speaker
                })

            print("\n--- Diarized Transcription ---")
            for seg in diarized_segments:
                print(f"[ {seg['start']:.1f}s - {seg['end']:.1f}s ] Speaker {seg['speaker']}: {seg['text'].strip()}")
            print("------------------------------")

            # Save diarized output
            diarized_output_path = os.path.join(os.path.dirname(audio_file_path) if os.path.dirname(audio_file_path) else ".", f"{output_filename_prefix}_diarized.txt")
            with open(diarized_output_path, "w", encoding="utf-8") as f:
                for seg in diarized_segments:
                    f.write(f"[ {seg['start']:.1f}s - {seg['end']:.1f}s ] Speaker {seg['speaker']}: {seg['text'].strip()}\n")
            print(f"Diarized transcription saved to {diarized_output_path}")

            return diarized_segments

        if audio_file_path and audio_segments:
            diarized_audio_data = diarize_audio(audio_file_path, audio_segments)
        else:
            print("No audio file or segments available for diarization. Please run previous cells.")

    except Exception as e:
        print(f"Error loading or running diarization pipeline: {e}")
        print("Please ensure your Hugging Face token is correct and you have accepted the model's user conditions.")


5.4. Text Summarization
Summarize the transcribed text using a pre-trained summarization model from the Hugging Face transformers library.

In [None]:
# @title Summarize Transcribed Text
# @markdown Run this cell to generate a summary of the transcribed text.
# @markdown You can adjust the `max_length` and `min_length` for the summary.
summary_max_length = 150 # @param {type:"integer"}
summary_min_length = 50 # @param {type:"integer"}

def summarize_text(text, max_length=150, min_length=50):
    if not text:
        print("No text available for summarization.")
        return None

    print(f"Starting summarization (max_length={max_length}, min_length={min_length})...")
    try:
        # Using a general-purpose summarization model
        summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0 if torch.cuda.is_available() else -1)
        summary = summarizer(text, max_length=max_length, min_length=min_length, do_sample=False)[0]['summary_text']

        print("\n--- Summary ---")
        print(summary)
        print("---------------")

        # Save summary
        summary_output_path = os.path.join(os.path.dirname(audio_file_path) if os.path.dirname(audio_file_path) else ".", f"{output_filename_prefix}_summary.txt")
        with open(summary_output_path, "w", encoding="utf-8") as f:
            f.write(summary)
        print(f"Summary saved to {summary_output_path}")

        return summary
    except Exception as e:
        print(f"Error during summarization: {e}")
        print("Ensure you have a stable internet connection for model download and sufficient memory.")
        return None

if transcribed_text:
    summarized_content = summarize_text(transcribed_text, summary_max_length, summary_min_length)
else:
    print("No transcribed text available for summarization. Please run the transcription cell first.")


In [None]:
6. Conclusion and Next Steps
You have successfully used Whisper for advanced audio processing tasks!

What's next?

Experiment with Models: Try different Whisper model sizes to see the trade-off between speed and accuracy.

Refine Diarization: For better diarization results, consider fine-tuning pyannote.audio or using more specialized models if available.

Customization: Modify the output formats or integrate this script into a larger application.

Batch Processing: Extend the script to process multiple audio files in a directory.

UI Integration: Build a simple web interface (e.g., using Gradio or Streamlit) to make it more user-friendly.