**Recommended:**<br/>
Use a device with a dedicated Graphics card, everything will be much faster.

create a virtual environment to run whisper.<br/>
run the following in a terminal:<br/>
```python -m venv whisper-env```

To access virtual environment, run the following in a terminal.

**on Windows:**<br/>
```.\whisper-env\Scripts\activate```

**on Mac:**<br/>
```source whisper-env/bin/activate```

**on Google Colab:**
already set up, very simple.

Check to make sure the notebook is using the environment. The file path should show something with ```whisper-env``` if correct.

In [None]:
import sys
print(sys.executable)


If using colab, run below. Otherwise, find correct installs for ffmpeg.

In [None]:
!apt-get update
!apt-get install -y ffmpeg

Install required libraries for running whisper locally. These steps may be different for Mac and Windows users.

**for Windows:**<br/>
find and install ffmpeg from the internet. This should be an executable, so it will automatically set up in your program files.

Next, pip install the libraries into your virtual environment:

In [None]:
%pip install openai-whisper
%pip install ffmpeg
%pip install pytube
%pip install pydub
%pip install yt-dlp
%pip install googletrans==4.0.0-rc1


PyTorch and companions are required. However, if you want to use a graphics card for faster processing, there are some additional steps.

**for Mac:**<br/>
You can use Metal if it's a silicon chip. This is experimental at the time of writing this document (2025).<br/>
```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/mps```

**for Windows:**<br/>
You'll need to first install the CUDA toolkit from Nvidia's website. This comes with an install wizard and goes to Program Files.<br/>
You also should use CudNN from the website as well. This is a zip file, which needs to be unpacked and manually moved to the CUDA program files.<br/>
Just match the name of the folders, and copy over all of the zip files contents to the CUDA folders respectively.

finally, here's the pip for nightly.<br/>
```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126```

**for Google Colab:**<br/>
This should again be already set up, skip the next code below.

Uncomment the lines below for your specific use case:

In [None]:
# General case, no GPU
# %pip install torch torchvision torchaudio

# Mac using Metal
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/mps

# Windows using CUDA 12.6+ (modify /cu126 for specific requirements)
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126


Check to verify CUDA is reachable, and GPU is registered properly.

In [None]:
import torch
print(torch.__version__)  # Ensure it's the latest nightly build
print(torch.cuda.is_available())  # Should return True if CUDA is available
print(torch.version.cuda)
print(torch.cuda.get_device_name(0))  # Should show your GPU model
torch.cuda.empty_cache()


This is the best accuracy I've gotten in transcription. The result should be the non-translated transcription file with timestamps.

In [None]:
import whisper
import os
import subprocess
from pydub import AudioSegment
from datetime import datetime

# Set ffmpeg path explicitly (adjust if necessary)
os.environ["FFMPEG_BINARY"] = "C:/ProgramData/chocolatey/bin/ffmpeg.exe"

# Check ffmpeg availability
subprocess.run(['ffmpeg', '-version'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# Ensure required directories exist
os.makedirs("audio_files", exist_ok=True)
os.makedirs("transcripts", exist_ok=True)

# Generate unique timestamp
def get_timestamp():
    return datetime.now().strftime("%Y%m%d_%H%M%S")

# Convert seconds to minutes:seconds format
def format_timestamp(seconds):
    minutes = int(seconds // 60)
    remaining_seconds = int(seconds % 60)
    return f"{minutes}m{remaining_seconds}s"

# Transcribe Audio using Whisper (no translation)
def transcribe_audio(audio_path):
    model = whisper.load_model("large")
    print(f"Transcribing {audio_path}...")

    # Transcribe using Whisper (task = "transcribe")
    result = model.transcribe(audio_path, word_timestamps=True, verbose=True)
    detected_language = result["language"]
    print(f"Detected language: {detected_language}")

    # Collect the transcription with timestamps
    segments = []
    for segment in result["segments"]:
        start_time = segment["start"]
        end_time = segment["end"]
        text = segment['text']
        formatted_start_time = format_timestamp(start_time)
        formatted_end_time = format_timestamp(end_time)
        segments.append(f"[{formatted_start_time} - {formatted_end_time}]: {text}")

    # Combine all segments into a single string
    full_transcription = "\n".join(segments)
    return full_transcription

# Save the non-translated transcription with timestamps
def save_combined_transcription_without_translation(full_transcription):
    timestamp = get_timestamp()
    combined_file_path = f"transcripts/combined_transcription_{timestamp}.txt"

    with open(combined_file_path, "w", encoding="utf-8") as file:
        file.write(full_transcription)

    print(f"Saved combined transcription to {combined_file_path}")

import yt_dlp as youtube_dl

# Download YouTube Video Audio using yt-dlp
def download_audio(youtube_url, output_path="audio_files/audio.mp4"):
    print("Downloading video...")
    ydl_opts = {'format': 'bestaudio/best', 'outtmpl': output_path}
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])
    print("Download complete!")

# Main function (for testing or production)
def main(input_source):
    # This will hold the full transcription with timestamps
    full_transcription = ""
    audio_path = None  # This will store the final audio file to process

    # Check if input is a YouTube URL or a local file
    if input_source.startswith("http"):
        print("Downloading audio from YouTube...")
        audio_path = f"audio_files/audio_{get_timestamp()}.mp4"
        try:
            download_audio(input_source, audio_path)
        except Exception as e:
            print(f"Error downloading YouTube audio: {e}")
            return
    else:
        if not os.path.exists(input_source):
            print(f"Error: The file '{input_source}' does not exist.")
            return
        audio_path = input_source

    # Transcribe the audio in a single go using Whisper
    try:
        full_transcription = transcribe_audio(audio_path)
    except Exception as e:
        print(f"Error transcribing {audio_path}: {e}")
        return

    # Save the combined transcription with timestamps (non-translated)
    if full_transcription:
        save_combined_transcription_without_translation(full_transcription)
    else:
        print("No transcription was generated.")

    # Clean up downloaded YouTube audio if applicable
    if input_source.startswith("http"):
        try:
            os.remove(audio_path)
        except Exception as e:
            print(f"Warning: Could not delete {audio_path}: {e}")

if __name__ == "__main__":
    user_input = input("Enter YouTube link or local audio file path: ").strip()
    main(user_input)


For translation, batch translation of the full transcription file seems to be most accurate. Unfortunately most accurate methods require manual entry or are not free, finding both in one API has yet to happen.

My low tech solution is to copy and paste the contents of the file in either an AI like ChatGPT or DeepSeek, or use Google Translate.

ChatGPT and DeepSeek have proven more accurate so far, though may present challenges with larger files.