<a href="https://colab.research.google.com/github/Hana19951208/youtube-speaker-diarization/blob/master/YouTube_Speaker_Diarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Speaker Diarization Pipeline

This notebook provides an end-to-end pipeline for:
1. Downloading audio from YouTube videos
2. Transcribing speech using WhisperX
3. Performing speaker diarization using PyAnnote
4. Identifying target speakers using reference audio
5. Generating SRT subtitles with speaker labels

## Setup Instructions

### 1. Configure HuggingFace Token
You need a HuggingFace token to access PyAnnote's speaker diarization model.

1. Go to https://huggingface.co/settings/tokens
2. Create a new token with 'read' access
3. Accept the model license at https://huggingface.co/pyannote/speaker-diarization-3.1
4. Enter your token below:

In [7]:
# Set your HuggingFace token here
HF_TOKEN = "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"  # <-- Paste your HF token here

# Set environment variable
import os
os.environ['HF_TOKEN'] = HF_TOKEN

print(f"HF_TOKEN set: {bool(HF_TOKEN)}")

HF_TOKEN set: True


### 2. Install Dependencies
Run the cell below to install all required packages:

In [8]:
# Install dependencies
!pip install -q torch torchaudio
!pip install -q yt-dlp ffmpeg-python pydub
!pip install -q demucs
!pip install -q whisperx
!pip install -q pyannote.audio
!pip install -q speechbrain scikit-learn

print("Dependencies installed successfully!")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.1/182.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.1/87.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m21.8 MB/s[0m eta [36m

### 3. Upload Reference Audio
Upload a reference audio file of the target speaker you want to identify:

In [1]:
from google.colab import files

print("Please upload a reference audio file (WAV or MP3) of the target speaker:")
uploaded = files.upload()

ref_audio_path = list(uploaded.keys())[0]
print(f"Reference audio uploaded: {ref_audio_path}")

Please upload a reference audio file (WAV or MP3) of the target speaker:


Saving 祁同伟-同志们，这个项目事关重大，需要我们统筹兼顾，审慎推进。要充分考虑各方利益，把握好工作节奏。我建议先做个详细调研，听取基层同志的意见，再研究具体实施方案。-由微信公.mp3 to 祁同伟-同志们，这个项目事关重大，需要我们统筹兼顾，审慎推进。要充分考虑各方利益，把握好工作节奏。我建议先做个详细调研，听取基层同志的意见，再研究具体实施方案。-由微信公.mp3
Reference audio uploaded: 祁同伟-同志们，这个项目事关重大，需要我们统筹兼顾，审慎推进。要充分考虑各方利益，把握好工作节奏。我建议先做个详细调研，听取基层同志的意见，再研究具体实施方案。-由微信公.mp3


## Pipeline Configuration

Configure the pipeline parameters below:

In [3]:
# Configuration
CONFIG = {
    # YouTube URL to process
    "youtube_url": "https://www.youtube.com/watch?v=Zs8jUFaqtCI&list=PLCecIXiOoaWnlFxxe4eWa1a7EJ-BMXHzr",  # <-- Paste YouTube URL here

    # Language (set to None for auto-detect)
    "language": None,  # e.g., "en", "zh", "ja", etc.

    # Maximum number of speakers
    "max_speakers": 3,

    # WhisperX model size
    "whisper_model": "large-v3",  # Options: tiny, base, small, medium, large-v1, large-v2, large-v3

    # Processing options
    "do_separation": True,  # Perform vocal separation
    "do_vad": False,  # Apply voice activity detection
    "do_enhance": False,  # Apply audio enhancement

    # Speaker matching threshold
    "similarity_threshold": 0.25,

    # Output directory
    "output_dir": "./output",
}

# Print configuration
print("Pipeline Configuration:")
print("=" * 60)
for key, value in CONFIG.items():
    print(f"  {key}: {value}")
print("=" * 60)

Pipeline Configuration:
  youtube_url: https://www.youtube.com/watch?v=Zs8jUFaqtCI&list=PLCecIXiOoaWnlFxxe4eWa1a7EJ-BMXHzr
  language: None
  max_speakers: 3
  whisper_model: large-v3
  do_separation: True
  do_vad: False
  do_enhance: False
  similarity_threshold: 0.25
  output_dir: ./output


## Run the Pipeline

Execute the cell below to run the complete pipeline:

In [4]:
# Initialize and run pipeline
from pipeline import YouTubeSpeakerPipeline

pipeline = YouTubeSpeakerPipeline(
    hf_token=HF_TOKEN,
    output_dir=CONFIG["output_dir"],
    whisper_model=CONFIG["whisper_model"],
    max_speakers=CONFIG["max_speakers"],
    do_separation=CONFIG["do_separation"],
    do_vad=CONFIG["do_vad"],
    do_enhance=CONFIG["do_enhance"],
    similarity_threshold=CONFIG["similarity_threshold"],
)

results = pipeline.process(
    youtube_url=CONFIG["youtube_url"],
    ref_audio_path=ref_audio_path,
    language=CONFIG["language"],
)

print("\n" + "=" * 60)
print("PROCESSING COMPLETE!")
print("=" * 60)

ModuleNotFoundError: No module named 'pipeline'

## View Results

### Download Output Files

In [None]:
from google.colab import files
import os

# Find output files
output_files = [
    os.path.join(CONFIG["output_dir"], f)
    for f in os.listdir(CONFIG["output_dir"])
    if f.endswith((".srt", ".json"))
]

print("Output files available for download:")
for f in output_files:
    print(f"  - {os.path.basename(f)}")

# Download all files
for f in output_files:
    files.download(f)

### Preview SRT Content

In [None]:
# Find and display the SRT file
srt_files = [f for f in output_files if f.endswith('.srt')]

if srt_files:
    with open(srt_files[0], 'r', encoding='utf-8') as f:
        content = f.read()

    print("SRT Preview (first 3000 characters):")
    print("=" * 60)
    print(content[:3000])
    if len(content) > 3000:
        print("\n... (truncated)")
else:
    print("No SRT file found")

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Try using a smaller WhisperX model (e.g., `medium` instead of `large-v3`)
2. **HF_TOKEN Error**: Make sure you've accepted the PyAnnote model license and set your token
3. **FFmpeg Error**: Make sure FFmpeg is installed: `!apt-get install ffmpeg`
4. **YouTube Download Error**: Some videos may be blocked or require authentication