# 📘 Instructions: Using `collector.ipynb`

> This cell explains what this notebook does, what you need to run it, and the recommended workflow.

**What’s inside (at a glance):** This notebook focuses on **YouTube audio download, audio segmentation into WAV snippets, Whisper speech-to-text (optional), tabular metadata handling (CSV)**.

## 1) Prerequisites
- **Python:** 3.9+ recommended.
- **Key Python packages detected:** `pandas, pydub, whisper, yt_dlp`

## 2) Setup
1. (Optional) Create and activate a virtual environment.
2. Install dependencies:
   ```bash
   pip install pandas pydub whisper yt_dlp
   ```
3. Ensure input paths and credentials (if any) are configured in the first configuration cell.

## 3) How to use this notebook
1. **Configure** input/output directories and parameters (e.g., source URLs, segment length).
2. **Run Data Collection** (e.g., download audio from YouTube if applicable).
3. **Segment/Preprocess Audio** into smaller WAV files (and generate per-segment metadata).
4. **(Optional) Transcribe** audio with Whisper and store transcripts.
5. **Validate/Export** results (CSV/JSON manifests, WAV files).

**Notable functions in this notebook:**
```
convert_to_wav
download_audio
flush_accumulator
youtube_to_segmented_wavs
```

## 4) Tips for reliability & reproducibility
- Run cells in order (top → bottom).
- Keep configuration in one place and document parameter values.
- Version outputs (e.g., add timestamps to filenames).
- Log important steps and counts (downloaded files, segments created).

## 5) Outputs
- **Audio segments**: `*.wav` files per segment.
- **Manifests/metadata**: CSV/JSON files summarising segments and sources.
- **(Optional) Transcripts**: text files or CSV columns with transcriptions.

In [1]:
### Imports
import os
import subprocess
import pandas as pd
import whisper
from pathlib import Path
from yt_dlp import YoutubeDL
from pydub import AudioSegment

In [3]:
### Utils:
def convert_to_wav(src_path: str, dst_path: str, sr: int = 16000):
    AudioSegment.from_file(src_path) \
        .set_frame_rate(sr) \
        .set_channels(1) \
        .export(dst_path, format="wav")

In [None]:
# 1. Download audio-only + subtitles (VTT)
def download_audio(url: str, outdir: str = 'downloads'):
    os.makedirs(outdir, exist_ok=True)

    # 1) First grab metadata only
    meta_opts = {
        'format': 'bestaudio/best',
        'skip_download': True,
        'writesubtitles': False,
        'writeautomaticsub': False,
    }
    with YoutubeDL(meta_opts) as ydl:
        info = ydl.extract_info(url, download=False)

    ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': f'{outdir}/%(id)s.%(ext)s',
        'quiet': True,
        'subtitlesformat': 'vtt',
    }

    print(f"[download] audio: {ydl_opts['format']}")
    
    # 4) Download audio
    with YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)

    vid_id = info['id']
    ext = info.get('ext')  
    audio_file = Path(outdir) / f"{vid_id}.{ext}"

    return str(audio_file), info

# 2. Convert to WAV @ 16 kHz mono
def convert_to_wav(src_path: str, dst_path: str, sr: int = 16000):
    AudioSegment.from_file(src_path) \
        .set_frame_rate(sr) \
        .set_channels(1) \
        .export(dst_path, format="wav")

In [None]:
# whisper configuration, please check the whisper documentation for more details: https://github.com/openai/whisper
model_size = "medium.en"
device = "cpu"

# load the Whisper model
whisper_model = whisper.load_model(model_size, device)

In [None]:
def youtube_to_segmented_wavs(url, lang='en', outdir='downloads', min_limit=6, pad=0.05):
    audio_src, info = download_audio(url, outdir)
    wav_path = Path(outdir) / f"{info['id']}.wav"
    convert_to_wav(audio_src, wav_path, sr=16000)

    # 2. whisper transcript
    stt_result = whisper.transcribe(whisper_model, wav_path)["text"]
    segments = stt_result.segments
    print(stt_result)

    # 3. slice with ffmpeg
    seg_dir = Path(outdir) / 'segments'
    seg_dir.mkdir(exist_ok=True)
    rows = []
    
    segment_counter = 1
    acc_seg = None
    
    def flush_accumulator(acc, counter):
        """
        Given acc = {'start', 'end', 'text'}, cut with ffmpeg (using pad),
        append one row to `rows`, and increment counter.
        
        Returns: updated counter (counter+1).
        """
        start_ts = acc['start']
        end_ts   = acc['end']
        transcript = acc['text']
        
        # padded cut times (clamped at 0 on the left)
        padded_start = max(0, start_ts - pad)
        padded_end   = end_ts + pad
        
        out_file = seg_dir / f"{info['id']}_seg{counter:03d}.wav"
        subprocess.run([
            'ffmpeg', '-y',
            '-i', str(wav_path),
            '-ss', f"{padded_start:.3f}",
            '-to', f"{padded_end:.3f}",
            '-ar', '16000',
            '-ac', '1',
            str(out_file)
        ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        
        rows.append({
            'segment_filename': out_file.name,
            'start': start_ts,
            'end': end_ts,
            'transcript': transcript,
            'video_id': info.get('id'),
            'video_title': info.get('title'),
            'video_url': info.get('webpage_url'),
        })
        return counter + 1
    
    # 4. iterate through Whisper segments, merging below‐limit ones
    for seg in segments:        
        if acc_seg is None:
            acc_seg = {'start': seg.start, 'end': seg.end, 'text': seg.text.strip()}
            continue

        # always extend
        acc_seg['end']  = seg.end
        acc_seg['text'] = f"{acc_seg['text'].rstrip()} {seg.text.lstrip()}"

        if (acc_seg['end'] - acc_seg['start']) >= min_limit:
            segment_counter = flush_accumulator(acc_seg, segment_counter)
            acc_seg = None
    
    # 5. After loop, if there's still leftover acc_seg, flush it regardless of duration
    if acc_seg is not None:
        segment_counter = flush_accumulator(acc_seg, segment_counter)
        acc_seg = None
    
    # 6. Write CSV and return
    df = pd.DataFrame(rows)
    csv_path = Path(outdir) / f"segments_metadata_{info['id']}.csv"
    df.to_csv(csv_path, index=False)
    return df, seg_dir, csv_path
    

In [None]:
# Please insert the URL of the YouTube video you want to process:
YOUR_VIDEO_ID = "put_your_video_id_here"
url = f"https://www.youtube.com/watch?v={YOUR_VIDEO_ID}"

In [None]:
youtube_to_segmented_wavs(url=url)