### This notebook is for focusing on a roll call to see how it is transcribed

Recognizing short words by different speakers is difficult.  This notebook focuses in a roll call vote to see if changing model parameters can improve it.  


In [5]:
import sys
import pandas as pd
sys.path.append("../")
from pathlib import Path

### use ffmpeg to get a section of a meeting
This 30 second clip is a roll call vote

In [6]:
import subprocess
from pathlib import Path

# Input and output file paths
input_file = Path("../data/video/regular_council_meeting___2025_02_26.mp4")
clip_file = Path("../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4")

# Parameters for clip extraction
start_time = "4:50"
duration = "30"  # 30 seconds

# Run FFmpeg command
result = subprocess.run(
    [
        "ffmpeg",
        "-i",
        str(input_file),
        "-ss",
        start_time,
        "-t",
        duration,
        "-c",
        "copy",  # Copy codec (fast but might not be frame accurate)
        "-avoid_negative_ts",
        "1",
        str(clip_file),
        "-y",  # Overwrite if exists
    ],
    capture_output=True,
    text=True,
)

# Check if command was successful
if result.returncode == 0:
    print(f"Clip successfully extracted to: {clip_file}")
else:
    print(f"Error extracting clip: {result.stderr}")

Clip successfully extracted to: ../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4


### experiment with model parameters

using these setting actually made the results worse:
- min_speakers=3,  # Specify at least 3 speakers
- max_speakers=15,  # Limit to at most 10 speakers
- diarize_min_duration=0.1,  # Shorter minimum segment duration
I also tested with medium, and large versions but the results using tiny were the same


In [7]:
from src.videos import transcribe_video_with_diarization

transcription_dir = Path("../data/transcripts")

transcript_data = await transcribe_video_with_diarization(
    clip_file,
    transcription_dir,
    model_size="tiny",
)

INFO:src.videos:Transcribing video with speaker diarization: ../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4
INFO:src.videos:Output will be saved to: ../data/transcripts/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.diarized.json
INFO:src.huggingface:Auto-detected device: cpu
INFO:src.huggingface:Auto-selected compute_type: int8
INFO:src.huggingface:Loading WhisperX model: tiny on cpu with int8 precision


tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/75.5M [00:00<?, ?B/s]

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../Library/Caches/pypoetry/virtualenvs/tgov_scraper-zRR99ne3-py3.11/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin`
INFO:src.huggingface:Loading diarization pipeline


No language specified, language will be first be detected for each audio file (increases inference time).
>>Performing voice activity detection using Pyannote...
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.4.1. Bad things might happen unless you revert torch to 1.x.


INFO:src.huggingface:WhisperX model loaded in 4.50 seconds
INFO:src.videos:Running initial transcription with batch size 8...


Detected language: en (0.99) in first 30s of audio...


INFO:src.videos:Detected language: en
INFO:src.videos:Loading alignment model for detected language: en
INFO:src.videos:Aligning transcription with audio...
INFO:src.videos:Running speaker diarization...
  std = sequences.std(dim=-1, correction=1)
INFO:src.videos:Assigning speakers to transcription...
INFO:src.videos:Processing transcription segments...
INFO:src.videos:Diarized transcription completed in 30.03 seconds
INFO:src.videos:Detailed JSON saved to: ../data/transcripts/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.diarized.json


In [8]:
def format_timestamp(seconds: float) -> str:
    """Convert seconds to HH:MM:SS format"""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}"


from ipywidgets import HTML, VBox, Layout
from textwrap import fill

# Create formatted HTML output
html_output = ["<h3>Meeting Script</h3>"]
html_output.append("<hr>")

current_speaker = None
current_text = []
current_start = None

for segment in transcript_data["segments"]:
    if current_speaker != segment["speaker"]:
        # Output previous speaker's text
        if current_speaker:
            timestamp = format_timestamp(current_start)
            wrapped_text = fill(" ".join(current_text), width=80)
            html_output.append(f"<p><b>[{timestamp}] {current_speaker}:</b><br>")
            html_output.append(f"{wrapped_text}</p>")
            html_output.append("<hr>")

        # Start new speaker
        current_speaker = segment["speaker"]
        current_text = [segment["text"].strip()]
        current_start = segment["start"]
    else:
        # Continue current speaker
        current_text.append(segment["text"].strip())

# Output final speaker
if current_speaker:
    timestamp = format_timestamp(current_start)
    wrapped_text = fill(" ".join(current_text), width=80)
    html_output.append(f"<p><b>[{timestamp}] {current_speaker}:</b><br>")
    html_output.append(f"{wrapped_text}</p>")
    html_output.append("<hr>")

# Display formatted output
display(
    HTML(
        value="".join(html_output),
        layout=Layout(width="100%", border="1px solid gray", padding="10px"),
    )
)

HTML(value='<h3>Meeting Script</h3><hr><p><b>[00:00:00] SPEAKER_01:</b><br>Thank you, Mr. Huffinds. Any counci…