LAB M.107  Whisper STT Implementation
CINDY LUND

This notebook implements a complete speech-to-text pipeline using OpenAI Whisper, including:
- basic, prompted, and unprompted transcription
- audio chunking for long recordings
- timestamp extraction with chunk offsets
- exports to TXT, JSON, and SRT formats

In [None]:
%pip install openai pydub audioop-lts


Note: you may need to restart the kernel to use updated packages.


In [27]:
#Setup directories for audio and transcripts
   
import os
os.makedirs("audio", exist_ok=True)
os.makedirs("transcripts", exist_ok=True)

In [3]:
from openai import OpenAI
from pydub import AudioSegment

In [4]:
import os
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


In [5]:
print("API key loaded:", bool(os.getenv("OPENAI_API_KEY")))

API key loaded: True


In [17]:
import os
print(os.listdir("audio"))

['ME025.mp3', 'not used - no difference between guided and unguided']


In [28]:
from pathlib import Path
from pydub import AudioSegment

audio_dir = Path("audio")
audio_path = audio_dir / "ME025.mp3"

if not audio_path.exists():
    raise FileNotFoundError(f"‚ùå {audio_path.name} not found in ./audio")

audio = AudioSegment.from_file(audio_path)

print(f"‚úÖ Using meeting sample: {audio_path.name}")
print(f"‚è± Duration: {audio.duration_seconds:.1f} seconds")
print(f"üì¶ Channels: {audio.channels}")
print(f"üéö Frame rate: {audio.frame_rate} Hz")


‚úÖ Using meeting sample: ME025.mp3
‚è± Duration: 100.0 seconds
üì¶ Channels: 2
üéö Frame rate: 44100 Hz


In [None]:
#Basic Transcription (without chunking)


In [None]:
#Step 3 Transcription without prompts (unguided approach) BASIC
import os
from pathlib import Path
from openai import OpenAI
from pydub import AudioSegment

# OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load meeting audio
audio_path = Path("audio") / "ME025.mp3"
audio = AudioSegment.from_file(audio_path)

# Take first 30 seconds (30,000 ms)
preview = audio[:30_000]

# Export preview clip to a temporary WAV file (Whisper-friendly)
preview_path = Path("audio") / "preview_30s.wav"
preview.export(preview_path, format="wav")

print("ü§ñ Transcribing the first 30 seconds (no chunking, no prompt)...")

with open(preview_path, "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )

print("\nüìù Transcription (first 30 seconds):")
print("-" * 40)
print(transcript.text)


ü§ñ Transcribing the first 30 seconds (no chunking, no prompt)...

üìù Transcription (first 30 seconds):
----------------------------------------
Well, how do you go about making a small rowboat? We just make the small scale model and draft it from that. Make a keel out. You make a scale model first? Most everybody does, make a scale model. Or else they draft them out, draw them out on paper. Either one you want to, it doesn't matter. How big are these scale models? A general rule on small type.


In [30]:
#Step 4 Transcription with prompt (guided approach)i
import os
from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

audio_path = Path("audio") / "ME025.mp3"
if not audio_path.exists():
    raise FileNotFoundError("‚ùå ME025.mp3 not found")

prompt_text = (
    "This is a discussion about building a small rowboat. "
    "Topics include making a scale model, drafting plans on paper, "
    "and boat parts such as the keel. "
    "Transcribe clearly with proper punctuation and sentence boundaries."
)

print("ü§ñ Step 4: Prompted transcription...")

with open(audio_path, "rb") as f:
    prompted = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        prompt=prompt_text
    )

prompted_text = prompted.text.strip()

Path("transcripts").mkdir(exist_ok=True)
step4_path = Path("transcripts") / "step4_ME025_prompted.txt"
step4_path.write_text(prompted_text + "\n", encoding="utf-8")

print("‚úÖ Step 4 saved to:", step4_path)


ü§ñ Step 4: Prompted transcription...
‚úÖ Step 4 saved to: transcripts\step4_ME025_prompted.txt


In [33]:
#Step 5 Transcription without prompts (unguided approach)
import os
from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

audio_path = Path("audio") / "ME025.mp3"
if not audio_path.exists():
    raise FileNotFoundError("‚ùå ME025.mp3 not found")

print("ü§ñ Step 5: Unprompted transcription...")

with open(audio_path, "rb") as f:
    unprompted = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )

unprompted_text = unprompted.text.strip()

step5_path = Path("transcripts") / "step5_ME025_unprompted.txt"
step5_path.write_text(unprompted_text + "\n", encoding="utf-8")

print("‚úÖ Step 5 saved to:", step5_path)


ü§ñ Step 5: Unprompted transcription...
‚úÖ Step 5 saved to: transcripts\step5_ME025_unprompted.txt


In [34]:
#Comarison between Prompted (Step 4) and Unprompted (Step 5)
from pathlib import Path

step4_text = Path("transcripts/step4_ME025_prompted.txt").read_text(encoding="utf-8")
step5_text = Path("transcripts/step5_ME025_unprompted.txt").read_text(encoding="utf-8")

print("üîπ STEP 4 ‚Äî PROMPTED")
print("-" * 60)
print(step4_text)

print("\nüîπ STEP 5 ‚Äî UNPROMPTED")
print("-" * 60)
print(step5_text)


üîπ STEP 4 ‚Äî PROMPTED
------------------------------------------------------------
How do you go about making a small rowboat? We just make the small scale model and draft it from that. Make a keel out. You make a scale model first? Most everybody does, make a scale model. Or else they draft them out, draw them out on paper. Either one you want to, it doesn't matter. How big are these scale models? The general rule on small type boat is 3 quarter inch to a foot. The large ones are up to a quarter inch to a foot. And what's the purpose of the scale model? The length and the width and all this. Oh I see, they just use smaller everything and then they just scale them up and down. And then how do you go about deciding to build the boat itself? Well you make the keel first from the model, from the drafting, drawing, whatever it is. Then you make the stem and the stern. And for the small stuff, the small boats where you bend the frame, you make a mold, what we call a mold, there's section

In [35]:
#Step 6 Implementing chunking for long audio files
from pathlib import Path
from pydub import AudioSegment

# Input audio file
audio_path = Path("audio") / "Podcast.mp3"
if not audio_path.exists():
    raise FileNotFoundError("‚ùå Podcast.mp3 not found in ./audio")

# Output directory for chunks
chunks_dir = Path("audio") / "chunks"
chunks_dir.mkdir(parents=True, exist_ok=True)

# Chunk length: 10 minutes (in milliseconds)
chunk_length_ms = 10 * 60 * 1000  # 600,000 ms

# Load audio
audio = AudioSegment.from_file(audio_path)
total_duration_ms = len(audio)

print(f"‚úÖ Loaded audio: {audio_path.name}")
print(f"‚è± Total duration: {total_duration_ms / 1000 / 60:.2f} minutes")

# Split into chunks
chunk_paths = []
chunk_number = 1

for start_ms in range(0, total_duration_ms, chunk_length_ms):
    end_ms = min(start_ms + chunk_length_ms, total_duration_ms)
    chunk = audio[start_ms:end_ms]

    chunk_filename = (
        f"Podcast_chunk_{chunk_number:03d}_"
        f"{start_ms//1000}s_to_{end_ms//1000}s.wav"
    )
    chunk_path = chunks_dir / chunk_filename

    chunk.export(chunk_path, format="wav")
    chunk_paths.append(chunk_path)

    chunk_number += 1

print(f"\nüî™ Created {len(chunk_paths)} chunk(s) in '{chunks_dir}'")

# Checkpoint verification: list chunks and durations
print("\nüì¶ Chunk verification:")
for path in chunk_paths:
    c = AudioSegment.from_file(path)
    print(f" - {path.name} | {c.duration_seconds:.1f} seconds")


‚úÖ Loaded audio: Podcast.mp3
‚è± Total duration: 28.09 minutes

üî™ Created 3 chunk(s) in 'audio\chunks'

üì¶ Chunk verification:
 - Podcast_chunk_001_0s_to_600s.wav | 600.0 seconds
 - Podcast_chunk_002_600s_to_1200s.wav | 600.0 seconds
 - Podcast_chunk_003_1200s_to_1685s.wav | 485.2 seconds


In [None]:
#Step 6 with smaller audio chunks to ensure they are under 25MB for Whisper
from pathlib import Path
from pydub import AudioSegment
import os

audio_path = Path("audio") / "Podcast.mp3"
chunks_dir = Path("audio") / "chunks"
chunks_dir.mkdir(parents=True, exist_ok=True)

# 10 minutes
chunk_length_ms = 10 * 60 * 1000

audio = AudioSegment.from_file(audio_path)
total_ms = len(audio)

print(f"‚úÖ Loaded: {audio_path.name}")
print(f"‚è± Total duration: {total_ms/1000/60:.2f} minutes")

chunk_paths = []
chunk_number = 1

for start_ms in range(0, total_ms, chunk_length_ms):
    end_ms = min(start_ms + chunk_length_ms, total_ms)
    chunk = audio[start_ms:end_ms]

    # üîë Downsample to keep file size < 25MB
    chunk = chunk.set_frame_rate(16000).set_channels(1).set_sample_width(2)

    chunk_filename = f"Podcast_chunk_{chunk_number:03d}_{start_ms//1000}s_to_{end_ms//1000}s.wav"
    chunk_path = chunks_dir / chunk_filename
    chunk.export(chunk_path, format="wav")

    size_mb = os.path.getsize(chunk_path) / (1024 * 1024)
    print(f" - created {chunk_path.name} | {chunk.duration_seconds:.1f}s | {size_mb:.2f} MB")

    chunk_paths.append(chunk_path)
    chunk_number += 1

print(f"\n‚úÖ Created {len(chunk_paths)} chunk(s) in {chunks_dir}")



‚úÖ Loaded: Podcast.mp3
‚è± Total duration: 28.09 minutes
 - created Podcast_chunk_001_0s_to_600s.wav | 600.0s | 18.31 MB
 - created Podcast_chunk_002_600s_to_1200s.wav | 600.0s | 18.31 MB
 - created Podcast_chunk_003_1200s_to_1685s.wav | 485.2s | 14.81 MB

‚úÖ Created 3 chunk(s) in audio\chunks


In [39]:
#Step 7 Transcribing chunks with timestamps
import os
import re
import json
from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

chunks_dir = Path("audio") / "chunks"
chunk_files = sorted(chunks_dir.glob("Podcast_chunk_*.wav"))

if not chunk_files:
    raise FileNotFoundError("‚ùå No chunk files found in audio/chunks")

def parse_chunk_offset_seconds(filename: str) -> int:
    """
    Extract the start offset in seconds from filenames like:
    Podcast_chunk_002_600s_to_1200s.wav  -> 600
    """
    m = re.search(r"_(\d+)s_to_(\d+)s\.wav$", filename)
    if not m:
        raise ValueError(f"Could not parse offset from filename: {filename}")
    return int(m.group(1))

all_segments = []
combined_text_parts = []

print(f"ü§ñ Transcribing {len(chunk_files)} chunks with timestamps...\n")

for chunk_path in chunk_files:
    offset_s = parse_chunk_offset_seconds(chunk_path.name)
    print(f"‚û°Ô∏è {chunk_path.name} (offset {offset_s}s)")

    with open(chunk_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["segment"]  # segment-level timestamps
        )

    # Save the chunk's full text (optional but useful)
    combined_text_parts.append(transcript.text.strip())

    # Adjust segment timestamps by adding the chunk offset
    if hasattr(transcript, "segments") and transcript.segments:
        for seg in transcript.segments:
            all_segments.append({
                "start": float(seg.start) + offset_s,
                "end": float(seg.end) + offset_s,
                "text": seg.text.strip(),
                "chunk_file": chunk_path.name
            })

print("\n‚úÖ Done transcribing all chunks.")

# Sort segments by time (just to be safe)
all_segments.sort(key=lambda x: x["start"])

# Create a combined transcript text with timestamps (human-readable)
def format_time(seconds: float) -> str:
    # hh:mm:ss
    seconds = int(round(seconds))
    h = seconds // 3600
    m = (seconds % 3600) // 60
    s = seconds % 60
    return f"{h:02d}:{m:02d}:{s:02d}"

lines = []
for seg in all_segments:
    lines.append(f"[{format_time(seg['start'])} - {format_time(seg['end'])}] {seg['text']}")

combined_timestamped_text = "\n".join(lines)

# Save outputs (these are part of Step 7 + help for Step 8)
Path("transcripts").mkdir(exist_ok=True)

json_path = Path("transcripts") / "step7_podcast_segments_with_timestamps.json"
txt_path = Path("transcripts") / "step7_podcast_timestamped.txt"
full_txt_path = Path("transcripts") / "step7_podcast_full_text.txt"

json_path.write_text(json.dumps(all_segments, indent=2, ensure_ascii=False), encoding="utf-8")
txt_path.write_text(combined_timestamped_text, encoding="utf-8")
full_txt_path.write_text("\n\n".join(combined_text_parts), encoding="utf-8")

print(f"\n‚úÖ Saved:")
print(f" - {json_path}")
print(f" - {txt_path}")
print(f" - {full_txt_path}")


ü§ñ Transcribing 3 chunks with timestamps...

‚û°Ô∏è Podcast_chunk_001_0s_to_600s.wav (offset 0s)
‚û°Ô∏è Podcast_chunk_002_600s_to_1200s.wav (offset 600s)
‚û°Ô∏è Podcast_chunk_003_1200s_to_1685s.wav (offset 1200s)

‚úÖ Done transcribing all chunks.

‚úÖ Saved:
 - transcripts\step7_podcast_segments_with_timestamps.json
 - transcripts\step7_podcast_timestamped.txt
 - transcripts\step7_podcast_full_text.txt


In [1]:
#Step 8 Exporting transcripts in multiple formats (TXT, JSON, SRT)
import json
from pathlib import Path

# Input: Step 7 output
segments_path = Path("transcripts") / "step7_podcast_segments_with_timestamps.json"
if not segments_path.exists():
    raise FileNotFoundError("‚ùå Step 7 JSON not found. Run Step 7 first.")

segments = json.loads(segments_path.read_text(encoding="utf-8"))

# Output folder
export_dir = Path("transcripts") / "exports"
export_dir.mkdir(parents=True, exist_ok=True)

# ---------- Helpers ----------
def format_hhmmss(seconds: float) -> str:
    """HH:MM:SS for human-readable text exports."""
    seconds = int(round(seconds))
    h = seconds // 3600
    m = (seconds % 3600) // 60
    s = seconds % 60
    return f"{h:02d}:{m:02d}:{s:02d}"

def srt_timestamp(seconds: float) -> str:
    """SRT timestamp format: HH:MM:SS,mmm"""
    if seconds < 0:
        seconds = 0
    ms = int(round((seconds - int(seconds)) * 1000))
    total = int(seconds)
    h = total // 3600
    m = (total % 3600) // 60
    s = total % 60
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

# ---------- 1) Human-readable TXT with timestamps ----------
txt_lines = []
for seg in segments:
    start = format_hhmmss(seg["start"])
    end = format_hhmmss(seg["end"])
    text = seg["text"].strip()
    txt_lines.append(f"[{start} - {end}] {text}")

timestamped_txt = "\n".join(txt_lines)
txt_path = export_dir / "podcast_timestamped.txt"
txt_path.write_text(timestamped_txt, encoding="utf-8")

# ---------- 2) JSON export (already exists, but we‚Äôll copy a clean version) ----------
json_export_path = export_dir / "podcast_segments.json"
json_export_path.write_text(json.dumps(segments, indent=2, ensure_ascii=False), encoding="utf-8")

# ---------- 3) SRT export ----------
# SRT is: index, time-range line, text, blank line
# We'll use each segment as one subtitle cue.
srt_blocks = []
for i, seg in enumerate(segments, start=1):
    start = srt_timestamp(seg["start"])
    end = srt_timestamp(seg["end"])
    text = seg["text"].strip()

    # Optional: avoid empty cues
    if not text:
        continue

    block = f"{i}\n{start} --> {end}\n{text}\n"
    srt_blocks.append(block)

srt_text = "\n".join(srt_blocks)
srt_path = export_dir / "podcast.srt"
srt_path.write_text(srt_text, encoding="utf-8")

# ---------- Checkpoint output ----------
print("‚úÖ Step 8 exports created:")
print(f" - {txt_path}")
print(f" - {json_export_path}")
print(f" - {srt_path}")

print("\nüîé Preview (first 5 timestamped lines):")
for line in timestamped_txt.splitlines()[:5]:
    print(line)

print("\nüîé Preview (first SRT cue):")
print(srt_text.split("\n\n")[0])


‚úÖ Step 8 exports created:
 - transcripts\exports\podcast_timestamped.txt
 - transcripts\exports\podcast_segments.json
 - transcripts\exports\podcast.srt

üîé Preview (first 5 timestamped lines):
[00:00:00 - 00:00:02] Hey there, I'm Asma Khalid.
[00:00:02 - 00:00:06] And I'm Tristan Redman, and we're here with a bonus episode for you from the Global
[00:00:06 - 00:00:07] Story podcast.
[00:00:07 - 00:00:09] The world order is shifting.
[00:00:09 - 00:00:13] Old alliances are fraying and new ones are emerging.

üîé Preview (first SRT cue):
1
00:00:00,000 --> 00:00:02,000
Hey there, I'm Asma Khalid.


In [None]:
#to Show what is in Transcrips Folder
mport os
print(os.listdir("transcripts"))