# ðŸŽ¬ Video Summarization with GPU Demo

This notebook demonstrates the complete Video Summarization pipeline on **Google Colab with GPU acceleration**.

**What you'll do:**
1. Download a YouTube video
2. Transcribe audio using Whisper (small.en on GPU)
3. Summarize the transcript using Qwen 2.5 1.5B GGUF (on GPU)
4. Download results as a ZIP file

**Estimated time:** 5-10 minutes (first run includes model downloads)

## Step 1: Check GPU & Environment

In [58]:
!nvidia-smi || echo "No GPU found â€“ using CPU"

import sys
print("Python:", sys.version)
print("Executable:", sys.executable)

!df -h /content | head -2

Wed Jan 28 12:31:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             32W /   70W |    4036MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Step 2: Install System Packages

In [59]:
!apt-get update -qq
!apt-get install -y -qq ffmpeg
!ffmpeg -version | head -1

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers


## Step 3: Install Python Libraries

In [60]:
!pip install -q faster-whisper
!pip install -q llama-cpp-python
!pip install -q yt-dlp
!pip install -q huggingface-hub
!pip install -q torch

## Step 4: Setup Hugging Face Token (Optional but Recommended)

In [61]:
import os
from getpass import getpass

print("Optional: Hugging Face token (helps avoid rate limits)")
hf_token = getpass("HF token (press Enter to skip): ")

if hf_token.strip():
    os.environ["HF_TOKEN"] = hf_token
    print("Token set")
else:
    print("Proceeding without token")

Optional: Hugging Face token (helps avoid rate limits)
HF token (press Enter to skip): Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·
Proceeding without token


## Step 5: Optional - Mount Google Drive (for storage)

In [62]:
# Optional: Mount Google Drive for model storage
# Uncomment the next 3 lines if you want to save models to Drive

# from google.colab import drive
# drive.mount('/content/drive')
# print("âœ“ Google Drive mounted at /content/drive")

## Step 6: GPU Detection & Setup

In [63]:
import torch, time, json
from pathlib import Path

GPU_AVAILABLE = torch.cuda.is_available()
DEVICE = "cuda" if GPU_AVAILABLE else "cpu"
DEVICE_NAME = torch.cuda.get_device_name(0) if GPU_AVAILABLE else "CPU"
COMPUTE_TYPE = "float16" if GPU_AVAILABLE else "int8"

print("GPU:", GPU_AVAILABLE)
print("Device:", DEVICE_NAME)
print("Compute type:", COMPUTE_TYPE)

timings = {}

GPU: True
Device: Tesla T4
Compute type: float16


## Step 7: Download Models

In [64]:
from huggingface_hub import hf_hub_download

model_dir = Path("/content/models")
model_dir.mkdir(exist_ok=True)

qwen_repo = "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
qwen_file = "qwen2.5-1.5b-instruct-q4_k_m.gguf"

print("Downloading Qwen GGUFâ€¦")

qwen_model_path = hf_hub_download(
    repo_id=qwen_repo,
    filename=qwen_file,
    cache_dir=str(model_dir),
    token=os.environ.get("HF_TOKEN")
)

print("Qwen model path:", qwen_model_path)

Downloading Qwen GGUFâ€¦
Qwen model path: /content/models/models--Qwen--Qwen2.5-1.5B-Instruct-GGUF/snapshots/91cad51170dc346986eccefdc2dd33a9da36ead9/qwen2.5-1.5b-instruct-q4_k_m.gguf


## Step 8: Load Models

In [65]:
from faster_whisper import WhisperModel

print("Loading Whisper (small.en)â€¦")
t0 = time.time()

whisper_model = WhisperModel(
    "small.en",
    device=DEVICE,
    compute_type=COMPUTE_TYPE
)

timings["load_whisper"] = time.time() - t0
print("Whisper loaded")

Loading Whisper (small.en)â€¦
Whisper loaded


In [66]:
from llama_cpp import Llama
import multiprocessing

print("Loading Qwenâ€¦")
t0 = time.time()

llm = Llama(
    model_path=qwen_model_path,
    n_ctx=8192,
    n_gpu_layers=35 if GPU_AVAILABLE else 0,
    n_threads=multiprocessing.cpu_count(),
    n_batch=512,
    verbose=False
)

timings["load_qwen"] = time.time() - t0
print("Qwen loaded")

Loading Qwenâ€¦


llama_context: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


Qwen loaded


## Step 9: Download YouTube Audio

In [67]:
import subprocess

YOUTUBE_URL = "https://youtu.be/e-P5IFTqB98"

audio_dir = Path("/content/audio")
audio_dir.mkdir(exist_ok=True)
audio_path = audio_dir / "audio.wav"

if not audio_path.exists():
    subprocess.run(
        [
            "yt-dlp",
            "-f", "bestaudio/best",
            "-x",
            "--audio-format", "wav",
            "--postprocessor-args", "-ar 16000 -ac 1",
            "-o", str(audio_path),
            YOUTUBE_URL
        ],
        check=True
    )

print("Audio ready:", audio_path)

Audio ready: /content/audio/audio.wav


In [68]:
audio_path # /content/audio/Black Holes Explained â€“Â From Birth to Death.opus

PosixPath('/content/audio/audio.wav')

## Step 10: Transcribe Audio with Whisper

In [69]:
print("Transcribingâ€¦")
t0 = time.time()

segments, info = whisper_model.transcribe(
    str(audio_path),
    language="en",
    beam_size=1,
    vad_filter=True,
    condition_on_previous_text=True
)

transcript = " ".join(s.text for s in segments)
timings["transcription"] = time.time() - t0

output_dir = Path("/content/output")
output_dir.mkdir(exist_ok=True)

transcript_file = output_dir / "output_transcript.txt"
transcript_file.write_text(transcript)

print("Transcription time:", timings["transcription"])
print("Preview:", transcript[:300])

Transcribingâ€¦
Transcription time: 13.630083560943604
Preview:  Black holes are one of the strangest things in existence.  They don't seem to make any sense at all.  Where do they come from?  And what happens if you fall into one?  Stars are incredibly massive collections of mostly hydrogen atoms  that collapse from enormous gas clouds under their own gravity. 


# Step 10.5

In [70]:
# Step 10.5 â€” Extractive Compression (CPU, fast)

!pip install -q scikit-learn

import re
from sklearn.feature_extraction.text import TfidfVectorizer

def compress_chunk_tfidf(text, top_k=8):
    """
    Keep only the most informative sentences using TF-IDF.
    Preserves factual density while removing filler.
    """
    sentences = re.split(r'(?<=[.!?])\s+', text)
    if len(sentences) <= top_k:
        return text

    tfidf = TfidfVectorizer(stop_words="english")
    X = tfidf.fit_transform(sentences)
    scores = X.sum(axis=1).A1

    top_idx = scores.argsort()[::-1][:top_k]
    top_idx = sorted(top_idx)

    return " ".join(sentences[i] for i in top_idx)

## Step 11: Chunk Transcript for Summarization

In [71]:
import re

def chunk_text(text, max_words=3000, overlap=300):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks, current, count = [], [], 0

    for s in sentences:
        w = len(s.split())
        if count + w > max_words and current:
            chunks.append(" ".join(current))
            current = current[-3:]
            count = sum(len(x.split()) for x in current)
        current.append(s)
        count += w

    if current:
        chunks.append(" ".join(current))
    return chunks

chunks = chunk_text(
    transcript,
    max_words=4500,   # was 3000
    overlap=300
)

print(f"Chunked transcript into {len(chunks)} sections")

Chunked transcript into 1 sections


## Step 12: Summarize with Qwen (Hierarchical)

In [72]:
def summarize_chunk(llm, text, max_tokens=110):
    compressed = compress_chunk_tfidf(text, top_k=8)

    prompt = f"""<|im_start|>system
Extract 5 concise bullet points.
Preserve all numbers, units, and named entities exactly.
<|im_end|>
<|im_start|>user
{compressed}
<|im_end|>
<|im_start|>assistant
"""

    output = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=0.0,
        stop=["\n\n", "<|im_end|>"],
        echo=False
    )

    return output["choices"][0]["text"].strip()


print(f"Summarizing {len(chunks)} chunks on {DEVICE.upper()}...\n")
start = time.time()

summaries = []
for i, chunk in enumerate(chunks, 1):
    print(f"   Chunk {i}/{len(chunks)}...", end="\r", flush=True)
    summaries.append(summarize_chunk(llm, chunk))

timings["chunk_summaries"] = time.time() - start
print(f"\nChunks summarized in {timings['chunk_summaries']:.1f}s")

Summarizing 1 chunks on CUDA...

   Chunk 1/1...
Chunks summarized in 36.9s


# Step 12.5

In [73]:
def merge_two(llm, a, b, max_tokens=160):
    prompt = f"""<|im_start|>system
Merge and de-duplicate into a clean, factual summary (max 6 bullets).
<|im_end|>
<|im_start|>user
A:
{a}

B:
{b}
<|im_end|>
<|im_start|>assistant
"""

    out = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=0.0,
        stop=["\n\n", "<|im_end|>"],
        echo=False
    )
    return out["choices"][0]["text"].strip()

print("Number of summaries before final reduce:", len(summaries))

def tree_reduce(llm, summaries):
    # Case 1: single summary â†’ polish only
    if len(summaries) == 1:
        out = llm(
            f"""<|im_start|>system
Polish and slightly refine the summary.
Do not add new facts.
Preserve numbers and entities exactly.
<|im_end|>
<|im_start|>user
{summaries[0]}
<|im_end|>
<|im_start|>assistant
""",
            max_tokens=160,
            temperature=0.0,
            stop=["\n\n", "<|im_end|>"],
            echo=False
        )
        return out["choices"][0]["text"].strip()

    # Case 2: multiple summaries â†’ tree reduce
    level = summaries
    while len(level) > 1:
        next_level = []
        for i in range(0, len(level), 2):
            if i + 1 < len(level):
                next_level.append(merge_two(llm, level[i], level[i + 1]))
            else:
                next_level.append(level[i])
        level = next_level

    return level[0]


print("\nGenerating final consolidated summary...")
start = time.time()

final_summary = tree_reduce(llm, summaries)

timings["final_summary"] = time.time() - start
# Save final summary
summary_file = output_dir / "final_summary.txt"
summary_file.write_text(final_summary)

print(f"Final summary generated in {timings['final_summary']:.1f}s")
print("\nFinal Summary:\n")
print(final_summary)

Number of summaries before final reduce: 1

Generating final consolidated summary...
Final summary generated in 0.0s

Final Summary:

- Stars are massive collections of hydrogen atoms that collapse under gravity, fusing heavier elements until reaching iron.
- A singularity is infinitely dense, with all mass concentrated into a single point with no surface or volume.
- Black holes have a mass so concentrated that even tiny distances mean millions of times more force.
- Smaller black holes can kill you before you enter their event horizon, while supermassive ones allow travel inside.
- The largest known supermassive black hole is S5 0014 with 81, 4


# STEP 13

In [74]:
audio_duration = info.duration
total_time = sum(timings.values())

print("\n=== PERFORMANCE ===")
for k, v in timings.items():
    print(f"{k:20s}: {v:.2f}s")

print("Audio duration:", audio_duration, "s")
print("Transcription speed:",
      f"{audio_duration / timings['transcription']:.1f}x realtime")


=== PERFORMANCE ===
load_whisper        : 0.49s
load_qwen           : 1.43s
transcription       : 13.63s
chunk_summaries     : 36.93s
final_summary       : 0.00s
Audio duration: 355.8458125 s
Transcription speed: 26.1x realtime


# Step 14

In [75]:
import zipfile
from datetime import datetime

metadata = {
    "timestamp": datetime.now().isoformat(),
    "device": DEVICE_NAME,
    "models": {
        "whisper": "small.en",
        "summarizer": "Qwen 2.5 1.5B GGUF"
    },
    "timings": timings
}

meta_file = output_dir / "metadata.json"
meta_file.write_text(json.dumps(metadata, indent=2))

zip_path = "/content/video_summarization_results.zip"
with zipfile.ZipFile(zip_path, "w") as z:
    z.write(transcript_file, transcript_file.name)
    z.write(summary_file, summary_file.name)
    z.write(meta_file, meta_file.name)

print("Saved:", zip_path)

Saved: /content/video_summarization_results.zip


# Step 15

In [76]:
from google.colab import files
files.download(zip_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Step 16

In [77]:
import gc

del whisper_model, llm
gc.collect()

if GPU_AVAILABLE:
    torch.cuda.empty_cache()

print("Cleanup complete")

Cleanup complete
