# üéôÔ∏è Transcript Summarization - Testing Notebook

This notebook tests the Meeting Summary project's transcription and summarization pipeline.

**Models Used (Optimized for Colab GPU quota):**
- **Transcription**: `faster-whisper` with `small` model (244MB, high accuracy)
- **Summarization**: `sshleifer/distilbart-cnn-12-6` (~1GB, 2x faster than BART-large)

---

## üì¶ Step 1: Install Dependencies

In [None]:
# Install required packages
!pip install -q faster-whisper transformers torch sentencepiece tqdm
!pip install -q scipy numpy huggingface-hub

# Install FFmpeg for audio processing
!apt-get install -qq ffmpeg

print("‚úÖ All dependencies installed!")

## üîß Step 2: Configuration

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

# ========================================
# CONFIGURATION - Modify these as needed
# ========================================

# Whisper model size: 'tiny', 'base', 'small', 'medium', 'large-v3'
# Recommendation: 'small' for best speed/accuracy balance
WHISPER_MODEL = "small"

# Use GPU if available (faster-whisper supports CUDA)
USE_GPU = True

# Summary word limit
SUMMARY_WORD_LIMIT = 500

# Model info for reference
MODEL_INFO = {
    "tiny": {"size": "~39 MB", "speed": "fastest", "accuracy": "basic"},
    "base": {"size": "~74 MB", "speed": "very fast", "accuracy": "good"},
    "small": {"size": "~244 MB", "speed": "fast", "accuracy": "high"},
    "medium": {"size": "~769 MB", "speed": "moderate", "accuracy": "very high"},
    "large-v3": {"size": "~1.5 GB", "speed": "slow", "accuracy": "best"},
}

print(f"üìä Selected Whisper Model: {WHISPER_MODEL}")
print(f"   Size: {MODEL_INFO[WHISPER_MODEL]['size']}")
print(f"   Speed: {MODEL_INFO[WHISPER_MODEL]['speed']}")
print(f"   Accuracy: {MODEL_INFO[WHISPER_MODEL]['accuracy']}")
print(f"\nüéØ GPU Enabled: {USE_GPU}")
print(f"üìù Summary Word Limit: {SUMMARY_WORD_LIMIT}")

## üé§ Step 3: Fast Transcriber Class

In [None]:
import torch
from faster_whisper import WhisperModel

class FastTranscriber:
    """High-speed transcription using faster-whisper with GPU support."""
    
    def __init__(self, model_size: str = "small", use_gpu: bool = True):
        self.model_size = model_size
        self.use_gpu = use_gpu and torch.cuda.is_available()
        self.model = None
        self.device = "cuda" if self.use_gpu else "cpu"
        self.compute_type = "float16" if self.use_gpu else "int8"
    
    def load_model(self):
        """Load the faster-whisper model."""
        if self.model is not None:
            return
        
        print(f"üì• Loading Whisper {self.model_size} model on {self.device.upper()}...")
        
        self.model = WhisperModel(
            self.model_size,
            device=self.device,
            compute_type=self.compute_type
        )
        
        print(f"‚úÖ Whisper {self.model_size} model loaded!")
    
    def transcribe(self, audio_path: str) -> dict:
        """Transcribe an audio file."""
        if self.model is None:
            self.load_model()
        
        print(f"üéôÔ∏è Transcribing: {audio_path}")
        
        segments, info = self.model.transcribe(
            audio_path,
            beam_size=1,              # Greedy decoding - faster
            vad_filter=True,          # Remove silence
            word_timestamps=False,
            condition_on_previous_text=False
        )
        
        segment_list = []
        full_text = []
        
        for seg in segments:
            segment_list.append({
                "start": seg.start,
                "end": seg.end,
                "text": seg.text.strip()
            })
            full_text.append(seg.text.strip())
        
        transcript = " ".join(full_text)
        
        print(f"‚úÖ Transcription complete!")
        print(f"   Language: {info.language}")
        print(f"   Length: {len(transcript):,} chars | {len(transcript.split()):,} words")
        
        return {
            "text": transcript,
            "segments": segment_list,
            "language": info.language,
            "word_count": len(transcript.split())
        }
    
    def format_with_timestamps(self, result: dict) -> str:
        """Format transcript with timestamps."""
        lines = []
        for seg in result.get("segments", []):
            start = self._format_time(seg["start"])
            end = self._format_time(seg["end"])
            lines.append(f"[{start} ‚Üí {end}] {seg['text']}")
        return "\n".join(lines)
    
    def _format_time(self, seconds: float) -> str:
        mins = int(seconds // 60)
        secs = int(seconds % 60)
        return f"{mins:02d}:{secs:02d}"

print("‚úÖ FastTranscriber class defined!")

## üìù Step 4: Summarizer Class (DistilBART)

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
from datetime import datetime

class Summarizer:
    """Summarizes transcripts using DistilBART (2x faster than BART-large)."""
    
    MODEL_NAME = "sshleifer/distilbart-cnn-12-6"
    
    def __init__(self, use_gpu: bool = True):
        self.device = "cuda" if use_gpu and torch.cuda.is_available() else "cpu"
        self.model = None
        self.tokenizer = None
    
    def load_model(self):
        """Load the DistilBART model."""
        if self.model is not None:
            return
        
        print(f"üì• Loading DistilBART model on {self.device.upper()}...")
        
        self.tokenizer = BartTokenizer.from_pretrained(self.MODEL_NAME)
        self.model = BartForConditionalGeneration.from_pretrained(self.MODEL_NAME)
        self.model.to(self.device)
        
        print("‚úÖ DistilBART model loaded!")
    
    def summarize(self, transcript: str, word_limit: int = 500) -> str:
        """Generate a structured summary."""
        if self.model is None:
            self.load_model()
        
        print(f"üìù Generating summary (~{word_limit} words)...")
        
        if not transcript.strip():
            return "No content to summarize."
        
        # Chunk the transcript for processing
        chunks = self._chunk_text(transcript, max_length=2000)
        
        summaries = []
        for i, chunk in enumerate(chunks):
            print(f"   Processing chunk {i+1}/{len(chunks)}...")
            summary = self._summarize_chunk(chunk)
            if summary:
                summaries.append(summary)
        
        detailed_summary = "\n\n".join(summaries)
        
        # Trim to word limit
        words = detailed_summary.split()
        if len(words) > word_limit * 1.2:
            detailed_summary = " ".join(words[:word_limit]) + "..."
        
        # Generate executive summary
        exec_summary = self._summarize_chunk(detailed_summary[:2000]) if len(detailed_summary) > 500 else detailed_summary
        
        # Extract key points
        key_points = self._extract_key_points(summaries)
        
        # Format output
        output = f"""# Meeting Summary
**Generated:** {datetime.now().strftime("%Y-%m-%d %H:%M")}
**Transcript Length:** {len(transcript):,} characters | {len(transcript.split()):,} words

---

## Executive Summary
{exec_summary}

---

## Detailed Summary
{detailed_summary}

---

## Key Points
{key_points}

---

*Generated by Meeting Summary App using DistilBART*
"""
        
        print("‚úÖ Summary generation complete!")
        return output
    
    def _summarize_chunk(self, text: str) -> str:
        """Summarize a single chunk."""
        if not text.strip():
            return ""
        
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=1024,
            truncation=True
        ).to(self.device)
        
        summary_ids = self.model.generate(
            inputs["input_ids"],
            max_length=200,
            min_length=50,
            length_penalty=1.5,
            num_beams=2,
            early_stopping=True,
            no_repeat_ngram_size=3
        )
        
        return self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    def _chunk_text(self, text: str, max_length: int) -> list:
        """Split text into chunks."""
        words = text.split()
        chunks = []
        current = []
        length = 0
        
        for word in words:
            if length + len(word) > max_length and current:
                chunks.append(" ".join(current))
                current = [word]
                length = len(word)
            else:
                current.append(word)
                length += len(word) + 1
        
        if current:
            chunks.append(" ".join(current))
        
        return chunks if chunks else [text]
    
    def _extract_key_points(self, summaries: list) -> str:
        """Extract key points as bullet points."""
        points = []
        for summary in summaries:
            sentences = summary.split('.')
            if sentences and len(sentences[0].strip()) > 20:
                points.append(f"‚Ä¢ {sentences[0].strip()}.")
        
        return "\n".join(points[:8]) if points else "‚Ä¢ Key points from the meeting above."

print("‚úÖ Summarizer class defined!")

## üì§ Step 5: Upload Your Audio/Video File

Upload an audio file (MP3, WAV) or video file (MP4, etc.) to test the pipeline.

In [None]:
from google.colab import files
import os

print("üì§ Please upload an audio/video file...")
uploaded = files.upload()

# Get the uploaded file path
uploaded_file = list(uploaded.keys())[0]
print(f"\n‚úÖ Uploaded: {uploaded_file}")
print(f"   Size: {os.path.getsize(uploaded_file) / (1024*1024):.2f} MB")

## üé¨ Step 6: Extract Audio (if video file)

In [None]:
import subprocess

def extract_audio(input_file: str) -> str:
    """Extract audio from video file using FFmpeg."""
    # Check if already audio file
    audio_extensions = ['.mp3', '.wav', '.m4a', '.flac', '.ogg', '.aac']
    if any(input_file.lower().endswith(ext) for ext in audio_extensions):
        print(f"‚úÖ File is already audio: {input_file}")
        return input_file
    
    # Extract audio from video
    output_file = "extracted_audio.wav"
    
    print(f"üé¨ Extracting audio from video...")
    
    cmd = [
        "ffmpeg", "-y", "-i", input_file,
        "-vn",                    # No video
        "-acodec", "pcm_s16le",   # WAV format
        "-ar", "16000",           # 16kHz (optimal for Whisper)
        "-ac", "1",               # Mono
        output_file
    ]
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode == 0:
        print(f"‚úÖ Audio extracted: {output_file}")
        print(f"   Size: {os.path.getsize(output_file) / (1024*1024):.2f} MB")
        return output_file
    else:
        print(f"‚ùå Error: {result.stderr}")
        return input_file

# Extract audio
audio_file = extract_audio(uploaded_file)
print(f"\nüéµ Audio file ready: {audio_file}")

## üöÄ Step 7: Run Transcription

In [None]:
import time

# Initialize transcriber
transcriber = FastTranscriber(model_size=WHISPER_MODEL, use_gpu=USE_GPU)

# Transcribe
print("\n" + "="*50)
print("üéôÔ∏è STARTING TRANSCRIPTION")
print("="*50 + "\n")

start_time = time.time()
result = transcriber.transcribe(audio_file)
transcription_time = time.time() - start_time

print(f"\n‚è±Ô∏è Transcription Time: {transcription_time:.2f} seconds")

## üìÑ Step 8: View Transcript

In [None]:
# Display transcript with timestamps
print("\n" + "="*50)
print("üìú TRANSCRIPT WITH TIMESTAMPS")
print("="*50 + "\n")

formatted_transcript = transcriber.format_with_timestamps(result)
print(formatted_transcript[:3000])  # Show first 3000 chars

if len(formatted_transcript) > 3000:
    print(f"\n... [Truncated - Full transcript is {len(formatted_transcript):,} characters]")

## üìù Step 9: Generate Summary

In [None]:
# Initialize summarizer
summarizer = Summarizer(use_gpu=USE_GPU)

# Generate summary
print("\n" + "="*50)
print("üìù GENERATING SUMMARY")
print("="*50 + "\n")

start_time = time.time()
summary = summarizer.summarize(result["text"], word_limit=SUMMARY_WORD_LIMIT)
summary_time = time.time() - start_time

print(f"\n‚è±Ô∏è Summary Generation Time: {summary_time:.2f} seconds")

## üìä Step 10: View Summary

In [None]:
from IPython.display import Markdown, display

print("\n" + "="*50)
print("üìä MEETING SUMMARY")
print("="*50 + "\n")

# Display as formatted markdown
display(Markdown(summary))

## üíæ Step 11: Download Results

In [None]:
from google.colab import files

# Save transcript
transcript_filename = "transcript.txt"
with open(transcript_filename, "w", encoding="utf-8") as f:
    f.write(formatted_transcript)

# Save summary
summary_filename = "summary.md"
with open(summary_filename, "w", encoding="utf-8") as f:
    f.write(summary)

print("üì• Downloading files...")
files.download(transcript_filename)
files.download(summary_filename)

print("\n‚úÖ Done! Files downloaded.")

## üìà Step 12: Performance Summary

In [None]:
print("\n" + "="*60)
print("üìà PERFORMANCE SUMMARY")
print("="*60)

print(f"""
üîß Configuration:
   ‚Ä¢ Whisper Model: {WHISPER_MODEL} ({MODEL_INFO[WHISPER_MODEL]['size']})
   ‚Ä¢ Device: {'GPU (CUDA)' if torch.cuda.is_available() else 'CPU'}
   ‚Ä¢ Summary Word Limit: {SUMMARY_WORD_LIMIT}

üìä Results:
   ‚Ä¢ Transcript Words: {result['word_count']:,}
   ‚Ä¢ Transcript Characters: {len(result['text']):,}
   ‚Ä¢ Detected Language: {result['language']}

‚è±Ô∏è Timing:
   ‚Ä¢ Transcription: {transcription_time:.2f}s
   ‚Ä¢ Summarization: {summary_time:.2f}s
   ‚Ä¢ Total: {transcription_time + summary_time:.2f}s

üíæ Files Saved:
   ‚Ä¢ transcript.txt
   ‚Ä¢ summary.md
""")

print("="*60)
print("‚úÖ All tests completed successfully!")
print("="*60)

---

## üß™ Alternative: Test with Sample Text

If you don't have an audio file, you can test just the summarization with sample text:

In [None]:
# Sample meeting transcript for testing summarization only
SAMPLE_TRANSCRIPT = """
Welcome everyone to today's product planning meeting. We have several important topics to cover.

First, let's discuss the Q1 roadmap. Our main priority is launching the mobile app by end of February.
The development team has been making good progress on the core features. Sarah mentioned that the 
authentication module is complete and we're now working on the dashboard.

John from the design team shared the latest mockups. The stakeholders approved the new color scheme
and we'll be implementing those changes next week. We need to finalize the icon set by Friday.

Regarding the backend, Mike reported that the API is 80% complete. We still need to implement the
notification system and the analytics tracking. These should be done by next Wednesday.

For marketing, Lisa prepared a launch strategy. We'll start a teaser campaign on social media two
weeks before launch. The press release will go out to tech publications on launch day.

Budget update: We're currently under budget by 15%, which gives us room for additional testing
resources if needed. Alice suggested hiring a QA contractor for the final testing phase.

Action items: Sarah will complete the dashboard by Friday. John will finalize icons. Mike will
finish the notification API. Lisa will prepare social media content. Alice will interview QA candidates.

Our next meeting is scheduled for next Monday at 10 AM. Thank you everyone for your participation.
"""

# Test summarization with sample text
print("\nüß™ Testing summarization with sample transcript...\n")

test_summarizer = Summarizer(use_gpu=USE_GPU)
test_summary = test_summarizer.summarize(SAMPLE_TRANSCRIPT, word_limit=200)

display(Markdown(test_summary))