# Meeting summary pipeline

## Introduction

Here we will try to implement meeting summary pipeline espesialy audio to text summary.

Recording is taken from this [dataset](https://huggingface.co/datasets/huuuyeah/meetingbank)
Audio files can be found [here](https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio/tree/main)

For testing purpose audio file is taken localy into '/extra' folder

In [1]:
# Imports
import os
import sys
from pathlib import Path
from dotenv import load_dotenv
from huggingface_hub import login
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, AutoTokenizer, AutoModelForCausalLM, pipeline, TextStreamer, QuantoConfig
import numpy as np
from IPython.display import display, Markdown
import torch
from typing import Optional


In [2]:
# Load environment variables
load_dotenv() 

# Add project root to Python path and change working directory
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))
os.chdir(project_root)  # Change working directory to project root

In [3]:
# Sign in to HuggingFace Hub

hf_token = os.environ.get('HF_TOKEN')
if hf_token is None:
    raise ValueError("HF_TOKEN is not set")
login(hf_token, add_to_git_credential=True)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Brifly about models we are using here:

**distil-medium.en** - is a distilated whisper model that is trained using Robust Knowledge Distillation. Distilated model will be not only smaller by 49% but also 6 times faster.
- More about Robust Knowledge Distillation you can read [here](https://arxiv.org/abs/2311.00430)
- Model can be found [here](https://huggingface.co/distil-whisper/distil-medium.en)



In [9]:
# Constants

ASR_MODEL = "distil-whisper/distil-medium.en"

audio_file_name = "extra/denver_meeting_rec.mp3"

## Audio To Text

Let's start from our ASR. We will use distil-whisper using HF transformers pipelines. As long as our audio is longer than 30-seconds (when *Short-Form Transcription* can be used) we will use *Long-Form Transcription*. 

Small note from model page:
> Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds).
> In practice, this chunked long-form algorithm is 9x faster than the sequential > algorithm proposed by OpenAI in the Whisper paper

In [10]:
# Initialize ASR model
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    ASR_MODEL, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(ASR_MODEL)

asr_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15, # This will enable Long-Form Transcription
    # batch_size=16, # for batch processing
    torch_dtype=torch_dtype,
    device=device,
)

`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cpu


In [None]:
transcription = asr_pipe(audio_file_name)
display(Markdown(transcription['text'][:1000]))

 kind of the confluence of this whole idea of the confluence week, the merging of two rivers, and as we've kind of seen recently in politics and in the world, there's a lot of situations where water is very important right now and it's a very big issue so that is the reason that the back of the logo is considered water. So let you see the reason behind the logo and all the meanings behind the symbolism. And you'll hear a little bit more about our Confluence Week is basically highlighting all of these indigenous events and things that are happening around Denver so that we can kind of bring more people together and kind of share this whole idea of Indigenous Peoples Day. So thank you. Thank you so much and thanks for your leadership. All right. Welcome to the Denver City Council meeting of Monday, October 9th. Please rise with the Pledge of Allegiance by Councilman Lopez. I pledge allegiance to the flag of the United States of America, to the Republic of which it stands, one nation unde

## Summarization

Now when ASR is ready we can start working on summarization part. 
One of the important points of this project is to try [quantization](https://huggingface.co/docs/optimum/en/concept_guides/quantization), so I will try 2 aproaches 
 - Use 8b parameter model with 4bit quantization. (**meta-llama/Meta-Llama-3.1-8B-Instruct**)
 - Use model that specialized on summarizations. (**sshleifer/distilbart-cnn-12-6**)

### LLAMA_3_1

Here I will try the power of 8B pamrameters version of model Meta-Llama-3.1. 
This model is huge with FP16 weights ~16 GB. But there is a way to handle this and it is a quantization.

Quantisation:
> reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types

Course suggestion is to use BitsAndBytes, but it is not available for Mac with Apple Silicon chips and I decide to change quantization method. I will try:
 - optimum-quanto to int8
 - GGUF

In [58]:
# Model of choice
LLAMA_3_1 = "meta-llama/Meta-Llama-3.1-8B-Instruct"

In [59]:
# Prompts
default_system_message = "You are an assistant that produces minutes of meetings from transcripts, with summary, key discussion points, takeaways and action items with owners, in markdown."
default_user_prompt = "Below is an extract transcript of council meeting. Please write minutes in markdown, including a summary with attendees, location and date; discussion points; takeaways; and action items with owners.\n{transcription}"

#### optimum-quanto

In [60]:
# Configurations

# Mac with M1/M2/M3 chip will not BitsAndBytesConfig so I will try another quantization method, but left another aproach for you
# default_4b_quant_config = BitsAndBytesConfig(
#     load_in_4bit=True,  # Use 4-bit quantization
#     bnb_4bit_use_double_quant=True,  # Use double quantization
#     bnb_4bit_compute_dtype=torch.bfloat16,  # Use bfloat16 for computation
#     bnb_4bit_quant_type="nf4"  # Use NF4 quantization type (Not supported on MAC)
# )

# Let's try optimum-quanto
default_optimum_quant_config = QuantoConfig(weights="int8")

In [61]:
# Util functions
def build_messages(system: str, user: str) -> list[dict]:
    return [
        {"role": "system", "content": system},
        {"role": "user", "content": user}
    ]

# Summarizer class
class MeetingSummarizer:
    model_name: str
    quant_config: QuantoConfig
    tokenizer: AutoTokenizer
    model: AutoModelForCausalLM
    system_message: str = default_system_message
    user_prompt: str = default_user_prompt
    device: str

    def __init__(self, model_name: str, quant_config: QuantoConfig,
                 system_message: Optional[str] = None, user_prompt: Optional[str] = None,
                 device: str = "mps"):
        self.model_name = model_name
        self.quant_config = quant_config
        self.device = device
        
        # Initialize tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        tokenizer.pad_token = tokenizer.eos_token
        self.tokenizer = tokenizer
        if tokenizer is None:
            raise ValueError("Tokenizer is not provided")
        
        # Initialize model with proper device handling
        try:
            # Try with device_map="auto" first (requires accelerate)
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name, 
                device_map="auto", 
                quantization_config=quant_config,
                torch_dtype=torch.float16 if device != "cpu" else torch.float32
            )
            print(f"✅ Model loaded with device_map='auto'")
        except Exception as e:
            print(f"⚠️  device_map='auto' failed: {e}")
            print("🔄 Falling back to manual device placement...")
            
            # Fallback: load without device_map and manually move to device
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name, 
                quantization_config=quant_config,
                torch_dtype=torch.float16 if device != "cpu" else torch.float32,
                low_cpu_mem_usage=True
            )
            if device != "cpu":
                self.model = self.model.to(device)
            print(f"✅ Model loaded and moved to {device}")
        
        # Set custom messages if provided
        if system_message is not None:
            self.system_message = system_message
        if user_prompt is not None:
            self.user_prompt = user_prompt
        
    def summarize(self, transcription: str) -> str:
        # Check if this is a chat model (like Llama) or a summarization model (like DistilBART)
        is_chat_model = hasattr(self.tokenizer, 'chat_template') and self.tokenizer.chat_template is not None
        
        if is_chat_model:
            # For chat models like Llama, use chat template
            formatted_prompt = self.user_prompt.format(transcription=transcription)
            messages = build_messages(self.system_message, formatted_prompt)
            inputs = self.tokenizer.apply_chat_template(messages, return_tensors="pt").to(self.device)
        else:
            # For summarization models like DistilBART, use direct text input
            # For BART models, we typically just pass the text directly
            tokenized = self.tokenizer(transcription, return_tensors="pt", max_length=1024, truncation=True)
            inputs = tokenized.input_ids.to(self.device)
            
            # Ensure batch size is 1 for TextStreamer compatibility
            if inputs.dim() == 1:
                inputs = inputs.unsqueeze(0)
        
        # Create streamer for real-time output (only for chat models to avoid batch size issues)
        streamer = TextStreamer(self.tokenizer, skip_prompt=True) if is_chat_model else None
        
        # Generate with streaming
        with torch.no_grad():
            if is_chat_model:
                # Chat model generation parameters
                outputs = self.model.generate(
                    inputs, 
                    max_new_tokens=2000,
                    do_sample=True,
                    temperature=0.7,
                    top_p=0.9,
                    streamer=streamer,
                    pad_token_id=self.tokenizer.eos_token_id,
                    eos_token_id=self.tokenizer.eos_token_id 
                )
            else:
                # Summarization model generation parameters (without streamer to avoid batch issues)
                outputs = self.model.generate(
                    inputs,
                    do_sample=True,
                    temperature=0.7,
                    top_p=0.9,
                    streamer=streamer,
                    pad_token_id=self.tokenizer.eos_token_id,
                    eos_token_id=self.tokenizer.eos_token_id,
                    early_stopping=True
                )
        
        # Decode the full response (excluding the input prompt)
        response = self.tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
        return response
    
    def summarize_with_display(self, transcription: str) -> str:
        """Summarize and display the result in Markdown format"""
        print("🎯 Generating meeting summary...")
        print("=" * 50)
        
        summary = self.summarize(transcription)
        
        print("\n" + "=" * 50)
        print("✅ Summary generated! Displaying in Markdown:")
        display(Markdown(summary))
        
        return summary


In [None]:
# Example usage of usage quantized model by optimum-quanto

# Initialize the summarizer 
summarizer = MeetingSummarizer(
    model_name=LLAMA_3_1,
    quant_config=default_optimum_quant_config,
    device="mps"  # let's try Metal Performance Shaders on Mac
)

# Example: Use with transcription
# Uncomment the following lines when you have a transcription ready:
summary = summarizer.summarize_with_display(transcription['text'])

print("✅ MeetingSummarizer initialized with 8-bit quantization!")
print("📝 Ready to process meeting transcriptions with streaming output and Markdown display.")


**Result**: Even with such quantization works really long and will be tested seperately in async way...

#### GGUF

In [None]:
# TBD

---

### DESTIL BART



In [None]:
DISTILBART = "sshleifer/distilbart-cnn-12-6"
# Example usage of usage of specialised destilated summarization model. 
# It is not chat-aware so we can inject our system and user prompts as text
input_text = f"""
System: {default_system_message}
User: {default_user_prompt}
{transcription['text']}
"""

# Here we can just use summarization pipeline 
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
summary = summarizer(
    input_text,
    tokenizer="sshleifer/distilbart-cnn-12-6",
    max_length=2000,     # maximum tokens in summary
    do_sample=False,   # True → more diverse output, False → deterministic
    num_beams=4        # beam search size (higher → better quality, slower)

)


Device set to use mps:0
Token indices sequence length is longer than the specified maximum sequence length for this model (2658 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
# Example: Using chunking with DistilBART for long transcriptions
print("📊 Checking transcription length...")
transcription_text = transcription['text']
word_count = len(transcription_text.split())
estimated_tokens = word_count * 0.75

print(f"📈 Transcription stats:")
print(f"   • Word count: {word_count:,}")
print(f"   • Estimated tokens: {estimated_tokens:,.0f}")
print(f"   • DistilBART max context: ~1024 tokens")

if estimated_tokens > 800:
    print("⚠️  Text exceeds recommended chunk size - chunking will be used")
else:
    print("✅ Text fits within context window")


📊 Checking transcription length...
📈 Transcription stats:
   • Word count: 2,202
   • Estimated tokens: 1,652
   • DistilBART max context: ~1024 tokens
⚠️  Text exceeds recommended chunk size - chunking will be used


In [None]:

def chunk_text_simple(text: str, max_chunk_size: int = 800, overlap_size: int = 100) -> list[str]:
    """
    Split text into overlapping chunks suitable for model processing.
    
    Args:
        text: Input text to chunk
        max_chunk_size: Maximum tokens per chunk (conservative estimate using words)
        overlap_size: Number of tokens to overlap between chunks
    
    Returns:
        List of text chunks
    """
    # Simple word-based chunking (rough token estimation: ~1.3 words per token)
    words = text.split()
    chunks = []
    
    # Convert token estimates to word counts
    max_words = int(max_chunk_size * 1.3)  # Conservative estimate
    overlap_words = int(overlap_size * 1.3)
    
    start = 0
    while start < len(words):
        # Calculate end position
        end = min(start + max_words, len(words))
        
        # Create chunk
        chunk = ' '.join(words[start:end])
        chunks.append(chunk)
        
        # Move start position (with overlap)
        if end >= len(words):
            break
        start = end - overlap_words
        
        # Ensure we make progress
        if start <= 0:
            start = max_words
    
    return chunks

def simple_pipeline_with_chunking(transcription_text, max_tokens=600):  # Reduced from 800
    """
    Simple approach using HuggingFace pipeline with automatic chunking
    """
    from transformers import pipeline
    
    # Calculate system/user prompt overhead
    prompt_overhead = f"""
        System: {default_system_message}
        User: {default_user_prompt}
        """
    prompt_tokens = len(prompt_overhead.split()) * 0.75
    print(f"📏 System/User prompts use ~{prompt_tokens:.0f} tokens")
    
    # Adjust available space for transcription content
    available_tokens = max_tokens - prompt_tokens - 50  # 50 token safety buffer
    print(f"📊 Available tokens for content: {available_tokens:.0f}")
    
    # Check transcription length
    estimated_tokens = len(transcription_text.split()) * 0.75
    print(f"📊 Transcription length: {estimated_tokens:.0f} estimated tokens")
    print(f"📏 DistilBART context limit: 1024 tokens")
    
    # Initialize pipeline
    summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
    
    if estimated_tokens <= available_tokens:
        print("✅ Text fits in available space, processing directly...")
        
        # Create input with system and user prompts (original approach)
        input_text = f"""
            System: {default_system_message}
            User: {default_user_prompt}
            {transcription_text}
            """
        
        # Double-check final length
        final_tokens = len(input_text.split()) * 0.75
        print(f"🔍 Final input length: {final_tokens:.0f} tokens")
        
        if final_tokens > 1000:  # Conservative limit
            print("⚠️  Still too long, forcing chunking...")
        else:
            result = summarizer(
                input_text,
                max_length=512,
                min_length=50,
                do_sample=False,
                num_beams=4
            )
            return result[0]['summary_text']
    
    print("⚠️  Text too long! Using chunking strategy...")
    
    # Use much smaller chunks to account for prompt overhead
    chunk_size = max(200, int(available_tokens * 0.8))  # Very conservative
    chunks = chunk_text_simple(transcription_text, max_chunk_size=chunk_size, overlap_size=50)
    print(f"📄 Split into {len(chunks)} chunks (max {chunk_size} tokens each + prompts)")
    
    # Summarize each chunk
    chunk_summaries = []
    for i, chunk in enumerate(chunks, 1):
        print(f"🔄 Processing chunk {i}/{len(chunks)}...")
        
        # Add system/user prompts to each chunk
        chunk_input = f"""
            System: {default_system_message}
            User: {default_user_prompt}
            {chunk}
            """
        
        # Verify chunk length before processing
        chunk_tokens = len(chunk_input.split()) * 0.75
        print(f"   📏 Chunk {i} length: {chunk_tokens:.0f} tokens")
        
        if chunk_tokens > 1000:
            print(f"   ⚠️  Chunk {i} still too long! Truncating...")
            # Truncate the chunk content (keep prompts, cut transcription)
            words = chunk.split()
            max_chunk_words = int((1000 - prompt_tokens - 50) * 1.3)  # Conservative
            truncated_chunk = ' '.join(words[:max_chunk_words])
            chunk_input = f"""
                System: {default_system_message}
                User: {default_user_prompt}
                {truncated_chunk}
                """
            print(f"   ✂️  Truncated to ~{len(chunk_input.split()) * 0.75:.0f} tokens")
        
        result = summarizer(
            chunk_input,
            max_length=150,  # Shorter for chunks
            min_length=30,
            do_sample=False,
            num_beams=4
        )
        
        chunk_summaries.append(result[0]['summary_text'])
        print(f"✅ Chunk {i} done")
    
    # Combine results
    if len(chunk_summaries) == 1:
        return chunk_summaries[0]
    
    print("🔗 Combining chunk summaries...")
    combined = "\n\n".join([f"Section {i+1}: {summary}" for i, summary in enumerate(chunk_summaries)])
    
    # If combined result is still too long, summarize it
    if len(combined.split()) * 0.75 > max_tokens:
        print("📋 Final summary step...")
        final_result = summarizer(
            f"Summarize this meeting: {combined}",
            max_length=512,
            min_length=100,
            do_sample=False,
            num_beams=4
        )
        return final_result[0]['summary_text']
    
    return combined

print("🔧 Testing simple pipeline with chunking...")
print("=" * 60)

fixed_summary = simple_pipeline_with_chunking(transcription['text'])

print("\n📋 Pipeline Result:")
print("=" * 40)
display(Markdown(fixed_summary))


🔧 Testing FIXED simple pipeline with chunking...
📏 System/User prompts use ~42 tokens
📊 Available tokens for content: 508
📊 Transcription length: 1652 estimated tokens
📏 DistilBART context limit: 1024 tokens


Device set to use mps:0


⚠️  Text too long! Using chunking strategy...
📄 Split into 5 chunks (max 406 tokens each + prompts)
🔄 Processing chunk 1/5...
   📏 Chunk 1 length: 437 tokens
✅ Chunk 1 done
🔄 Processing chunk 2/5...
   📏 Chunk 2 length: 437 tokens
✅ Chunk 2 done
🔄 Processing chunk 3/5...
   📏 Chunk 3 length: 437 tokens
✅ Chunk 3 done
🔄 Processing chunk 4/5...
   📏 Chunk 4 length: 437 tokens
✅ Chunk 4 done
🔄 Processing chunk 5/5...
   📏 Chunk 5 length: 308 tokens
✅ Chunk 5 done
🔗 Combining chunk summaries...

📋 FIXED Pipeline Result:


Section 1:  The back of the logo of the Denver City Council logo is considered water . Councilor Clark invites everyone down to the first ever Halloween parade on Broadway in Lucky District 7 . Proclamation number 17 is an observance of the second annual Indigenous peoples day in the city .

Section 2:  The council of the city and county of Denver recognizes that the indigenous peoples have lived and flourished on the lands known as the America since time and memorial . Denver and surrounding communities are built upon the ancestral homelands of numerous indigenous tribes, which include the southern Ute, the Ute Mountains, Ute tribes of Colorado .

Section 3:  Indigenous Indigenous Peoples Day is celebrated in Denver, Colorado, for the second time . Mayor: "We are celebrating indigenous people's day out of pride for who we are" Council member: "It's very important to be proud of who you're from"

Section 4:  Councilwoman Martega: "This day is not a day off, it's a day on in Denver, right? And addressing those critical issues" Councilwoman Caniche: "I'm very proud of today. Oh, and we made Time magazine and Newsweek once again today as a leader in terms of the cities that are celebrating indigenous peoples"

Section 5:  Councilwoman Artega: "I just wanted to say thank you many of the Native American peoples of Colorado have been at the forefront or actually nationally of defending some of the public lands that have been protected over the last few years that are under attack right now. And there are places that the communities have fought to protect, but that everyone gets to enjoy"

This aproach is much faster but quolity have to be enhanced. 
Possibly one more aggregation is needed for combining chunks summaries into one summary, but reusing same model seems bad aproach, and needs to be agregated by other model

---

## Results from UI test:

DistilBart results:
![distilbart result](distilbart-result.png)

Llama3.1 8B with optimum-quanto:
![distilbart result](llama-quanto-result.png)

From result's it is visible how significantly large model cope with this task better even with quantization.