# Podcast Notes, Summarization, and Translation using Phi‑4 Multimodal LLM with NVIDIA NIM Microservices

This notebook demonstrates a complete workflow using the [**Phi‑4 Multimodal LLM**](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/) model. Below are some key model details (from the internal model card):

- **Parameters:** 5.6B
- **Inputs:** Text, Image, Audio
- **Context Length:** 128K tokens
- **Training Data:** 5T text tokens, 2.3M speech hours, 1.1T image-text tokens
- **Supported Languages:** Multilingual text and audio (e.g. English, Chinese, German, French, etc.)

The Phi-4 LLM will be accelerated with NVIDIA NIM Microservices in this tutorial. NVIDIA NIM Microservices that is a set of easy-to-use inference microservices for accelerating the deployment of foundation models on any cloud or data center and helping to keep your data secure.

We will be using [preview NIM microservice API](https://build.nvidia.com/microsoft/phi-4-multimodal-instruct) for Phi-4 through the  [NVIDIA API Catalog](https://build.nvidia.com/microsoft)

This notebook covers:
1. A podcast notes and summarization use-case (with long-audio chunking).
2. Translation of the transcript and summary into another language.

Ensure that you have the required dependencies (e.g. `pydub`, `requests`, `Pillow`) installed.

In [None]:
!pip install pydub requests Pillow

## Utility Functions for Audio Processing

In this section, we define helper functions to process audio files, such as converting an audio segment to a base64-encoded string 
and splitting long audio into manageable chunks for transcription.


### Function Overview

- **audio_to_base64**: Converts audio segments to base64-encoded strings for API transmission
- **generate_notes_chunk**: Processes individual audio chunks and generates transcription
- **generate_detailed_notes**: Handles long audio files by splitting them into manageable chunks
- **refine_transcription_to_notes**: Transforms raw transcription into well-formatted notes
- **summarize_notes**: Creates a concise summary from the detailed notes
- **translate_text**: Translates content to other languages while preserving formatting
- **save_text_to_file**: Exports results to text files

In [25]:
import requests
import base64
from io import BytesIO
from pydub import AudioSegment

API_URL = "https://integrate.api.nvidia.com/v1/chat/completions"

def audio_to_base64(audio_segment):
    """Convert a pydub AudioSegment to a base64-encoded WAV string."""
    buffer = BytesIO()
    audio_segment.export(buffer, format="mp3")
    return base64.b64encode(buffer.getvalue()).decode()

def generate_notes_chunk(audio_chunk, api_key, chunk_index, total_chunks):
    """
    Generate detailed notes for a single audio chunk.
    
    The prompt instructs the model to produce detailed notes without any extra commentary.
    """
    audio_b64 = audio_to_base64(audio_chunk)
    prompt = (
        f"Transcribe the following audio accurately. "
        f"This is segment {chunk_index+1} of {total_chunks}. "
        "Please do not include any system commentary or self-referential text."
    )
    # Append the audio data (encoded in base64)
    prompt += f' <audio src="data:audio/wav;base64,{audio_b64}" />'
    
    headers = {"Authorization": f"Bearer {api_key}", "Accept": "application/json"}
    payload = {
        "model": "microsoft/phi-4-multimodal-instruct",
        "messages": [
            {
                "role": "system",
                "content": "You're a transcription assistant."
            },
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 1024,
        "temperature": 0.1,
        "top_p": 0.7,
        "stream": False
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    result = response.json()
    try:
        text = result["choices"][0]["message"]["content"].strip()
    except Exception as e:
        text = f"[Error generating notes for chunk {chunk_index+1}: {e}]"
    return text

def generate_detailed_notes(audio_path, api_key, chunk_duration_ms=30000):
    """
    Process a long audio file by splitting it into chunks, generating detailed notes for each chunk,
    and concatenating the results.
    """
    # Load audio file using pydub
    audio = AudioSegment.from_file(audio_path)
    total_duration = len(audio)
    total_chunks = (total_duration // chunk_duration_ms) + (1 if total_duration % chunk_duration_ms > 0 else 0)
    
    notes_chunks = []
    for i in range(total_chunks):
        start_ms = i * chunk_duration_ms
        end_ms = min((i+1) * chunk_duration_ms, total_duration)
        chunk = audio[start_ms:end_ms]
        print(f"Processing chunk {i+1}/{total_chunks} (from {start_ms}ms to {end_ms}ms)...")
        notes = generate_notes_chunk(chunk, api_key, i, total_chunks)
        notes_chunks.append(notes)
    
    transcription = "\n\n".join(notes_chunks)
    return transcription


def refine_transcription_to_notes(transcription, api_key):
    """
    Given a raw transcription, remove any system prompt lines and generate coherent, well-formatted detailed notes.
    """
    # Remove unwanted system phrases
    cleaned = transcription.replace("You are a helpful assistant.", "").replace("you are a helpful assistant.", "")
    
    # Build a prompt instructing the LLM to produce formatted notes
    prompt = (
        "Based on the transcription below, generate well-formatted, detailed notes that capture the main points. "
        "Organize the notes using bullet points or numbered lists for key points and separate paragraphs clearly for readability. "
        "Do not include any system commentary or self-referential text.\n\n"
        "Transcription:\n"
        f"{cleaned}"
    )
    
    headers = {"Authorization": f"Bearer {api_key}", "Accept": "application/json"}
    payload = {
        "model": "microsoft/phi-4-multimodal-instruct",
        "messages": [
            {"role": "system", "content": "You are a note-taking assistant. Provide only detailed, well-formatted notes."},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 1024,
        "temperature": 0.1,
        "top_p": 0.7,
        "stream": False
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    result = response.json()
    try:
        notes = result["choices"][0]["message"]["content"].strip()
    except Exception as e:
        notes = f"[Error generating refined notes: {e}]"
    return notes



def summarize_notes(notes, api_key):
    """
    Generate a concise summary based on the detailed notes.
    """
    prompt = (
        "Based on the detailed notes provided below, please generate a concise summary of the content. "
        "Do not include any extra commentary or self-referential text.\n\n"
        f"{notes}"
    )
    
    headers = {"Authorization": f"Bearer {api_key}", "Accept": "application/json"}
    payload = {
        "model": "microsoft/phi-4-multimodal-instruct",
        "messages": [
            {"role": "system", "content": "You are a summarization assistant. Provide only a concise summary."},
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 512,
        "temperature": 0.1,
        "top_p": 0.7,
        "stream": False
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    result = response.json()
    try:
        summary = result["choices"][0]["message"]["content"].strip()
    except Exception as e:
        summary = f"[Error generating summary: {e}]"
    return summary


def translate_text(text, target_lang, api_key):
    """
    Translate the given text into the target language using the Phi‑4 API,
    preserving all bullet points, formatting, and structure.
    """
    prompt = (
        f"Translate the following text to {target_lang} exactly as it is, "
        "preserving all bullet points, formatting, and structure. Do not omit any sections.\n\n"
        f"{text}"
    )
    
    headers = {"Authorization": f"Bearer {api_key}", "Accept": "application/json"}
    payload = {
        "model": "microsoft/phi-4-multimodal-instruct",
        "messages": [
            {
                "role": "system",
                "content": "You are a translation assistant. Provide only the translated text with the original formatting preserved."
            },
            {"role": "user", "content": prompt}
        ],
        "max_tokens": 1500,
        "temperature": 0.1,
        "top_p": 0.7,
        "stream": False
    }
    
    response = requests.post(API_URL, headers=headers, json=payload)
    result = response.json()
    try:
        translated_text = result["choices"][0]["message"]["content"].strip()
    except Exception as e:
        translated_text = f"[Error generating translation: {e}]"
    return translated_text


def save_text_to_file(text, filename):
    """Save the given text to a file."""
    with open(filename, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"Text saved to {filename}")


## Workflow for detailed notes generation


In [13]:
# Set your NVIDIA API key and podcast audio file path
API_KEY = "your_api_key_here"
podcast_audio_path = "podcast_audio.mp3"  # update with your file path

# Generate detailed notes from the audio file (with 30-second chunks by default)
print("Starting detailed note generation...")
transcription = generate_detailed_notes(podcast_audio_path, API_KEY, chunk_duration_ms=30000)
detailed_notes = refine_transcription_to_notes(transcription, API_KEY)

print("\n--- Detailed Notes ---\n")
print(detailed_notes)

# Generate a summary of the detailed notes
print("\nGenerating summary...")
summary = summarize_notes(detailed_notes, API_KEY)

print("\n--- Summary ---\n")
print(summary)

# Save detailed notes and summary to a text file
combined_text = detailed_notes + "\n\n--- SUMMARY ---\n\n" + summary
save_text_to_file(combined_text, "podcast_detailed_notes.txt")


Starting detailed note generation...
Processing chunk 1/54 (from 0ms to 30000ms)...
Processing chunk 2/54 (from 30000ms to 60000ms)...
Processing chunk 3/54 (from 60000ms to 90000ms)...
Processing chunk 4/54 (from 90000ms to 120000ms)...
Processing chunk 5/54 (from 120000ms to 150000ms)...
Processing chunk 6/54 (from 150000ms to 180000ms)...
Processing chunk 7/54 (from 180000ms to 210000ms)...
Processing chunk 8/54 (from 210000ms to 240000ms)...
Processing chunk 9/54 (from 240000ms to 270000ms)...
Processing chunk 10/54 (from 270000ms to 300000ms)...
Processing chunk 11/54 (from 300000ms to 330000ms)...
Processing chunk 12/54 (from 330000ms to 360000ms)...
Processing chunk 13/54 (from 360000ms to 390000ms)...
Processing chunk 14/54 (from 390000ms to 420000ms)...
Processing chunk 15/54 (from 420000ms to 450000ms)...
Processing chunk 16/54 (from 450000ms to 480000ms)...
Processing chunk 17/54 (from 480000ms to 510000ms)...
Processing chunk 18/54 (from 510000ms to 540000ms)...
Processing 

## Translation

The following cell translates the combined notes and summary into another language (e.g., Spanish).

In [26]:
target_language = "Spanish"
print(f"\nTranslating transcript and summary to {target_language}...")
translated_text = translate_text(combined_text, target_language, API_KEY)

print("\n--- Translated Text ---\n")
print(translated_text)

# Save the translated text to a file
save_text_to_file(translated_text, "podcast_transcription_translated.txt")


Translating transcript and summary to Spanish...

--- Translated Text ---

- El podcast de Nvidia AI, presentado por Noah Kravitz, discute el impacto de los humanos digitales y agentes de IA, destacando la colaboración entre InWorld AI, Nvidia y Streamlabs en CES 2025.
- Chris Covert, director de experiencias de producto en InWorld AI, enfatiza la misión de la empresa de hacer la IA accesible y aborda los desafíos en la implementación de la IA, especialmente en el juego y entretenimiento.
- El asistente de transmisión inteligente de InWorld AI, mostrado en CES, es un agente de IA que proporciona comentarios en tiempo real y apoyo para los streamers, aprovechando la tecnología de Nvidia y las capacidades de IA generativa de Inworld.
- Chris Covert comparte sus conocimientos sobre la evolución de la IA, desde simples chatbots hasta agentes autónomos capaces de interacciones y toma de decisiones complejas.
- La conversación toca la importancia del diseño centrado en el humano en la IA, l

## Troubleshooting

If you encounter issues with the API, here are some common problems and solutions:

1. **Payload Too Large**: Use the optimization functions with more aggressive parameters
2. **Authentication Errors**: Verify your API key is correct and has the necessary permissions
3. **Model Not Available**: Check that you're using a valid model name for your account tier
4. **Format Issues**: Ensure your media files are in supported formats
5. **Rate Limiting**: If processing many chunks, add delays between API calls to avoid rate limits