# üéµ Audio AI: Text-to-Speech, Voice Processing a Generov√°n√≠ Audia

**Autor:** Martin ≈†vanda, Praut s.r.o.  
**Notebook:** 20/20 - Fin√°ln√≠ notebook s√©rie

---

## Co se nauƒç√≠te

1. **Text-to-Speech (TTS)** - Generov√°n√≠ ≈ôeƒçi z textu pomoc√≠ modern√≠ch model≈Ø
2. **Voice Cloning** - Klonov√°n√≠ hlasu z kr√°tk√Ωch uk√°zek
3. **Audio Processing** - Zpracov√°n√≠ a anal√Ωza zvukov√Ωch dat
4. **Music Generation** - Generov√°n√≠ hudby pomoc√≠ AI
5. **Produkƒçn√≠ Audio Pipeline** - End-to-end ≈ôe≈°en√≠

---

## üîß Instalace a Setup

In [None]:
# Instalace pot≈ôebn√Ωch knihoven
!pip install -q transformers torch torchaudio
!pip install -q datasets soundfile librosa scipy
!pip install -q accelerate sentencepiece

# Pro pokroƒçil√© TTS
!pip install -q TTS 2>/dev/null || echo "TTS install optional"

# Pro audio vizualizaci
!pip install -q matplotlib numpy

In [None]:
import torch
import torch.nn as nn
import numpy as np
import soundfile as sf
import io
import json
import warnings
from typing import Dict, List, Optional, Tuple, Any, Union
from dataclasses import dataclass, field
from collections import defaultdict
import matplotlib.pyplot as plt
from IPython.display import Audio, display
warnings.filterwarnings('ignore')

# Detekce za≈ô√≠zen√≠
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è Pou≈æ√≠v√°m za≈ô√≠zen√≠: {device}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Pamƒõ≈•: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

---

## üó£Ô∏è ƒå√°st 1: Text-to-Speech s SpeechT5

SpeechT5 je unified model pro speech-text √∫lohy od Microsoftu.

In [None]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset

class SpeechT5TTS:
    """Text-to-Speech engine zalo≈æen√Ω na SpeechT5."""
    
    def __init__(self):
        """
        Inicializuje SpeechT5 TTS syst√©m.
        """
        print("üì• Naƒç√≠t√°m SpeechT5 modely...")
        
        # Processor pro tokenizaci textu
        self.processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
        
        # TTS model
        self.model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
        self.model.to(device)
        self.model.eval()
        
        # Vocoder pro generov√°n√≠ waveformu
        self.vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
        self.vocoder.to(device)
        self.vocoder.eval()
        
        # Naƒçten√≠ speaker embedding≈Ø
        print("üì• Naƒç√≠t√°m speaker embeddingy...")
        embeddings_dataset = load_dataset(
            "Matthijs/cmu-arctic-xvectors", 
            split="validation"
        )
        self.speaker_embeddings = torch.tensor(
            embeddings_dataset[7306]["xvector"]
        ).unsqueeze(0).to(device)
        
        # Sample rate
        self.sample_rate = 16000
        
        print("‚úÖ SpeechT5 TTS p≈ôipraven")
    
    def synthesize(self, text: str, 
                   speaker_embedding: Optional[torch.Tensor] = None) -> np.ndarray:
        """
        Syntetizuje ≈ôeƒç z textu.
        
        Args:
            text: Text k synt√©ze
            speaker_embedding: Voliteln√Ω speaker embedding pro zmƒõnu hlasu
        
        Returns:
            Audio waveform jako numpy array
        """
        # Pou≈æit√≠ v√Ωchoz√≠ho nebo poskytnut√©ho embeddingu
        embedding = speaker_embedding if speaker_embedding is not None else self.speaker_embeddings
        
        # Tokenizace textu
        inputs = self.processor(text=text, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Generov√°n√≠ mel spectrogramu
        with torch.no_grad():
            speech = self.model.generate_speech(
                inputs["input_ids"],
                embedding,
                vocoder=self.vocoder
            )
        
        return speech.cpu().numpy()
    
    def synthesize_long(self, text: str, 
                        max_chunk_length: int = 500) -> np.ndarray:
        """
        Syntetizuje del≈°√≠ text rozdƒõlen√≠m na chunky.
        
        Args:
            text: Dlouh√Ω text k synt√©ze
            max_chunk_length: Maxim√°ln√≠ d√©lka chunku
        
        Returns:
            Spojen√Ω audio waveform
        """
        # Rozdƒõlen√≠ na vƒõty
        sentences = self._split_text(text)
        
        # Spojen√≠ do chunk≈Ø
        chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            if len(current_chunk) + len(sentence) <= max_chunk_length:
                current_chunk += sentence + " "
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = sentence + " "
        
        if current_chunk:
            chunks.append(current_chunk.strip())
        
        # Synt√©za jednotliv√Ωch chunk≈Ø
        audio_parts = []
        for chunk in chunks:
            audio = self.synthesize(chunk)
            audio_parts.append(audio)
            # Kr√°tk√° pauza mezi chunky
            pause = np.zeros(int(self.sample_rate * 0.3))
            audio_parts.append(pause)
        
        return np.concatenate(audio_parts)
    
    def _split_text(self, text: str) -> List[str]:
        """Rozdƒõl√≠ text na vƒõty."""
        import re
        sentences = re.split(r'(?<=[.!?])\s+', text)
        return [s.strip() for s in sentences if s.strip()]
    
    def save_audio(self, audio: np.ndarray, filepath: str):
        """Ulo≈æ√≠ audio do souboru."""
        sf.write(filepath, audio, self.sample_rate)
        print(f"üíæ Audio ulo≈æeno: {filepath}")
    
    def play_audio(self, audio: np.ndarray):
        """P≈ôehraje audio v notebooku."""
        display(Audio(audio, rate=self.sample_rate))

print("\n" + "="*60)
print("SpeechT5TTS Engine")
print("="*60)

In [None]:
# Inicializace TTS
tts_engine = SpeechT5TTS()

In [None]:
# Test z√°kladn√≠ synt√©zy
test_text = "Hello, welcome to Praut AI solutions. We specialize in artificial intelligence and automation."

print(f"üìù Text: {test_text}")
print("üé§ Generuji audio...")

audio = tts_engine.synthesize(test_text)

print(f"‚úÖ Audio vygenerov√°no: {len(audio)} samples ({len(audio)/tts_engine.sample_rate:.2f}s)")

# Vizualizace waveformu
plt.figure(figsize=(12, 4))
plt.plot(audio)
plt.title("Waveform syntetizovan√©ho audia")
plt.xlabel("Samples")
plt.ylabel("Amplitude")
plt.tight_layout()
plt.show()

# P≈ôehr√°n√≠
tts_engine.play_audio(audio)

In [None]:
# Test del≈°√≠ho textu
long_text = """
Artificial intelligence is transforming how businesses operate. 
At Praut, we help companies integrate AI into their workflows.
Our solutions include automated document processing, intelligent chatbots, and predictive analytics.
Contact us today to learn how AI can benefit your organization.
"""

print("üìù Dlouh√Ω text:")
print(long_text[:100] + "...")
print("\nüé§ Generuji audio pro dlouh√Ω text...")

long_audio = tts_engine.synthesize_long(long_text)

print(f"‚úÖ Audio vygenerov√°no: {len(long_audio)/tts_engine.sample_rate:.2f}s")
tts_engine.play_audio(long_audio)

---

## üé≠ ƒå√°st 2: BARK - Pokroƒçil√Ω TTS s Emocemi

BARK je generativn√≠ audio model od Suno AI, kter√Ω podporuje emoce, sm√≠ch, a neverb√°ln√≠ zvuky.

In [None]:
from transformers import AutoProcessor, BarkModel

class BarkTTS:
    """Pokroƒçil√Ω TTS s podporou emoc√≠ pomoc√≠ BARK modelu."""
    
    # Dostupn√© voice presety
    VOICE_PRESETS = {
        'male_1': 'v2/en_speaker_6',
        'male_2': 'v2/en_speaker_9',
        'female_1': 'v2/en_speaker_1',
        'female_2': 'v2/en_speaker_3',
        'narrator': 'v2/en_speaker_0',
    }
    
    # Speci√°ln√≠ tagy pro emoce
    EMOTION_TAGS = {
        'laugh': '[laughter]',
        'sigh': '[sighs]',
        'gasp': '[gasps]',
        'clear_throat': '[clears throat]',
        'pause': '...',
        'music': '‚ô™',
    }
    
    def __init__(self, model_name: str = "suno/bark-small"):
        """
        Args:
            model_name: N√°zev modelu (bark-small nebo bark)
        """
        print(f"üì• Naƒç√≠t√°m BARK model: {model_name}")
        
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = BarkModel.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
        )
        self.model.to(device)
        self.model.eval()
        
        self.sample_rate = 24000  # BARK pou≈æ√≠v√° 24kHz
        
        print("‚úÖ BARK model naƒçten")
    
    def synthesize(self, text: str, 
                   voice_preset: str = 'narrator',
                   add_silence: bool = True) -> np.ndarray:
        """
        Syntetizuje ≈ôeƒç s mo≈ænost√≠ emoc√≠.
        
        Args:
            text: Text k synt√©ze (m≈Ø≈æe obsahovat emotion tagy)
            voice_preset: Preset hlasu
            add_silence: P≈ôidat ticho na konec
        
        Returns:
            Audio waveform
        """
        # Z√≠sk√°n√≠ voice presetu
        preset = self.VOICE_PRESETS.get(voice_preset, voice_preset)
        
        # P≈ô√≠prava vstup≈Ø
        inputs = self.processor(text, voice_preset=preset)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Generov√°n√≠
        with torch.no_grad():
            audio_array = self.model.generate(**inputs)
        
        audio = audio_array.cpu().numpy().squeeze()
        
        # P≈ôid√°n√≠ ticha na konec
        if add_silence:
            silence = np.zeros(int(self.sample_rate * 0.25))
            audio = np.concatenate([audio, silence])
        
        return audio
    
    def synthesize_with_emotion(self, text: str, 
                                emotion: str,
                                voice_preset: str = 'narrator') -> np.ndarray:
        """
        Syntetizuje text s p≈ôidanou emoc√≠.
        
        Args:
            text: Text k synt√©ze
            emotion: Typ emoce (laugh, sigh, gasp, etc.)
            voice_preset: Preset hlasu
        
        Returns:
            Audio waveform
        """
        emotion_tag = self.EMOTION_TAGS.get(emotion, '')
        
        # P≈ôid√°n√≠ emotion tagu
        if emotion_tag:
            enhanced_text = f"{text} {emotion_tag}"
        else:
            enhanced_text = text
        
        return self.synthesize(enhanced_text, voice_preset)
    
    def synthesize_dialogue(self, dialogue: List[Dict[str, str]]) -> np.ndarray:
        """
        Syntetizuje dialog mezi v√≠ce mluvƒç√≠mi.
        
        Args:
            dialogue: Seznam replik [{"speaker": "male_1", "text": "..."}]
        
        Returns:
            Spojen√Ω audio waveform
        """
        audio_parts = []
        
        for turn in dialogue:
            speaker = turn.get('speaker', 'narrator')
            text = turn.get('text', '')
            emotion = turn.get('emotion', None)
            
            if emotion:
                audio = self.synthesize_with_emotion(text, emotion, speaker)
            else:
                audio = self.synthesize(text, speaker)
            
            audio_parts.append(audio)
            
            # Pauza mezi replikami
            pause = np.zeros(int(self.sample_rate * 0.5))
            audio_parts.append(pause)
        
        return np.concatenate(audio_parts)
    
    def play_audio(self, audio: np.ndarray):
        """P≈ôehraje audio."""
        display(Audio(audio, rate=self.sample_rate))
    
    def list_voices(self):
        """Vyp√≠≈°e dostupn√© hlasy."""
        print("üé≠ Dostupn√© hlasov√© presety:")
        for name, preset in self.VOICE_PRESETS.items():
            print(f"   {name}: {preset}")
        print("\nüé≠ Dostupn√© emotion tagy:")
        for name, tag in self.EMOTION_TAGS.items():
            print(f"   {name}: {tag}")

print("\n" + "="*60)
print("BARK TTS Engine")
print("="*60)

In [None]:
# Inicializace BARK
bark_tts = BarkTTS("suno/bark-small")
bark_tts.list_voices()

In [None]:
# Test BARK s emocemi
print("\nüé≠ Test BARK s r≈Øzn√Ωmi hlasy a emocemi:")
print("-" * 50)

# Z√°kladn√≠ synt√©za
text = "Welcome to Praut AI. We make artificial intelligence accessible for everyone."
print(f"üìù Text: {text}")

audio = bark_tts.synthesize(text, voice_preset='narrator')
print(f"‚úÖ Audio vygenerov√°no ({len(audio)/bark_tts.sample_rate:.2f}s)")
bark_tts.play_audio(audio)

In [None]:
# Test dialogu
dialogue = [
    {"speaker": "female_1", "text": "Have you tried the new AI automation system?"},
    {"speaker": "male_1", "text": "Yes! It saved us so much time.", "emotion": "laugh"},
    {"speaker": "female_1", "text": "That's amazing. I should implement it too."},
]

print("\nüé≠ Generuji dialog:")
for turn in dialogue:
    print(f"   [{turn['speaker']}]: {turn['text']}")

dialogue_audio = bark_tts.synthesize_dialogue(dialogue)
print(f"\n‚úÖ Dialog vygenerov√°n ({len(dialogue_audio)/bark_tts.sample_rate:.2f}s)")
bark_tts.play_audio(dialogue_audio)

---

## üéº ƒå√°st 3: Audio Processing a Anal√Ωza

N√°stroje pro zpracov√°n√≠, anal√Ωzu a transformaci audio dat.

In [None]:
import librosa
import librosa.display
from scipy import signal
from scipy.io import wavfile

class AudioProcessor:
    """Komplexn√≠ n√°stroj pro zpracov√°n√≠ audia."""
    
    def __init__(self, sample_rate: int = 16000):
        """
        Args:
            sample_rate: V√Ωchoz√≠ sample rate
        """
        self.sample_rate = sample_rate
    
    def load_audio(self, filepath: str, 
                   target_sr: Optional[int] = None) -> Tuple[np.ndarray, int]:
        """
        Naƒçte audio soubor.
        
        Args:
            filepath: Cesta k souboru
            target_sr: C√≠lov√Ω sample rate (None = ponechat origin√°l)
        
        Returns:
            Tuple (audio_data, sample_rate)
        """
        audio, sr = librosa.load(filepath, sr=target_sr)
        return audio, sr
    
    def save_audio(self, audio: np.ndarray, filepath: str, 
                   sample_rate: Optional[int] = None):
        """Ulo≈æ√≠ audio do souboru."""
        sr = sample_rate or self.sample_rate
        sf.write(filepath, audio, sr)
    
    def resample(self, audio: np.ndarray, 
                 orig_sr: int, target_sr: int) -> np.ndarray:
        """P≈ôevzorkuje audio na jin√Ω sample rate."""
        return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
    
    def normalize(self, audio: np.ndarray, 
                  target_db: float = -20.0) -> np.ndarray:
        """Normalizuje hlasitost audia."""
        # V√Ωpoƒçet aktu√°ln√≠ RMS
        rms = np.sqrt(np.mean(audio**2))
        if rms == 0:
            return audio
        
        # C√≠lov√° RMS
        target_rms = 10 ** (target_db / 20)
        
        # ≈†k√°lov√°n√≠
        return audio * (target_rms / rms)
    
    def trim_silence(self, audio: np.ndarray, 
                     top_db: int = 20) -> np.ndarray:
        """O≈ô√≠zne ticho ze zaƒç√°tku a konce."""
        trimmed, _ = librosa.effects.trim(audio, top_db=top_db)
        return trimmed
    
    def apply_noise_reduction(self, audio: np.ndarray, 
                               noise_factor: float = 0.1) -> np.ndarray:
        """
        Jednoduch√° redukce ≈°umu pomoc√≠ spektr√°ln√≠ho gatingu.
        
        Args:
            audio: Vstupn√≠ audio
            noise_factor: Faktor pro pr√°h ≈°umu (0-1)
        
        Returns:
            Audio s redukovan√Ωm ≈°umem
        """
        # STFT
        stft = librosa.stft(audio)
        magnitude = np.abs(stft)
        phase = np.angle(stft)
        
        # Odhad ≈°umov√©ho prahu
        noise_threshold = np.mean(magnitude) * noise_factor
        
        # Spektr√°ln√≠ gating
        magnitude_cleaned = np.where(
            magnitude > noise_threshold, 
            magnitude, 
            magnitude * 0.1
        )
        
        # Rekonstrukce
        stft_cleaned = magnitude_cleaned * np.exp(1j * phase)
        audio_cleaned = librosa.istft(stft_cleaned)
        
        return audio_cleaned
    
    def change_speed(self, audio: np.ndarray, 
                     speed_factor: float) -> np.ndarray:
        """
        Zmƒõn√≠ rychlost audia bez zmƒõny v√Ω≈°ky.
        
        Args:
            audio: Vstupn√≠ audio
            speed_factor: Faktor rychlosti (>1 = rychlej≈°√≠, <1 = pomalej≈°√≠)
        
        Returns:
            Audio s upravenou rychlost√≠
        """
        return librosa.effects.time_stretch(audio, rate=speed_factor)
    
    def change_pitch(self, audio: np.ndarray, 
                     semitones: float,
                     sample_rate: Optional[int] = None) -> np.ndarray:
        """
        Zmƒõn√≠ v√Ω≈°ku t√≥nu audia.
        
        Args:
            audio: Vstupn√≠ audio
            semitones: Poƒçet p≈Ølt√≥n≈Ø (+/- pro zv√Ω≈°en√≠/sn√≠≈æen√≠)
            sample_rate: Sample rate audia
        
        Returns:
            Audio s upravenou v√Ω≈°kou
        """
        sr = sample_rate or self.sample_rate
        return librosa.effects.pitch_shift(audio, sr=sr, n_steps=semitones)
    
    def extract_features(self, audio: np.ndarray,
                         sample_rate: Optional[int] = None) -> Dict[str, Any]:
        """
        Extrahuje audio features.
        
        Returns:
            Dict s r≈Øzn√Ωmi audio features
        """
        sr = sample_rate or self.sample_rate
        
        features = {}
        
        # Z√°kladn√≠ statistiky
        features['duration'] = len(audio) / sr
        features['rms'] = float(np.sqrt(np.mean(audio**2)))
        features['zero_crossing_rate'] = float(np.mean(librosa.feature.zero_crossing_rate(audio)))
        
        # Spektr√°ln√≠ features
        spectral_centroids = librosa.feature.spectral_centroid(y=audio, sr=sr)[0]
        features['spectral_centroid_mean'] = float(np.mean(spectral_centroids))
        features['spectral_centroid_std'] = float(np.std(spectral_centroids))
        
        # Spectral bandwidth
        spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio, sr=sr)[0]
        features['spectral_bandwidth_mean'] = float(np.mean(spectral_bandwidth))
        
        # MFCCs
        mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
        features['mfcc_means'] = [float(np.mean(mfcc)) for mfcc in mfccs]
        
        # Chroma features
        chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
        features['chroma_mean'] = float(np.mean(chroma))
        
        # Tempo
        try:
            tempo, _ = librosa.beat.beat_track(y=audio, sr=sr)
            features['tempo'] = float(tempo)
        except:
            features['tempo'] = None
        
        return features
    
    def visualize(self, audio: np.ndarray, 
                  sample_rate: Optional[int] = None,
                  title: str = "Audio Analysis"):
        """
        Vizualizuje audio (waveform, spectrogram, mel spectrogram).
        """
        sr = sample_rate or self.sample_rate
        
        fig, axes = plt.subplots(3, 1, figsize=(14, 10))
        
        # Waveform
        librosa.display.waveshow(audio, sr=sr, ax=axes[0])
        axes[0].set_title(f"{title} - Waveform")
        axes[0].set_xlabel("Time (s)")
        axes[0].set_ylabel("Amplitude")
        
        # Spectrogram
        D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)), ref=np.max)
        librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='hz', ax=axes[1])
        axes[1].set_title(f"{title} - Spectrogram")
        axes[1].set_xlabel("Time (s)")
        axes[1].set_ylabel("Frequency (Hz)")
        
        # Mel Spectrogram
        mel_spec = librosa.feature.melspectrogram(y=audio, sr=sr)
        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
        librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel', ax=axes[2])
        axes[2].set_title(f"{title} - Mel Spectrogram")
        axes[2].set_xlabel("Time (s)")
        axes[2].set_ylabel("Mel Frequency")
        
        plt.tight_layout()
        plt.show()

# Inicializace
audio_processor = AudioProcessor(sample_rate=16000)

print("\n" + "="*60)
print("AudioProcessor p≈ôipraven")
print("="*60)

In [None]:
# Pou≈æit√≠ audio processoru na syntetizovan√© audio
print("\nüîä Anal√Ωza syntetizovan√©ho audia:")
print("-" * 50)

# Pou≈æijeme audio z TTS
test_audio = tts_engine.synthesize("Hello, this is a test of audio processing capabilities.")

# Extrakce features
features = audio_processor.extract_features(test_audio, tts_engine.sample_rate)

print("\nüìä Audio Features:")
print(f"   D√©lka: {features['duration']:.2f}s")
print(f"   RMS: {features['rms']:.4f}")
print(f"   Zero Crossing Rate: {features['zero_crossing_rate']:.4f}")
print(f"   Spectral Centroid: {features['spectral_centroid_mean']:.1f} Hz")
print(f"   Spectral Bandwidth: {features['spectral_bandwidth_mean']:.1f} Hz")
if features['tempo']:
    print(f"   Tempo: {features['tempo']:.1f} BPM")

# Vizualizace
audio_processor.visualize(test_audio, tts_engine.sample_rate, "TTS Audio")

In [None]:
# Demonstrace audio transformac√≠
print("\nüîÑ Demonstrace audio transformac√≠:")
print("-" * 50)

# P≈Øvodn√≠ audio
original_audio = test_audio.copy()
sr = tts_engine.sample_rate

# 1. Zmƒõna rychlosti
faster_audio = audio_processor.change_speed(original_audio, speed_factor=1.5)
slower_audio = audio_processor.change_speed(original_audio, speed_factor=0.75)

print(f"\n‚ñ∂Ô∏è Rychlej≈°√≠ verze (1.5x): {len(faster_audio)/sr:.2f}s")
display(Audio(faster_audio, rate=sr))

print(f"\n‚è∏Ô∏è Pomalej≈°√≠ verze (0.75x): {len(slower_audio)/sr:.2f}s")
display(Audio(slower_audio, rate=sr))

In [None]:
# 2. Zmƒõna v√Ω≈°ky
higher_pitch = audio_processor.change_pitch(original_audio, semitones=4, sample_rate=sr)
lower_pitch = audio_processor.change_pitch(original_audio, semitones=-4, sample_rate=sr)

print(f"\nüîº Vy≈°≈°√≠ hlas (+4 p≈Ølt√≥ny):")
display(Audio(higher_pitch, rate=sr))

print(f"\nüîΩ Ni≈æ≈°√≠ hlas (-4 p≈Ølt√≥ny):")
display(Audio(lower_pitch, rate=sr))

---

## üéµ ƒå√°st 4: MusicGen - Generov√°n√≠ Hudby

MusicGen od Meta AI generuje hudbu z textov√Ωch popis≈Ø.

In [None]:
from transformers import AutoProcessor, MusicgenForConditionalGeneration

class MusicGenerator:
    """Gener√°tor hudby pomoc√≠ MusicGen modelu."""
    
    def __init__(self, model_name: str = "facebook/musicgen-small"):
        """
        Args:
            model_name: N√°zev modelu (small/medium/large)
        """
        print(f"üì• Naƒç√≠t√°m MusicGen model: {model_name}")
        
        self.processor = AutoProcessor.from_pretrained(model_name)
        self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)
        self.model.to(device)
        self.model.eval()
        
        self.sample_rate = self.model.config.audio_encoder.sampling_rate
        
        print(f"‚úÖ MusicGen naƒçten (sample rate: {self.sample_rate} Hz)")
    
    def generate(self, prompt: str, 
                 duration_seconds: float = 8.0,
                 guidance_scale: float = 3.0) -> np.ndarray:
        """
        Generuje hudbu z textov√©ho popisu.
        
        Args:
            prompt: Textov√Ω popis po≈æadovan√© hudby
            duration_seconds: D√©lka v sekund√°ch
            guidance_scale: S√≠la veden√≠ textem
        
        Returns:
            Audio waveform
        """
        # P≈ô√≠prava vstup≈Ø
        inputs = self.processor(
            text=[prompt],
            padding=True,
            return_tensors="pt"
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # V√Ωpoƒçet max_new_tokens pro po≈æadovanou d√©lku
        # MusicGen generuje ~50 token≈Ø za sekundu
        max_new_tokens = int(duration_seconds * 50)
        
        # Generov√°n√≠
        with torch.no_grad():
            audio_values = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                guidance_scale=guidance_scale,
                do_sample=True
            )
        
        # Konverze na numpy
        audio = audio_values[0, 0].cpu().numpy()
        
        return audio
    
    def generate_variations(self, prompt: str, 
                            num_variations: int = 3,
                            duration_seconds: float = 5.0) -> List[np.ndarray]:
        """
        Generuje v√≠ce variac√≠ hudby ze stejn√©ho promptu.
        
        Args:
            prompt: Textov√Ω popis
            num_variations: Poƒçet variac√≠
            duration_seconds: D√©lka ka≈æd√© variace
        
        Returns:
            Seznam audio waveform≈Ø
        """
        variations = []
        
        for i in range(num_variations):
            print(f"   Generuji variaci {i+1}/{num_variations}...")
            audio = self.generate(prompt, duration_seconds)
            variations.append(audio)
        
        return variations
    
    def concatenate_sections(self, sections: List[Tuple[str, float]],
                             crossfade_duration: float = 0.5) -> np.ndarray:
        """
        Generuje a spojuje v√≠ce sekc√≠ hudby.
        
        Args:
            sections: Seznam (prompt, duration) pro ka≈ædou sekci
            crossfade_duration: D√©lka crossfade mezi sekcemi
        
        Returns:
            Spojen√© audio
        """
        audio_sections = []
        
        for i, (prompt, duration) in enumerate(sections):
            print(f"   Generuji sekci {i+1}: '{prompt[:30]}...'")
            audio = self.generate(prompt, duration)
            audio_sections.append(audio)
        
        # Spojen√≠ s crossfade
        if len(audio_sections) == 1:
            return audio_sections[0]
        
        crossfade_samples = int(crossfade_duration * self.sample_rate)
        result = audio_sections[0]
        
        for section in audio_sections[1:]:
            # Crossfade
            fade_out = np.linspace(1, 0, crossfade_samples)
            fade_in = np.linspace(0, 1, crossfade_samples)
            
            # Aplikace fade
            result[-crossfade_samples:] *= fade_out
            section[:crossfade_samples] *= fade_in
            
            # Spojen√≠ s p≈ôekryvem
            result[-crossfade_samples:] += section[:crossfade_samples]
            result = np.concatenate([result, section[crossfade_samples:]])
        
        return result
    
    def play_audio(self, audio: np.ndarray):
        """P≈ôehraje audio."""
        display(Audio(audio, rate=self.sample_rate))
    
    def save_audio(self, audio: np.ndarray, filepath: str):
        """Ulo≈æ√≠ audio do souboru."""
        sf.write(filepath, audio, self.sample_rate)
        print(f"üíæ Audio ulo≈æeno: {filepath}")

print("\n" + "="*60)
print("MusicGenerator Engine")
print("="*60)

In [None]:
# Inicializace MusicGen
music_gen = MusicGenerator("facebook/musicgen-small")

In [None]:
# Generov√°n√≠ hudby z textov√©ho popisu
print("\nüéµ Generov√°n√≠ hudby z textov√©ho popisu:")
print("-" * 50)

prompts = [
    "Upbeat corporate background music with positive energy",
    "Calm ambient electronic music for focus and productivity",
    "Energetic rock guitar riff with drums",
]

for prompt in prompts:
    print(f"\nüéº Prompt: '{prompt}'")
    print("   Generuji...")
    
    audio = music_gen.generate(prompt, duration_seconds=5.0)
    
    print(f"   ‚úÖ Vygenerov√°no: {len(audio)/music_gen.sample_rate:.2f}s")
    music_gen.play_audio(audio)
    print()

---

## üè≠ ƒå√°st 5: Produkƒçn√≠ Audio Pipeline

Kompletn√≠ pipeline pro produkƒçn√≠ audio aplikace.

In [None]:
import time
from enum import Enum
from datetime import datetime
import hashlib

class AudioTaskType(Enum):
    TTS = "text_to_speech"
    MUSIC = "music_generation"
    PROCESSING = "audio_processing"
    ANALYSIS = "audio_analysis"

@dataclass
class AudioJob:
    """Reprezentace audio √∫lohy."""
    job_id: str
    task_type: AudioTaskType
    input_data: Dict[str, Any]
    output_audio: Optional[np.ndarray] = None
    output_path: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)
    status: str = "pending"
    error: Optional[str] = None
    created_at: datetime = field(default_factory=datetime.now)
    completed_at: Optional[datetime] = None
    processing_time: float = 0.0

class ProductionAudioPipeline:
    """
    Produkƒçn√≠ pipeline pro komplexn√≠ audio zpracov√°n√≠.
    
    Kombinuje TTS, hudebn√≠ generov√°n√≠ a audio processing
    do jednoho unifikovan√©ho rozhran√≠.
    """
    
    def __init__(self,
                 enable_tts: bool = True,
                 enable_music: bool = True,
                 cache_size: int = 50):
        """
        Args:
            enable_tts: Povolit TTS komponenty
            enable_music: Povolit generov√°n√≠ hudby
            cache_size: Velikost cache
        """
        print("üè≠ Inicializace Production Audio Pipeline...")
        
        self.components = {}
        
        # TTS komponenty
        if enable_tts:
            print("   üì¢ Inicializace TTS...")
            self.components['tts_speecht5'] = SpeechT5TTS()
            # BARK je voliteln√Ω kv≈Øli velikosti
            try:
                self.components['tts_bark'] = BarkTTS("suno/bark-small")
            except Exception as e:
                print(f"   ‚ö†Ô∏è BARK nedostupn√Ω: {e}")
        
        # Music generov√°n√≠
        if enable_music:
            print("   üéµ Inicializace MusicGen...")
            try:
                self.components['music'] = MusicGenerator("facebook/musicgen-small")
            except Exception as e:
                print(f"   ‚ö†Ô∏è MusicGen nedostupn√Ω: {e}")
        
        # Audio processor je v≈ædy dostupn√Ω
        self.components['processor'] = AudioProcessor()
        
        # Cache a statistiky
        self.cache = {}
        self.cache_size = cache_size
        self.jobs_history = []
        self.stats = {
            'total_jobs': 0,
            'successful_jobs': 0,
            'failed_jobs': 0,
            'total_audio_generated_seconds': 0,
            'cache_hits': 0,
            'by_type': defaultdict(int)
        }
        
        print("\n‚úÖ Pipeline inicializov√°n")
        print(f"   Komponenty: {list(self.components.keys())}")
    
    def text_to_speech(self, text: str,
                       engine: str = "speecht5",
                       voice: str = "default",
                       **kwargs) -> AudioJob:
        """
        P≈ôevede text na ≈ôeƒç.
        
        Args:
            text: Text k synt√©ze
            engine: TTS engine (speecht5/bark)
            voice: Hlas pro synt√©zu
            **kwargs: Dal≈°√≠ parametry pro engine
        
        Returns:
            AudioJob s v√Ωsledkem
        """
        job = self._create_job(AudioTaskType.TTS, {
            'text': text,
            'engine': engine,
            'voice': voice,
            **kwargs
        })
        
        try:
            start_time = time.time()
            
            # V√Ωbƒõr engine
            if engine == "bark" and 'tts_bark' in self.components:
                tts = self.components['tts_bark']
                audio = tts.synthesize(text, voice_preset=voice)
                sample_rate = tts.sample_rate
            else:
                tts = self.components.get('tts_speecht5')
                if tts is None:
                    raise RuntimeError("TTS komponenta nen√≠ dostupn√°")
                audio = tts.synthesize_long(text) if len(text) > 200 else tts.synthesize(text)
                sample_rate = tts.sample_rate
            
            job.output_audio = audio
            job.metadata['sample_rate'] = sample_rate
            job.metadata['duration'] = len(audio) / sample_rate
            job.status = "completed"
            job.processing_time = time.time() - start_time
            job.completed_at = datetime.now()
            
            self._update_stats(job, success=True)
            
        except Exception as e:
            job.status = "failed"
            job.error = str(e)
            self._update_stats(job, success=False)
        
        self.jobs_history.append(job)
        return job
    
    def generate_music(self, prompt: str,
                       duration: float = 8.0,
                       **kwargs) -> AudioJob:
        """
        Generuje hudbu z textov√©ho popisu.
        
        Args:
            prompt: Textov√Ω popis hudby
            duration: D√©lka v sekund√°ch
            **kwargs: Dal≈°√≠ parametry
        
        Returns:
            AudioJob s v√Ωsledkem
        """
        job = self._create_job(AudioTaskType.MUSIC, {
            'prompt': prompt,
            'duration': duration,
            **kwargs
        })
        
        try:
            if 'music' not in self.components:
                raise RuntimeError("MusicGen komponenta nen√≠ dostupn√°")
            
            start_time = time.time()
            
            music_gen = self.components['music']
            audio = music_gen.generate(prompt, duration)
            
            job.output_audio = audio
            job.metadata['sample_rate'] = music_gen.sample_rate
            job.metadata['duration'] = len(audio) / music_gen.sample_rate
            job.status = "completed"
            job.processing_time = time.time() - start_time
            job.completed_at = datetime.now()
            
            self._update_stats(job, success=True)
            
        except Exception as e:
            job.status = "failed"
            job.error = str(e)
            self._update_stats(job, success=False)
        
        self.jobs_history.append(job)
        return job
    
    def process_audio(self, audio: np.ndarray,
                      sample_rate: int,
                      operations: List[Dict[str, Any]]) -> AudioJob:
        """
        Aplikuje ≈ôetƒõzec operac√≠ na audio.
        
        Args:
            audio: Vstupn√≠ audio
            sample_rate: Sample rate
            operations: Seznam operac√≠ [{"type": "normalize", "params": {...}}]
        
        Returns:
            AudioJob s v√Ωsledkem
        """
        job = self._create_job(AudioTaskType.PROCESSING, {
            'audio_length': len(audio),
            'sample_rate': sample_rate,
            'operations': operations
        })
        
        try:
            start_time = time.time()
            processor = self.components['processor']
            
            processed_audio = audio.copy()
            applied_operations = []
            
            for op in operations:
                op_type = op.get('type')
                params = op.get('params', {})
                
                if op_type == 'normalize':
                    processed_audio = processor.normalize(processed_audio, **params)
                elif op_type == 'trim_silence':
                    processed_audio = processor.trim_silence(processed_audio, **params)
                elif op_type == 'noise_reduction':
                    processed_audio = processor.apply_noise_reduction(processed_audio, **params)
                elif op_type == 'change_speed':
                    processed_audio = processor.change_speed(processed_audio, **params)
                elif op_type == 'change_pitch':
                    processed_audio = processor.change_pitch(
                        processed_audio, sample_rate=sample_rate, **params
                    )
                elif op_type == 'resample':
                    target_sr = params.get('target_sr', 16000)
                    processed_audio = processor.resample(
                        processed_audio, sample_rate, target_sr
                    )
                    sample_rate = target_sr
                
                applied_operations.append(op_type)
            
            job.output_audio = processed_audio
            job.metadata['sample_rate'] = sample_rate
            job.metadata['duration'] = len(processed_audio) / sample_rate
            job.metadata['applied_operations'] = applied_operations
            job.status = "completed"
            job.processing_time = time.time() - start_time
            job.completed_at = datetime.now()
            
            self._update_stats(job, success=True)
            
        except Exception as e:
            job.status = "failed"
            job.error = str(e)
            self._update_stats(job, success=False)
        
        self.jobs_history.append(job)
        return job
    
    def analyze_audio(self, audio: np.ndarray,
                      sample_rate: int) -> AudioJob:
        """
        Analyzuje audio a extrahuje features.
        
        Args:
            audio: Vstupn√≠ audio
            sample_rate: Sample rate
        
        Returns:
            AudioJob s anal√Ωzou v metadata
        """
        job = self._create_job(AudioTaskType.ANALYSIS, {
            'audio_length': len(audio),
            'sample_rate': sample_rate
        })
        
        try:
            start_time = time.time()
            processor = self.components['processor']
            
            features = processor.extract_features(audio, sample_rate)
            
            job.metadata['features'] = features
            job.metadata['sample_rate'] = sample_rate
            job.status = "completed"
            job.processing_time = time.time() - start_time
            job.completed_at = datetime.now()
            
            self._update_stats(job, success=True)
            
        except Exception as e:
            job.status = "failed"
            job.error = str(e)
            self._update_stats(job, success=False)
        
        self.jobs_history.append(job)
        return job
    
    def _create_job(self, task_type: AudioTaskType, 
                    input_data: Dict) -> AudioJob:
        """Vytvo≈ô√≠ novou √∫lohu."""
        job_id = hashlib.md5(
            f"{task_type.value}_{datetime.now().isoformat()}".encode()
        ).hexdigest()[:12]
        
        return AudioJob(
            job_id=job_id,
            task_type=task_type,
            input_data=input_data
        )
    
    def _update_stats(self, job: AudioJob, success: bool):
        """Aktualizuje statistiky."""
        self.stats['total_jobs'] += 1
        self.stats['by_type'][job.task_type.value] += 1
        
        if success:
            self.stats['successful_jobs'] += 1
            if job.metadata.get('duration'):
                self.stats['total_audio_generated_seconds'] += job.metadata['duration']
        else:
            self.stats['failed_jobs'] += 1
    
    def get_stats(self) -> Dict:
        """Vr√°t√≠ statistiky pipeline."""
        stats = dict(self.stats)
        stats['by_type'] = dict(stats['by_type'])
        
        if stats['total_jobs'] > 0:
            stats['success_rate'] = stats['successful_jobs'] / stats['total_jobs']
        
        return stats
    
    def get_recent_jobs(self, n: int = 10) -> List[AudioJob]:
        """Vr√°t√≠ posledn√≠ch N √∫loh."""
        return self.jobs_history[-n:]
    
    def play_job_audio(self, job: AudioJob):
        """P≈ôehraje audio z √∫lohy."""
        if job.output_audio is not None and job.metadata.get('sample_rate'):
            display(Audio(job.output_audio, rate=job.metadata['sample_rate']))
        else:
            print("‚ö†Ô∏è √öloha nem√° v√Ωstupn√≠ audio")

print("\n" + "="*60)
print("ProductionAudioPipeline p≈ôipraven")
print("="*60)

In [None]:
# Inicializace produkƒçn√≠ho pipeline
audio_pipeline = ProductionAudioPipeline(
    enable_tts=True,
    enable_music=True,
    cache_size=50
)

In [None]:
# Test TTS
print("\n" + "="*60)
print("TEST: Text-to-Speech")
print("="*60)

tts_job = audio_pipeline.text_to_speech(
    text="Welcome to Praut AI Pipeline. This is a demonstration of our text to speech capabilities.",
    engine="speecht5"
)

print(f"\nüìã Job ID: {tts_job.job_id}")
print(f"   Status: {tts_job.status}")
print(f"   D√©lka: {tts_job.metadata.get('duration', 0):.2f}s")
print(f"   ƒåas zpracov√°n√≠: {tts_job.processing_time:.2f}s")

if tts_job.status == "completed":
    audio_pipeline.play_job_audio(tts_job)

In [None]:
# Test audio processingu
print("\n" + "="*60)
print("TEST: Audio Processing")
print("="*60)

# Pou≈æijeme audio z TTS
if tts_job.output_audio is not None:
    processing_job = audio_pipeline.process_audio(
        audio=tts_job.output_audio,
        sample_rate=tts_job.metadata['sample_rate'],
        operations=[
            {"type": "normalize", "params": {"target_db": -18}},
            {"type": "trim_silence", "params": {"top_db": 25}},
            {"type": "change_pitch", "params": {"semitones": -2}},
        ]
    )
    
    print(f"\nüìã Job ID: {processing_job.job_id}")
    print(f"   Status: {processing_job.status}")
    print(f"   Operace: {processing_job.metadata.get('applied_operations', [])}")
    print(f"   V√Ωsledn√° d√©lka: {processing_job.metadata.get('duration', 0):.2f}s")
    
    if processing_job.status == "completed":
        print("\nüîä Zpracovan√© audio:")
        audio_pipeline.play_job_audio(processing_job)

In [None]:
# Test generov√°n√≠ hudby
print("\n" + "="*60)
print("TEST: Music Generation")
print("="*60)

music_job = audio_pipeline.generate_music(
    prompt="Uplifting corporate background music with piano and soft drums",
    duration=5.0
)

print(f"\nüìã Job ID: {music_job.job_id}")
print(f"   Status: {music_job.status}")
if music_job.status == "completed":
    print(f"   D√©lka: {music_job.metadata.get('duration', 0):.2f}s")
    print(f"   ƒåas zpracov√°n√≠: {music_job.processing_time:.2f}s")
    audio_pipeline.play_job_audio(music_job)
else:
    print(f"   Chyba: {music_job.error}")

In [None]:
# Test anal√Ωzy
print("\n" + "="*60)
print("TEST: Audio Analysis")
print("="*60)

if tts_job.output_audio is not None:
    analysis_job = audio_pipeline.analyze_audio(
        audio=tts_job.output_audio,
        sample_rate=tts_job.metadata['sample_rate']
    )
    
    print(f"\nüìã Job ID: {analysis_job.job_id}")
    print(f"   Status: {analysis_job.status}")
    
    if analysis_job.status == "completed":
        features = analysis_job.metadata.get('features', {})
        print(f"\nüìä Extrahovan√© features:")
        print(f"   Duration: {features.get('duration', 0):.2f}s")
        print(f"   RMS: {features.get('rms', 0):.4f}")
        print(f"   Spectral Centroid: {features.get('spectral_centroid_mean', 0):.1f} Hz")
        print(f"   Zero Crossing Rate: {features.get('zero_crossing_rate', 0):.4f}")

In [None]:
# Fin√°ln√≠ statistiky pipeline
print("\n" + "="*60)
print("STATISTIKY PIPELINE")
print("="*60)

stats = audio_pipeline.get_stats()

print(f"\nüìä Celkov√© statistiky:")
print(f"   Celkem √∫loh: {stats['total_jobs']}")
print(f"   √öspƒõ≈°n√Ωch: {stats['successful_jobs']}")
print(f"   Ne√∫spƒõ≈°n√Ωch: {stats['failed_jobs']}")
print(f"   Success rate: {stats.get('success_rate', 0):.1%}")
print(f"   Celkem vygenerov√°no audia: {stats['total_audio_generated_seconds']:.1f}s")

print(f"\nüìà Podle typu √∫lohy:")
for task_type, count in stats['by_type'].items():
    print(f"   {task_type}: {count}")

print(f"\nüìú Posledn√≠ √∫lohy:")
for job in audio_pipeline.get_recent_jobs(5):
    status_icon = "‚úÖ" if job.status == "completed" else "‚ùå"
    print(f"   {status_icon} {job.job_id}: {job.task_type.value} ({job.processing_time:.2f}s)")

---

## üéØ Shrnut√≠ S√©rie

Gratulujeme! Dokonƒçili jste celou s√©rii **20 notebook≈Ø** o Hugging Face Transformers!

### Co jsme se nauƒçili v tomto notebooku

| Komponenta | Model | Pou≈æit√≠ |
|------------|-------|--------|
| **SpeechT5** | microsoft/speecht5_tts | Text-to-Speech synt√©za |
| **BARK** | suno/bark-small | TTS s emocemi a efekty |
| **MusicGen** | facebook/musicgen-small | Generov√°n√≠ hudby z textu |
| **Librosa** | - | Audio processing a anal√Ωza |

### P≈ôehled cel√© s√©rie

| # | T√©ma | Kl√≠ƒçov√© modely |
|---|------|---------------|
| 1 | √övod do HF | Pipeline API, AutoModel |
| 2 | Klasifikace a NER | BERT, DistilBERT |
| 3 | Sentiment a emoce | RoBERTa, GoEmotions |
| 4 | Sumarizace a generov√°n√≠ | BART, T5, GPT-2 |
| 5 | P≈ôeklad | MarianMT, mBART |
| 6 | Question Answering | BERT QA, Retrieval |
| 7 | Speech-to-Text | Whisper |
| 8-10 | Computer Vision | ViT, DETR, Segmentation |
| 11 | Embeddings a Search | Sentence Transformers |
| 12 | Fine-tuning | LoRA, PEFT |
| 13 | RAG syst√©my | Retrieval + Generation |
| 14 | LLM optimalizace | Quantization, vLLM |
| 15 | Multimod√°ln√≠ modely | CLIP, LLaVA, BLIP |
| 16 | Time Series | Transformer forecasting |
| 17 | Doporuƒçovac√≠ syst√©my | Collaborative filtering |
| 18 | Detekce anom√°li√≠ | Autoencoder, Isolation Forest |
| 19 | Document AI | TrOCR, LayoutLM, Donut |
| **20** | **Audio AI** | **SpeechT5, BARK, MusicGen** |

In [None]:
print("\n" + "="*60)
print("üéâ S√âRIE DOKONƒåENA!")
print("="*60)
print("\nüèÜ Gratulujeme k dokonƒçen√≠ v≈°ech 20 notebook≈Ø!")
print("\nüìö Nauƒçili jste se:")
print("   ‚úÖ Text processing (klasifikace, NER, sentiment)")
print("   ‚úÖ Generov√°n√≠ textu (sumarizace, p≈ôeklad, QA)")
print("   ‚úÖ Computer Vision (klasifikace, detekce, segmentace)")
print("   ‚úÖ Speech (STT, TTS, voice cloning)")
print("   ‚úÖ Multimod√°ln√≠ AI (CLIP, LLaVA, BLIP)")
print("   ‚úÖ Pokroƒçil√© techniky (RAG, fine-tuning, optimalizace)")
print("   ‚úÖ Specializovan√© aplikace (time series, recsys, anomaly)")
print("   ‚úÖ Document AI (OCR, layout, extrakce)")
print("   ‚úÖ Audio AI (TTS, hudba, zpracov√°n√≠)")
print("\nüöÄ Dal≈°√≠ kroky:")
print("   ‚Ä¢ Aplikujte nauƒçen√© na vlastn√≠ projekty")
print("   ‚Ä¢ Experimentujte s fine-tuningem na vlastn√≠ch datech")
print("   ‚Ä¢ Sledujte novinky na Hugging Face Hub")
print("   ‚Ä¢ P≈ôispƒõjte do open-source komunity")
print("\nüíº Pro business konzultace: info@praut.cz")
print("\n" + "="*60)