# üé§ Whisper - P≈ôevod ≈ôeƒçi na text (Speech-to-Text)

**Autor:** Praut s.r.o. - AI Integration & Business Automation

## Co se nauƒç√≠te:
- Transkripce audio soubor≈Ø
- V√≠cejazyƒçn√° transkripce
- Automatick√© titulky
- Zpracov√°n√≠ sch≈Øzek a hovor≈Ø

In [None]:
!pip install -q transformers accelerate torch librosa soundfile datasets

In [None]:
from transformers import pipeline, WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

device = 0 if torch.cuda.is_available() else -1
print(f"üñ•Ô∏è Device: {'GPU' if device == 0 else 'CPU'}")

## 1. Z√°kladn√≠ transkripce s Whisper

In [None]:
# Whisper pipeline
whisper = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
    device=device
)

# Pro demo pou≈æijeme sample z datasetu
from datasets import load_dataset

# Naƒçten√≠ sample audio
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:1]")
sample = dataset[0]

print(f"üìä Audio info:")
print(f"   Sampling rate: {sample['audio']['sampling_rate']} Hz")
print(f"   D√©lka: {len(sample['audio']['array']) / sample['audio']['sampling_rate']:.1f} s")

# Transkripce
result = whisper(sample['audio']['array'])

print(f"\nüé§ Transkripce:")
print(f"   {result['text']}")

## 2. R≈Øzn√© velikosti Whisper model≈Ø

In [None]:
# Dostupn√© modely:
# - openai/whisper-tiny (39M parametr≈Ø)
# - openai/whisper-base (74M)
# - openai/whisper-small (244M)
# - openai/whisper-medium (769M)
# - openai/whisper-large-v3 (1.5B)

print("üìä Porovn√°n√≠ Whisper model≈Ø:\n")
print("| Model | Parametry | VRAM | Rychlost | P≈ôesnost |")
print("|-------|-----------|------|----------|----------|")
print("| tiny  | 39M       | ~1GB | Nejrychlej≈°√≠ | Z√°kladn√≠ |")
print("| base  | 74M       | ~1GB | Rychl√Ω   | Dobr√°    |")
print("| small | 244M      | ~2GB | St≈ôedn√≠  | Velmi dobr√° |")
print("| medium| 769M      | ~5GB | Pomal√Ω   | Vynikaj√≠c√≠ |")
print("| large | 1.5B      | ~10GB| Nejpomalej≈°√≠ | Nejlep≈°√≠ |")

In [None]:
# Naƒçten√≠ vƒõt≈°√≠ho modelu pro lep≈°√≠ p≈ôesnost
whisper_small = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    device=device
)

result_small = whisper_small(sample['audio']['array'])

print("üîç Porovn√°n√≠ v√Ωsledk≈Ø:")
print(f"   Base:  {result['text']}")
print(f"   Small: {result_small['text']}")

## 3. Transkripce s ƒçasov√Ωmi znaƒçkami

In [None]:
# Transkripce s timestamps
result_timestamps = whisper_small(
    sample['audio']['array'],
    return_timestamps=True
)

print("‚è±Ô∏è Transkripce s ƒçasov√Ωmi znaƒçkami:\n")

if 'chunks' in result_timestamps:
    for chunk in result_timestamps['chunks']:
        start = chunk['timestamp'][0] if chunk['timestamp'][0] else 0
        end = chunk['timestamp'][1] if chunk['timestamp'][1] else '?'
        print(f"   [{start:.1f}s - {end}s] {chunk['text']}")
else:
    print(f"   {result_timestamps['text']}")

## 4. V√≠cejazyƒçn√° transkripce

In [None]:
# Whisper automaticky detekuje jazyk
# M≈Ø≈æeme ale specifikovat jazyk pro lep≈°√≠ v√Ωsledky

def transkribuj(audio_array, jazyk=None):
    """Transkribuje audio s voliteln√Ωm urƒçen√≠m jazyka."""
    
    generate_kwargs = {}
    if jazyk:
        generate_kwargs["language"] = jazyk
    
    result = whisper_small(
        audio_array,
        generate_kwargs=generate_kwargs
    )
    
    return result['text']

# Podporovan√© jazyky (v√Ωbƒõr)
print("üåç Podporovan√© jazyky Whisper:")
jazyky = [
    "english", "czech", "german", "french", "spanish",
    "italian", "polish", "portuguese", "russian", "chinese",
    "japanese", "korean", "arabic", "hindi", "turkish"
]
print(f"   {', '.join(jazyky)}")
print("   ... a dal≈°√≠ch 80+ jazyk≈Ø")

## 5. Generov√°n√≠ titulk≈Ø (SRT form√°t)

In [None]:
def generuj_srt(audio_array, sampling_rate=16000):
    """Generuje titulky ve form√°tu SRT."""
    
    result = whisper_small(
        audio_array,
        return_timestamps=True,
        chunk_length_s=30
    )
    
    srt_lines = []
    
    if 'chunks' in result:
        for i, chunk in enumerate(result['chunks'], 1):
            start = chunk['timestamp'][0] if chunk['timestamp'][0] else 0
            end = chunk['timestamp'][1] if chunk['timestamp'][1] else start + 5
            
            # Form√°tov√°n√≠ ƒçasu pro SRT
            start_srt = f"{int(start//3600):02d}:{int((start%3600)//60):02d}:{int(start%60):02d},{int((start%1)*1000):03d}"
            end_srt = f"{int(end//3600):02d}:{int((end%3600)//60):02d}:{int(end%60):02d},{int((end%1)*1000):03d}"
            
            srt_lines.append(f"{i}")
            srt_lines.append(f"{start_srt} --> {end_srt}")
            srt_lines.append(chunk['text'].strip())
            srt_lines.append("")
    
    return "\n".join(srt_lines)

# Generov√°n√≠ SRT
srt = generuj_srt(sample['audio']['array'])
print("üìù Vygenerovan√© SRT titulky:\n")
print(srt)

## 6. Zpracov√°n√≠ audio soubor≈Ø

In [None]:
import soundfile as sf
import numpy as np

def transkribuj_soubor(cesta_k_souboru):
    """Transkribuje audio soubor (MP3, WAV, FLAC...)."""
    
    # Naƒçten√≠ audio
    audio, sr = librosa.load(cesta_k_souboru, sr=16000)
    
    # Transkripce
    result = whisper_small(
        audio,
        return_timestamps=True
    )
    
    return {
        'text': result['text'],
        'chunks': result.get('chunks', []),
        'duration': len(audio) / sr
    }

# Demo - vytvo≈ô√≠me testovac√≠ audio
print("üìÅ Funkce transkribuj_soubor() p≈ôipravena.")
print("   Pou≈æit√≠: transkribuj_soubor('cesta/k/audio.mp3')")
print("   Podporovan√© form√°ty: MP3, WAV, FLAC, OGG, M4A")

## 7. D√°vkov√© zpracov√°n√≠ v√≠ce soubor≈Ø

In [None]:
import pandas as pd
from pathlib import Path

def zpracuj_slozku(cesta_slozky, pripony=['.mp3', '.wav', '.flac']):
    """Zpracuje v≈°echny audio soubory ve slo≈æce."""
    
    vysledky = []
    slozka = Path(cesta_slozky)
    
    soubory = [f for f in slozka.iterdir() if f.suffix.lower() in pripony]
    
    for soubor in soubory:
        print(f"   Zpracov√°v√°m: {soubor.name}")
        try:
            result = transkribuj_soubor(str(soubor))
            vysledky.append({
                'soubor': soubor.name,
                'delka_s': result['duration'],
                'text': result['text'],
                'status': 'OK'
            })
        except Exception as e:
            vysledky.append({
                'soubor': soubor.name,
                'delka_s': 0,
                'text': '',
                'status': f'Error: {str(e)}'
            })
    
    return pd.DataFrame(vysledky)

print("üìÅ Funkce zpracuj_slozku() p≈ôipravena.")
print("   Pou≈æit√≠: zpracuj_slozku('/cesta/ke/slozce')")

## 8. P≈ôeklad a transkripce z√°rove≈à

In [None]:
# Whisper um√≠ p≈ô√≠mo p≈ôekl√°dat do angliƒçtiny
whisper_translate = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    device=device
)

def transkribuj_a_preloz(audio_array):
    """Transkribuje a p≈ôelo≈æ√≠ do angliƒçtiny."""
    
    # P≈Øvodn√≠ transkripce
    original = whisper_translate(audio_array)
    
    # P≈ôeklad do angliƒçtiny
    translated = whisper_translate(
        audio_array,
        generate_kwargs={"task": "translate"}
    )
    
    return {
        'original': original['text'],
        'english': translated['text']
    }

print("üåç Funkce transkribuj_a_preloz() p≈ôipravena.")
print("   P≈ôelo≈æ√≠ audio z libovoln√©ho jazyka do angliƒçtiny.")

## 9. Zpracov√°n√≠ sch≈Øzek

In [None]:
def zpracuj_schuzku(audio_array):
    """Zpracuje nahr√°vku sch≈Øzky a vytvo≈ô√≠ strukturovan√Ω v√Ωstup."""
    
    # Transkripce s timestamps
    result = whisper_small(
        audio_array,
        return_timestamps=True,
        chunk_length_s=30
    )
    
    # Struktura v√Ωstupu
    output = {
        'full_transcript': result['text'],
        'duration_minutes': len(audio_array) / 16000 / 60,
        'segments': []
    }
    
    if 'chunks' in result:
        for chunk in result['chunks']:
            output['segments'].append({
                'start': chunk['timestamp'][0],
                'end': chunk['timestamp'][1],
                'text': chunk['text'].strip()
            })
    
    return output

# Demo
meeting = zpracuj_schuzku(sample['audio']['array'])

print("üìã Z√°pis ze sch≈Øzky:\n")
print(f"D√©lka: {meeting['duration_minutes']:.1f} minut")
print(f"\nTranskript:\n{meeting['full_transcript']}")

## 10. Praktick√° automatizace: Voice Notes

In [None]:
from datetime import datetime

class VoiceNotesProcessor:
    def __init__(self, model_name="openai/whisper-small"):
        self.whisper = pipeline(
            "automatic-speech-recognition",
            model=model_name,
            device=device
        )
        self.notes = []
    
    def add_note(self, audio_array, title=None):
        """P≈ôid√° hlasovou pozn√°mku."""
        result = self.whisper(audio_array, return_timestamps=True)
        
        note = {
            'id': len(self.notes) + 1,
            'title': title or f"Note {len(self.notes) + 1}",
            'timestamp': datetime.now().isoformat(),
            'text': result['text'],
            'duration': len(audio_array) / 16000
        }
        
        self.notes.append(note)
        return note
    
    def search_notes(self, keyword):
        """Vyhled√° v pozn√°mk√°ch."""
        return [n for n in self.notes if keyword.lower() in n['text'].lower()]
    
    def export_markdown(self):
        """Exportuje pozn√°mky do Markdown."""
        md = "# Voice Notes\n\n"
        for note in self.notes:
            md += f"## {note['title']}\n"
            md += f"*{note['timestamp']}* ({note['duration']:.1f}s)\n\n"
            md += f"{note['text']}\n\n---\n\n"
        return md

# Demo
processor = VoiceNotesProcessor()
processor.add_note(sample['audio']['array'], "Meeting notes")

print("üìù Voice Notes System:")
print(processor.export_markdown())

---
## üèÅ Shrnut√≠

- ‚úÖ Whisper pro p≈ôesnou transkripci
- ‚úÖ Podpora 90+ jazyk≈Ø
- ‚úÖ Automatick√© titulky (SRT)
- ‚úÖ D√°vkov√© zpracov√°n√≠ soubor≈Ø
- ‚úÖ P≈ôeklad z ≈ôeƒçi do angliƒçtiny

**Dal≈°√≠ notebook:** Computer Vision - klasifikace a detekce