# Unified TTS Notebook

**Single notebook for all TTS models and PDF extraction strategies**

This notebook provides a unified interface for:
- **TTS Models**: Kokoro (v0.9, v1.0), Silero v5
- **PDF Extractors**: Unstructured, PyMuPDF, Apple Vision, Nougat
- **Input Formats**: Text strings, PDF files, EPUB books
- **Output Formats**: WAV, MP3 with timeline manifests

The notebook will automatically install only the dependencies you need based on your selections!

## 0) Environment Setup (Optional)

**This step helps you manage Python packages and avoid conflicts with your system installation.**

- If you have **conda** installed, you can create a fresh environment for this notebook
- Or use an existing environment by providing its name
- At the end of the notebook, you can easily clean up and delete the environment to free storage

In [None]:
import subprocess
import sys
import os

# Flag to track if we created an environment in this notebook
environment_created_by_notebook = False
environment_name = None

# Check if conda is installed
try:
    result = subprocess.run(['conda', '--version'], capture_output=True, text=True, check=True)
    conda_available = True
    print(f"‚úì Conda detected: {result.stdout.strip()}")
except (subprocess.CalledProcessError, FileNotFoundError):
    conda_available = False
    print("‚úó Conda not found - skipping environment management")
    print("Packages will be installed in your current Python environment")

if conda_available:
    print("\n" + "="*60)
    print("ENVIRONMENT SETUP OPTIONS")
    print("="*60)
    
    choice = input("\nDo you want to:\n  [1] Create a NEW conda environment (recommended)\n  [2] Use an EXISTING environment\n  [3] Skip and use current environment\n\nEnter choice (1/2/3): ").strip()
    
    if choice == "1":
        env_name = input("\nEnter name for new environment (default: tts_unified): ").strip()
        if not env_name:
            env_name = "tts_unified"
        
        print(f"\n‚Üí Creating conda environment: {env_name}")
        print("  This may take a few minutes...")
        
        try:
            subprocess.run(['conda', 'create', '-n', env_name, 'python=3.10', '-y'],
                           check=True, capture_output=True)
            
            environment_created_by_notebook = True
            environment_name = env_name
            
            print(f"‚úì Environment '{env_name}' created successfully!")
            print(f"\n{'='*60}")
            print("IMPORTANT: Restart your Jupyter kernel and select the new environment:")
            print(f"  Kernel ‚Üí Change Kernel ‚Üí {env_name}")
            print(f"{'='*60}\n")
            
        except subprocess.CalledProcessError as e:
            print(f"‚úó Failed to create environment: {e}")
            print("Continuing with current environment...")
    
    elif choice == "2":
        env_name = input("\nEnter name of existing environment: ").strip()
        if env_name:
            environment_name = env_name
            print(f"\n‚úì Using existing environment: {env_name}")
            print(f"\n{'='*60}")
            print("IMPORTANT: Make sure your kernel is using this environment:")
            print(f"  Kernel ‚Üí Change Kernel ‚Üí {env_name}")
            print(f"{'='*60}\n")
        else:
            print("‚úó No environment name provided - using current environment")
    
    else:
        print("\n‚úì Using current environment")

print("\nYou can now proceed with the rest of the notebook.")

## 1) Configuration - Choose Your Setup

**Select which TTS model, PDF extractor, and formats you want to use.**

The notebook will automatically install only the dependencies you need!

In [None]:
# ========================================
# TTS MODEL SELECTION
# ========================================
# Choose ONE of the following:
#   - "kokoro_0.9": Kokoro v0.9+ (10 voices, English-focused, stable)
#   - "kokoro_1.0": Kokoro v1.0 (54 voices, 8 languages, latest)
#   - "silero_v5": Silero v5 (Russian language, 6 speakers)

TTS_MODEL = "kokoro_0.9"

# ========================================
# PDF EXTRACTOR SELECTION
# ========================================
# Choose ONE of the following:
#   - "unstructured": Advanced layout analysis (recommended, ~500MB dependencies)
#   - "pymupdf": Fast extraction for clean PDFs (~15MB, lightweight)
#   - "vision": OCR for scanned PDFs (macOS only)
#   - "nougat": Academic papers with equations (~1.5GB model)
#   - None: Skip PDF extraction (only for text/EPUB input)

PDF_EXTRACTOR = "unstructured"

# ========================================
# INPUT FORMATS
# ========================================
# Enable the input formats you plan to use:

ENABLE_TEXT_INPUT = True    # Plain text strings
ENABLE_PDF_INPUT = True     # PDF files (requires PDF_EXTRACTOR)
ENABLE_EPUB_INPUT = True    # EPUB books

# ========================================
# OUTPUT FORMATS
# ========================================
# Enable the output formats you plan to use:

ENABLE_WAV_OUTPUT = True    # WAV audio files
ENABLE_MP3_OUTPUT = True    # MP3 audio files (requires ffmpeg and pydub)

# ========================================
# DEVICE CONFIGURATION
# ========================================
# Device to use for TTS synthesis:
#   - "auto": Automatically select best device (CUDA > MPS > CPU)
#   - "cuda": Force CUDA/GPU
#   - "cpu": Force CPU
#   - "mps": Force Apple Silicon MPS

DEVICE = "auto"

# ========================================
# OUTPUT DIRECTORY
# ========================================
# Directory where generated files will be saved

OUTPUT_DIR = "."  # Current directory

# ========================================
# VALIDATION
# ========================================
if ENABLE_PDF_INPUT and PDF_EXTRACTOR is None:
    print("‚ö†Ô∏è  WARNING: PDF input enabled but no PDF extractor selected!")
    print("   Set PDF_EXTRACTOR to 'unstructured', 'pymupdf', 'vision', or 'nougat'")

print("="*60)
print("CONFIGURATION SUMMARY")
print("="*60)
print(f"TTS Model: {TTS_MODEL}")
print(f"PDF Extractor: {PDF_EXTRACTOR or 'None'}")
print(f"Input Formats: Text={ENABLE_TEXT_INPUT}, PDF={ENABLE_PDF_INPUT}, EPUB={ENABLE_EPUB_INPUT}")
print(f"Output Formats: WAV={ENABLE_WAV_OUTPUT}, MP3={ENABLE_MP3_OUTPUT}")
print(f"Device: {DEVICE}")
print(f"Output Directory: {OUTPUT_DIR}")
print("="*60)

## 2) Install & Import Dependencies

**This cell will automatically install and import only the packages you need based on your configuration above.**

This may take a few minutes on first run, but subsequent runs will be faster.

In [None]:
import subprocess
import sys

def install_package(package):
    """Install a package using pip."""
    print(f"Installing {package}...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
    print(f"‚úì {package} installed")

print("="*60)
print("INSTALLING DEPENDENCIES")
print("="*60)

# Core dependencies (always needed)
print("\nüì¶ Installing core dependencies...")
core_packages = ["torch", "soundfile", "numpy"]
for pkg in core_packages:
    try:
        __import__(pkg.replace("-", "_"))
        print(f"‚úì {pkg} already installed")
    except ImportError:
        install_package(pkg)

# TTS model dependencies
print("\nüé§ Installing TTS model dependencies...")
if TTS_MODEL.startswith("kokoro"):
    kokoro_packages = ["kokoro>=0.9.4", "misaki[en]"]
    for pkg in kokoro_packages:
        try:
            __import__("kokoro")
            print(f"‚úì Kokoro already installed")
            break
        except ImportError:
            install_package(pkg)
elif TTS_MODEL == "silero_v5":
    try:
        __import__("omegaconf")
        print(f"‚úì omegaconf already installed")
    except ImportError:
        install_package("omegaconf")
    print("‚úì Silero loads via torch.hub (no additional packages needed)")

# PDF extractor dependencies
if ENABLE_PDF_INPUT and PDF_EXTRACTOR:
    print("\nüìÑ Installing PDF extractor dependencies...")
    
    if PDF_EXTRACTOR == "unstructured":
        unstructured_packages = [
            "unstructured[local-inference]",
            "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
        ]
        for pkg in unstructured_packages:
            try:
                if "detectron2" in pkg:
                    __import__("detectron2")
                    print(f"‚úì detectron2 already installed")
                else:
                    __import__("unstructured")
                    print(f"‚úì unstructured already installed")
            except ImportError:
                install_package(pkg)
    
    elif PDF_EXTRACTOR == "pymupdf":
        try:
            __import__("fitz")
            print(f"‚úì pymupdf already installed")
        except ImportError:
            install_package("pymupdf")
    
    elif PDF_EXTRACTOR == "vision":
        import platform
        if platform.system() != "Darwin":
            print("‚ö†Ô∏è  WARNING: Vision Framework is only available on macOS!")
        else:
            vision_packages = ["pyobjc-framework-Vision", "pyobjc-framework-Quartz"]
            for pkg in vision_packages:
                try:
                    module_name = pkg.replace("-", "_").replace("pyobjc_framework_", "")
                    __import__(module_name)
                    print(f"‚úì {pkg} already installed")
                except ImportError:
                    install_package(pkg)
    
    elif PDF_EXTRACTOR == "nougat":
        nougat_packages = ["nougat-ocr", "transformers"]
        for pkg in nougat_packages:
            try:
                __import__(pkg.replace("-", "_"))
                print(f"‚úì {pkg} already installed")
            except ImportError:
                install_package(pkg)

# EPUB dependencies
if ENABLE_EPUB_INPUT:
    print("\nüìö Installing EPUB dependencies...")
    try:
        __import__("ebooklib")
        print(f"‚úì ebooklib already installed")
    except ImportError:
        install_package("ebooklib")

# MP3 output dependencies
if ENABLE_MP3_OUTPUT:
    print("\nüéµ Installing MP3 output dependencies...")
    try:
        __import__("pydub")
        print(f"‚úì pydub already installed")
    except ImportError:
        install_package("pydub")
    print("\n‚ö†Ô∏è  NOTE: MP3 encoding requires ffmpeg to be installed on your system:")
    print("   - macOS: brew install ffmpeg")
    print("   - Linux: sudo apt-get install ffmpeg")
    print("   - Windows: Download from https://ffmpeg.org/")

print("\n" + "="*60)
print("‚úì ALL DEPENDENCIES INSTALLED")
print("="*60)

# Now import the modules
print("\nüì• Importing modules...")

import io
import json
from pathlib import Path

# Import our modular components
from config import TTSConfig, print_device_info, setup_logging
from tts_backends import create_backend
from tts_utils import wav_to_mp3_bytes, safe_name
from manifest import create_manifest, save_manifest

# Import PDF extractor if needed
if ENABLE_PDF_INPUT and PDF_EXTRACTOR:
    from pdf_extractors import get_available_extractors

# Import EPUB utilities if needed
if ENABLE_EPUB_INPUT:
    from tts_utils import extract_chapters_from_epub

# Set up logging to reduce noise
setup_logging()

print("‚úì Modules imported successfully")
print("\nüöÄ Ready to synthesize!")

## 3) Initialize TTS System

Load the TTS model and PDF extractor based on your configuration.

In [None]:
# Print available devices
print_device_info()

# Create configuration
config = TTSConfig(output_dir=OUTPUT_DIR, device=DEVICE)
print(f"\n{config}")

# Load TTS backend
print(f"\nüì• Loading TTS backend: {TTS_MODEL}...")
tts = create_backend(TTS_MODEL, device=config.device)
print(f"‚úì TTS backend loaded: {tts.get_name()}")
print(f"  Available voices: {tts.get_available_voices()[:5]}...")  # Show first 5
print(f"  Default voice: {tts.get_default_voice()}")
print(f"  Sample rate: {tts.get_sample_rate()} Hz")

# Load PDF extractor if needed
if ENABLE_PDF_INPUT and PDF_EXTRACTOR:
    print(f"\nüì• Loading PDF extractor: {PDF_EXTRACTOR}...")
    extractors = get_available_extractors()
    pdf_extractor = extractors[PDF_EXTRACTOR]
    print(f"‚úì PDF extractor loaded: {pdf_extractor.get_name()}")
    print(f"  Description: {pdf_extractor.get_description()}")
else:
    pdf_extractor = None
    print("\n‚ö†Ô∏è  PDF extraction disabled (PDF_EXTRACTOR not set)")

print("\n‚úì System initialized and ready!")

## 4) Synthesis Functions

High-level wrapper functions for easy synthesis.

In [None]:
def synth_string(
    text,
    voice=None,
    speed=1.0,
    out_format="wav",
    basename="tts_text",
    **kwargs
):
    """Synthesize a text string to audio."""
    if not ENABLE_TEXT_INPUT:
        raise ValueError("Text input is disabled. Set ENABLE_TEXT_INPUT=True in configuration.")
    
    if out_format == "mp3" and not ENABLE_MP3_OUTPUT:
        raise ValueError("MP3 output is disabled. Set ENABLE_MP3_OUTPUT=True in configuration.")
    
    voice = voice or tts.get_default_voice()
    
    elements = [{
        "text": text,
        "metadata": {"page_number": 1, "source": "string", "points": None}
    }]
    
    # Synthesize
    if TTS_MODEL.startswith("kokoro"):
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, voice=voice, speed=speed, **kwargs
        )
    else:  # Silero
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, speaker=voice, **kwargs
        )
    
    # Save audio
    out_base = config.get_output_path(basename)
    
    if out_format.lower() == "mp3":
        mp3 = wav_to_mp3_bytes(wav_bytes)
        audio_path = str(out_base) + ".mp3"
        with open(audio_path, "wb") as f:
            f.write(mp3)
    else:
        audio_path = str(out_base) + ".wav"
        with open(audio_path, "wb") as f:
            f.write(wav_bytes)
    
    # Save manifest
    manifest_path = str(out_base) + "_manifest.json"
    manifest = create_manifest(Path(audio_path).name, timeline)
    save_manifest(manifest, manifest_path)
    
    return audio_path, manifest_path


def synth_pdf(
    file_path_or_bytes,
    voice=None,
    speed=1.0,
    out_format="wav",
    basename=None,
    pages=None,
    **kwargs
):
    """Synthesize a PDF to audio."""
    if not ENABLE_PDF_INPUT:
        raise ValueError("PDF input is disabled. Set ENABLE_PDF_INPUT=True in configuration.")
    
    if pdf_extractor is None:
        raise ValueError("No PDF extractor configured. Set PDF_EXTRACTOR in configuration.")
    
    if out_format == "mp3" and not ENABLE_MP3_OUTPUT:
        raise ValueError("MP3 output is disabled. Set ENABLE_MP3_OUTPUT=True in configuration.")
    
    voice = voice or tts.get_default_voice()
    
    # Load PDF
    if isinstance(file_path_or_bytes, (str, Path)):
        with open(file_path_or_bytes, "rb") as fh:
            pdf_bytes = io.BytesIO(fh.read())
        stem = Path(file_path_or_bytes).stem
    else:
        pdf_bytes = file_path_or_bytes
        stem = basename or "document"
    
    # Extract text
    elements = pdf_extractor.extract(pdf_bytes, pages=pages)
    
    # Synthesize
    if TTS_MODEL.startswith("kokoro"):
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, voice=voice, speed=speed, **kwargs
        )
    else:  # Silero
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, speaker=voice, **kwargs
        )
    
    # Save audio
    out_base = config.get_output_path(f"{basename or stem}_tts")
    
    if out_format.lower() == "mp3":
        mp3 = wav_to_mp3_bytes(wav_bytes)
        audio_path = str(out_base) + ".mp3"
        with open(audio_path, "wb") as f:
            f.write(mp3)
    else:
        audio_path = str(out_base) + ".wav"
        with open(audio_path, "wb") as f:
            f.write(wav_bytes)
    
    # Save manifest
    manifest_path = str(out_base) + "_manifest.json"
    manifest = create_manifest(Path(audio_path).name, timeline)
    save_manifest(manifest, manifest_path)
    
    return audio_path, manifest_path


def synth_epub(
    file_path_or_bytes,
    voice=None,
    speed=1.0,
    per_chapter_format="wav",
    zip_name=None,
    **kwargs
):
    """Synthesize an EPUB to per-chapter audio files in a ZIP."""
    import zipfile
    
    if not ENABLE_EPUB_INPUT:
        raise ValueError("EPUB input is disabled. Set ENABLE_EPUB_INPUT=True in configuration.")
    
    if per_chapter_format == "mp3" and not ENABLE_MP3_OUTPUT:
        raise ValueError("MP3 output is disabled. Set ENABLE_MP3_OUTPUT=True in configuration.")
    
    voice = voice or tts.get_default_voice()
    
    # Load EPUB
    if isinstance(file_path_or_bytes, (str, Path)):
        with open(file_path_or_bytes, "rb") as fh:
            epub_bytes = io.BytesIO(fh.read())
        stem = Path(file_path_or_bytes).stem
    else:
        epub_bytes = file_path_or_bytes
        stem = "book"
    
    # Extract chapters
    chapters = extract_chapters_from_epub(epub_bytes)
    assert chapters, "No chapters detected in EPUB."
    
    # Create ZIP
    zip_buf = io.BytesIO()
    with zipfile.ZipFile(zip_buf, "w", zipfile.ZIP_DEFLATED) as zf:
        for idx, (title, body) in enumerate(chapters, 1):
            name = f"{idx:02d}_{safe_name(title)[:40]}"
            
            chapter_elements = [{
                "text": body,
                "metadata": {
                    "chapter_index": idx,
                    "chapter_title": title,
                    "page_number": 1,
                    "points": None
                }
            }]
            
            # Synthesize chapter
            if TTS_MODEL.startswith("kokoro"):
                wav_bytes, timeline = tts.synthesize_text_to_wav(
                    chapter_elements, voice=voice, speed=speed, **kwargs
                )
            else:  # Silero
                wav_bytes, timeline = tts.synthesize_text_to_wav(
                    chapter_elements, speaker=voice, **kwargs
                )
            
            # Add audio to ZIP
            if per_chapter_format.lower() == "mp3":
                data = wav_to_mp3_bytes(wav_bytes)
                audio_name = f"{name}.mp3"
                zf.writestr(audio_name, data)
            else:
                audio_name = f"{name}.wav"
                zf.writestr(audio_name, wav_bytes)
            
            # Add manifest to ZIP
            manifest = create_manifest(audio_name, timeline)
            zf.writestr(f"{name}_manifest.json", json.dumps(manifest, ensure_ascii=False, indent=2))
    
    # Save ZIP
    zip_buf.seek(0)
    zpath = str(config.get_output_path(f"{zip_name or (stem + '_chapters')}.zip"))
    with open(zpath, "wb") as f:
        f.write(zip_buf.read())
    
    return zpath


print("‚úì Synthesis functions loaded")

## Usage Examples

Run the examples below to synthesize text, PDFs, and EPUBs.

**Note:** Only the examples for enabled input/output formats will work.

### A) String ‚Üí Audio

In [None]:
if not ENABLE_TEXT_INPUT:
    print("‚ö†Ô∏è  Text input is disabled. Set ENABLE_TEXT_INPUT=True to use this example.")
else:
    # Configuration
    VOICE = None  # Use default voice (or specify: "af_heart", "xenia", etc.)
    SPEED = 1.0   # Speech speed (Kokoro only)
    FORMAT = "mp3" if ENABLE_MP3_OUTPUT else "wav"
    BASENAME = "tts_text"

    # Text to synthesize
    TEXT = """Hello! This is a test of the unified TTS system.
    It automatically installs only the dependencies you need.
    """

    # Run synthesis
    audio_path, manifest_path = synth_string(
        TEXT,
        voice=VOICE,
        speed=SPEED,
        out_format=FORMAT,
        basename=BASENAME
    )

    print(f"\n‚úì Audio saved to: {audio_path}")
    print(f"‚úì Manifest saved to: {manifest_path}")

### B) PDF ‚Üí Audio (with page selection)

In [None]:
if not ENABLE_PDF_INPUT:
    print("‚ö†Ô∏è  PDF input is disabled. Set ENABLE_PDF_INPUT=True and PDF_EXTRACTOR to use this example.")
else:
    # Configuration
    VOICE = None  # Use default voice
    SPEED = 1.0
    FORMAT = "mp3" if ENABLE_MP3_OUTPUT else "wav"

    # PDF file path
    PDF_PATH = "document.pdf"  # Change this to your PDF filename

    # Page selection (optional)
    # None = all pages (default)
    # [1, 2, 3] = only pages 1, 2, and 3
    # [5] = only page 5
    PAGES = None

    # Run synthesis
    audio_path, manifest_path = synth_pdf(
        PDF_PATH,
        voice=VOICE,
        speed=SPEED,
        out_format=FORMAT,
        pages=PAGES
    )

    print(f"\n‚úì Audio saved to: {audio_path}")
    print(f"‚úì Manifest saved to: {manifest_path}")

### C) EPUB ‚Üí ZIP (Per-Chapter Audio)

In [None]:
if not ENABLE_EPUB_INPUT:
    print("‚ö†Ô∏è  EPUB input is disabled. Set ENABLE_EPUB_INPUT=True to use this example.")
else:
    # Configuration
    VOICE = None  # Use default voice
    SPEED = 1.0
    CHAPTER_FORMAT = "mp3" if ENABLE_MP3_OUTPUT else "wav"
    ZIP_NAME = ""  # Optional: custom name for ZIP file

    # EPUB file path
    EPUB_PATH = "book.epub"  # Change this to your EPUB filename

    # Run synthesis
    zip_path = synth_epub(
        EPUB_PATH,
        voice=VOICE,
        speed=SPEED,
        per_chapter_format=CHAPTER_FORMAT,
        zip_name=(ZIP_NAME or None)
    )

    print(f"\n‚úì ZIP archive saved to: {zip_path}")

## Notes

- **Switching Models**: To use a different TTS model or PDF extractor, change the settings in Section 1 and re-run from there
- **Voice Selection**: Each model has different voices. Check the output of Section 3 for available voices
- **Manifest Files**: Each audio output includes a JSON manifest with sentence-level timing and coordinates
- **Dependencies**: Only the packages needed for your selected configuration were installed

## Cleanup: Delete Environment (Optional)

**If you created a new environment at the beginning of this notebook**, you can delete it here to free up storage space.

‚ö†Ô∏è **Warning**: This will permanently delete the environment and all installed packages!

In [None]:
import subprocess

if 'environment_created_by_notebook' not in globals():
    print("‚úó No environment tracking found")
    print("This cell only works if you ran the environment setup cell at the beginning")
elif not environment_created_by_notebook:
    print("‚úó No environment was created by this notebook")
    print("You can only delete environments that were created in this session")
else:
    print(f"Environment '{environment_name}' was created by this notebook")
    print(f"\n{'='*60}")
    print("DELETE ENVIRONMENT")
    print(f"{'='*60}")
    
    confirm = input(f"\nAre you sure you want to DELETE '{environment_name}'?\nType 'yes' to confirm: ").strip().lower()
    
    if confirm == 'yes':
        print(f"\n‚Üí Deleting environment '{environment_name}'...")
        print("  This may take a moment...")
        
        try:
            subprocess.run(['conda', 'env', 'remove', '-n', environment_name, '-y'],
                           check=True, capture_output=True)
            print(f"‚úì Environment '{environment_name}' deleted successfully!")
            print("  Storage space has been freed.")
            
            environment_created_by_notebook = False
            environment_name = None
            
        except subprocess.CalledProcessError as e:
            print(f"‚úó Failed to delete environment: {e}")
            print(f"You may need to delete it manually with: conda env remove -n {environment_name}")
    else:
        print("\n‚úó Deletion cancelled - environment preserved")