# Unified TTS Notebook

**Single notebook for all TTS models and PDF extraction strategies**

This notebook provides a unified interface for:
- **TTS Models**: Kokoro (v0.9, v1.0), Silero v5
- **PDF Extractors**: Unstructured, PyMuPDF, Apple Vision, Nougat
- **Input Formats**: Text strings, PDF files, EPUB books
- **Output Formats**: WAV, MP3 with timeline manifests

Simply select your preferred model and extractor, then run synthesis!

## 0) Environment Setup (Optional)

**This step helps you manage Python packages and avoid conflicts with your system installation.**

- If you have **conda** installed, you can create a fresh environment for this notebook
- Or use an existing environment by providing its name
- At the end of the notebook, you can easily clean up and delete the environment to free storage

In [None]:
import subprocess
import sys
import os

# Flag to track if we created an environment in this notebook
environment_created_by_notebook = False
environment_name = None

# Check if conda is installed
try:
    result = subprocess.run(['conda', '--version'], capture_output=True, text=True, check=True)
    conda_available = True
    print(f"✓ Conda detected: {result.stdout.strip()}")
except (subprocess.CalledProcessError, FileNotFoundError):
    conda_available = False
    print("✗ Conda not found - skipping environment management")
    print("Packages will be installed in your current Python environment")

if conda_available:
    print("\n" + "="*60)
    print("ENVIRONMENT SETUP OPTIONS")
    print("="*60)
    
    choice = input("\nDo you want to:\n  [1] Create a NEW conda environment (recommended)\n  [2] Use an EXISTING environment\n  [3] Skip and use current environment\n\nEnter choice (1/2/3): ").strip()
    
    if choice == "1":
        # Create new environment
        env_name = input("\nEnter name for new environment (default: tts_unified): ").strip()
        if not env_name:
            env_name = "tts_unified"
        
        print(f"\n→ Creating conda environment: {env_name}")
        print("  This may take a few minutes...")
        
        try:
            # Create environment with Python 3.10
            subprocess.run(['conda', 'create', '-n', env_name, 'python=3.10', '-y'],
                           check=True, capture_output=True)
            
            environment_created_by_notebook = True
            environment_name = env_name
            
            print(f"✓ Environment '{env_name}' created successfully!")
            print(f"\n{'='*60}")
            print("IMPORTANT: Restart your Jupyter kernel and select the new environment:")
            print(f"  Kernel → Change Kernel → {env_name}")
            print(f"{'='*60}\n")
            
        except subprocess.CalledProcessError as e:
            print(f"✗ Failed to create environment: {e}")
            print("Continuing with current environment...")
    
    elif choice == "2":
        # Use existing environment
        env_name = input("\nEnter name of existing environment: ").strip()
        if env_name:
            environment_name = env_name
            print(f"\n✓ Using existing environment: {env_name}")
            print(f"\n{'='*60}")
            print("IMPORTANT: Make sure your kernel is using this environment:")
            print(f"  Kernel → Change Kernel → {env_name}")
            print(f"{'='*60}\n")
        else:
            print("✗ No environment name provided - using current environment")
    
    else:
        print("\n✓ Using current environment")

print("\nYou can now proceed with the rest of the notebook.")

## 1) Install Dependencies

Install core dependencies and your chosen TTS model + PDF extractor.

In [None]:
# Core dependencies (always needed)
!pip install torch soundfile ebooklib pydub

# Choose your TTS model(s) - install the one(s) you need:

# For Kokoro TTS:
!pip install "kokoro>=0.9.4" misaki[en]

# For Silero v5 (Russian):
# !pip install omegaconf
# Note: Silero loads via torch.hub, no separate package needed

# Choose your PDF extractor(s) - install the one(s) you need:

# For Unstructured (recommended, best quality):
!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"

# For PyMuPDF (fast, lightweight):
# !pip install pymupdf

# For Apple Vision (macOS only):
# !pip install pyobjc-framework-Vision pyobjc-framework-Quartz

# For Nougat (academic papers):
# !pip install nougat-ocr transformers

# Note: ffmpeg should be installed on your system for MP3 encoding
# Linux: sudo apt-get install ffmpeg
# macOS: brew install ffmpeg
# Windows: Download from https://ffmpeg.org/

## 2) Import Modules and Configure Logging

In [None]:
import io
import json
from pathlib import Path

# Import our modular components
from config import TTSConfig, print_device_info, setup_logging
from tts_backends import create_backend, get_available_backends
from pdf_extractors import get_available_extractors, get_default_extractor
from tts_utils import extract_chapters_from_epub, wav_to_mp3_bytes, safe_name
from manifest import create_manifest, save_manifest, print_manifest_summary

# Set up logging to reduce noise
setup_logging()

print("✓ Modules imported successfully")

## 3) Configuration

Configure your TTS system settings.

In [None]:
# Print available devices
print_device_info()

# Create configuration
# device="auto" will automatically select the best available device (CUDA > MPS > CPU)
# You can also specify "cuda", "cpu", or "mps" explicitly
config = TTSConfig(
    output_dir=".",  # Output to current directory
    device="auto"     # Auto-select best device
)

print(f"\n{config}")

## 4) Select TTS Model and PDF Extractor

Choose which TTS model and PDF extraction strategy to use.

In [None]:
# ========================================
# TTS MODEL SELECTION
# ========================================
# Available options:
#   - "kokoro_0.9": Kokoro v0.9+ (10 voices, English-focused)
#   - "kokoro_1.0": Kokoro v1.0 (54 voices, 8 languages)
#   - "silero_v5": Silero v5 (Russian, 6 speakers)

TTS_MODEL = "kokoro_0.9"

# Create TTS backend
tts = create_backend(TTS_MODEL, device=config.device)
print(f"✓ TTS backend loaded: {tts.get_name()}")
print(f"  Available voices: {tts.get_available_voices()}")
print(f"  Default voice: {tts.get_default_voice()}")
print(f"  Sample rate: {tts.get_sample_rate()} Hz")

# ========================================
# PDF EXTRACTOR SELECTION
# ========================================
# Available options:
#   - "unstructured": Advanced layout analysis (recommended)
#   - "pymupdf": Fast extraction for clean PDFs
#   - "vision": OCR for scanned PDFs (macOS only)
#   - "nougat": Academic papers with equations

PDF_EXTRACTOR = "unstructured"

# Get PDF extractor
extractors = get_available_extractors()
pdf_extractor = extractors[PDF_EXTRACTOR]
print(f"\n✓ PDF extractor loaded: {pdf_extractor.get_name()}")
print(f"  Description: {pdf_extractor.get_description()}")

## 5) High-Level Synthesis Functions

These wrapper functions make it easy to synthesize different input types.

In [None]:
def synth_string(
    text,
    voice=None,
    speed=1.0,
    out_format="wav",
    basename="tts_text",
    **kwargs
):
    """Synthesize a text string to audio.
    
    Args:
        text: Text to synthesize
        voice: Voice/speaker to use (None = default)
        speed: Speech speed (Kokoro only)
        out_format: "wav" or "mp3"
        basename: Base name for output files
        **kwargs: Additional model-specific parameters
    
    Returns:
        Tuple of (audio_path, manifest_path)
    """
    voice = voice or tts.get_default_voice()
    
    # Prepare element
    elements = [{
        "text": text,
        "metadata": {"page_number": 1, "source": "string", "points": None}
    }]
    
    # Synthesize
    if TTS_MODEL.startswith("kokoro"):
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, voice=voice, speed=speed, **kwargs
        )
    else:  # Silero
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, speaker=voice, **kwargs
        )
    
    # Save audio
    out_base = config.get_output_path(basename)
    
    if out_format.lower() == "mp3":
        mp3 = wav_to_mp3_bytes(wav_bytes)
        audio_path = str(out_base) + ".mp3"
        with open(audio_path, "wb") as f:
            f.write(mp3)
    else:
        audio_path = str(out_base) + ".wav"
        with open(audio_path, "wb") as f:
            f.write(wav_bytes)
    
    # Save manifest
    manifest_path = str(out_base) + "_manifest.json"
    manifest = create_manifest(Path(audio_path).name, timeline)
    save_manifest(manifest, manifest_path)
    
    return audio_path, manifest_path


def synth_pdf(
    file_path_or_bytes,
    voice=None,
    speed=1.0,
    out_format="wav",
    basename=None,
    pages=None,
    **kwargs
):
    """Synthesize a PDF to audio.
    
    Args:
        file_path_or_bytes: Path to PDF or BytesIO object
        voice: Voice/speaker to use (None = default)
        speed: Speech speed (Kokoro only)
        out_format: "wav" or "mp3"
        basename: Base name for output files
        pages: Optional list of page numbers (1-indexed). None = all pages
        **kwargs: Additional model-specific parameters
    
    Returns:
        Tuple of (audio_path, manifest_path)
    """
    voice = voice or tts.get_default_voice()
    
    # Load PDF
    if isinstance(file_path_or_bytes, (str, Path)):
        with open(file_path_or_bytes, "rb") as fh:
            pdf_bytes = io.BytesIO(fh.read())
        stem = Path(file_path_or_bytes).stem
    else:
        pdf_bytes = file_path_or_bytes
        stem = basename or "document"
    
    # Extract text
    elements = pdf_extractor.extract(pdf_bytes, pages=pages)
    
    # Synthesize
    if TTS_MODEL.startswith("kokoro"):
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, voice=voice, speed=speed, **kwargs
        )
    else:  # Silero
        wav_bytes, timeline = tts.synthesize_text_to_wav(
            elements, speaker=voice, **kwargs
        )
    
    # Save audio
    out_base = config.get_output_path(f"{basename or stem}_tts")
    
    if out_format.lower() == "mp3":
        mp3 = wav_to_mp3_bytes(wav_bytes)
        audio_path = str(out_base) + ".mp3"
        with open(audio_path, "wb") as f:
            f.write(mp3)
    else:
        audio_path = str(out_base) + ".wav"
        with open(audio_path, "wb") as f:
            f.write(wav_bytes)
    
    # Save manifest
    manifest_path = str(out_base) + "_manifest.json"
    manifest = create_manifest(Path(audio_path).name, timeline)
    save_manifest(manifest, manifest_path)
    
    return audio_path, manifest_path


def synth_epub(
    file_path_or_bytes,
    voice=None,
    speed=1.0,
    per_chapter_format="wav",
    zip_name=None,
    **kwargs
):
    """Synthesize an EPUB to per-chapter audio files in a ZIP.
    
    Args:
        file_path_or_bytes: Path to EPUB or BytesIO object
        voice: Voice/speaker to use (None = default)
        speed: Speech speed (Kokoro only)
        per_chapter_format: "wav" or "mp3"
        zip_name: Custom name for output ZIP
        **kwargs: Additional model-specific parameters
    
    Returns:
        Path to output ZIP file
    """
    import zipfile
    
    voice = voice or tts.get_default_voice()
    
    # Load EPUB
    if isinstance(file_path_or_bytes, (str, Path)):
        with open(file_path_or_bytes, "rb") as fh:
            epub_bytes = io.BytesIO(fh.read())
        stem = Path(file_path_or_bytes).stem
    else:
        epub_bytes = file_path_or_bytes
        stem = "book"
    
    # Extract chapters
    chapters = extract_chapters_from_epub(epub_bytes)
    assert chapters, "No chapters detected in EPUB."
    
    # Create ZIP
    zip_buf = io.BytesIO()
    with zipfile.ZipFile(zip_buf, "w", zipfile.ZIP_DEFLATED) as zf:
        for idx, (title, body) in enumerate(chapters, 1):
            name = f"{idx:02d}_{safe_name(title)[:40]}"
            
            chapter_elements = [{
                "text": body,
                "metadata": {
                    "chapter_index": idx,
                    "chapter_title": title,
                    "page_number": 1,
                    "points": None
                }
            }]
            
            # Synthesize chapter
            if TTS_MODEL.startswith("kokoro"):
                wav_bytes, timeline = tts.synthesize_text_to_wav(
                    chapter_elements, voice=voice, speed=speed, **kwargs
                )
            else:  # Silero
                wav_bytes, timeline = tts.synthesize_text_to_wav(
                    chapter_elements, speaker=voice, **kwargs
                )
            
            # Add audio to ZIP
            if per_chapter_format.lower() == "mp3":
                data = wav_to_mp3_bytes(wav_bytes)
                audio_name = f"{name}.mp3"
                zf.writestr(audio_name, data)
            else:
                audio_name = f"{name}.wav"
                zf.writestr(audio_name, wav_bytes)
            
            # Add manifest to ZIP
            manifest = create_manifest(audio_name, timeline)
            zf.writestr(f"{name}_manifest.json", json.dumps(manifest, ensure_ascii=False, indent=2))
    
    # Save ZIP
    zip_buf.seek(0)
    zpath = str(config.get_output_path(f"{zip_name or (stem + '_chapters')}.zip"))
    with open(zpath, "wb") as f:
        f.write(zip_buf.read())
    
    return zpath


print("✓ Synthesis functions loaded")

## Usage Examples

Run the examples below to synthesize text, PDFs, and EPUBs.

### A) String → Audio

In [None]:
# Configuration
VOICE = None  # Use default voice (or specify: "af_heart", "xenia", etc.)
SPEED = 1.0   # Speech speed (Kokoro only)
FORMAT = "mp3"  # "wav" or "mp3"
BASENAME = "tts_text"

# Text to synthesize
TEXT = """Paste or type your text here.
It can be multiple paragraphs.
"""

# Run synthesis
audio_path, manifest_path = synth_string(
    TEXT,
    voice=VOICE,
    speed=SPEED,
    out_format=FORMAT,
    basename=BASENAME
)

print(f"\n✓ Audio saved to: {audio_path}")
print(f"✓ Manifest saved to: {manifest_path}")

### B) PDF → Audio (with page selection)

In [None]:
# Configuration
VOICE = None  # Use default voice
SPEED = 1.0
FORMAT = "mp3"  # "wav" or "mp3"

# PDF file path
PDF_PATH = "document.pdf"  # Change this to your PDF filename

# Page selection (optional)
# None = all pages (default)
# [1, 2, 3] = only pages 1, 2, and 3
# [5] = only page 5
PAGES = None

# Run synthesis
audio_path, manifest_path = synth_pdf(
    PDF_PATH,
    voice=VOICE,
    speed=SPEED,
    out_format=FORMAT,
    pages=PAGES
)

print(f"\n✓ Audio saved to: {audio_path}")
print(f"✓ Manifest saved to: {manifest_path}")

### C) EPUB → ZIP (Per-Chapter Audio)

In [None]:
# Configuration
VOICE = None  # Use default voice
SPEED = 1.0
CHAPTER_FORMAT = "wav"  # "wav" or "mp3"
ZIP_NAME = ""  # Optional: custom name for ZIP file

# EPUB file path
EPUB_PATH = "book.epub"  # Change this to your EPUB filename

# Run synthesis
zip_path = synth_epub(
    EPUB_PATH,
    voice=VOICE,
    speed=SPEED,
    per_chapter_format=CHAPTER_FORMAT,
    zip_name=(ZIP_NAME or None)
)

print(f"\n✓ ZIP archive saved to: {zip_path}")

## Notes

- **Switching Models**: To use a different TTS model or PDF extractor, change the settings in Section 4 and re-run
- **Voice Selection**: Each model has different voices. Check `tts.get_available_voices()` for options
- **Output Directory**: By default, outputs are saved to the notebook directory. Change in Section 3
- **Device Selection**: The notebook auto-detects CUDA/MPS/CPU. Override in Section 3 if needed
- **Manifest Files**: Each audio output includes a JSON manifest with sentence-level timing and coordinates

## Cleanup: Delete Environment (Optional)

**If you created a new environment at the beginning of this notebook**, you can delete it here to free up storage space.

⚠️ **Warning**: This will permanently delete the environment and all installed packages!

In [None]:
import subprocess

# Check if we created an environment in this notebook
if 'environment_created_by_notebook' not in globals():
    print("✗ No environment tracking found")
    print("This cell only works if you ran the environment setup cell at the beginning")
elif not environment_created_by_notebook:
    print("✗ No environment was created by this notebook")
    print("You can only delete environments that were created in this session")
else:
    print(f"Environment '{environment_name}' was created by this notebook")
    print(f"\n{'='*60}")
    print("DELETE ENVIRONMENT")
    print(f"{'='*60}")
    
    confirm = input(f"\nAre you sure you want to DELETE '{environment_name}'?\nType 'yes' to confirm: ").strip().lower()
    
    if confirm == 'yes':
        print(f"\n→ Deleting environment '{environment_name}'...")
        print("  This may take a moment...")
        
        try:
            subprocess.run(['conda', 'env', 'remove', '-n', environment_name, '-y'],
                           check=True, capture_output=True)
            print(f"✓ Environment '{environment_name}' deleted successfully!")
            print("  Storage space has been freed.")
            
            # Reset the flag
            environment_created_by_notebook = False
            environment_name = None
            
        except subprocess.CalledProcessError as e:
            print(f"✗ Failed to delete environment: {e}")
            print(f"You may need to delete it manually with: conda env remove -n {environment_name}")
    else:
        print("\n✗ Deletion cancelled - environment preserved")