# Echo-TTS Gradio App

Quick setup notebook for running Echo-TTS on an L4 GPU (24GB VRAM).

Features:
- **Standard Mode**: Normal TTS generation
- **Continuation Mode**: Continue generating from existing audio (blockwise)
- **Rhythm Transfer Mode**: Transfer phoneme-level timing from reference audio using wav2vec2 + G2P + DTW alignment

## 1. Clone Repository

In [None]:
!git clone -b phonem https://github.com/CoreBedtime/echo-tts.git
%cd echo-tts

## 2. Install Dependencies

In [None]:
!pip install -q torch torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install -q torchcodec
!pip install -q gradio==5.49.1
!pip install -q huggingface-hub safetensors einops
# For rhythm transfer (phoneme extraction)
!pip install -q transformers g2p_en

## 3. Install FFmpeg

In [None]:
!apt-get update -qq && apt-get install -qq -y ffmpeg > /dev/null 2>&1
!ffmpeg -version | head -1

## 4. Verify GPU

In [None]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 5. Launch Gradio App

This will download the model weights (~2GB) on first run.

In [None]:
import os
os.environ["GRADIO_SERVER_NAME"] = "0.0.0.0"

from gradio_app import demo
demo.launch(share=True, show_error=True)

---

## Usage Guide

### Standard Mode
1. Upload a speaker reference audio (or leave blank)
2. Enter your text prompt
3. Click **Generate Audio**

### Continuation Mode
1. Select **Continuation** from Generation Mode
2. Upload the audio you want to continue from
3. Enter text that includes BOTH the original transcription AND the new text
4. Click **Generate Audio**

### Rhythm Transfer Mode (Phoneme-Based)
1. Select **Rhythm Transfer** from Generation Mode
2. Upload a rhythm source audio (whose pacing you want to mimic)
3. Enter your target text
4. Set target duration and phoneme group threshold
5. Click **Generate Audio**

The algorithm:
- Extracts phonemes + timings from reference audio using **wav2vec2**
- Converts your text to phonemes using **G2P**
- Aligns reference phonemes to target using **DTW**
- Transfers durations to create matching rhythm

### Tips
- If voice doesn't match reference, enable **Force Speaker**
- For continuation, use [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) for accurate transcription
- L4 GPU can handle full 640 latent length (~30 seconds)
- Lower phoneme group threshold = more blocks = finer rhythm control