# Speech Generation (語音合成)

**對應課程**: 李宏毅 2025 Fall GenAI-ML HW10 - Speech Generation

本 notebook 介紹語音合成（Text-to-Speech, TTS）技術：
- **傳統方法**: 拼接合成、參數合成
- **神經網路方法**: Tacotron、FastSpeech
- **聲碼器 (Vocoder)**: WaveNet、HiFi-GAN
- **端到端模型**: VITS、Bark

```
TTS 系統演進：

傳統方法 (2010s前)          神經網路方法 (2016+)         端到端方法 (2020+)
─────────────────          ──────────────────          ─────────────────
┌──────────────┐           ┌──────────────┐           ┌──────────────┐
│ 文字分析     │           │ 文字編碼     │           │              │
│ (G2P, 韻律)  │           │ (Embedding)  │           │              │
└──────┬───────┘           └──────┬───────┘           │    VITS /    │
       │                          │                   │    Bark      │
       ▼                          ▼                   │              │
┌──────────────┐           ┌──────────────┐           │  Text ──►    │
│ 單位選擇     │           │ Tacotron /   │           │   Audio      │
│ (拼接合成)   │           │ FastSpeech   │           │              │
└──────┬───────┘           └──────┬───────┘           │              │
       │                          │ Mel               │              │
       ▼                          ▼                   │              │
┌──────────────┐           ┌──────────────┐           │              │
│ 信號處理     │           │ WaveNet /    │           │              │
│ (PSOLA)      │           │ HiFi-GAN     │           │              │
└──────────────┘           └──────────────┘           └──────────────┘
```

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, List, Dict, Tuple
from dataclasses import dataclass
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Part 1: 語音基礎知識

理解語音信號的基本表示方式。

```
語音信號表示：

1. 波形 (Waveform)
   - 原始音頻信號
   - 採樣率: 16kHz, 22.05kHz, 24kHz
   - 振幅隨時間變化

2. 頻譜圖 (Spectrogram)
   - 時頻表示
   - STFT 計算得到
   - 顯示頻率隨時間的變化

3. 梅爾頻譜 (Mel Spectrogram)
   - 符合人耳感知的頻率尺度
   - 常用 80 個 mel bins
   - TTS 系統的標準中間表示

Mel 尺度轉換：
m = 2595 * log10(1 + f/700)
```

In [None]:
class AudioProcessor:
    """音頻處理工具"""
    
    def __init__(
        self,
        sample_rate: int = 22050,
        n_fft: int = 1024,
        hop_length: int = 256,
        n_mels: int = 80,
        f_min: float = 0.0,
        f_max: float = 8000.0
    ):
        self.sample_rate = sample_rate
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.n_mels = n_mels
        self.f_min = f_min
        self.f_max = f_max
        
        # 創建 Mel 濾波器組
        self.mel_filterbank = self._create_mel_filterbank()
        
    def _hz_to_mel(self, hz: float) -> float:
        """Hz 轉 Mel"""
        return 2595 * np.log10(1 + hz / 700)
    
    def _mel_to_hz(self, mel: float) -> float:
        """Mel 轉 Hz"""
        return 700 * (10 ** (mel / 2595) - 1)
    
    def _create_mel_filterbank(self) -> np.ndarray:
        """創建 Mel 濾波器組"""
        # Mel 頻率點
        mel_min = self._hz_to_mel(self.f_min)
        mel_max = self._hz_to_mel(self.f_max)
        mel_points = np.linspace(mel_min, mel_max, self.n_mels + 2)
        hz_points = np.array([self._mel_to_hz(m) for m in mel_points])
        
        # 頻率 bin 索引
        bin_points = np.floor((self.n_fft + 1) * hz_points / self.sample_rate).astype(int)
        
        # 創建濾波器
        filterbank = np.zeros((self.n_mels, self.n_fft // 2 + 1))
        
        for i in range(self.n_mels):
            left = bin_points[i]
            center = bin_points[i + 1]
            right = bin_points[i + 2]
            
            # 上升斜坡
            for j in range(left, center):
                if j < filterbank.shape[1]:
                    filterbank[i, j] = (j - left) / (center - left + 1e-8)
            # 下降斜坡
            for j in range(center, right):
                if j < filterbank.shape[1]:
                    filterbank[i, j] = (right - j) / (right - center + 1e-8)
                    
        return filterbank
    
    def waveform_to_mel(self, waveform: np.ndarray) -> np.ndarray:
        """
        波形轉梅爾頻譜
        
        Args:
            waveform: 音頻波形 [T]
            
        Returns:
            梅爾頻譜 [n_mels, frames]
        """
        # STFT
        window = np.hanning(self.n_fft)
        
        # 計算幀數
        num_frames = 1 + (len(waveform) - self.n_fft) // self.hop_length
        
        # 分幀
        frames = np.zeros((num_frames, self.n_fft))
        for i in range(num_frames):
            start = i * self.hop_length
            frames[i] = waveform[start:start + self.n_fft] * window
            
        # FFT
        spectrogram = np.abs(np.fft.rfft(frames, axis=1)) ** 2
        
        # 應用 Mel 濾波器
        mel_spec = np.dot(spectrogram, self.mel_filterbank.T)
        
        # Log 壓縮
        mel_spec = np.log(np.maximum(mel_spec, 1e-5))
        
        return mel_spec.T  # [n_mels, frames]


# 示範
processor = AudioProcessor()
print(f"Mel 濾波器形狀: {processor.mel_filterbank.shape}")

# 生成測試信號
t = np.linspace(0, 1, 22050)
test_wave = np.sin(2 * np.pi * 440 * t)  # 440 Hz 正弦波
mel_spec = processor.waveform_to_mel(test_wave)
print(f"Mel 頻譜形狀: {mel_spec.shape}")

In [None]:
def visualize_audio_representations():
    """視覺化不同的音頻表示"""
    fig, axes = plt.subplots(3, 1, figsize=(12, 8))
    
    # 生成複合信號
    sr = 22050
    duration = 0.5
    t = np.linspace(0, duration, int(sr * duration))
    
    # 模擬語音：基頻 + 諧波
    f0 = 150  # 基頻
    wave = np.zeros_like(t)
    for harmonic in range(1, 6):
        amplitude = 1.0 / harmonic
        wave += amplitude * np.sin(2 * np.pi * f0 * harmonic * t)
    
    # 添加振幅包絡
    envelope = np.concatenate([
        np.linspace(0, 1, len(t)//4),
        np.ones(len(t)//2),
        np.linspace(1, 0, len(t) - len(t)//4 - len(t)//2)
    ])
    wave = wave * envelope
    wave = wave / np.max(np.abs(wave))
    
    # 1. 波形
    axes[0].plot(t[:2000], wave[:2000], 'b-', linewidth=0.5)
    axes[0].set_xlabel('Time (s)')
    axes[0].set_ylabel('Amplitude')
    axes[0].set_title('Waveform')
    axes[0].grid(True, alpha=0.3)
    
    # 2. 頻譜圖
    n_fft = 512
    hop = 128
    window = np.hanning(n_fft)
    num_frames = 1 + (len(wave) - n_fft) // hop
    
    spec = np.zeros((n_fft // 2 + 1, num_frames))
    for i in range(num_frames):
        frame = wave[i * hop:i * hop + n_fft] * window
        spec[:, i] = np.abs(np.fft.rfft(frame))
    
    axes[1].imshow(
        20 * np.log10(spec + 1e-6),
        aspect='auto',
        origin='lower',
        cmap='viridis'
    )
    axes[1].set_xlabel('Frame')
    axes[1].set_ylabel('Frequency Bin')
    axes[1].set_title('Spectrogram (dB)')
    
    # 3. Mel 頻譜
    processor = AudioProcessor(sample_rate=sr, n_fft=n_fft, hop_length=hop)
    mel_spec = processor.waveform_to_mel(wave)
    
    axes[2].imshow(
        mel_spec,
        aspect='auto',
        origin='lower',
        cmap='viridis'
    )
    axes[2].set_xlabel('Frame')
    axes[2].set_ylabel('Mel Bin')
    axes[2].set_title('Mel Spectrogram')
    
    plt.tight_layout()
    plt.show()

visualize_audio_representations()

## Part 2: Tacotron 架構

Tacotron 是經典的 sequence-to-sequence TTS 模型。

```
Tacotron 2 架構：

┌─────────────────────────────────────────────────────────────┐
│                        Encoder                              │
├─────────────────────────────────────────────────────────────┤
│  文字輸入 "Hello world"                                     │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────┐                                            │
│  │ Character   │                                            │
│  │ Embedding   │                                            │
│  └──────┬──────┘                                            │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────┐                                            │
│  │ 3 Conv + BN │  (5×1 kernels)                            │
│  └──────┬──────┘                                            │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────┐                                            │
│  │ BiLSTM      │                                            │
│  └──────┬──────┘                                            │
│         │                                                   │
│         ▼ Encoder outputs                                   │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                        Decoder                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Previous mel frame ──► PreNet ──┐                         │
│                                   │                         │
│  Encoder outputs ──► Attention ◄──┘                         │
│                          │                                  │
│                          ▼                                  │
│                    ┌──────────┐                             │
│                    │ 2 LSTM   │                             │
│                    └────┬─────┘                             │
│                         │                                   │
│              ┌──────────┴──────────┐                        │
│              ▼                     ▼                        │
│        ┌──────────┐          ┌──────────┐                   │
│        │ Linear   │          │ Linear   │                   │
│        │ (mel)    │          │ (stop)   │                   │
│        └────┬─────┘          └──────────┘                   │
│             │                                               │
│             ▼                                               │
│        PostNet (5 Conv)                                     │
│             │                                               │
│             ▼                                               │
│        Mel Spectrogram                                      │
└─────────────────────────────────────────────────────────────┘
```

In [None]:
class TacotronEncoder(nn.Module):
    """Tacotron 2 編碼器"""
    
    def __init__(
        self,
        vocab_size: int = 100,
        embedding_dim: int = 512,
        encoder_dim: int = 512,
        num_conv_layers: int = 3,
        conv_kernel_size: int = 5
    ):
        super().__init__()
        
        # 字符嵌入
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # 卷積層
        conv_layers = []
        for i in range(num_conv_layers):
            in_channels = embedding_dim if i == 0 else encoder_dim
            conv_layers.extend([
                nn.Conv1d(
                    in_channels, encoder_dim,
                    kernel_size=conv_kernel_size,
                    padding=conv_kernel_size // 2
                ),
                nn.BatchNorm1d(encoder_dim),
                nn.ReLU(),
                nn.Dropout(0.5)
            ])
        self.convolutions = nn.Sequential(*conv_layers)
        
        # BiLSTM
        self.lstm = nn.LSTM(
            encoder_dim,
            encoder_dim // 2,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )
        
    def forward(self, text: torch.Tensor) -> torch.Tensor:
        """
        Args:
            text: 文字序列 [batch, seq_len]
            
        Returns:
            編碼器輸出 [batch, seq_len, encoder_dim]
        """
        # 嵌入
        x = self.embedding(text)  # [batch, seq_len, embed_dim]
        
        # 卷積（需要轉置）
        x = x.transpose(1, 2)  # [batch, embed_dim, seq_len]
        x = self.convolutions(x)
        x = x.transpose(1, 2)  # [batch, seq_len, encoder_dim]
        
        # BiLSTM
        x, _ = self.lstm(x)
        
        return x


# 測試
encoder = TacotronEncoder()
test_text = torch.randint(0, 100, (2, 20))  # batch=2, seq_len=20
encoder_output = encoder(test_text)
print(f"編碼器輸出形狀: {encoder_output.shape}")

In [None]:
class LocationSensitiveAttention(nn.Module):
    """位置敏感注意力（Tacotron 2 使用）"""
    
    def __init__(
        self,
        attention_dim: int = 128,
        encoder_dim: int = 512,
        decoder_dim: int = 1024,
        attention_location_n_filters: int = 32,
        attention_location_kernel_size: int = 31
    ):
        super().__init__()
        
        # 查詢投影
        self.query_layer = nn.Linear(decoder_dim, attention_dim, bias=False)
        
        # 記憶投影
        self.memory_layer = nn.Linear(encoder_dim, attention_dim, bias=False)
        
        # 位置卷積
        self.location_conv = nn.Conv1d(
            2, attention_location_n_filters,
            kernel_size=attention_location_kernel_size,
            padding=attention_location_kernel_size // 2
        )
        self.location_dense = nn.Linear(attention_location_n_filters, attention_dim, bias=False)
        
        # 能量計算
        self.v = nn.Linear(attention_dim, 1, bias=False)
        self.score_mask_value = -float('inf')
        
    def forward(
        self,
        query: torch.Tensor,
        memory: torch.Tensor,
        processed_memory: torch.Tensor,
        attention_weights_cat: torch.Tensor,
        mask: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            query: 解碼器狀態 [batch, decoder_dim]
            memory: 編碼器輸出 [batch, seq_len, encoder_dim]
            processed_memory: 預處理的記憶 [batch, seq_len, attention_dim]
            attention_weights_cat: 之前的注意力權重 [batch, 2, seq_len]
            mask: 填充遮罩
            
        Returns:
            context: 上下文向量 [batch, encoder_dim]
            attention_weights: 注意力權重 [batch, seq_len]
        """
        # 處理查詢
        processed_query = self.query_layer(query.unsqueeze(1))  # [batch, 1, attention_dim]
        
        # 處理位置特徵
        processed_location = self.location_conv(attention_weights_cat)  # [batch, filters, seq_len]
        processed_location = processed_location.transpose(1, 2)  # [batch, seq_len, filters]
        processed_location = self.location_dense(processed_location)  # [batch, seq_len, attention_dim]
        
        # 計算能量
        energies = self.v(torch.tanh(
            processed_query + processed_memory + processed_location
        )).squeeze(-1)  # [batch, seq_len]
        
        # 應用遮罩
        if mask is not None:
            energies.masked_fill_(mask, self.score_mask_value)
            
        # 計算注意力權重
        attention_weights = F.softmax(energies, dim=1)
        
        # 計算上下文
        context = torch.bmm(attention_weights.unsqueeze(1), memory).squeeze(1)
        
        return context, attention_weights


# 測試
attention = LocationSensitiveAttention()
query = torch.randn(2, 1024)
memory = torch.randn(2, 20, 512)
processed_memory = torch.randn(2, 20, 128)
attn_weights_cat = torch.randn(2, 2, 20)

context, weights = attention(query, memory, processed_memory, attn_weights_cat)
print(f"上下文形狀: {context.shape}")
print(f"注意力權重形狀: {weights.shape}")

In [None]:
class PreNet(nn.Module):
    """Tacotron PreNet：對先前的 mel 幀進行處理"""
    
    def __init__(self, in_dim: int = 80, hidden_dim: int = 256, out_dim: int = 256):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.5),  # 注意：推理時也使用 dropout
            nn.Linear(hidden_dim, out_dim),
            nn.ReLU(),
            nn.Dropout(0.5)
        )
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)


class PostNet(nn.Module):
    """Tacotron PostNet：改善 mel 頻譜品質"""
    
    def __init__(
        self,
        n_mels: int = 80,
        postnet_dim: int = 512,
        postnet_kernel_size: int = 5,
        num_layers: int = 5
    ):
        super().__init__()
        
        layers = []
        for i in range(num_layers):
            in_channels = n_mels if i == 0 else postnet_dim
            out_channels = n_mels if i == num_layers - 1 else postnet_dim
            
            layers.extend([
                nn.Conv1d(
                    in_channels, out_channels,
                    kernel_size=postnet_kernel_size,
                    padding=postnet_kernel_size // 2
                ),
                nn.BatchNorm1d(out_channels)
            ])
            
            if i < num_layers - 1:
                layers.append(nn.Tanh())
            layers.append(nn.Dropout(0.5))
            
        self.convolutions = nn.Sequential(*layers)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: mel 頻譜 [batch, n_mels, frames]
        Returns:
            殘差 [batch, n_mels, frames]
        """
        return self.convolutions(x)


# 測試
prenet = PreNet()
postnet = PostNet()

mel_frame = torch.randn(2, 80)
print(f"PreNet 輸出: {prenet(mel_frame).shape}")

mel_spec = torch.randn(2, 80, 100)
print(f"PostNet 輸出: {postnet(mel_spec).shape}")

## Part 3: FastSpeech 架構

FastSpeech 使用非自回歸架構，實現快速並行合成。

```
FastSpeech 2 架構：

┌──────────────────────────────────────────────────────────────┐
│                     Phoneme Encoder                          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Phoneme Embedding + Positional Encoding            │    │
│  │              │                                      │    │
│  │              ▼                                      │    │
│  │  FFT Block × N (Self-Attention + Conv FFN)          │    │
│  └─────────────────────────────────────────────────────┘    │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                   Variance Adaptor                           │
│  ┌────────────────────────────────────────────────────────┐ │
│  │                                                        │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐   │ │
│  │  │   Duration   │ │    Pitch     │ │   Energy     │   │ │
│  │  │  Predictor   │ │  Predictor   │ │  Predictor   │   │ │
│  │  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘   │ │
│  │         │                │                │           │ │
│  │         ▼                ▼                ▼           │ │
│  │  Length Regulator   Pitch Embed     Energy Embed      │ │
│  │         │                │                │           │ │
│  │         └────────────────┴────────────────┘           │ │
│  │                          │                             │ │
│  │                          ▼                             │ │
│  │                       Add All                          │ │
│  └────────────────────────────────────────────────────────┘ │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                      Mel Decoder                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  FFT Block × N (Self-Attention + Conv FFN)          │    │
│  │              │                                      │    │
│  │              ▼                                      │    │
│  │         Linear → Mel Spectrogram                    │    │
│  └─────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘
```

In [None]:
class FFTBlock(nn.Module):
    """Feed-Forward Transformer Block"""
    
    def __init__(
        self,
        d_model: int = 256,
        num_heads: int = 2,
        d_ff: int = 1024,
        kernel_size: int = 9,
        dropout: float = 0.2
    ):
        super().__init__()
        
        # Multi-Head Attention
        self.self_attention = nn.MultiheadAttention(
            d_model, num_heads, dropout=dropout, batch_first=True
        )
        self.norm1 = nn.LayerNorm(d_model)
        
        # Convolutional FFN
        self.conv1 = nn.Conv1d(
            d_model, d_ff, kernel_size,
            padding=kernel_size // 2
        )
        self.conv2 = nn.Conv1d(
            d_ff, d_model, kernel_size,
            padding=kernel_size // 2
        )
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(
        self,
        x: torch.Tensor,
        mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Args:
            x: 輸入 [batch, seq_len, d_model]
            mask: 注意力遮罩
        """
        # Self-Attention
        residual = x
        x, _ = self.self_attention(x, x, x, key_padding_mask=mask)
        x = self.dropout(x)
        x = self.norm1(residual + x)
        
        # Conv FFN
        residual = x
        x = x.transpose(1, 2)  # [batch, d_model, seq_len]
        x = F.relu(self.conv1(x))
        x = self.dropout(x)
        x = self.conv2(x)
        x = x.transpose(1, 2)  # [batch, seq_len, d_model]
        x = self.dropout(x)
        x = self.norm2(residual + x)
        
        return x


# 測試
fft_block = FFTBlock()
x = torch.randn(2, 20, 256)
out = fft_block(x)
print(f"FFT Block 輸出形狀: {out.shape}")

In [None]:
class VariancePredictor(nn.Module):
    """變異預測器：預測 duration/pitch/energy"""
    
    def __init__(
        self,
        d_model: int = 256,
        d_hidden: int = 256,
        kernel_size: int = 3,
        dropout: float = 0.5
    ):
        super().__init__()
        
        self.layers = nn.Sequential(
            nn.Conv1d(d_model, d_hidden, kernel_size, padding=kernel_size // 2),
            nn.ReLU(),
            nn.LayerNorm(d_hidden),
            nn.Dropout(dropout),
            nn.Conv1d(d_hidden, d_hidden, kernel_size, padding=kernel_size // 2),
            nn.ReLU(),
            nn.LayerNorm(d_hidden),
            nn.Dropout(dropout)
        )
        self.linear = nn.Linear(d_hidden, 1)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: [batch, seq_len, d_model]
        Returns:
            預測值 [batch, seq_len]
        """
        x = x.transpose(1, 2)  # [batch, d_model, seq_len]
        x = self.layers(x)
        x = x.transpose(1, 2)  # [batch, seq_len, d_hidden]
        x = self.linear(x).squeeze(-1)
        return x


class LengthRegulator(nn.Module):
    """長度調節器：根據 duration 擴展序列"""
    
    def forward(
        self,
        x: torch.Tensor,
        durations: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            x: 編碼器輸出 [batch, seq_len, d_model]
            durations: 每個 phoneme 的持續時間 [batch, seq_len]
            
        Returns:
            擴展後的序列 [batch, mel_len, d_model]
        """
        outputs = []
        for i in range(x.shape[0]):
            output = []
            for j in range(x.shape[1]):
                duration = int(durations[i, j].item())
                if duration > 0:
                    output.append(x[i, j].unsqueeze(0).expand(duration, -1))
            if output:
                outputs.append(torch.cat(output, dim=0))
            else:
                outputs.append(x[i, :1])  # 至少保留一幀
                
        # 填充到相同長度
        max_len = max(o.shape[0] for o in outputs)
        padded = torch.zeros(len(outputs), max_len, x.shape[-1], device=x.device)
        for i, output in enumerate(outputs):
            padded[i, :output.shape[0]] = output
            
        return padded


# 測試
var_predictor = VariancePredictor()
length_reg = LengthRegulator()

x = torch.randn(2, 10, 256)  # 10 個 phonemes
durations = torch.tensor([[2, 3, 1, 2, 2, 1, 3, 2, 2, 2],
                          [3, 2, 2, 1, 3, 2, 2, 1, 2, 2]])

pred_dur = var_predictor(x)
print(f"預測的 duration 形狀: {pred_dur.shape}")

expanded = length_reg(x, durations)
print(f"擴展後序列形狀: {expanded.shape}")

## Part 4: 聲碼器 (Vocoder)

聲碼器將 mel 頻譜轉換為音頻波形。

```
聲碼器發展：

WaveNet (2016)         WaveRNN (2018)         HiFi-GAN (2020)
─────────────          ──────────────          ────────────────
自回歸                 自回歸                 非自回歸
Dilated Conv           GRU + Sparse           GAN-based
高品質                 較快                   最快
極慢                   中等                   實時

HiFi-GAN 架構：

┌─────────────────────────────────────────────────────────────┐
│                        Generator                            │
├─────────────────────────────────────────────────────────────┤
│  Mel Spectrogram                                            │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────┐                                            │
│  │   Conv1d    │                                            │
│  └──────┬──────┘                                            │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────────────────────────────────┐       │
│  │           Upsample Block × 4                    │       │
│  │  ┌───────────────────────────────────────┐      │       │
│  │  │ TransposeConv1d (upsample)            │      │       │
│  │  │         │                             │      │       │
│  │  │         ▼                             │      │       │
│  │  │ Multi-Receptive Field Fusion (MRF)   │      │       │
│  │  │  ├─ ResBlock (kernel=3, dilation=1)  │      │       │
│  │  │  ├─ ResBlock (kernel=7, dilation=1)  │      │       │
│  │  │  └─ ResBlock (kernel=11, dilation=1) │      │       │
│  │  └───────────────────────────────────────┘      │       │
│  └─────────────────────────────────────────────────┘       │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────┐                                            │
│  │   Conv1d    │                                            │
│  │   + tanh    │                                            │
│  └──────┬──────┘                                            │
│         │                                                   │
│         ▼                                                   │
│    Waveform                                                 │
└─────────────────────────────────────────────────────────────┘
```

In [None]:
class ResBlock(nn.Module):
    """HiFi-GAN 殘差塊"""
    
    def __init__(
        self,
        channels: int,
        kernel_size: int = 3,
        dilations: List[int] = [1, 3, 5]
    ):
        super().__init__()
        
        self.convs1 = nn.ModuleList([
            nn.Conv1d(
                channels, channels, kernel_size,
                dilation=d, padding=d * (kernel_size - 1) // 2
            ) for d in dilations
        ])
        
        self.convs2 = nn.ModuleList([
            nn.Conv1d(
                channels, channels, kernel_size,
                dilation=1, padding=(kernel_size - 1) // 2
            ) for _ in dilations
        ])
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for conv1, conv2 in zip(self.convs1, self.convs2):
            residual = x
            x = F.leaky_relu(x, 0.1)
            x = conv1(x)
            x = F.leaky_relu(x, 0.1)
            x = conv2(x)
            x = x + residual
        return x


class MRF(nn.Module):
    """Multi-Receptive Field Fusion"""
    
    def __init__(
        self,
        channels: int,
        kernel_sizes: List[int] = [3, 7, 11],
        dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
    ):
        super().__init__()
        
        self.resblocks = nn.ModuleList([
            ResBlock(channels, k, d)
            for k, d in zip(kernel_sizes, dilations)
        ])
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        outputs = [block(x) for block in self.resblocks]
        return sum(outputs) / len(outputs)


class HiFiGANGenerator(nn.Module):
    """簡化版 HiFi-GAN 生成器"""
    
    def __init__(
        self,
        n_mels: int = 80,
        upsample_initial_channel: int = 512,
        upsample_rates: List[int] = [8, 8, 2, 2],
        upsample_kernel_sizes: List[int] = [16, 16, 4, 4],
        resblock_kernel_sizes: List[int] = [3, 7, 11],
        resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
    ):
        super().__init__()
        
        # 初始卷積
        self.conv_pre = nn.Conv1d(n_mels, upsample_initial_channel, 7, padding=3)
        
        # 上採樣層
        self.ups = nn.ModuleList()
        self.mrfs = nn.ModuleList()
        
        ch = upsample_initial_channel
        for i, (u_rate, u_kernel) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
            self.ups.append(
                nn.ConvTranspose1d(
                    ch, ch // 2, u_kernel,
                    stride=u_rate, padding=(u_kernel - u_rate) // 2
                )
            )
            self.mrfs.append(
                MRF(ch // 2, resblock_kernel_sizes, resblock_dilations)
            )
            ch = ch // 2
            
        # 最終卷積
        self.conv_post = nn.Conv1d(ch, 1, 7, padding=3)
        
    def forward(self, mel: torch.Tensor) -> torch.Tensor:
        """
        Args:
            mel: Mel 頻譜 [batch, n_mels, frames]
            
        Returns:
            波形 [batch, 1, samples]
        """
        x = self.conv_pre(mel)
        
        for up, mrf in zip(self.ups, self.mrfs):
            x = F.leaky_relu(x, 0.1)
            x = up(x)
            x = mrf(x)
            
        x = F.leaky_relu(x, 0.1)
        x = self.conv_post(x)
        x = torch.tanh(x)
        
        return x


# 測試
vocoder = HiFiGANGenerator()
mel = torch.randn(2, 80, 100)  # 100 frames
waveform = vocoder(mel)

print(f"Mel 輸入形狀: {mel.shape}")
print(f"波形輸出形狀: {waveform.shape}")
print(f"上採樣倍率: {waveform.shape[-1] / mel.shape[-1]}x")

## Part 5: 現代端到端模型

介紹 VITS 和 Bark 等端到端語音合成模型。

In [None]:
def print_modern_tts_architectures():
    """印出現代 TTS 架構比較"""
    print("""
╔═══════════════════════════════════════════════════════════════════════════════╗
║                        現代端到端 TTS 模型比較                                ║
╠════════════════╦══════════════════════╦══════════════════════════════════════╣
║     模型       ║        特點          ║              適用場景                ║
╠════════════════╬══════════════════════╬══════════════════════════════════════╣
║                ║ - VAE + Flow + GAN   ║                                      ║
║     VITS       ║ - 端到端訓練         ║ 高品質單說話人/多說話人 TTS          ║
║  (2021)        ║ - 無需額外 vocoder   ║ 需要自然語調的應用                   ║
║                ║ - 支援多說話人       ║ 實時合成                             ║
╠════════════════╬══════════════════════╬══════════════════════════════════════╣
║                ║ - GPT-style 自回歸   ║                                      ║
║     Bark       ║ - 支援多語言         ║ 多語言、多情感語音合成               ║
║  (2023)        ║ - 可控制情感/音效    ║ 播客、有聲書、創意內容               ║
║                ║ - 開源可用           ║ 需要表現力的應用                     ║
╠════════════════╬══════════════════════╬══════════════════════════════════════╣
║                ║ - 超長上下文         ║                                      ║
║   Tortoise     ║ - CLIP 引導          ║ 最高品質、離線處理                   ║
║   TTS          ║ - 高品質但較慢       ║ 後製配音、高端應用                   ║
║                ║ - 可複製任意聲音     ║ 聲音複製                             ║
╠════════════════╬══════════════════════╬══════════════════════════════════════╣
║                ║ - Neural Codec       ║                                      ║
║    VALL-E      ║ - Zero-shot 複製     ║ 零樣本語音複製                       ║
║  (Microsoft)   ║ - 僅需 3 秒樣本      ║ 個人化 TTS                           ║
║                ║ - 保留說話風格       ║ 語音轉換                             ║
╚════════════════╩══════════════════════╩══════════════════════════════════════╝

VITS 架構概覽：
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   Text ──► Text Encoder ──► Stochastic Duration Predictor                   │
│                │                                                             │
│                ▼                                                             │
│         Normalizing Flow                                                     │
│                │                                                             │
│                ▼                                                             │
│          VAE Decoder ──► Waveform                                            │
│                                                                              │
│   訓練時額外有：                                                              │
│   - Posterior Encoder (從音頻學習隱空間)                                     │
│   - GAN Discriminator (對抗學習)                                             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Bark 架構概覽：
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│   Text ──► GPT-style Text Model ──► Semantic Tokens                         │
│                                           │                                  │
│                                           ▼                                  │
│                                    Coarse Acoustic Model ──► Coarse Tokens   │
│                                                                │             │
│                                                                ▼             │
│                                    Fine Acoustic Model ──► Fine Tokens       │
│                                                                │             │
│                                                                ▼             │
│                                           Encodec Decoder ──► Waveform       │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
""")

print_modern_tts_architectures()

In [None]:
# 使用 Bark 的範例程式碼
bark_example = '''
# 安裝
# pip install git+https://github.com/suno-ai/bark.git

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# 預載入模型
preload_models()

# 基本文字轉語音
text = "Hello, my name is Bark. I can speak in many languages!"
audio_array = generate_audio(text)
write_wav("bark_output.wav", SAMPLE_RATE, audio_array)

# 帶情感標記的語音
text_with_emotion = """
    [laughs] Oh wow, that's hilarious!
    [sighs] But sometimes I feel a bit sad...
    [clears throat] Anyway, let me continue.
"""
audio_array = generate_audio(text_with_emotion)

# 多語言支援
chinese_text = "你好，我是 Bark。我可以說中文！"
audio_array = generate_audio(chinese_text)

# 使用特定說話人
audio_array = generate_audio(
    text,
    history_prompt="v2/en_speaker_6"  # 預設說話人
)

# 生成音樂音效
music_text = "♪ La la la, singing a song ♪"
audio_array = generate_audio(music_text)
'''
print("=== Bark 使用範例 ===")
print(bark_example)

In [None]:
# 使用 VITS / Coqui TTS 的範例
vits_example = '''
# 安裝
# pip install TTS

from TTS.api import TTS
import torch

# 列出可用模型
print(TTS().list_models())

# 載入 VITS 模型
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/en/ljspeech/vits").to(device)

# 基本文字轉語音
tts.tts_to_file(
    text="Hello world, this is a test of VITS text to speech.",
    file_path="vits_output.wav"
)

# 多說話人模型
tts = TTS("tts_models/en/vctk/vits").to(device)
tts.tts_to_file(
    text="Hello world!",
    speaker="p225",  # VCTK 說話人 ID
    file_path="vits_speaker.wav"
)

# 聲音複製
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(
    text="This is a voice cloning test.",
    speaker_wav="reference_audio.wav",  # 參考音頻
    language="en",
    file_path="cloned_output.wav"
)

# 中文 TTS
tts = TTS("tts_models/zh-CN/baker/tacotron2-DDC-GST").to(device)
tts.tts_to_file(
    text="你好，這是中文語音合成測試。",
    file_path="chinese_output.wav"
)
'''
print("=== VITS (Coqui TTS) 使用範例 ===")
print(vits_example)

## Part 6: 評估指標

In [None]:
class TTSEvaluator:
    """TTS 評估指標計算"""
    
    @staticmethod
    def mel_cepstral_distortion(
        generated_mel: np.ndarray,
        reference_mel: np.ndarray
    ) -> float:
        """
        計算 Mel Cepstral Distortion (MCD)
        越低越好，通常 < 8 dB 為可接受品質
        
        Args:
            generated_mel: 生成的 mel 頻譜 [n_mels, frames]
            reference_mel: 參考的 mel 頻譜
        """
        # 對齊長度
        min_len = min(generated_mel.shape[1], reference_mel.shape[1])
        gen = generated_mel[:, :min_len]
        ref = reference_mel[:, :min_len]
        
        # 計算 MCD
        diff = gen - ref
        mcd = np.mean(np.sqrt(np.sum(diff ** 2, axis=0))) * (10 / np.log(10))
        
        return mcd
    
    @staticmethod
    def f0_rmse(
        generated_f0: np.ndarray,
        reference_f0: np.ndarray
    ) -> float:
        """
        計算基頻 RMSE
        
        Args:
            generated_f0: 生成的 F0 序列
            reference_f0: 參考的 F0 序列
        """
        # 對齊長度
        min_len = min(len(generated_f0), len(reference_f0))
        gen = generated_f0[:min_len]
        ref = reference_f0[:min_len]
        
        # 只計算有聲段
        voiced_mask = (gen > 0) & (ref > 0)
        if np.sum(voiced_mask) == 0:
            return float('inf')
            
        rmse = np.sqrt(np.mean((gen[voiced_mask] - ref[voiced_mask]) ** 2))
        return rmse
    
    @staticmethod
    def character_error_rate(
        generated_text: str,
        reference_text: str
    ) -> float:
        """
        計算 Character Error Rate (CER)
        用於評估可懂度（需配合 ASR）
        """
        # 簡化的編輯距離計算
        m, n = len(reference_text), len(generated_text)
        dp = [[0] * (n + 1) for _ in range(m + 1)]
        
        for i in range(m + 1):
            dp[i][0] = i
        for j in range(n + 1):
            dp[0][j] = j
            
        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if reference_text[i-1] == generated_text[j-1]:
                    dp[i][j] = dp[i-1][j-1]
                else:
                    dp[i][j] = min(
                        dp[i-1][j] + 1,    # 刪除
                        dp[i][j-1] + 1,    # 插入
                        dp[i-1][j-1] + 1   # 替換
                    )
                    
        return dp[m][n] / m if m > 0 else 0.0


# 測試
evaluator = TTSEvaluator()

# 模擬數據
gen_mel = np.random.randn(80, 100)
ref_mel = gen_mel + np.random.randn(80, 100) * 0.1  # 加入小擾動

mcd = evaluator.mel_cepstral_distortion(gen_mel, ref_mel)
print(f"MCD: {mcd:.2f} dB")

cer = evaluator.character_error_rate("hello world", "helo wrld")
print(f"CER: {cer:.2%}")

In [None]:
def print_evaluation_summary():
    """印出 TTS 評估指標總結"""
    print("""
╔═══════════════════════════════════════════════════════════════════════════════╗
║                          TTS 評估指標總結                                     ║
╠══════════════════════╦════════════════════════════════════════════════════════╣
║       指標           ║                    說明                                ║
╠══════════════════════╬════════════════════════════════════════════════════════╣
║ MCD (Mel Cepstral    ║ 客觀指標，衡量頻譜相似度                               ║
║     Distortion)      ║ 越低越好，< 8 dB 通常可接受                           ║
╠══════════════════════╬════════════════════════════════════════════════════════╣
║ F0 RMSE              ║ 衡量基頻（音高）的準確性                               ║
║                      ║ 越低越好，影響語調自然度                               ║
╠══════════════════════╬════════════════════════════════════════════════════════╣
║ MOS (Mean Opinion    ║ 主觀指標，人工評分 1-5 分                              ║
║      Score)          ║ 4.0+ 為高品質，4.5+ 接近真人                           ║
╠══════════════════════╬════════════════════════════════════════════════════════╣
║ CER/WER              ║ 可懂度指標（配合 ASR）                                 ║
║                      ║ 越低越好                                               ║
╠══════════════════════╬════════════════════════════════════════════════════════╣
║ RTF (Real-Time       ║ 合成速度指標                                           ║
║      Factor)         ║ < 1.0 表示實時合成                                     ║
╠══════════════════════╬════════════════════════════════════════════════════════╣
║ Speaker Similarity   ║ 說話人相似度（用於聲音複製）                           ║
║                      ║ 使用說話人嵌入的餘弦相似度                             ║
╚══════════════════════╩════════════════════════════════════════════════════════╝

典型 MOS 分數參考：
┌────────────────────────────────────────────────────────────┐
│  系統               │ MOS                                 │
├────────────────────────────────────────────────────────────┤
│  Ground Truth       │ ~4.5                                │
│  VITS               │ ~4.3                                │
│  FastSpeech 2       │ ~4.0                                │
│  Tacotron 2         │ ~3.9                                │
│  傳統拼接合成        │ ~3.0                                │
└────────────────────────────────────────────────────────────┘
""")

print_evaluation_summary()

## 練習

### Exercise 1: 實作簡單的 G2P (Grapheme-to-Phoneme)

實作一個簡單的英文文字轉音素函數。

In [None]:
class SimpleG2P:
    """簡單的 Grapheme-to-Phoneme 轉換"""
    
    def __init__(self):
        # 基本的字母到音素映射（簡化版）
        self.letter_to_phoneme = {
            'a': 'AH', 'b': 'B', 'c': 'K', 'd': 'D', 'e': 'EH',
            'f': 'F', 'g': 'G', 'h': 'HH', 'i': 'IH', 'j': 'JH',
            'k': 'K', 'l': 'L', 'm': 'M', 'n': 'N', 'o': 'OW',
            'p': 'P', 'q': 'K', 'r': 'R', 's': 'S', 't': 'T',
            'u': 'UH', 'v': 'V', 'w': 'W', 'x': 'K S', 'y': 'Y',
            'z': 'Z', ' ': 'SIL'
        }
        
        # 常見的字母組合
        self.digraph_to_phoneme = {
            'th': 'TH', 'sh': 'SH', 'ch': 'CH', 'ph': 'F',
            'wh': 'W', 'ng': 'NG', 'ck': 'K', 'gh': 'G'
        }
        
    def convert(self, text: str) -> List[str]:
        """
        將文字轉換為音素序列
        
        Args:
            text: 輸入文字
            
        Returns:
            音素列表
        """
        # TODO: 實作 G2P 轉換
        # 提示：
        # 1. 轉換為小寫
        # 2. 先檢查 digraph
        # 3. 再檢查單字母
        pass

# 測試
# g2p = SimpleG2P()
# phonemes = g2p.convert("hello world")
# print(phonemes)

### Exercise 2: 實作 Duration 預測器訓練

實作一個簡單的 Duration 預測器並訓練它。

In [None]:
class DurationPredictor(nn.Module):
    """Duration 預測器"""
    
    def __init__(self, d_model: int = 256):
        super().__init__()
        # TODO: 定義網路架構
        # 提示：使用卷積 + LayerNorm + ReLU
        pass
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # TODO: 實作前向傳播
        pass


def train_duration_predictor(
    model: nn.Module,
    train_data: List[Tuple[torch.Tensor, torch.Tensor]],
    num_epochs: int = 10,
    lr: float = 1e-3
) -> List[float]:
    """
    訓練 Duration 預測器
    
    Args:
        model: Duration 預測器
        train_data: [(encoder_output, target_duration), ...]
        num_epochs: 訓練輪數
        lr: 學習率
        
    Returns:
        每個 epoch 的損失
    """
    # TODO: 實作訓練迴圈
    # 提示：使用 MSE loss
    pass

### Exercise 3: 比較不同聲碼器

設計實驗比較不同聲碼器的性能。

In [None]:
@dataclass
class VocoderBenchmark:
    """聲碼器性能評估結果"""
    name: str
    rtf: float  # Real-Time Factor
    mcd: float  # Mel Cepstral Distortion
    params: int  # 參數量
    
def benchmark_vocoder(
    vocoder: nn.Module,
    test_mels: List[torch.Tensor],
    reference_wavs: List[np.ndarray],
    sample_rate: int = 22050
) -> VocoderBenchmark:
    """
    評估聲碼器性能
    
    Args:
        vocoder: 聲碼器模型
        test_mels: 測試用 mel 頻譜
        reference_wavs: 參考波形
        sample_rate: 採樣率
        
    Returns:
        評估結果
    """
    # TODO: 實作聲碼器評估
    # 提示：
    # 1. 測量推理時間計算 RTF
    # 2. 計算 MCD
    # 3. 統計參數量
    pass

## 總結

```
語音合成重點回顧：

1. 音頻表示
   ├─ 波形：原始信號
   ├─ 頻譜圖：時頻分析
   └─ Mel 頻譜：人耳感知尺度

2. 經典架構
   ├─ Tacotron 2：自回歸，高品質
   ├─ FastSpeech 2：非自回歸，快速
   └─ 關鍵組件：Encoder, Attention, Decoder, PostNet

3. 聲碼器
   ├─ WaveNet：自回歸，高品質但慢
   ├─ HiFi-GAN：GAN-based，快速
   └─ 上採樣 + MRF 結構

4. 現代端到端模型
   ├─ VITS：VAE + Flow + GAN
   ├─ Bark：GPT-style，多功能
   └─ VALL-E：零樣本聲音複製

5. 評估指標
   ├─ 客觀：MCD, F0 RMSE, RTF
   └─ 主觀：MOS (1-5 分)

實際應用建議：
- RTX 5080 (16GB) 可運行 VITS、Bark
- 生產環境優先考慮 VITS（速度快、品質好）
- 需要情感/效果時使用 Bark
```