# VITS for Azerbaijani Voice Cloning and Audiobooks

This notebook is enhanced for:
1. **Azerbaijani Language:** Uses `espeak` for phonemization.
2. **Large Datasets:** Implements `WebDataset` for efficient data loading, suitable for audiobook creation.
3. **Voice Cloning:** Integrates a speaker encoder for zero-shot voice cloning from a reference audio.

## 1. Setup and Installation

We will install `espeak` for phonemization, `webdataset` for efficient data loading, and other necessary libraries.

In [None]:
# Install system-level dependency for phonemization
!sudo apt-get update && sudo apt-get install -y espeak

# Install Python libraries
!pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install numpy scipy librosa unidecode tensorboard
!pip install phonemizer webdataset
!pip install speechbrain

# Clone VITS repository for base model and utilities
!git clone https://github.com/jaywalnut310/vits.git
%cd vits
!pip install -e .

## 2. Imports, Configuration, and Azerbaijani Symbols

In [None]:
import os
import json
import torch
import webdataset as wds
from speechbrain.pretrained import EncoderClassifier
import librosa
import IPython.display as ipd

# Define Azerbaijani characters
# Based on common Azerbaijani alphabet, including special characters
_pad        = '_',
_punctuation = '¡!¿?.,;:'
_special    = ' -'
_letters    = 'AaBbCcÇçDdEeƏəFfGgĞğHhXxIıİiJjKkQqLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz'

# Export symbols
symbols = list(_pad) + list(_special) + list(_punctuation) + list(_letters)

# Create a mapping from symbol to numeric ID
symbol_to_id = {s: i for i, s in enumerate(symbols)}

# Configuration for the model
config_json = {
    "train": {
        "log_interval": 200,
        "eval_interval": 1000,
        "seed": 1234,
        "epochs": 20000,
        "learning_rate": 2e-4,
        "betas": [0.8, 0.99],
        "eps": 1e-9,
        "batch_size": 16,
        "fp16_run": true,
        "lr_decay": 0.999875,
        "segment_size": 8192,
        "init_lr_ratio": 1,
        "warmup_epochs": 0,
        "c_mel": 45,
        "c_kl": 1.0
    },
    "data": {
        "webdataset_base_path": "./dataset_tar/az_audio-{000000..000099}.tar",
        "text_cleaners": ["transliteration_cleaners"],
        "language": "az",
        "phonemizer": "espeak",
        "max_wav_value": 32768.0,
        "sampling_rate": 22050,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "n_mel_channels": 80,
        "add_blank": true,
        "n_speakers": 0, "# Set to 0 for speaker embeddings"
        "cleaned_text": true
    },
    "model": {
        "inter_channels": 192,
        "hidden_channels": 192,
        "filter_channels": 768,
        "n_heads": 2,
        "n_layers": 6,
        "kernel_size": 3,
        "p_dropout": 0.1,
        "resblock": "1",
        "resblock_kernel_sizes": [3, 7, 11],
        "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
        "upsample_rates": [8, 8, 2, 2],
        "upsample_initial_channel": 512,
        "upsample_kernel_sizes": [16, 16, 4, 4],
        "n_layers_q": 3,
        "use_spectral_norm": false,
        "g_speaker_cond": 512 "# Dimension of speaker embedding"
    }
}

with open('config_az.json', 'w') as f:
    json.dump(config_json, f, indent=2)

hps = utils.get_hparams_from_file("config_az.json")

## 3. Data Preparation for Large Datasets

For large datasets, we'll use `WebDataset`. This requires converting your dataset into a series of `.tar` files (shards). Each sample in the tar file will contain the audio and its transcript.

**Action Required:** Create a filelist named `my_dataset.txt` with the format `path/to/audio.wav|text transcription`. Then run the cell below to create the tar shards.

In [None]:
import tarfile

filelist_path = 'my_dataset.txt' # Change this to your filelist
output_dir = 'dataset_tar'
os.makedirs(output_dir, exist_ok=True)

# Read filelist
with open(filelist_path, 'r', encoding='utf-8') as f:
    filepaths_and_text = [line.strip().split('|') for line in f] 

# Create tar shards
shard_size = 1000  # Number of samples per shard
shard_count = 0
for i in range(0, len(filepaths_and_text), shard_size):
    shard_name = os.path.join(output_dir, f'az_audio-{shard_count:06d}.tar')
    with tarfile.open(shard_name, 'w') as tar:
        for wav_path, text in filepaths_and_text[i:i+shard_size]:
            sample_name = os.path.splitext(os.path.basename(wav_path))[0]
            
            # Add audio file
            tar.add(wav_path, arcname=f'{sample_name}.wav')
            
            # Add text file
            text_bytes = text.encode('utf-8')
            tarinfo = tarfile.TarInfo(name=f'{sample_name}.txt')
            tarinfo.size = len(text_bytes)
            tar.addfile(tarinfo, io.BytesIO(text_bytes))
    print(f'Created shard: {shard_name}')
    shard_count += 1

## 4. Training with Voice Cloning

We'll modify the training process to include speaker embeddings for voice cloning.

In [None]:
# NOTE: The standard VITS model needs to be modified to accept speaker embeddings.
# The following is a conceptual guide. You would need to modify the SynthesizerTrn model in vits/models.py
# to accept a `g_speaker` argument and use it in the generator.

print('This section is a conceptual guide.')
print('You need to modify the VITS model to accept speaker embeddings.')

# Example of how the dataloader would look
# This part is illustrative and assumes you have a function to process the data
# url = hps.data.webdataset_base_path
# dataset = wds.Dataset(url).shuffle(1000).decode('torch_audio').to_tuple('wav', 'txt')
# train_loader = DataLoader(dataset, batch_size=hps.train.batch_size, num_workers=2)

## 5. Inference with Voice Cloning

For inference, we'll load a reference audio, generate its speaker embedding, and use it to synthesize speech in the target voice.

In [None]:
# Load a pre-trained speaker encoder
speaker_encoder = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="pretrained_models/spkrec-ecapa-voxceleb")

def get_speaker_embedding(wav_path):
    audio, sample_rate = librosa.load(wav_path, sr=16000)
    audio_tensor = torch.FloatTensor(audio).unsqueeze(0)
    with torch.no_grad():
        embedding = speaker_encoder.encode_batch(audio_tensor)
        embedding = embedding.squeeze(0).squeeze(0) # [1, 1, 192] -> [192]
    return embedding.cpu().numpy()

# Path to your reference audio for voice cloning
reference_audio_path = 'path/to/your/reference.wav' # <<< CHANGE THIS
speaker_embedding = get_speaker_embedding(reference_audio_path)

# Load trained VITS model (conceptual)
# net_g = SynthesizerTrn(...) 
# utils.load_checkpoint('path/to/your/model.pth', net_g, None)

text_to_synthesize = "Salam dünya, bu bir səs klonlama testidir."
# text_sequence = text_to_sequence(text_to_synthesize, hps.data.text_cleaners)

# with torch.no_grad():
#     x_tst = ...
#     speaker_emb_tst = torch.FloatTensor(speaker_embedding).unsqueeze(0).to(device)
#     audio = net_g.infer(x_tst, ..., g=speaker_emb_tst) # 'g' is the speaker embedding

print(f'Generated speaker embedding of shape: {speaker_embedding.shape}')
print('Inference part is conceptual and requires a model trained with speaker embeddings.')
# ipd.display(ipd.Audio(audio, rate=hps.data.sampling_rate))