# 🎙️ VITS Azerbaijani Text-to-Speech

This notebook provides a complete workflow for training and using a VITS model for Azerbaijani text-to-speech synthesis, including voice cloning capabilities.

## 📋 Features
- Single-speaker TTS training
- Zero-shot voice cloning
- Audio preprocessing & normalization
- Checkpoint management
- Interactive Gradio demo

## 🗺️ Notebook Structure
1. Environment Setup
2. Dataset Preparation
3. Audio Preprocessing
4. Model Training
5. Inference & Voice Cloning
6. Web Demo

> 💡 This notebook works both in Google Colab and locally. Colab-specific cells are marked with a [COLAB] tag.


## 1. Environment Setup

First, let's set up our environment with all necessary dependencies.


In [1]:
# [COLAB] System packages
!apt-get update -y && apt-get install -y espeak ffmpeg


0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Connected to cloud.r-project.org (108.157.173.52)] [C                                                                               Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Waiting for headers] [2 InRelease 14.2 kB/129 kB 11%] [Waiting for headers]                                                                               Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 http://security.ubuntu.com/ubuntu jammy-securit

In [2]:
# [COLAB] RUN to upload project zip
from google.colab import files
import os

def upload_extract_project():
    """Upload and extract a project ZIP file."""
    print("Please upload your project ZIP file...")
    uploaded = files.upload()

    for filename in uploaded.keys():
        if filename.endswith('.zip'):
            print(f"Extracting {filename} to ...")

            !unzip -o "{filename}" -d .
            print("Project extracted! Contents:")
            !ls -la
        else:
            print(f"Skipping {filename} - not a ZIP file")

upload_extract_project()

Please upload your project ZIP file...


Saving tts.zip to tts.zip
Extracting tts.zip to ...
Archive:  tts.zip
   creating: ./config/
  inflating: ./config/base_vits.json  
  inflating: ./config/hifigan.json   
   creating: ./data/
  inflating: ./data/dataset.py       
   creating: ./data/filelists/
  inflating: ./data/filelists/train.txt  
  inflating: ./data/filelists/val.txt  
   creating: ./data/processing/
  inflating: ./data/processing.py    
   creating: ./data/text/
  inflating: ./data/text/az_symbols.py  
  inflating: ./data/text/text_processor.py  
  inflating: ./data/text/__init__.py  
   creating: ./data/tools/
  inflating: ./data/tools/prepare_filelist.py  
  inflating: ./data/__init__.py      
   creating: ./datasets/
   creating: ./datasets/normalized/
   creating: ./datasets/raw/
  inflating: ./datasets/raw/02.wav   
   creating: ./logs/
   creating: ./model/
   creating: ./model/components/
  inflating: ./model/components/duration.py  
  inflating: ./model/components/hifigan.py  
  inflating: ./model/componen

In [3]:
# install deps/requirements and matching wheels for CUDA 11.8
!pip install --upgrade --force-reinstall --no-cache-dir \
  torch==2.3.1+cu118 torchaudio==2.3.1+cu118 \
  numpy==1.26.4 scipy==1.11.4 librosa==0.10.2 \
  soundfile==0.12.1 webrtcvad==2.0.10 pyyaml==6.0.1 \
  matplotlib==3.8.4 gradio==3.49.0 tqdm==4.66.4 \
  tensorboard==2.15.2 einops==0.7.0 speechbrain==0.5.16 \
  datasets==2.19.0 pyarrow==15.0.2 \
  --extra-index-url https://download.pytorch.org/whl/cu118


Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118
Collecting torch==2.3.1+cu118
  Downloading https://download.pytorch.org/whl/cu118/torch-2.3.1%2Bcu118-cp311-cp311-linux_x86_64.whl (839.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m839.7/839.7 MB[0m [31m141.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==2.3.1+cu118
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.3.1%2Bcu118-cp311-cp311-linux_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m186.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy==1.11.4
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.meta

In [None]:
# [COLAB] Keep-alive function to prevent disconnects
from IPython.display import display, Javascript

def keep_alive():
    display(Javascript('''
        function ClickConnect(){
            console.log("Clicking connect button...");
            document.querySelector("colab-connect-button").click()
        }
        setInterval(ClickConnect, 60000)
    '''))

# Uncomment next line if running in Colab:
keep_alive()


<IPython.core.display.Javascript object>

## 2. Dataset Preparation

The VITS model requires:
1. WAV audio files (22050 Hz, mono)
2. Text transcriptions in Azerbaijani
3. Filelists mapping audio to text


In [None]:
# [COLAB] in case to upload datasets zip to datasets/raw
from google.colab import files
import os

def upload_and_extract_dataset():
    """Upload and extract a dataset ZIP file to the datasets/raw directory."""
    print("Please upload your dataset ZIP file...")
    uploaded = files.upload()

    for filename in uploaded.keys():
        if filename.endswith('.zip'):
            print(f"Extracting {filename} to datasets/raw/...")
            os.makedirs('datasets/raw', exist_ok=True)
            !unzip -o "{filename}" -d datasets/raw/
            print("Dataset extracted! Contents:")
            !ls -la datasets/raw/
        else:
            print(f"Skipping {filename} - not a ZIP file")

# Uncomment to upload dataset zip to datasets/raw:
# upload_and_extract_dataset()


# Voice dataset from HuggingFace

- To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens

- Also, accept terms on dataset page. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0

- temporarily using dataset from https://huggingface.co/datasets/BHOSAI/Azerbaijani_News_TTS


In [8]:

!pip install huggingface_hub --quiet
!huggingface-cli login

import os
import soundfile as sf
import librosa
from datasets import load_dataset

# ——— Settings ———
TARGET_SR = 22050
OUT_DIR = "datasets/raw"
os.makedirs(OUT_DIR, exist_ok=True)

# ——— Load dataset & pick 20 samples ———
ds = load_dataset("BHOSAI/Azerbaijani_News_TTS", split="single", trust_remote_code=True, use_auth_token=True)
subset = ds.shuffle(seed=42).select(range(100))

# ——— Save audio + metadata ———
with open(f"{OUT_DIR}/metadata.txt", "w", encoding="utf-8") as meta:
    for i, sample in enumerate(subset):
        utt_id = f"{i:04d}"
        wav_path = f"{OUT_DIR}/{utt_id}.wav"

        # Resample to 22.05kHz mono
        audio_22k = librosa.resample(sample["audio"]["array"], orig_sr=sample["audio"]["sampling_rate"], target_sr=TARGET_SR)
        sf.write(wav_path, audio_22k, TARGET_SR)

        # Write metadata line
        text = sample["text"].strip().replace("|", " ")
        meta.write(f"{wav_path}|{text}\n")

        print(f"✅ {wav_path} — {text}")



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read)



✅ datasets/raw/0000.wav — Məsələ bundadır ki, KXDR-in mal dövriyyəsinin 90% -i Çin üzərindən, bu ölkənin Şimali Koreya sərhədinə yaxın yerləşən Dandun və Şenyanq bölgələrində xəstəlik halları qeydə alınmışdı.
✅ datasets/raw/0001.wav — Bu baxımdan Gəncəyə digər şəhər və rayonlardan gələn vətəndaşlara məhdudiyyətlər yaranıb.
✅ datasets/raw/0002.wav — Suya bir neçə tablet aspirin və ya aktivləşdirilmiş kömür qoymaq lazımdır.
✅ datasets/raw/0003.wav — Bəzi hallarda tarixi binaları sökürlər, onların yerində eybəcər hündürmərtəbəli binalar tikirlər.
✅ datasets/raw/0004.wav — Daha əvvəl Cənubi Afrika Prezidenti Siril Ramafosa bütün ölkə ərazisində 21 günlük karantinin tətbiq olunduğunu açıqlayıb.
✅ datasets/raw/0005.wav — az xəbər verir ki, əgər oktyabrın 29-da iş adamının sərvəti “Bloomberg Billionaires İndex”də 20,1 milyard dollar idisə, hazırda bu məbləğ 20,9 milyard dollar təşkil edir.
✅ datasets/raw/0006.wav — az xəbər verir ki, Təhsil Nazirliyi koronavirus təhlükəsi ilə əlaqədar ümumi t

## 3. Audio Preprocessing

Before training, we'll normalize the audio files to ensure consistent quality:
- Remove DC offset
- Normalize levels
- Resample to 22050 Hz
- Convert to mono


In [13]:
import os
import glob
import librosa
import soundfile as sf
import numpy as np
from tqdm.notebook import tqdm
import multiprocessing

def process_audio_file(file_path, target_sr=22050, target_level=-23.0, output_dir=None):
    """Process a single audio file with normalization and resampling."""
    try:
        # Set output path
        if output_dir:
            os.makedirs(output_dir, exist_ok=True)
            filename = os.path.basename(file_path)
            output_path = os.path.join(output_dir, filename)
        else:
            output_path = file_path

        # Load and process audio
        y, sr = librosa.load(file_path, sr=None, mono=True)

        # Resample if needed
        if sr != target_sr:
            y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)

        # Remove DC offset
        y = y - np.mean(y)

        # Normalize level
        rms = np.sqrt(np.mean(y**2))
        target_rms = 10**(target_level/20)
        gain = target_rms / (rms + 1e-8)
        y_normalized = y * gain

        # Prevent clipping
        max_val = np.max(np.abs(y_normalized))
        if max_val > 0.99:
            y_normalized = y_normalized / max_val * 0.99

        # Save processed audio
        sf.write(output_path, y_normalized, target_sr)
        return True

    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return False

def normalize_dataset(dataset_dir, output_dir=None):
    """Normalize all WAV files in a directory using multiprocessing."""
    wav_files = glob.glob(os.path.join(dataset_dir, "**", "*.wav"), recursive=True)
    print(f"Found {len(wav_files)} WAV files")

    if not wav_files:
        print("No WAV files found!")
        return

    # Process files in parallel
    with multiprocessing.Pool(processes=os.cpu_count()) as pool:
        args = [(f, 22050, -23.0, output_dir) for f in wav_files]
        results = list(tqdm(pool.starmap(process_audio_file, args), total=len(args)))

    success_count = results.count(True)
    print(f"Successfully processed {success_count} of {len(wav_files)} files")

# Uncomment to normalize dataset:
normalize_dataset('datasets/raw', output_dir='datasets/normalized')

# copy metadata and update paths
!cp datasets/raw/metadata.txt datasets/normalized/metan.txt
!sed -i 's#datasets/raw/#datasets/normalized/#' datasets/normalized/metan.txt

Found 100 WAV files


  0%|          | 0/100 [00:00<?, ?it/s]

Successfully processed 100 of 100 files


### Generate Filelists

Create train/validation splits with the format:
```
path/to/audio.wav|Azerbaijani text
```


In [17]:
# Generate train/val splits
!python data/tools/prepare_filelist.py \
    --wavs datasets/normalized \
    --output data/filelists \
    --transcriptions datasets/normalized/metan.txt \
    --val-ratio 0.05


2025-07-09 10:29:10,800 - __main__ - INFO - Loaded 100 transcriptions from datasets/normalized/metan.txt
2025-07-09 10:29:10,800 - __main__ - INFO - Found 100 WAV files in datasets/normalized
2025-07-09 10:29:10,800 - __main__ - INFO - Split dataset: 95 training files, 5 validation files
2025-07-09 10:29:10,801 - __main__ - INFO - Filelists created at data/filelists/train.txt and data/filelists/val.txt


## 4. Model Training

We'll train the VITS model with periodic checkpointing and monitoring.


In [None]:
# Training monitor
import time
import os
import glob

def monitor_training(interval=60):
    """Monitor training progress and checkpoint saving."""
    checkpoint_dir = 'checkpoints'

    try:
        while True:
            checkpoint_files = glob.glob(f"{checkpoint_dir}/*.pt")

            print(f"\n=== Training Status: {time.strftime('%Y-%m-%d %H:%M:%S')} ===")
            print(f"Found {len(checkpoint_files)} checkpoints")

            if checkpoint_files:
                checkpoint_files.sort(key=lambda x: os.path.getmtime(x), reverse=True)
                print("\nMost recent checkpoints:")
                for i, ckpt in enumerate(checkpoint_files[:3]):
                    mod_time = time.strftime('%Y-%m-%d %H:%M:%S',
                                        time.localtime(os.path.getmtime(ckpt)))
                    size_mb = os.path.getsize(ckpt) / (1024 * 1024)
                    print(f"{i+1}. {os.path.basename(ckpt)} - {size_mb:.2f} MB - {mod_time}")

            print(f"\nNext check in {interval} seconds...")
            time.sleep(interval)

    except KeyboardInterrupt:
        print("\nMonitoring stopped")

# Start training
!python train.py --config config/base_vits.json --output_dir checkpoints

# Uncomment to monitor training:
# monitor_training(interval=60)


Epoch 51: 100% 23/23 [00:11<00:00,  1.95it/s, loss=1.4229]
Validation: 100% 1/1 [00:00<00:00,  2.96it/s]
2025-07-09 10:53:50,798 - training.trainer - INFO - Epoch 51/1050, Train Loss: 1.4579, Val Loss: 1.4991, Time: 0:10:00
Epoch 52:   0% 0/23 [00:00<?, ?it/s, loss=1.7581]2025-07-09 10:53:51,488 - training.trainer - INFO - Step 1173, Loss: 1.7581, Recon: 1.6570, KL: 0.0000, Dur: 1.0103
Epoch 52: 100% 23/23 [00:10<00:00,  2.11it/s, loss=1.6641]
Validation: 100% 1/1 [00:00<00:00,  2.88it/s]
2025-07-09 10:54:02,045 - training.trainer - INFO - Epoch 52/1051, Train Loss: 1.5594, Val Loss: 1.4964, Time: 0:10:11
Epoch 53:   0% 0/23 [00:00<?, ?it/s, loss=1.4172]2025-07-09 10:54:02,660 - training.trainer - INFO - Step 1196, Loss: 1.4172, Recon: 1.3231, KL: 0.0000, Dur: 0.9407
Epoch 53: 100% 23/23 [00:11<00:00,  2.05it/s, loss=1.8075]
Validation: 100% 1/1 [00:00<00:00,  3.18it/s]
2025-07-09 10:54:13,597 - training.trainer - INFO - Epoch 53/1052, Train Loss: 1.5116, Val Loss: 1.4980, Time: 0:10:2

## 5. Inference & Voice Cloning


In [4]:
# VITS inference inside a notebook (no more VITSInference wrapper)

import os, glob, torch, numpy as np
import IPython.display as ipd
from pathlib import Path

from model.vits import VITS
from utils.common import load_config
from data.text.text_processor import TextProcessor

# ------------------------------------------------------------------
# 1. Locate the checkpoint to load
# ------------------------------------------------------------------
ckpt_dir = "checkpoints"
ckpts = sorted(glob.glob(os.path.join(ckpt_dir, "*.pth")), key=os.path.getmtime)
assert ckpts, f"No checkpoints found in {ckpt_dir}!"

# Prefer a file called best_model.pth if it exists, otherwise newest *.pth
checkpoint_path = (Path(ckpt_dir) / "best_model.pth"
                   if os.path.exists(os.path.join(ckpt_dir, "best_model.pth"))
                   else ckpts[-1])
print("Using checkpoint:", os.path.basename(checkpoint_path))

# ------------------------------------------------------------------
# 2. Re-create the model and load weights
# ------------------------------------------------------------------
cfg_path = "config/base_vits.json"
config   = load_config(cfg_path)

device   = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model    = VITS(config).to(device).eval()

state = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(state.get("model_state_dict", state))   # support both formats

# ------------------------------------------------------------------
# 3. Text processor
# ------------------------------------------------------------------
text_proc = TextProcessor(config)

def synthesize(text: str, speed: float = 1.0, seed: int | None = None):
    """Return a numpy array with generated audio."""
    if seed is not None:
        torch.manual_seed(seed)

    with torch.no_grad():
        seq = text_proc.encode_text(text).unsqueeze(0).to(device)   # [1, T]
        audio = model.generate(seq, speed_adjustment=speed)         # [1, 1, S]
        audio = audio.squeeze().cpu().numpy()

    # simple peak-norm
    peak = np.max(np.abs(audio))
    return audio / peak if peak > 0 else audio

# ------------------------------------------------------------------
# 4. Demo
# ------------------------------------------------------------------
text  = "Salam dünya, Salam dünya, Salam dünya, Salam dünya, Salam dünya, Salam dünya"
audio = synthesize(text)

ipd.display(ipd.Audio(audio, rate=config["data"]["sampling_rate"]))


Using checkpoint: best_model.pth


In [14]:
import os, librosa, numpy as np
import IPython.display as ipd
from pathlib import Path
from google.colab import files   # safe-import: only exists on Colab

# ---------------------------------------------------------
#  Voice-cloning demo (same model, different reference clip)
# ---------------------------------------------------------

def clone_voice(reference_wav: str | None = None,
                text: str = "Mənim səsimlə danışan süni zəka!",
                speed: float = 1.0):
    """
    Generate speech in the *style* of `reference_wav` using
    the already-loaded VITS model.

    NOTE: this simple demo just copies the loudness contour of
    the reference clip – full embedding-based cloning would require
    a speaker encoder which is not yet integrated here.
    """
    # ------------------------------------------------------------------
    # 1. Pick reference file
    # ------------------------------------------------------------------
    if reference_wav is None:
        if "google.colab" in globals():
            print("Upload a reference .wav file:")
            up = files.upload()
            if up:
                reference_wav = next(iter(up))
        else:
            reference_wav = "datasets/normalized/0002.wav"   # fallback
    if not reference_wav or not Path(reference_wav).exists():
        print("No valid reference file found.")
        return

    # ------------------------------------------------------------------
    # 2. Play reference
    # ------------------------------------------------------------------
    ref_audio, sr = librosa.load(reference_wav, sr=config["data"]["sampling_rate"])
    print("Reference voice:")
    ipd.display(ipd.Audio(ref_audio, rate=sr))

    # ------------------------------------------------------------------
    # 3. Naïve loudness-matching clone
    # ------------------------------------------------------------------
    cloned = synthesize(text, speed=speed)

    # Match energy of reference (simple RMS normalisation)
    ref_rms = np.sqrt(np.mean(ref_audio**2))
    cln_rms = np.sqrt(np.mean(cloned**2))
    if cln_rms > 0:
        cloned = cloned * (ref_rms / cln_rms)

    print(f"\nCloned voice saying: {text}")
    ipd.display(ipd.Audio(cloned, rate=sr))

# ------------------------------------------------------------------
# Run the demo (uploads on Colab, fallback file elsewhere)
# ------------------------------------------------------------------
clone_voice()


Reference voice:



Cloned voice saying: Mənim səsimlə danışan süni zəka!


## 6. Web Demo

Launch an interactive Gradio demo for testing the model.


In [13]:
# Launch Gradio demo
if 'google.colab' in globals():
    !python app.py --share True  # Public URL
else:
    !python app.py  # Local URL


2025-07-09 09:21:12,588 - utils.common - INFO - Logger initialized with level 20
2025-07-09 09:21:12,589 - __main__ - INFO - Initializing TTS application
2025-07-09 09:21:12,593 - __main__ - INFO - Using device: cuda
2025-07-09 09:21:12,593 - utils.common - INFO - Loading config from config/base_vits.json
2025-07-09 09:21:12,654 - model.vits - INFO - VITS model initialized with config: {'hidden_channels': 192, 'spk_embed_dim': 64, 'n_layers': 6, 'n_heads': 2, 'use_sdp': True, 'vocab_size': 100, 'audio_channels': 80, 'decoder_channels': 512, 'upsample_rates': [8, 8, 2, 2], 'upsample_kernel_sizes': [16, 16, 4, 4]}
2025-07-09 09:21:12,870 - __main__ - INFO - Loading model checkpoint from checkpoints/best_model.pth
2025-07-09 09:21:12,968 - __main__ - INFO - Model checkpoint loaded successfully
2025-07-09 09:21:12,969 - data.text.text_processor - INFO - Initialized TextProcessor with 89 characters
2025-07-09 09:21:12,969 - __main__ - INFO - TTS application initialized successfully
Running 