# üéôÔ∏è Karakalpak VITS TTS Model Training on Google Colab

This notebook provides a complete pipeline for fine-tuning a VITS (Variational Inference Text-to-Speech) model for Karakalpak language using the MMS (Massively Multilingual Speech) architecture.

## üìã Overview
- **Model**: MMS-TTS (VITS architecture)
- **Language**: Karakalpak (kaa)
- **Dataset**: HuggingFace dataset `nickoo004/karakalpak-tts-speaker1`
- **Training Time**: ~20-30 minutes on Colab GPU
- **‚úÖ Works with Private Repositories** - No authentication needed!

## ‚ö†Ô∏è IMPORTANT: Run Cells in Order!

**This notebook MUST be run from top to bottom, cell by cell!**

Do NOT skip cells or run them out of order, or you'll get errors like:
- "File not found"
- "Variable not defined"
- "Directory doesn't exist"

**Recommended**: Use `Runtime > Run all` to run everything in order automatically.

## üöÄ Steps:
1. Environment Setup & Dependencies
2. Download Repository (ZIP method - works with private repos!)
3. Dataset Loading from HuggingFace
4. Model Preparation
5. Training Configuration
6. Model Training
7. Inference & Testing
8. Model Saving & Upload

## 1Ô∏è‚É£ Environment Setup & GPU Check

In [None]:
# Check GPU availability
import torch
print(f"üîç GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"üìä GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected! Training will be very slow.")
    print("Please enable GPU in Runtime > Change runtime type > Hardware accelerator > GPU")

## 2Ô∏è‚É£ Install Required Dependencies

In [None]:
%%bash
# Install core dependencies
pip install -q transformers>=4.35.1 datasets[audio]>=2.14.7 accelerate>=0.24.1
pip install -q matplotlib wandb tensorboard Cython
pip install -q scipy librosa soundfile

echo "‚úÖ Dependencies installed successfully!"

## 3Ô∏è‚É£ Download Repository (Works with Private Repos!)

**‚úÖ No Authentication Required** - Downloads the repository as a ZIP file directly from GitHub.

In [None]:
import os
import zipfile
import urllib.request
import shutil

repo_name = "my-vits-finetuner-karakalpak"

# IMPORTANT: Update this URL if your main branch is named differently
# For 'main' branch: /archive/refs/heads/main.zip
# For 'master' branch: /archive/refs/heads/master.zip
zip_url = "https://github.com/NursultanMRX/my-vits-finetuner-karakalpak/archive/refs/heads/main.zip"

if not os.path.exists(repo_name):
    print(f"üì• Downloading repository as ZIP (no authentication needed)...")
    zip_path = "repo.zip"
    
    try:
        # Download the zip file
        urllib.request.urlretrieve(zip_url, zip_path)
        print("‚úÖ Downloaded successfully!")
        
        # Extract the zip file
        print("üì¶ Extracting files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(".")
        
        # The extracted folder has '-main' or '-master' suffix
        extracted_name = None
        for suffix in ['-main', '-master']:
            possible_name = f"{repo_name}{suffix}"
            if os.path.exists(possible_name):
                extracted_name = possible_name
                break
        
        if extracted_name:
            shutil.move(extracted_name, repo_name)
            print(f"‚úÖ Renamed '{extracted_name}' to '{repo_name}'")
        else:
            raise Exception(f"Could not find extracted directory")
        
        # Clean up
        os.remove(zip_path)
        print("‚úÖ Repository ready!")
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        print("\n‚ö†Ô∏è Please check:")
        print("1. Repository exists and is accessible")
        print("2. Branch name is correct (main or master)")
        raise
else:
    print("‚úÖ Repository already exists!")

# Change to repository directory
if os.path.exists(repo_name):
    os.chdir(repo_name)
    print(f"üìÇ Current directory: {os.getcwd()}")
    
    files = os.listdir('.')
    print(f"\nüìã Repository contains {len(files)} items")
    key_files = [f for f in ['run_vits_finetuning.py', 'monotonic_align', 'utils'] if f in files]
    print(f"   Key files present: {', '.join(key_files)}")
else:
    raise Exception(f"‚ùå Directory '{repo_name}' does not exist!")

In [None]:
%%bash
# ========================================
# 3Ô∏è‚É£.1 Build Monotonic Align Module
# ========================================
# Build the Cython monotonic alignment module (CRITICAL for fast training)
echo "üî® Building monotonic alignment module..."

if [ ! -d "monotonic_align" ]; then
    echo "‚ùå Error: monotonic_align directory not found!"
    exit 1
fi

cd monotonic_align
mkdir -p monotonic_align
python setup.py build_ext --inplace

if [ $? -eq 0 ]; then
    echo "‚úÖ Monotonic align built successfully!"
else
    echo "‚ùå Failed to build!"
    exit 1
fi

cd ..

## 4Ô∏è‚É£ HuggingFace Authentication (Optional)

**Only needed if you want to:**
- Push your trained model to HuggingFace Hub
- Access private datasets

Get your token from: https://huggingface.co/settings/tokens

In [None]:
from huggingface_hub import notebook_login

# Login to HuggingFace (optional - skip if you don't want to push to hub)
try:
    notebook_login()
    print("‚úÖ Successfully logged in to HuggingFace!")
except Exception as e:
    print(f"‚ö†Ô∏è Login skipped: {e}")
    print("You can continue training, but won't be able to push to Hub")

## 5Ô∏è‚É£ Load Dataset from HuggingFace

In [None]:
from datasets import load_dataset, DatasetDict
import os

# Load the Karakalpak TTS dataset
dataset_name = "nickoo004/karakalpak-tts-speaker1"

print(f"üìä Loading dataset: {dataset_name}")

try:
    # Try default loading
    dataset = load_dataset(dataset_name)
    print("‚úÖ Dataset loaded successfully!")
    
except ValueError as e:
    error_msg = str(e)
    print(f"‚ö†Ô∏è Default loading failed: {error_msg}")
    
    # Check if it's the file_name column issue
    if "file_name" in error_msg or "*_file_name" in error_msg:
        print("\\nüîß Detected column name issue - trying workaround...")
        print("   Loading with custom processing...\\n")
        
        try:
            # Load using datasets library with audiofolder
            from datasets import load_dataset
            
            # Load as audio dataset (it will auto-detect the structure)
            dataset = load_dataset(
                "audiofolder",
                data_dir=f"hf://datasets/{dataset_name}",
                drop_labels=False
            )
            print("‚úÖ Loaded using audiofolder method!")
            
        except Exception as e2:
            print(f"‚ùå Workaround failed: {e2}")
            print("\\n‚ö†Ô∏è PLEASE FIX THE DATASET:")
            print("Go to HuggingFace and rename 'audio_file' to 'file_name' in metadata.csv")
            print("See instructions below for exact steps.")
            raise
    else:
        # Different error - try other methods
        print("\\nüîÑ Trying alternative methods...\\n")
        
        try:
            dataset = load_dataset(dataset_name, split='train')
            if not isinstance(dataset, DatasetDict):
                from datasets import DatasetDict
                dataset = DatasetDict({'train': dataset})
            print("‚úÖ Loaded with explicit split!")
        except Exception as e2:
            print(f"‚ùå All methods failed: {e2}")
            print("\\n‚ö†Ô∏è SOLUTION: Fix dataset metadata on HuggingFace")
            print("Your metadata.csv needs a 'file_name' column")
            print("See instructions in the next cell below")
            raise

print(f"\\nüìà Dataset structure:")
print(dataset)

# Show sample
if 'train' in dataset:
    sample = dataset['train'][0]
    print("\\nüîç Sample from dataset:")
    print(f"  Keys: {list(sample.keys())}")
    
    # Find columns
    text_col = sample.get('text', sample.get('sentence', 'N/A'))
    print(f"  - Text: {text_col}")
    
    audio_col = None
    for col in ['audio', 'audio_file', 'file', 'path', 'file_name']:
        if col in sample:
            audio_col = col
            print(f"  - Audio column: '{col}'")
            break
    
    speaker_col = None
    for col in ['speaker_name', 'speaker', 'speaker_id']:
        if col in sample:
            speaker_col = col
            print(f"  - Speaker column: '{col}' = {sample[col]}")
            break
    
    print(f"\\nüìä Total samples: {len(dataset['train'])}")
    
    if audio_col:
        print(f"\\nüí° Use audio_column_name='{audio_col}' in training config")
    if speaker_col:
        print(f"üí° Use speaker_id_column_name='{speaker_col}' in training config")

### üìù How to Fix Your Dataset on HuggingFace

**Your Issue**: The metadata has `"audio_file"` but needs `"file_name"`

**Quick Fix (2 minutes):**

1. Go to: https://huggingface.co/datasets/nickoo004/karakalpak-tts-speaker1
2. Click **"Files and versions"** tab
3. Find and click on **metadata.csv**
4. Click **"Edit"** button (pencil icon)
5. Change the FIRST line from:
   ```csv
   "audio_file","text","speaker_name"
   ```
   To:
   ```csv
   "file_name","text","speaker_name"
   ```
6. Click **"Commit changes to main"**
7. Wait 1-2 minutes for HuggingFace to process
8. Come back and re-run the dataset loading cell above

**Alternative**: If editing doesn't work, download the metadata.csv, rename the column locally, and re-upload it.

**After fixing**, the automatic loading will work perfectly!

### üîß Alternative: Manual Dataset Upload

**Uncomment and run this cell if automatic loading fails:**

In [None]:
# MANUAL DATASET LOADING (uncomment if needed)

# from datasets import Dataset, Audio, DatasetDict
# import pandas as pd
# import os

# # Upload your dataset folder to Colab, then update these paths:
# dataset_dir = "/content/karakalpak_dataset"  # Your dataset directory
# metadata_file = f"{dataset_dir}/metadata.csv"

# # Load metadata
# df = pd.read_csv(metadata_file)

# # Add full audio paths
# df['audio_file'] = df['file_name'].apply(lambda x: os.path.join(dataset_dir, 'audio', x))

# # Create dataset
# dataset = Dataset.from_pandas(df)
# dataset = dataset.cast_column('audio_file', Audio(sampling_rate=16000))
# dataset = DatasetDict({'train': dataset})

# print("‚úÖ Manual dataset loaded!")
# print(dataset)

## 6Ô∏è‚É£ Prepare Base Model with Discriminator

In [None]:
from huggingface_hub import list_repo_files

model_name_or_path = "facebook/mms-tts-kaa"
local_model_dir = "./mms-tts-kaa-with-discriminator"

# Check if pre-converted model exists
try:
    test_model = "nickoo004/mms-tts-kaa-with-discriminator"
    files = list_repo_files(test_model)
    if 'discriminator.pth' in files or 'pytorch_model.bin' in files:
        print(f"‚úÖ Found existing model: {test_model}")
        model_name_or_path = test_model
    else:
        raise Exception("Need to convert")
except:
    print(f"\nüìù Converting base MMS model...")
    print(f"   Base: {model_name_or_path}")
    
    !python convert_original_discriminator_checkpoint.py \
        --language_code kaa \
        --pytorch_dump_folder_path {local_model_dir}
    
    model_name_or_path = local_model_dir
    print(f"\n‚úÖ Model converted: {model_name_or_path}")

print(f"\nüéØ Model ready: {model_name_or_path}")

## 7Ô∏è‚É£ Configure Training Parameters

In [None]:
import json
import os

# Make sure we're in the repository directory
repo_dir = "/content/my-vits-finetuner-karakalpak"
if os.path.exists(repo_dir):
    os.chdir(repo_dir)
    print(f"üìÇ Working in: {os.getcwd()}")
else:
    print("‚ö†Ô∏è Warning: Repository directory not found. Run Section 3 cells first!")

training_config = {
    # Model and Dataset
    "model_name_or_path": model_name_or_path,
    "dataset_name": "nickoo004/karakalpak-tts-speaker1",
    
    # Output
    "output_dir": "./mms-tts-kaa-finetuned-speaker1",
    "overwrite_output_dir": True,
    
    # HuggingFace Hub (optional)
    "push_to_hub": False,
    "hub_model_id": "your-username/mms-tts-kaa-finetuned-speaker1",
    
    # Dataset columns (IMPORTANT: adjust based on what the dataset loading showed)
    "audio_column_name": "audio",  # Change this if dataset loading showed different name
    "text_column_name": "text",
    "speaker_id_column_name": "speaker_name",
    "filter_on_speaker_id": "Speaker_1",
    "override_speaker_embeddings": True,
    
    # Audio filtering
    "max_duration_in_seconds": 20.0,
    "min_duration_in_seconds": 1.0,
    
    # Training hyperparameters
    "num_train_epochs": 150,
    "per_device_train_batch_size": 4,
    "learning_rate": 2e-5,
    "warmup_ratio": 0.01,
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": False,
    "group_by_length": False,
    
    # Training flags
    "do_train": True,
    "do_eval": False,
    
    # Loss weights
    "weight_disc": 3.0,
    "weight_fmaps": 1.0,
    "weight_gen": 1.0,
    "weight_kl": 1.5,
    "weight_duration": 1.0,
    "weight_mel": 35.0,
    
    # Optimization
    "fp16": True,
    "seed": 42,
    
    # Logging
    "logging_steps": 10,
    "save_steps": 500,
    "save_total_limit": 2,
}

config_path = "./training_config_colab.json"
with open(config_path, 'w') as f:
    json.dump(training_config, f, indent=4)

print("‚úÖ Training configuration created!")
print("\\nüìã Key settings:")
print(f"  - Model: {training_config['model_name_or_path']}")
print(f"  - Dataset: {training_config['dataset_name']}")
print(f"  - Audio column: {training_config['audio_column_name']}")
print(f"  - Epochs: {training_config['num_train_epochs']}")
print(f"  - Batch size: {training_config['per_device_train_batch_size']}")
print(f"  - Learning rate: {training_config['learning_rate']}")
print(f"  - Output: {training_config['output_dir']}")
print(f"\\nüìÑ Config saved to: {config_path}")
print(f"\\n‚ö†Ô∏è IMPORTANT: If dataset loading showed different audio column name,")
print(f"   update 'audio_column_name' above and re-run this cell!")

## 8Ô∏è‚É£ Start Training! üöÄ

**This will take ~20-30 minutes on a T4 GPU**

If you get OOM errors, reduce `per_device_train_batch_size` to 2 or 1 in the config above and re-run.

In [None]:
import os

# IMPORTANT: Verify we're in the correct directory
repo_dir = "/content/my-vits-finetuner-karakalpak"

if not os.path.exists(repo_dir):
    print("‚ùå ERROR: Repository directory not found!")
    print(f"   Looking for: {repo_dir}")
    print("\\n‚ö†Ô∏è Please run the repository download cells (Section 3) first!")
    raise Exception("Repository not downloaded")

# Change to repository directory
os.chdir(repo_dir)
print(f"üìÇ Current directory: {os.getcwd()}")

# Verify the training script exists
training_script = "run_vits_finetuning.py"
if not os.path.exists(training_script):
    print(f"‚ùå ERROR: Training script not found!")
    print(f"   Looking for: {training_script}")
    print(f"   In directory: {os.getcwd()}")
    print("\\nüìã Files in current directory:")
    print(os.listdir('.'))
    raise Exception("Training script not found")

print(f"‚úÖ Found training script: {training_script}")
print(f"‚úÖ Config file: {config_path}")
print("\\nüöÄ Starting training...\\n")

# Run training using accelerate
!accelerate launch run_vits_finetuning.py {config_path}

### üîë IMPORTANT: Set Up HuggingFace Token for Private Dataset

**If your dataset is PRIVATE**, you must set up your HuggingFace token in Colab Secrets:

1. Click the **üîë key icon** in the left sidebar (Secrets)
2. Click **"+ Add new secret"**
3. Name: **`HF_TOKEN`**
4. Value: Your HuggingFace token from https://huggingface.co/settings/tokens
5. Enable **"Notebook access"** toggle

**If your dataset is PUBLIC**, you can skip this step.

In [None]:
# Optional: Clear dataset cache (run this if you get cache errors)
import shutil
import os
import datasets

print("üóëÔ∏è Clearing HuggingFace dataset cache...")

cache_dir = "/root/.cache/huggingface/datasets"
if os.path.exists(cache_dir):
    shutil.rmtree(cache_dir)
    print("‚úÖ Cache cleared")
else:
    print("‚ÑπÔ∏è No cache found")

# Disable caching for fresh download
datasets.disable_caching()
print("‚úÖ Caching disabled - dataset will be downloaded fresh")

import os
from google.colab import userdata

# IMPORTANT: Verify we're in the correct directory
repo_dir = "/content/my-vits-finetuner-karakalpak"

if not os.path.exists(repo_dir):
    print("‚ùå ERROR: Repository directory not found!")
    print(f"   Looking for: {repo_dir}")
    print("\n‚ö†Ô∏è Please run the repository download cells (Section 3) first!")
    raise Exception("Repository not downloaded")

# Change to repository directory
os.chdir(repo_dir)
print(f"üìÇ Current directory: {os.getcwd()}")

# Verify the training script exists
training_script = "run_vits_finetuning.py"
if not os.path.exists(training_script):
    print(f"‚ùå ERROR: Training script not found!")
    print(f"   Looking for: {training_script}")
    print(f"   In directory: {os.getcwd()}")
    print("\nüìã Files in current directory:")
    print(os.listdir('.'))
    raise Exception("Training script not found")

print(f"‚úÖ Found training script: {training_script}")
print(f"‚úÖ Config file: {config_path}")

# Get HuggingFace token from Colab secrets for private dataset access
try:
    hf_token = userdata.get('HF_TOKEN')
    os.environ['HF_TOKEN'] = hf_token
    os.environ['HUGGING_FACE_HUB_TOKEN'] = hf_token
    print("‚úÖ HuggingFace token loaded from secrets")
except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not load HF_TOKEN from secrets: {e}")
    print("   If your dataset is private, this will fail!")

print("\nüöÄ Starting training...\n")

# Run training using accelerate with token in environment
!accelerate launch run_vits_finetuning.py {config_path}

In [None]:
from transformers import VitsModel, AutoTokenizer
import torch
import scipy.io.wavfile
from IPython.display import Audio, display

model_path = training_config['output_dir']

print(f"üì• Loading model from: {model_path}")
model = VitsModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"‚úÖ Model loaded on {device}!")

In [None]:
# Test with Karakalpak text
test_texts = [
    "S√°lem, qalaysƒ±z?",
    "M√∫g√°lim j√°qsƒ±.",
    "Men oqƒ±p atƒ±rman.",
]

print("üé§ Generating speech...\n")

for i, text in enumerate(test_texts, 1):
    print(f"{'='*60}")
    print(f"Sample {i}: {text}")
    print('='*60)
    
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    waveform = outputs.waveform[0].cpu().numpy()
    
    output_file = f"output_{i}.wav"
    scipy.io.wavfile.write(
        output_file,
        rate=model.config.sampling_rate,
        data=waveform
    )
    
    print(f"üíæ Saved: {output_file}")
    display(Audio(waveform, rate=model.config.sampling_rate))
    print()

print("‚úÖ All samples generated!")

## üîü Custom Text Generation

In [None]:
# Enter your own text here!
custom_text = "S√°lem, bul men!"

print(f"üìù Generating: {custom_text}\n")

inputs = tokenizer(custom_text, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)

waveform = outputs.waveform[0].cpu().numpy()
output_file = "custom_output.wav"
scipy.io.wavfile.write(output_file, rate=model.config.sampling_rate, data=waveform)

print(f"‚úÖ Saved: {output_file}\n")
display(Audio(waveform, rate=model.config.sampling_rate))

## 1Ô∏è‚É£1Ô∏è‚É£ Push to HuggingFace Hub (Optional)

In [None]:
push_to_hub = False  # Set to True to push

if push_to_hub:
    hub_model_id = training_config['hub_model_id']
    
    if "your-username" not in hub_model_id:
        print(f"üì§ Pushing to: {hub_model_id}")
        model.push_to_hub(hub_model_id)
        tokenizer.push_to_hub(hub_model_id)
        print(f"\n‚úÖ Pushed to: https://huggingface.co/{hub_model_id}")
    else:
        print("‚ö†Ô∏è Update hub_model_id first!")
else:
    print("‚ÑπÔ∏è Skipping push (set push_to_hub=True to enable)")

## 1Ô∏è‚É£2Ô∏è‚É£ Download Model Files

In [None]:
import shutil

output_dir = training_config['output_dir']
zip_filename = "mms-tts-kaa-finetuned"

print(f"üì¶ Creating ZIP archive...")
shutil.make_archive(zip_filename, 'zip', output_dir)

print(f"\n‚úÖ Archived: {zip_filename}.zip")
print(f"üì• Download from Files panel on the left")
print(f"   Size: {os.path.getsize(f'{zip_filename}.zip') / (1024*1024):.2f} MB")

print(f"\nüìÇ Model files:")
for file in os.listdir(output_dir):
    file_path = os.path.join(output_dir, file)
    if os.path.isfile(file_path):
        size = os.path.getsize(file_path) / (1024*1024)
        print(f"  - {file}: {size:.2f} MB")

## üéâ Congratulations!

### You've successfully trained a Karakalpak TTS model!

**What you accomplished:**
- ‚úÖ Set up the environment
- ‚úÖ Downloaded the code (from private repo!)
- ‚úÖ Loaded the dataset
- ‚úÖ Fine-tuned the model
- ‚úÖ Generated speech samples
- ‚úÖ Saved the model

**Next steps:**
1. Experiment with more epochs (200-300)
2. Try different hyperparameters
3. Add more training data
4. Deploy your model
5. Share on HuggingFace Hub

**Resources:**
- [VITS Paper](https://arxiv.org/abs/2106.06103)
- [MMS Paper](https://arxiv.org/abs/2305.13516)
- [HF VITS Docs](https://huggingface.co/docs/transformers/model_doc/vits)

## üîß Troubleshooting

**1. Out of Memory (OOM):**
- Reduce batch size to 2 or 1
- Enable gradient_checkpointing
- Ensure GPU is enabled

**2. Dataset loading fails:**
- Fix metadata.csv to include `file_name` column
- Or use manual upload method
- Check dataset exists on HuggingFace

**3. Training is slow:**
- Confirm GPU is enabled
- Check fp16=True
- Verify monotonic_align built correctly

**4. Repository download fails:**
- Check internet connection
- Verify branch name (main vs master)
- Check repository exists and is accessible

**5. Build errors:**
- Ensure Cython is installed
- Check numpy is available
- Try restarting runtime