# üéôÔ∏è XTTS Dataset Creator for Google Colab

Create professional voice datasets for XTTS training with a full-featured Gradio interface.

## üöÄ Features:
- **Multiple Input Sources:** YouTube, file upload, microphone
- **Automatic Transcription:** Faster Whisper with GPU support
- **Audio Segmentation:** Advanced VAD and quality filtering
- **Export Formats:** CSV, JSON, LJSpeech, metadata.txt
- **Statistics Dashboard:** Real-time dataset analytics
- **Google Drive Integration:** Auto-save all projects

---

## üì¶ Step 1: Install Dependencies

In [None]:
%%capture

# Install PyTorch with CUDA
!pip install torch==2.1.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118

# Install core dependencies
!pip install gradio>=4.44.0 faster-whisper>=1.0.0 librosa soundfile pandas numpy yt-dlp pydub

print("‚úÖ Dependencies installed!")

## üíæ Step 2: Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Create workspace
workspace = '/content/drive/MyDrive/XTTS_Datasets'
os.makedirs(workspace, exist_ok=True)

print(f"\n‚úÖ Google Drive mounted!")
print(f"üìç Workspace: {workspace}")
print(f"üíæ All datasets will be saved here automatically!")

## üîΩ Step 3: Clone Dataset Creator

In [None]:
import os
from pathlib import Path

# Clone repository
if not Path("xtts-finetune-webui-fresh").exists():
    print("üîΩ Cloning repository...")
    !git clone https://github.com/Diakonrobel/Amharic_XTTS-V2_TTS.git xtts-finetune-webui-fresh
    print("‚úÖ Repository cloned!")
else:
    print("üìÇ Repository exists, pulling updates...")
    !cd xtts-finetune-webui-fresh && git pull

%cd xtts-finetune-webui-fresh/dataset_creator

print(f"\n‚úÖ Dataset creator ready!")
print(f"üìç Current directory: {os.getcwd()}")

## üöÄ Step 4: Launch Dataset Creator UI

In [None]:
import sys
sys.path.append('..')

# Import and launch
from app import create_interface

print("üé® Launching Dataset Creator...\n")
print("üìä Features Available:")
print("   ‚úÖ YouTube video processing")
print("   ‚úÖ Audio file upload")
print("   ‚úÖ Microphone recording")
print("   ‚úÖ Automatic transcription")
print("   ‚úÖ Quality filtering")
print("   ‚úÖ Multiple export formats")
print("   ‚úÖ Real-time statistics")
print("\nüí° All projects saved to Google Drive automatically!\n")

# Create and launch interface
demo = create_interface()
demo.launch(share=True, debug=True)

---

## üìö Quick Guide

### 1Ô∏è‚É£ Create a Project
1. Go to **Project Setup** tab
2. Enter project name, language, and speaker name
3. Click **Create New Project**

### 2Ô∏è‚É£ Add Data (Choose One):

**üé¨ YouTube:**
- Paste YouTube URL
- Adjust min/max duration sliders
- Click **Process YouTube Video**

**üìÅ File Upload:**
- Upload audio files (WAV, MP3, FLAC)
- Configure segmentation options
- Click **Process Audio Files**

**üé§ Recording:**
- Click microphone to record
- Optionally add manual transcription
- Click **Add Recording**

### 3Ô∏è‚É£ Review & Export
- Check **Dataset Overview** for statistics
- Select export format (CSV, JSON, LJSpeech)
- Click **Export Dataset**
- Download the file

---

## ‚öôÔ∏è Processing Options

### Segment Duration
- **Min Duration (1-5s):** Filter out very short segments
- **Max Duration (5-30s):** Split long segments
- **Recommended:** 1-15 seconds

### Quality Threshold (0-1)
- **0.3-0.5:** Accept most segments (quantity)
- **0.6-0.7:** Balanced quality/quantity ‚úÖ
- **0.8-1.0:** Strict filtering (quality)

---

## üí° Best Practices

### Audio Quality
- ‚úÖ Clear voice, minimal background noise
- ‚úÖ Consistent volume levels
- ‚úÖ Sample rate: 22050 Hz or higher
- ‚ùå Avoid music, multiple speakers, echoes

### Dataset Size
- **Testing:** 5-10 minutes
- **Good Quality:** 30-60 minutes
- **Excellent Quality:** 2-4 hours
- **Professional:** 10+ hours

### Language Codes
- English: `en`
- Spanish: `es`
- French: `fr`
- German: `de`
- Amharic: `am` or `amh`
- [Full list in UI]

---

## üîß Troubleshooting

**YouTube Download Fails:**
```bash
# Update yt-dlp
!pip install -U yt-dlp
```

**Low Quality Segments:**
- Lower quality threshold to 0.5-0.6
- Check source audio quality
- Adjust min/max duration

**Transcription Errors:**
- Verify correct language selected
- Ensure audio is clear
- Try shorter segments

**Out of Memory:**
- Process fewer files at once
- Use shorter audio segments
- Restart runtime

---

## üìñ Export Formats

### CSV
- Pandas-compatible format
- Columns: audio_file, text, speaker_name, duration
- Use for data analysis

### JSON
- Structured data format
- Easy to parse programmatically
- Includes all metadata

### metadata.txt
- LJSpeech-style format
- Format: `filename|text`
- Compatible with TTS trainers

### LJSpeech
- Complete LJSpeech dataset
- Includes audio files + metadata
- Ready for training
- Exported as ZIP archive

---

## üéâ Credits

- **XTTS v2:** [Coqui AI](https://github.com/coqui-ai/TTS)
- **Faster Whisper:** [systran/faster-whisper](https://github.com/systran/faster-whisper)
- **Gradio:** [gradio-app/gradio](https://github.com/gradio-app/gradio)
- **yt-dlp:** [yt-dlp/yt-dlp](https://github.com/yt-dlp/yt-dlp)

---

**‚≠ê Star the repo:** https://github.com/Diakonrobel/Amharic_XTTS-V2_TTS

**üíæ All datasets saved to:** `/content/drive/MyDrive/XTTS_Datasets/`

**Status:** ‚úÖ Ready for Production
