# AutoVoice Voice Cloning Demo

Interactive tutorial for creating voice profiles using AutoVoice.

This notebook demonstrates:
1. **Providing voice samples** - Upload files, record from microphone, or use existing files
2. **Creating voice profiles** - Generate speaker embeddings from audio
3. **Analyzing voice characteristics** - Extract pitch range and quality metrics
4. **Creating multi-sample profiles** - Combine multiple recordings for robustness
5. **Computing speaker similarity** - Compare different voice samples

**Requirements**:
- Core: `torch`, `librosa`, `numpy`, `matplotlib`
- Optional (for upload): `ipywidgets` - `pip install ipywidgets`
- Optional (for recording): `sounddevice`, `soundfile` - `pip install sounddevice soundfile`

## Setup

Import required libraries and initialize the VoiceCloner.

In [None]:
# Import dependencies
import sys
sys.path.append('../src')  # Support editable installs with src/ layout

import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio, display
import torch

from auto_voice.inference import VoiceCloner
from auto_voice.audio.pitch_extractor import SingingPitchExtractor

# Check GPU availability
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

## 0. Data Setup (Optional)

This cell creates data directories for storing audio samples.

**Not required if you plan to**:
- Upload files directly (uses `./tmp/` automatically)
- Record from microphone (uses `./tmp/` automatically)

**Useful if you want to**:
- Store sample files for reuse across notebooks
- Organize multiple voice samples for multi-sample profiles

In [None]:
# Create data directories
from pathlib import Path
import os

# Define data directories
data_dir = Path('../data')
voice_samples_dir = data_dir / 'voice_samples'
songs_dir = data_dir / 'songs'

# Create directories
directories_created = []
for dir_path in [voice_samples_dir, songs_dir]:
    if not dir_path.exists():
        dir_path.mkdir(parents=True, exist_ok=True)
        directories_created.append(str(dir_path))
        print(f"✅ Created: {dir_path}")
    else:
        print(f"ℹ️  Already exists: {dir_path}")

# Check for existing samples
voice_samples = list(voice_samples_dir.glob('*.wav')) + list(voice_samples_dir.glob('*.mp3'))
song_samples = list(songs_dir.glob('*.mp3')) + list(songs_dir.glob('*.wav'))

print(f"\n📊 Status:")
print(f"   Voice samples found: {len(voice_samples)}")
if voice_samples:
    for sample in voice_samples[:5]:  # Show first 5
        print(f"      - {sample.name}")
    if len(voice_samples) > 5:
        print(f"      ... and {len(voice_samples) - 5} more")

print(f"   Song samples found: {len(song_samples)}")
if song_samples:
    for sample in song_samples[:5]:  # Show first 5
        print(f"      - {sample.name}")
    if len(song_samples) > 5:
        print(f"      ... and {len(song_samples) - 5} more")

if not voice_samples:
    print(f"\nℹ️  To add voice samples:")
    print(f"   1. Place audio files (WAV, MP3, FLAC) in: {voice_samples_dir}")
    print(f"   2. OR use the upload/recording options in later cells")

print(f"\n✅ Data directories ready!")

In [None]:
# Initialize VoiceCloner
cloner = VoiceCloner(device=device)

print("VoiceCloner initialized successfully")
print(f"Speaker encoder model: {cloner.speaker_encoder.__class__.__name__}")
print(f"Embedding dimensions: {cloner.speaker_encoder.embedding_size}")

## 2. Provide Voice Sample

To create a voice profile, you need to provide a voice sample (30-60 seconds recommended).

**IMPORTANT: Follow these steps in order:**

1. **Run the input setup cell first** to choose your audio source:
   - Upload a file using the widget
   - Record from microphone with `record_audio()`
   - Set `audio_path` manually for existing files

2. **Verify `audio_path` is set** using the status check cell

3. **Run the load/validate cell** to load and analyze the audio

The notebook is designed for top-to-bottom execution. Do not skip the setup cell!

### 2a. Voice Input Options

Choose one of the following methods to provide your voice sample:

**Option 1: Upload Audio File** (Recommended)
- Upload a pre-recorded voice sample (30-60 seconds)
- Supports WAV, MP3, FLAC, OGG formats
- Best quality when recorded in quiet environment

**Option 2: Record from Microphone**
- Record directly in the notebook (requires microphone access)
- Use `record_audio(duration=45)` function to start recording
- Automatically saves to `./tmp/recorded_voice.wav`

**Option 3: Use Existing File**
- Specify path to existing audio file
- Useful if you have files in `../data/voice_samples/`

**Run the next cell to set up your audio input.**

In [ ]:
# Audio Input Setup
import os
from pathlib import Path

# Create temp directory for uploads/recordings
Path('./tmp').mkdir(exist_ok=True)

# Initialize audio_path as None
audio_path = None

print("Voice Input Options:")
print("=" * 60)

# Option 1: File Upload Widget
try:
    import ipywidgets as widgets
    from IPython.display import display, clear_output
    
    uploader = widgets.FileUpload(
        accept='.wav,.mp3,.flac,.ogg',
        multiple=False,
        description='Upload Audio'
    )
    
    upload_status = widgets.Output()
    
    def on_upload_change(change):
        global audio_path
        with upload_status:
            clear_output()
            if uploader.value:
                # Get uploaded file
                uploaded_file = list(uploader.value.values())[0]
                filename = uploaded_file['metadata']['name']
                audio_path = f'./tmp/{filename}'
                
                # Save file
                with open(audio_path, 'wb') as f:
                    f.write(uploaded_file['content'])
                
                # Get file size
                file_size_mb = len(uploaded_file['content']) / (1024 * 1024)
                
                print(f"✅ File uploaded: {filename}")
                print(f"   Size: {file_size_mb:.2f} MB")
                print(f"   Path: {audio_path}")
                print("\n   Run the status check cell below to verify.")
    
    uploader.observe(on_upload_change, names='value')
    
    print("📁 Option 1: Upload Audio File")
    display(uploader)
    display(upload_status)
    
    UPLOAD_AVAILABLE = True
except ImportError:
    print("⚠️  Option 1: Upload not available (install ipywidgets)")
    print("   pip install ipywidgets")
    UPLOAD_AVAILABLE = False

print()

# Option 2: Microphone Recording
try:
    import sounddevice as sd
    import soundfile as sf
    import numpy as np
    
    def record_audio(duration=45, sample_rate=44100):
        """
        Record audio from microphone.
        
        Args:
            duration: Recording duration in seconds (default: 45)
            sample_rate: Sample rate in Hz (default: 44100)
        
        Returns:
            Path to saved audio file
        """
        global audio_path
        
        print(f"🎤 Recording {duration} seconds...")
        print("   Speak clearly into your microphone...")
        print("   3... 2... 1... Recording!")
        
        # Record audio
        audio_data = sd.rec(
            int(duration * sample_rate), 
            samplerate=sample_rate, 
            channels=1,
            dtype='float32'
        )
        sd.wait()  # Wait for recording to complete
        
        # Save to file
        audio_path = './tmp/recorded_voice.wav'
        sf.write(audio_path, audio_data, sample_rate)
        
        # Calculate duration and size
        duration_actual = len(audio_data) / sample_rate
        file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
        
        print(f"\n✅ Recording complete!")
        print(f"   Duration: {duration_actual:.1f}s")
        print(f"   Size: {file_size_mb:.2f} MB")
        print(f"   Path: {audio_path}")
        print("\n   Run the status check cell below to verify.")
        
        return audio_path
    
    print("🎙️  Option 2: Record from Microphone")
    print("   Usage: record_audio(duration=45)  # Record 45 seconds")
    print("   Example: record_audio(duration=60)  # Record 60 seconds")
    
    RECORDING_AVAILABLE = True
except ImportError:
    print("⚠️  Option 2: Recording not available (install sounddevice & soundfile)")
    print("   pip install sounddevice soundfile")
    RECORDING_AVAILABLE = False

print()

# Option 3: Use Existing File Path
print("📂 Option 3: Use Existing File")
print("   Set audio_path manually:")
print("   audio_path = '../data/voice_samples/my_voice.wav'")
print()

# Check for existing file as fallback
fallback_path = '../data/voice_samples/my_voice.wav'
if os.path.exists(fallback_path):
    print(f"ℹ️  Found existing file: {fallback_path}")
    print(f"   To use it, run: audio_path = '{fallback_path}'")
else:
    print(f"ℹ️  No default file found at: {fallback_path}")

print("=" * 60)

# Final status
if not UPLOAD_AVAILABLE and not RECORDING_AVAILABLE:
    print("\n⚠️  WARNING: No audio input methods available!")
    print("   Install dependencies: pip install ipywidgets sounddevice soundfile")
    print("   Or manually set: audio_path = 'path/to/your/audio.wav'")
elif audio_path is None:
    print("\n✅ Audio input ready!")
    print("   Upload a file above, call record_audio(), or set audio_path manually")
    print("   Then run the status check cell below.")
else:
    print(f"\n✅ Audio file ready: {audio_path}")

In [None]:
# Check audio_path status
import os

print("Audio Input Status Check:")
print("=" * 60)

if 'audio_path' not in globals():
    print("❌ audio_path is not defined")
    print("   Please run the input setup cell above first.")
elif audio_path is None:
    print("⚠️  audio_path is None")
    print("   Please upload a file, record audio, or set audio_path manually.")
elif not os.path.exists(audio_path):
    print(f"❌ File does not exist: {audio_path}")
    print("   Please check the path and try again.")
else:
    # File exists - show details
    file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    print(f"✅ Audio file ready!")
    print(f"   Path: {audio_path}")
    print(f"   Size: {file_size_mb:.2f} MB")
    print(f"   Exists: Yes")
    print("\n   You can now run the load/validate cell below.")

print("=" * 60)

In [None]:
# Load and validate voice sample
import os

# Check audio_path is configured
if audio_path is None:
    print("❌ Error: No audio file configured")
    print("\n   Please run the 'Audio Input Setup' cell above and:")
    print("   • Upload an audio file using the widget, OR")
    print("   • Call record_audio(duration=45) to record from microphone, OR")
    print("   • Set audio_path = 'path/to/your/audio.wav' manually")
    print("\n   Then re-run this cell.")
    raise ValueError("audio_path not configured. Please configure audio input first.")

# Check file exists
if not os.path.exists(audio_path):
    print(f"❌ Error: Audio file not found: {audio_path}")
    print("\n   Please check:")
    print("   • The file path is correct")
    print("   • The file exists at the specified location")
    print("   • You have permission to access the file")
    raise FileNotFoundError(f"Audio file not found: {audio_path}")

# Validate file format
valid_extensions = ['.wav', '.mp3', '.flac', '.ogg', '.m4a']
file_ext = os.path.splitext(audio_path)[1].lower()
if file_ext not in valid_extensions:
    print(f"⚠️  Warning: Unusual audio format: {file_ext}")
    print(f"   Supported formats: {', '.join(valid_extensions)}")
    print(f"   The file may still work, but results may vary.")

# Load audio with librosa
try:
    print(f"Loading audio file: {audio_path}")
    audio, sample_rate = librosa.load(audio_path, sr=None)
    duration = len(audio) / sample_rate
    file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    
    print("\n✅ Audio loaded successfully!")
    print(f"   File: {os.path.basename(audio_path)}")
    print(f"   Duration: {duration:.2f} seconds")
    print(f"   Sample rate: {sample_rate} Hz")
    print(f"   Audio shape: {audio.shape}")
    print(f"   File size: {file_size_mb:.2f} MB")
    
    # Quality warnings
    if duration < 30:
        print(f"\n⚠️  Warning: Audio is short ({duration:.1f}s)")
        print(f"   Recommendation: Use 30-60s audio for best quality")
        print(f"   Current length may reduce voice profile accuracy")
    elif duration > 90:
        print(f"\nℹ️  Audio is long ({duration:.1f}s)")
        print(f"   Processing may take longer, but quality will be good")
    
    if sample_rate < 16000:
        print(f"\n⚠️  Warning: Low sample rate ({sample_rate} Hz)")
        print(f"   Recommendation: Use ≥16kHz for better quality")
    
    # Audio preview
    print("\n🔊 Audio Preview:")
    display(Audio(audio, rate=sample_rate))
    
except Exception as e:
    print(f"\n❌ Error loading audio file: {str(e)}")
    print(f"\n   Troubleshooting:")
    print(f"   • Ensure the file is a valid audio file")
    print(f"   • Try converting to WAV format")
    print(f"   • Check if librosa is installed correctly")
    raise

## 3. Visualize Audio

Visualize the waveform and spectrogram of your voice sample.

In [None]:
# Plot waveform
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
librosa.display.waveshow(audio, sr=sample_rate)
plt.title('Waveform')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')

# Plot spectrogram
plt.subplot(1, 2, 2)
D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)), ref=np.max)
librosa.display.specshow(D, sr=sample_rate, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.tight_layout()
plt.show()

## 4. Analyze Voice Characteristics

Extract pitch range and quality metrics from the voice sample.

In [None]:
from auto_voice.audio.pitch_extractor import SingingPitchExtractor

# Extract pitch
pitch_extractor = SingingPitchExtractor(device=device)
f0, confidence = pitch_extractor.extract_f0(audio, sample_rate)

# Filter voiced frames
voiced_mask = (f0 > 0) & (confidence > 0.8)
f0_voiced = f0[voiced_mask]

# Compute statistics
f0_min = np.min(f0_voiced) if len(f0_voiced) > 0 else 0
f0_max = np.max(f0_voiced) if len(f0_voiced) > 0 else 0
f0_mean = np.mean(f0_voiced) if len(f0_voiced) > 0 else 0

# Convert to note names
def hz_to_note(hz):
    if hz == 0:
        return "N/A"
    notes = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
    semitones = 12 * np.log2(hz / 440.0) + 69
    note_idx = int(round(semitones)) % 12
    octave = int(round(semitones)) // 12 - 1
    return f"{notes[note_idx]}{octave}"

min_note = hz_to_note(f0_min)
max_note = hz_to_note(f0_max)

print(f"Vocal Range:")
print(f"  Min: {f0_min:.2f} Hz ({min_note})")
print(f"  Max: {f0_max:.2f} Hz ({max_note})")
print(f"  Mean: {f0_mean:.2f} Hz")
print(f"  Range: {max_note} - {min_note}")

## 5. Plot Pitch Contour

Visualize the pitch (F0) contour over time.

In [None]:
# Plot F0 contour
plt.figure(figsize=(14, 4))
times = np.arange(len(f0)) * (512 / sample_rate)  # Assuming hop_length=512
plt.plot(times, f0, linewidth=0.5, alpha=0.7)
plt.axhline(f0_mean, color='r', linestyle='--', label=f'Mean: {f0_mean:.1f} Hz')
plt.axhline(f0_min, color='g', linestyle='--', label=f'Min: {f0_min:.1f} Hz ({min_note})')
plt.axhline(f0_max, color='b', linestyle='--', label=f'Max: {f0_max:.1f} Hz ({max_note})')
plt.xlabel('Time (s)')
plt.ylabel('F0 (Hz)')
plt.title('Pitch Contour')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Create Voice Profile

Create a voice profile from the loaded audio sample.

In [None]:
# Create voice profile
profile = cloner.create_voice_profile(
    audio=audio_path,
    user_id='demo_user',
    profile_name='My Voice Demo'
)

print("Voice profile created successfully!")
print(f"Profile ID: {profile['profile_id']}")
print(f"User ID: {profile['user_id']}")
print(f"Profile Name: {profile['profile_name']}")
print(f"\nAudio Info:")
print(f"  Duration: {profile['audio_info']['duration_seconds']:.2f} seconds")
print(f"  Sample Rate: {profile['audio_info']['sample_rate']} Hz")
print(f"\nVocal Range:")
print(f"  Min: {profile['vocal_range']['min_note']} ({profile['vocal_range']['min_pitch_hz']:.2f} Hz)")
print(f"  Max: {profile['vocal_range']['max_note']} ({profile['vocal_range']['max_pitch_hz']:.2f} Hz)")
print(f"  Range: {profile['vocal_range']['range_semitones']} semitones")
print(f"\nQuality Metrics:")
print(f"  SNR: {profile['quality_metrics']['snr_db']:.2f} dB")
print(f"  Quality Score: {profile['quality_metrics']['quality_score']:.2f}")

## 7. Extract Speaker Embedding

Extract and visualize the speaker embedding (vector representation of voice).

In [None]:
# Get speaker embedding
embedding = profile['embedding']

print(f"Speaker Embedding:")
print(f"  Shape: {embedding.shape}")
print(f"  Norm: {np.linalg.norm(embedding):.4f}")
print(f"  Mean: {np.mean(embedding):.4f}")
print(f"  Std: {np.std(embedding):.4f}")

# Visualize embedding
plt.figure(figsize=(14, 4))
plt.plot(embedding, linewidth=0.5)
plt.xlabel('Dimension')
plt.ylabel('Value')
plt.title('Speaker Embedding (256-dimensional vector)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Multi-Sample Profile (Optional)

Create a more robust profile from multiple voice samples.

In [None]:
# Create profile from multiple samples
sample_paths = [
    '../data/voice_samples/sample1.wav',
    '../data/voice_samples/sample2.wav',
    '../data/voice_samples/sample3.wav'
]

# Check if files exist
import os
existing_samples = [path for path in sample_paths if os.path.exists(path)]

if len(existing_samples) > 1:
    multi_profile = cloner.create_voice_profile_from_multiple_samples(
        audio_paths=existing_samples,
        user_id='demo_user',
        profile_name='Multi-Sample Voice Profile'
    )

    print("Multi-sample profile created!")
    print(f"Profile ID: {multi_profile['profile_id']}")
    print(f"Number of samples: {len(existing_samples)}")
    print(f"Quality Score: {multi_profile['quality_metrics']['quality_score']:.2f}")
else:
    print("Skipping multi-sample profile (requires 2+ samples)")
    print(f"Place audio files in ../data/voice_samples/ to enable this feature")

## 9. Speaker Similarity Comparison

Compare similarity between different voice samples.

In [None]:
# Load a second voice sample for comparison
comparison_path = '../data/voice_samples/other_voice.wav'  # Update with your file

if os.path.exists(comparison_path):
    # Extract embedding for comparison sample
    comparison_audio, _ = librosa.load(comparison_path, sr=sample_rate)
    
    # Create temporary profile
    comparison_profile = cloner.create_voice_profile(
        audio=comparison_path,
        user_id='demo_user',
        profile_name='Comparison Voice'
    )
    
    comparison_embedding = comparison_profile['embedding']
    
    # Compute cosine similarity
    from sklearn.metrics.pairwise import cosine_similarity
    
    similarity = cosine_similarity(
        embedding.reshape(1, -1),
        comparison_embedding.reshape(1, -1)
    )[0, 0]
    
    print(f"Speaker Similarity:")
    print(f"  Cosine Similarity: {similarity:.4f}")
    
    if similarity > 0.85:
        print("  Result: Same speaker (>85% similarity)")
    elif similarity > 0.70:
        print("  Result: Similar speakers (70-85% similarity)")
    else:
        print("  Result: Different speakers (<70% similarity)")
    
    # Visualize embedding comparison
    plt.figure(figsize=(14, 4))
    plt.plot(embedding, label='Sample 1', linewidth=0.5, alpha=0.7)
    plt.plot(comparison_embedding, label='Sample 2', linewidth=0.5, alpha=0.7)
    plt.xlabel('Dimension')
    plt.ylabel('Value')
    plt.title(f'Speaker Embedding Comparison (Similarity: {similarity:.4f})')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("Skipping similarity comparison (no comparison file found)")
    print(f"Place another audio file at: {comparison_path}")

## 10. Quality Assessment

Assess the quality of your voice sample for voice cloning.

In [None]:
# Quality assessment
duration = len(audio) / sample_rate
snr = profile['quality_metrics']['snr_db']
quality_score = profile['quality_metrics']['quality_score']

print("Quality Assessment:")
print("="*50)

# Duration check
if duration >= 45:
    print(f"✅ Duration: {duration:.1f}s (Optimal: 45-60s)")
elif duration >= 30:
    print(f"⚠️  Duration: {duration:.1f}s (Acceptable: 30-45s)")
    print("   Recommendation: Record 45-60s for best quality")
else:
    print(f"❌ Duration: {duration:.1f}s (Too short: <30s)")
    print("   Recommendation: Record at least 30s, ideally 45-60s")

# SNR check
if snr >= 20:
    print(f"✅ SNR: {snr:.1f} dB (Excellent: >20 dB)")
elif snr >= 15:
    print(f"✅ SNR: {snr:.1f} dB (Good: 15-20 dB)")
elif snr >= 10:
    print(f"⚠️  SNR: {snr:.1f} dB (Fair: 10-15 dB)")
    print("   Recommendation: Record in quieter environment")
else:
    print(f"❌ SNR: {snr:.1f} dB (Poor: <10 dB)")
    print("   Recommendation: Reduce background noise significantly")

# Overall quality
if quality_score >= 0.8:
    print(f"✅ Overall Quality: {quality_score:.2f} (Excellent: >0.8)")
elif quality_score >= 0.6:
    print(f"✅ Overall Quality: {quality_score:.2f} (Good: 0.6-0.8)")
else:
    print(f"⚠️  Overall Quality: {quality_score:.2f} (Fair: <0.6)")
    print("   Recommendation: Improve recording quality")

print("="*50)

# Recommendations
print("\nRecommendations for Best Quality:")
print("  • Record 45-60 seconds of speech or singing")
print("  • Use quiet environment with minimal background noise")
print("  • Maintain 6-12 inches from microphone")
print("  • Include variety: different pitches and dynamics")
print("  • Avoid clipping and distortion")
print("  • Use lossless formats (WAV, FLAC) when possible")

## Summary

In this notebook, you learned how to:

1. ✅ Load and visualize voice samples
2. ✅ Extract pitch characteristics and vocal range
3. ✅ Create voice profiles with AutoVoice
4. ✅ Analyze speaker embeddings
5. ✅ Create multi-sample profiles for robustness
6. ✅ Compare speaker similarity
7. ✅ Assess voice sample quality

### Next Steps

- **Song Conversion**: Use your voice profile to convert songs
  - See `song_conversion_demo.ipynb` for song conversion tutorial
- **API Usage**: Integrate voice cloning into your applications
  - See `docs/api_voice_conversion.md` for API documentation
- **Production Deployment**: Deploy AutoVoice for production use
  - See `docs/runbook.md` for operations guide

### Resources

- User Guide: `docs/voice_conversion_guide.md`
- API Reference: `docs/api_voice_conversion.md`
- Model Architecture: `docs/model_architecture.md`
- Demo Scripts: `examples/demo_voice_conversion.py`