# TensorTalk: Zero-Shot Voice Cloning Demo

This notebook demonstrates how to use TensorTalk for zero-shot voice cloning.

**Paper**: [TensorTalk: A Voice Cloning Framework for Text-to-Speech Synthesis Using WavLM Features and Rhythm Modeling](https://drive.google.com/file/d/1j9t6o2sKrWnu83dZkSFWDphGyeI9_NYr/view)

**GitHub**: [github.com/robrich07/TensorTalk](https://github.com/robrich07/TensorTalk)

## 1. Setup and Installation

First, let's import the necessary libraries and initialize the pipeline.

In [None]:
import sys
sys.path.append('..')

import torch
import torchaudio
from IPython.display import Audio, display
import matplotlib.pyplot as plt
import numpy as np

from src import TensorTalkPipeline

## 2. Initialize Pipeline

Initialize the TensorTalk pipeline with default settings:
- **k=4**: Number of nearest neighbors for voice conversion
- **device**: Automatically selects CUDA if available

In [None]:
# Initialize the pipeline
pipeline = TensorTalkPipeline(k=4)

print(f"Using device: {pipeline.device}")

## 3. Prepare Target Speaker Audio

For zero-shot voice cloning, you need reference audio from the target speaker.
The more reference audio you provide, the better the quality.

**Recommendations:**
- Minimum: 3-5 seconds of clean speech
- Optimal: 10-30 seconds from multiple utterances
- Quality: Clear audio with minimal background noise

In [None]:
# Path to target speaker audio file(s)
# You can provide a single file or a list of files
target_audio_paths = [
    "../examples/target_speaker/sample1.wav",
    "../examples/target_speaker/sample2.wav",
    "../examples/target_speaker/sample3.wav",
]

# Or use a single file
# target_audio_paths = "../examples/target_speaker/sample.wav"

## 4. Text-to-Speech Synthesis

Now let's synthesize speech in the target speaker's voice!

In [None]:
# Text to synthesize
text = "Hello! This is a demonstration of zero-shot voice cloning using TensorTalk."

# Synthesize speech
# alpha controls voice conversion strength:
#   - alpha=1.0: Full target voice (default)
#   - alpha=0.5: Mix of source and target
#   - alpha=0.0: Original source voice (gTTS)

audio = pipeline.synthesize(
    text=text,
    target_audio_paths=target_audio_paths,
    alpha=1.0,  # Full voice conversion
    lang='en'   # Language code for gTTS
)

print("Synthesis complete!")

## 5. Listen to the Result

Play the synthesized audio directly in the notebook:

In [None]:
# Display audio player
display(Audio(audio.cpu().numpy(), rate=16000))

## 6. Visualize the Waveform

Let's visualize the generated audio waveform:

In [None]:
# Plot waveform
plt.figure(figsize=(14, 4))
plt.plot(audio.cpu().numpy())
plt.title("Generated Speech Waveform")
plt.xlabel("Sample")
plt.ylabel("Amplitude")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Save the Audio

Save the synthesized audio to a file:

In [None]:
# Save audio to file
output_path = "../examples/generated/output.wav"
pipeline.save_audio(audio, output_path, sample_rate=16000)

print(f"Audio saved to: {output_path}")

## 8. Experiment with Voice Blending

Try different alpha values to blend between source and target voices:

In [None]:
# Test different alpha values
alphas = [0.0, 0.25, 0.5, 0.75, 1.0]
text = "Testing voice blending with different alpha values."

results = []
for alpha in alphas:
    print(f"\nGenerating with alpha={alpha}...")
    audio = pipeline.synthesize(
        text=text,
        target_audio_paths=target_audio_paths,
        alpha=alpha
    )
    results.append((alpha, audio))
    
    # Save each result
    output_path = f"../examples/generated/alpha_{alpha:.2f}.wav"
    pipeline.save_audio(audio, output_path)

print("\nAll variations generated!")

## 9. Compare Voice Blending Results

Listen to each version to hear the gradual transition:

In [None]:
for alpha, audio in results:
    print(f"\nAlpha = {alpha} ({'Source' if alpha == 0 else 'Target' if alpha == 1 else 'Mixed'})")
    display(Audio(audio.cpu().numpy(), rate=16000))

## 10. Advanced: Direct Component Usage

For more control, you can use the individual components directly:

In [None]:
from src import SSLEncoder, KNNMatcher

# Initialize components separately
ssl_encoder = SSLEncoder()
knn_matcher = KNNMatcher(k=4)

# Load and extract features from an audio file
audio_path = "../examples/target_speaker/sample1.wav"
features = ssl_encoder.extract_from_file(audio_path)

print(f"Extracted features shape: {features.shape}")
print(f"Feature dimension: {features.shape[-1]}")
print(f"Number of frames: {features.shape[1]}")

## 11. Batch Processing

Process multiple sentences efficiently:

In [None]:
# List of sentences to synthesize
sentences = [
    "This is the first sentence.",
    "Here comes the second one.",
    "And finally, the third sentence.",
]

# Process each sentence
for i, text in enumerate(sentences):
    print(f"\nProcessing sentence {i+1}/{len(sentences)}...")
    
    audio = pipeline.synthesize(
        text=text,
        target_audio_paths=target_audio_paths,
        alpha=1.0
    )
    
    output_path = f"../examples/generated/sentence_{i+1}.wav"
    pipeline.save_audio(audio, output_path)

print("\nBatch processing complete!")

## Summary

In this notebook, you learned how to:
1. Initialize the TensorTalk pipeline
2. Synthesize speech in a target speaker's voice
3. Control voice conversion with the alpha parameter
4. Save and visualize results
5. Process multiple sentences efficiently

### Next Steps

- Try with your own voice samples
- Experiment with different k values (number of neighbors)
- Integrate rhythm modeling for more natural speech
- Check out the paper for technical details

### Resources

- **Paper**: https://drive.google.com/file/d/1j9t6o2sKrWnu83dZkSFWDphGyeI9_NYr/view
- **GitHub**: https://github.com/robrich07/TensorTalk
- **Issues**: Report bugs or request features on GitHub