# üé§ Voice Clone from Audio File

Clone a voice from your own audio file and generate new speech.

## Workflow:
1. Place your voice reference audio in: `assets/voices/`
2. Configure the settings in the **Configuration** cell
3. Run all cells
4. Find your output in: `assets/output/`

## 1. Setup & Imports

In [1]:
import os
import numpy as np
import torch
import soundfile as sf
from IPython.display import Audio, display
from qwen_tts import Qwen3TTSModel

# Paths
VOICES_DIR = "assets/voices"
OUTPUT_DIR = "assets/output"

os.makedirs(VOICES_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("‚úÖ Imports successful")
print(f"üìÅ Voice templates: {VOICES_DIR}/")
print(f"üìÅ Output folder: {OUTPUT_DIR}/")

  from .autonotebook import tqdm as notebook_tqdm
SoX could not be found!

    If you do not have SoX, proceed here:
     - - - http://sox.sourceforge.net/ - - -

    If you do (or think that you should) have SoX, double-check your
    path variables.
    



********
********
 
‚úÖ Imports successful
üìÅ Voice templates: assets/voices/
üìÅ Output folder: assets/output/


## 2. Configuration

**Edit the values below:**

In [2]:
# =============================================================================
# CONFIGURATION - Edit these values
# =============================================================================

# Filename of your reference audio (place it in assets/voices/)
REFERENCE_AUDIO = "my_voice_test.wav"

# Transcript of what is said in the reference audio (for better quality)
# Leave empty "" if you don't know it
REFERENCE_TEXT = ""

# The text you want to generate with the cloned voice
TEXT_TO_GENERATE = [
    "Hallo, das ist ein Test der Stimmenklonungs-Technologie.",
    "Ich kann alles sagen, was du m√∂chtest, in dieser Stimme.",
    "Ziemlich beeindruckend, oder? Die Zukunft ist da!",
]

# Language of the text (English, German, French, Spanish, etc.)
LANGUAGE = "German"

# Output file prefix
OUTPUT_PREFIX = "cloned"

# Combine all audio into one file?
COMBINE_INTO_ONE = True

# Pause between sentences (seconds)
PAUSE_BETWEEN_SENTENCES = 0.5

print("‚úÖ Configuration set")

‚úÖ Configuration set


## 3. Load Model

This may take a moment on first run (downloads model weights).

In [3]:
print("Loading model...")
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
)
print("‚úÖ Model loaded!")

Loading model...


Fetching 4 files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00, 4000.29it/s]


‚úÖ Model loaded!


## 4. Create Voice Clone Prompt

In [4]:
# Build full path to reference audio
ref_audio_path = os.path.join(VOICES_DIR, REFERENCE_AUDIO)

if not os.path.exists(ref_audio_path):
    print(f"‚ùå ERROR: Reference audio not found: {ref_audio_path}")
    print(f"   Please place your voice file in: {VOICES_DIR}/")
else:
    print(f"‚úÖ Found reference audio: {ref_audio_path}")
    
    # Play the reference audio
    print("\nüîä Reference audio:")
    display(Audio(ref_audio_path))

‚úÖ Found reference audio: assets/voices\my_voice_test.wav

üîä Reference audio:


In [5]:
# Create voice clone prompt
if REFERENCE_TEXT:
    print("Using transcript for better quality cloning...")
    voice_prompt = model.create_voice_clone_prompt(
        ref_audio=ref_audio_path,
        ref_text=REFERENCE_TEXT,
        x_vector_only_mode=False,
    )
else:
    print("No transcript provided, using x-vector only mode...")
    voice_prompt = model.create_voice_clone_prompt(
        ref_audio=ref_audio_path,
        x_vector_only_mode=True,
    )

print("‚úÖ Voice prompt created!")

No transcript provided, using x-vector only mode...
‚úÖ Voice prompt created!


## 5. Generate Audio

In [6]:
# Handle single string or list
texts = TEXT_TO_GENERATE if isinstance(TEXT_TO_GENERATE, list) else [TEXT_TO_GENERATE]

print(f"Generating {len(texts)} audio files...\n")

all_audio = []
sample_rate = None

for i, text in enumerate(texts, 1):
    print(f"[{i}/{len(texts)}] {text[:60]}...")
    
    wavs, sr = model.generate_voice_clone(
        text=text,
        language=LANGUAGE,
        voice_clone_prompt=voice_prompt,
    )
    
    sample_rate = sr
    all_audio.append(wavs[0])
    
    # Save individual file
    output_file = os.path.join(OUTPUT_DIR, f"{OUTPUT_PREFIX}_{i}.wav")
    sf.write(output_file, wavs[0], sr)
    print(f"    ‚úÖ Saved: {output_file}")
    
    # Play the audio
    display(Audio(wavs[0], rate=sr))
    print()

Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.


Generating 3 audio files...

[1/3] Hallo, das ist ein Test der Stimmenklonungs-Technologie....
    ‚úÖ Saved: assets/output\cloned_1.wav


Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.



[2/3] Ich kann alles sagen, was du m√∂chtest, in dieser Stimme....
    ‚úÖ Saved: assets/output\cloned_2.wav


Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.



[3/3] Ziemlich beeindruckend, oder? Die Zukunft ist da!...
    ‚úÖ Saved: assets/output\cloned_3.wav





## 6. Combine Audio (Optional)

In [7]:
if COMBINE_INTO_ONE and len(all_audio) > 1:
    print("Combining all audio into one file...")
    
    # Create silence for pause
    pause_samples = int(PAUSE_BETWEEN_SENTENCES * sample_rate)
    silence = np.zeros(pause_samples, dtype=all_audio[0].dtype)
    
    # Combine with pauses
    combined = []
    for i, audio in enumerate(all_audio):
        combined.append(audio)
        if i < len(all_audio) - 1:
            combined.append(silence)
    
    combined_audio = np.concatenate(combined)
    combined_file = os.path.join(OUTPUT_DIR, f"{OUTPUT_PREFIX}_combined.wav")
    sf.write(combined_file, combined_audio, sample_rate)
    
    print(f"‚úÖ Saved: {combined_file}")
    print("\nüîä Combined audio:")
    display(Audio(combined_audio, rate=sample_rate))
else:
    print("‚ÑπÔ∏è Skipping combine (COMBINE_INTO_ONE is False or only 1 file)")

Combining all audio into one file...
‚úÖ Saved: assets/output\cloned_combined.wav

üîä Combined audio:


## ‚úÖ Done!

Your files are saved in `assets/output/`