## Informal ablation study: semantic, fine, coarse contribution to speaker identity

[Serp.AI](https://github.com/serp-ai/bark-with-voice-clone)'s voice cloning method "clones" the coarse and fine tokens by simply encoding the ground-truth audio with Encodec. However, it doesn't condition the _semantic_ tokens with the source audio. Voice cloning with this method severely underperforms: speaker identity is often lost. Could the lack of semantic audio conditioning be the culprit? To find out, let's compare outputs from an existing prompt, which contains history for preceding semantic, fine, and coarse tokens:

- BASELINE: Load all three history prompts
- COARSE + FINE ONLY: Generate semantic tokens conditioned on just the input text, then condition the waveform generation (coarse + fine) on the history
- SEMANTIC ONLY 

In [1]:
import numpy as np
import os
from pprint import pprint
from bark.api import text_to_semantic, semantic_to_waveform, generate_audio
from bark.generation import SAMPLE_RATE
from IPython.display import Audio
from scipy.io.wavfile import write as write_wav
from datetime import datetime

  from .autonotebook import tqdm as notebook_tqdm


We choose Suno AI's pre-provided "en_speaker_0" as our prompt. 

In [2]:
prompt_basename = "en_speaker_0"

First, let's approximate the baseline audio for the prompt. We obviously don't have the ground truth, but we can sort-of rederive it by feeding the semantic prompt itself as input to the waveform generation with the coarse and fine parts of the prompt as history:

In [4]:
semantic_history = np.load(
    os.path.join("bark", "assets", "prompts", f"{prompt_basename}.npz")
)["semantic_prompt"]
original_prompt_arr = semantic_to_waveform(semantic_history, history_prompt=prompt_basename)

# Persist the prompt
filepath = f"./references/original_prompt_{prompt_basename}.wav" # change this to your desired output path
write_wav(filepath, SAMPLE_RATE, original_prompt_arr)
Audio(original_prompt_arr, rate=SAMPLE_RATE)

100%|██████████| 14/14 [00:06<00:00,  2.20it/s]


For `en_speaker_0`, the original line is "There are a lot of things I could talk about, but it would probably sound similar to this".

Now let's generate 10 samples using the full history prompt:

In [6]:
text_prompt = """
When I was a young boy, my father took me into the city to see a marching band.
"""

In [7]:
sample_arr = None

for i in range(0,10):
    print(f"Generating baseline sample {i}")
    x_semantic = text_to_semantic(text_prompt, history_prompt=prompt_basename)
    baseline_audio_arr = semantic_to_waveform(x_semantic, history_prompt=prompt_basename)
    filepath = f"./references/baseline_{prompt_basename}_{i}.wav"
    write_wav(filepath, SAMPLE_RATE, baseline_audio_arr)
    if i == 0:
        sample_arr = baseline_audio_arr

Audio(sample_arr, rate=SAMPLE_RATE)

Generating baseline sample 0


100%|██████████| 100/100 [00:02<00:00, 38.21it/s]
100%|██████████| 22/22 [00:10<00:00,  2.11it/s]


Generating baseline sample 1


100%|██████████| 100/100 [00:02<00:00, 43.36it/s]
100%|██████████| 21/21 [00:10<00:00,  2.07it/s]


Generating baseline sample 2


100%|██████████| 100/100 [00:02<00:00, 36.51it/s]
100%|██████████| 23/23 [00:10<00:00,  2.14it/s]


Generating baseline sample 3


100%|██████████| 100/100 [00:02<00:00, 36.71it/s] 
100%|██████████| 23/23 [00:10<00:00,  2.09it/s]


Generating baseline sample 4


100%|██████████| 100/100 [00:02<00:00, 35.75it/s] 
100%|██████████| 22/22 [00:10<00:00,  2.04it/s]


Generating baseline sample 5


100%|██████████| 100/100 [00:02<00:00, 39.15it/s] 
100%|██████████| 22/22 [00:10<00:00,  2.08it/s]


Generating baseline sample 6


100%|██████████| 100/100 [00:02<00:00, 38.08it/s]
100%|██████████| 23/23 [00:11<00:00,  2.09it/s]


Generating baseline sample 7


100%|██████████| 100/100 [00:01<00:00, 57.65it/s] 
100%|██████████| 15/15 [00:07<00:00,  2.08it/s]


Generating baseline sample 8


100%|██████████| 100/100 [00:03<00:00, 31.83it/s]
100%|██████████| 25/25 [00:12<00:00,  2.06it/s]


Generating baseline sample 9


100%|██████████| 100/100 [00:03<00:00, 30.35it/s]
100%|██████████| 27/27 [00:13<00:00,  2.04it/s]


Now contrast this with waveform-only prompt history:

In [8]:
sample_arr = None

for i in range(0,10):
    print(f"Generating waveform-only sample {i}")
    x_semantic = text_to_semantic(text_prompt)
    no_semantic_audio_arr = semantic_to_waveform(x_semantic, history_prompt=prompt_basename)
    filepath = f"./references/no_semantic_{prompt_basename}_{i}.wav"
    write_wav(filepath, SAMPLE_RATE, no_semantic_audio_arr)
    if i == 0:
        sample_arr = no_semantic_audio_arr

Audio(sample_arr, rate=SAMPLE_RATE)

Generating waveform-only sample 0


100%|██████████| 100/100 [00:01<00:00, 68.27it/s]
100%|██████████| 13/13 [00:06<00:00,  2.17it/s]


Generating waveform-only sample 1


100%|██████████| 100/100 [00:02<00:00, 38.74it/s] 
100%|██████████| 22/22 [00:10<00:00,  2.07it/s]


Generating waveform-only sample 2


100%|██████████| 100/100 [00:03<00:00, 29.21it/s]
100%|██████████| 28/28 [00:13<00:00,  2.04it/s]


Generating waveform-only sample 3


100%|██████████| 100/100 [00:01<00:00, 55.13it/s]
100%|██████████| 16/16 [00:07<00:00,  2.14it/s]


Generating waveform-only sample 4


100%|██████████| 100/100 [00:04<00:00, 21.63it/s]
100%|██████████| 36/36 [00:17<00:00,  2.07it/s]


Generating waveform-only sample 5


100%|██████████| 100/100 [00:02<00:00, 48.40it/s] 
100%|██████████| 17/17 [00:08<00:00,  2.03it/s]


Generating waveform-only sample 6


100%|██████████| 100/100 [00:03<00:00, 31.70it/s]
100%|██████████| 26/26 [00:12<00:00,  2.08it/s]


Generating waveform-only sample 7


100%|██████████| 100/100 [00:01<00:00, 61.43it/s]
100%|██████████| 14/14 [00:06<00:00,  2.02it/s]


Generating waveform-only sample 8


100%|██████████| 100/100 [00:02<00:00, 35.62it/s]
100%|██████████| 24/24 [00:11<00:00,  2.06it/s]


Generating waveform-only sample 9


100%|██████████| 100/100 [00:02<00:00, 44.17it/s] 
100%|██████████| 19/19 [00:09<00:00,  2.10it/s]


Performance is clearly worse. 

Now let's look at waveform-only:

In [None]:
sample_arr = None

for i in range(0,10):
    print(f"Generating waveform-only sample {i}")
    x_waveform = text_to_semantic(text_prompt, history_prompt=prompt_basename)
    no_waveform_audio_arr = semantic_to_waveform(x_semantic)
    filepath = f"./references/no_waveform_{prompt_basename}_{i}.wav"
    write_wav(filepath, SAMPLE_RATE, no_waveform_audio_arr)
    if i == 0:
        sample_arr = no_waveform_audio_arr

Audio(sample_arr, rate=SAMPLE_RATE)

For fun, let's look at the worst-case scenario: a completely random semantic prompt:

In [None]:
base_prompt = np.load(
    os.path.join("bark", "assets", "prompts", f"{prompt_basename}.npz")
)

np.savez(os.path.join("bark", "assets", "prompts", f"{prompt_basename}_random.npz"), 
    # Random tokens from the semantic codebook
    semantic_prompt=np.random.randint(0,10000,size=256),
    coarse_prompt=base_prompt["coarse_prompt"],
    fine_prompt=base_prompt["fine_prompt"])

random_arr=generate_audio(text_prompt, history_prompt=f"{prompt_basename}_random")
Audio(original_prompt_arr, rate=SAMPLE_RATE)