<a href="https://colab.research.google.com/github/Dimildizio/DS_course/blob/main/Neural_networks/Transformers/voice_cloning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bark text-to-speech voice cloning.
Clone voices to create speaker history prompt files (.npz) for [bark text-to-speech](https://github.com/suno-ai/bark).
(This version of the notebook is made to work on Google Colab, make sure your runtime hardware accelerator is set to GPU)

# Google Colab: Clone the repository

In [1]:
!git clone https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/
%cd bark-voice-cloning-HuBERT-quantizer
%pip install -r requirements.txt
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

!pip install bark

Cloning into 'bark-voice-cloning-HuBERT-quantizer'...
remote: Enumerating objects: 1882, done.[K
remote: Counting objects: 100% (247/247), done.[K
remote: Compressing objects: 100% (118/118), done.[K
remote: Total 1882 (delta 144), reused 215 (delta 124), pack-reused 1635[K
Receiving objects: 100% (1882/1882), 319.75 MiB | 14.89 MiB/s, done.
Resolving deltas: 100% (145/145), done.
/content/bark-voice-cloning-HuBERT-quantizer


## Install packages

In [None]:
import numpy as np
import torch
import torchaudio
from encodec import EncodecModel
from encodec.utils import convert_audio
from bark_hubert_quantizer.hubert_manager import HuBERTManager
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert
from bark_hubert_quantizer.customtokenizer import CustomTokenizer

from bark.api import generate_audio
from transformers import BertTokenizer
from bark.generation import SAMPLE_RATE, preload_models, codec_decode, generate_coarse, generate_fine, generate_text_semantic
from IPython.display import Audio

## Load models

In [3]:
large_quant_model = False  # Use the larger pretrained model
device = 'cuda'  # 'cuda', 'cpu', 'cuda:0', 0, -1, torch.device('cuda')

model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

print('Loading HuBERT...')
hubert_model = CustomHubert(HuBERTManager.make_sure_hubert_installed(), device=device)
print('Loading Quantizer...')
quant_model = CustomTokenizer.load_from_checkpoint(HuBERTManager.make_sure_tokenizer_installed(model=model[0], local_file=model[1]), device)
print('Loading Encodec...')
encodec_model = EncodecModel.encodec_model_24khz()
encodec_model.set_target_bandwidth(6.0)
encodec_model.to(device)

print('Downloaded and loaded models!')

Loading HuBERT...
Downloading HuBERT base model
Downloaded HuBERT




Loading Quantizer...
Downloading HuBERT custom tokenizer


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


quantifier_hubert_base_ls960_14.pth:   0%|          | 0.00/104M [00:00<?, ?B/s]

Downloaded tokenizer


Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th


Loading Encodec...


100%|██████████| 88.9M/88.9M [00:00<00:00, 97.0MB/s]


Downloaded and loaded models!


In [105]:
wav_file = 'speaker.wav'  # voice to clone
out_file = 'speaker.npz'  # embeddings to save to

In [110]:
wav, sr = torchaudio.load(wav_file)
Audio(wav, rate=sr)

## Load wav and create speaker history prompt

In [74]:
wav, sr = torchaudio.load(wav_file)
wav_hubert = wav.to(device)

if wav_hubert.shape[0] == 2:  # Stereo to mono if needed
    wav_hubert = wav_hubert.mean(0, keepdim=True)

print('Extracting semantics...')
semantic_vectors = hubert_model.forward(wav_hubert, input_sample_hz=sr)
print('Tokenizing semantics...')
semantic_tokens = quant_model.get_token(semantic_vectors)
print('Creating coarse and fine prompts...')
wav = convert_audio(wav, sr, encodec_model.sample_rate, 1).unsqueeze(0)
wav = wav.to(device)

Extracting semantics...
Tokenizing semantics...
Creating coarse and fine prompts...


In [75]:
with torch.no_grad():
    encoded_frames = encodec_model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()
codes = codes.cpu()
semantic_tokens = semantic_tokens.cpu()

np.savez(out_file, semantic_prompt=semantic_tokens, fine_prompt=codes, coarse_prompt=codes[:2, :])

print('Done!')

Done!


## Preload model

In [76]:
# download and load all models
preload_models(
    text_use_gpu=True, text_use_small=False,
    coarse_use_gpu=True, coarse_use_small=False,
    fine_use_gpu=True, fine_use_small=False,
    codec_use_gpu=True,
    force_reload=False)

In [98]:
text_prompt = "Hello and welcome to my class, my name is Steven, let me spell it for you... STEVEN."

#### More controllable generation

In [79]:
# generation with more control
x_semantic = generate_text_semantic(text_prompt, history_prompt=out_file, temp=0.7, top_k=30, top_p=0.95)
x_coarse_gen = generate_coarse(x_semantic, history_prompt=out_file, temp=0.7, top_k=30, top_p=0.95)
x_fine_gen = generate_fine(x_coarse_gen, history_prompt=out_file, temp=0.7)

another_audio_array = codec_decode(x_fine_gen)

100%|██████████| 100/100 [01:42<00:00,  1.03s/it]
100%|██████████| 37/37 [07:41<00:00, 12.49s/it]


### Check results

In [159]:
audio_array = generate_audio(text_prompt, history_prompt=out_file, text_temp=0.65, waveform_temp=0.75)

100%|██████████| 100/100 [00:12<00:00,  7.93it/s]
100%|██████████| 38/38 [00:36<00:00,  1.05it/s]


In [143]:
Audio(audio_array, rate=SAMPLE_RATE*1.05, autoplay=True)