## Cross-Lingual Voice Clone Demo


In [11]:
import os
import torch
import se_extractor
from api import BaseSpeakerTTS, ToneColorConverter

### Initialization


In [None]:
ckpt_converter = 'checkpoints/converter'
device="cuda:0" if torch.cuda.is_available() else "cpu"
output_dir = 'outputs'
input_dir = 'inputs'

tone_color_converter = ToneColorConverter(
    f"{ckpt_converter}/config.json", device=device
)
tone_color_converter.load_ckpt(f"{ckpt_converter}/checkpoint.pth")

os.makedirs(output_dir, exist_ok=True)

In this demo, we will use OpenAI TTS as the base speaker to produce multi-lingual speech audio. The users can flexibly change the base speaker according to their own needs. Please create a file named `.env` and place OpenAI key as `OPENAI_API_KEY=xxx`. We have also provided a Chinese base speaker model (see `demo_part1.ipynb`).


In [None]:
from openai import OpenAI
from dotenv import load_dotenv

# Please create a file named .env and place your
# OpenAI key as OPENAI_API_KEY=xxx
load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# response = client.audio.speech.create(
#     model="tts-1-hd",
#     voice="nova",
#     response_format="mp3",
#     input="陨石快躲,快找掩体,快进阴影,快找盾牌,快踩小花,快找泥水,驱散boss,快进白色符文,准备投掷,你被投掷,火柱出现,快找火柱,顺劈斩,准备击飞,大地之赐,真菌食肉者快打,孢子投射者快打,食脑真菌快打,绿蘑菇出现,蓝蘑菇出现,盾牌冲锋,震荡怒吼,粉碎,旋风斩,致衰咆哮,地震,返回原位,准备接球,远离冰球,,狂乱,4阶段准备,转阶段准备,1打断准备,2打断准备,3打断准备,4打断准备,5打断准备,集合分担,拉开小怪,快跑, 炸弹点你,快进黄色符文,召火者出现,熔渣元素出现,保安出现,工程师出现,一轨道快车,二轨道快车,三轨道快车,四轨道快车,随机轨道快车,一轨道小怪,二轨道小怪,三轨道小怪,四轨道小怪,随机轨道小怪,一轨道火炮,二轨道火炮,三轨道火炮,四轨道火炮,随机轨道火炮,一轨道大怪,二轨道大怪,三轨道大怪,四轨道大怪,随机轨道大怪,一轨道喷火,二轨道喷火,三轨道喷火,四轨道喷火,随机轨道喷火,随机三轨道快车,一四轨道快车,二三轨道小怪,一四轨道火炮,二轨道大怪四轨道火炮,二轨道小怪三轨道大怪,一轨道火炮四轨道大怪,二三轨道快车,一轨道大怪四轨道火炮,一轨道喷火二三轨道快车,二轨道小怪四轨道快车,二三轨道喷火,一轨道小怪四轨道喷火,二轨道大怪三轨道小怪,远离连线,火炮快打,乌克塔准备,高莱克准备,乌克罗格准备,左,中,右,坦克快打,死亡标记,靠近星星,靠近大饼,靠近菱形,靠近三角,靠近月亮,靠近方块,靠近叉叉,靠近骷髅,弹幕快躲,炸弹快打,血球快打,邪能血球快打,嗜血者快打,恐魔快打,你被傳送,連線靠近,传给受害者,传给治疗,传给坦克,用眼打断,灵魂裂劈,快进火焰,注意宝珠,第一位拉断枷锁,第二位拉断枷锁,第三位拉断枷锁,魔火之魂快打,召亡者快打,快跑,精炼混乱点你,快跑,聚焦混乱点你,碎石机,喷火机,巨炮,攻城车,运输车,拉断锁链,中偏左,中偏右,5秒后暗影之力,左下左下左下,右下右下右下,左上左上左上,右上右上右上,中下中下中下,中上中上中上,一组分担,二组分担,快跳,快咬人,快进中场,承受伤害,照亮阴影,照亮小怪,准备恐惧,帮忙吸收,靠近螃蟹,靠近龙,靠近猎人,靠近狼,快躲小怪后面,层数过高,快到橙色,快到黄色,快到蓝色,快到绿色,快到紫色,图腾快打,切换世界,邪能灌注点你,光明灌注点你,快跳坑,靠近水母,运送墨汁,切换区域,靠近时间立场,快靠边站,准备沉默,靠近坦克,使用道具,刷满血量,新传送门,五阶段准备, 攻击护盾,矩阵点你,收集物品,准备交换,快躲boss后面,鲜血盛宴,快躲开,快引诱,准备控制小怪,,,,,理智过低,拉断连线,靠近近战,拉锁链撞boss,小心拉近,使用特别技能,别动,谐波,旋律,炸弹点你,快挡视线,一阶段准备,靠近柱子,正极,负极,极性反转,小心击飞,换颜色,靠近小怪,打破藤曼,帮助灵魂,注意debuff,快上天,种子点你,火焰点你,暗影点你,进池子"
# )

# response.stream_to_file(f"{output_dir}/台词_openai_output.mp3")

### Obtain Tone Color Embedding


The `source_se` is the tone color embedding of the base speaker.
It is an average for multiple sentences with multiple emotions
of the base speaker. We directly provide the result here but
the readers feel free to extract `source_se` by themselves.


In [None]:
base_speaker = f"{input_dir}/OPENAI_TTS.mp3"
source_se, audio_name = se_extractor.get_se(
    base_speaker, tone_color_converter, vad=True
)

reference_speaker = "resources/dada.wav"
target_se, audio_name = se_extractor.get_se(
    reference_speaker, tone_color_converter, vad=True
)

### Inference


In [13]:
# Run the base speaker tts
text = "何公子这是我的全新形态,你先听听效果如何, 如果可以的话再把UI做了"
src_path = f"{output_dir}/tmp.wav"
ckpt_base = 'checkpoints/base_speakers/ZH'
base_speaker_tts = BaseSpeakerTTS(f'{ckpt_base}/config.json', device=device)
base_speaker_tts.load_ckpt(f'{ckpt_base}/checkpoint.pth')
base_speaker_tts.tts(text, src_path, speaker='default', language='Chinese', speed=1.0)

# for i, t in enumerate(text):
#     response = client.audio.speech.create(
#         model="tts-1-hd",
#         voice="nova",
#         input=t,
#     )

#     response.stream_to_file(src_path)

save_path = f"{output_dir}/output_notebook2.wav"

# Run the tone color converter
encode_message = "@MyShell"
tone_color_converter.convert(
    audio_src_path=src_path,
    src_se=source_se,
    tgt_se=target_se,
    output_path=save_path,
    message=encode_message,
)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache


Loaded checkpoint 'checkpoints/base_speakers/ZH/checkpoint.pth'
missing/unexpected keys: [] []
 > Text splitted to sentences.
何公子这是我的全新形态,
你先听听效果如何, 如果可以的话再把UI做了


Loading model cost 0.396 seconds.
Prefix dict has been built successfully.


xə↑ k⁼ʊŋ→ts⁼ɹ ts`⁼ə↓ s`ɹ`↓ wo↓↑ t⁼ə tʃʰɥæn↑ʃin→ ʃiŋ↑tʰaɪ↓,
 length:58
 length:58
ni↓↑ ʃjɛn→ tʰiŋ→tʰiŋ→ ʃiɑʊ↓k⁼wo↓↑ ɹ`u↑xə↑,  ɹ`u↑k⁼wo↓↑ kʰə↓↑i↓↑ t⁼əxwa↓ ts⁼aɪ↓ p⁼a↓↑joʊ→aɪ↓ ts⁼wo↓ lə.
 length:102
 length:102
