Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS]空白文本、文本中连续多个符号合成语音报错,无法正常合成 #2505

Open
tianyu8969 opened this issue Oct 8, 2022 · 4 comments

Comments

@tianyu8969
Copy link

tianyu8969 commented Oct 8, 2022

tts_python、tts_inference合成的文本存在以下情况时会直接报错,无法正常合成

  1. 文本为空白字符串时,如:""、" "
  2. 文本中首部有符号时,如:","、",测试"
  3. 文本中有连续的符号时,如:"测试,,"、"测试,。"、"测试。“。"

image

@lym0302
Copy link
Contributor

lym0302 commented Oct 8, 2022

请问是在用服务的时候报的错吗?用的指令测试吗?类似于这种:paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav

@yt605155624
Copy link
Collaborator

yt605155624 commented Oct 8, 2022

  1. 文本为空白字符串时,如:""、" ", 输入为空,报错
KeyError: 'phone_ids'
  1. 文本中首部有符号时,如:","、",测试", 报错
[2022-10-08 10:10:37,743] [   ERROR] - (InvalidArgument) The depth of Input(X)'s dimension should be greater than pad_front in reflect mode, but received depth(25) and pad_front(27).
  [Hint: Expected in_depth > pads[4], but received in_depth:25 <= pads[4]:27.] (at /paddle/paddle/phi/kernels/gpu/pad3d_kernel.cu:384)
  [operator < pad3d > error]
Traceback (most recent call last):
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/server/engine/tts/python/tts_engine.py", line 231, in run
    text=sentence, lang=lang, am=self.config.am, spk_id=spk_id)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 354, in _decorate_function
    return func(*args, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/cli/tts/infer.py", line 474, in infer
    wav = self.voc_inference(mel)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/t2s/models/melgan/melgan.py", line 570, in forward
    wav = self.melgan_generator.inference(normalized_mel)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/t2s/models/melgan/melgan.py", line 270, in inference
    out = self.melgan(c)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/container.py", line 98, in forward
    input = layer(input)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/t2s/modules/residual_stack.py", line 112, in forward
    return self.stack(c) + self.skip_layer(c)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/container.py", line 98, in forward
    input = layer(input)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/nn/layer/common.py", line 1020, in forward
    name=self._name)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddle/nn/functional/common.py", line 1368, in pad
    "data_format", data_format, "name", name)
ValueError: (InvalidArgument) The depth of Input(X)'s dimension should be greater than pad_front in reflect mode, but received depth(25) and pad_front(27).
  [Hint: Expected in_depth > pads[4], but received in_depth:25 <= pads[4]:27.] (at /paddle/paddle/phi/kernels/gpu/pad3d_kernel.cu:384)
  [operator < pad3d > error]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/server/restful/tts_api.py", line 113, in tts
    text, spk_id, speed, volume, sample_rate, save_path)
  File "/home/yuantian01/yt_py37/lib/python3.7/site-packages/paddlespeech/server/engine/tts/python/tts_engine.py", line 245, in run
    sys.exit(-1)
SystemExit: -1

因为 paddlespeech 按照分句合成,可能是第一个分句的 mel 特征为空或者不够长导致的
3. 文本中有连续的符号时,如:"测试,,"、"测试,。"、"测试。“。",报错与 2 一样

1 需要有判空操作, 2, 3 直接使用 paddlespeech tts --input "," 不报错, 使用 paddlespeech tts --voc mb_melgan_csmsc --input "," 报错,tts server 中默认使用的是 mb_melgan_csmsc, cli 默认使用 hifigan, 报错是因为 mb_melgan 网络结构要求, 输入 mel 不满足条件时会报错(mb_melgan 输入 mel 最小要求 6 帧

ValueError: (InvalidArgument) The depth of Input(X)'s dimension should be greater than pad_front in reflect mode, but received depth(25) and pad_front(27).

对于情况 1, 输入为空时确实没办法给你返回音频,我认为符合预期,如果有什么建议可以提交 pr 修复,程序在

phone_ids = frontend_dict['phone_ids']

对于情况 2, 3, 可以把 voc 换成 hifigan_csmsc, 或者修改

merge_sentences = False
True

欢迎调试后反馈或提交修复 pr

@tianyu8969
Copy link
Author

tianyu8969 commented Oct 11, 2022

反馈或提交修复 pr

1是否可以返回一个空白的音频,或者如何输出一个空白音频,有个需要播放一段空白音频,paddlespeech无法合成此类音频
2, 3, 把 voc 换成 hifigan_csmsc可以正常合成,但是合成速度很慢,长文本99个字要20多秒,而mb_melgan_csmsc只要3秒,
修改 merge_sentences = True 无效还是无法正常合成
image

使用模型:speedyspeech_csmsc + mb_melgan_csmsc 2, 3问题可正常合成,问题1还是不行
image

@yt605155624
Copy link
Collaborator

问题 1 你自己 try catch 判断一下然后生成空音频就行
问题 2, 3 已经说了是 mb_melgan 模型本身的要求,感兴趣可以看下论文和代码,speedyspeech_csmsc + mb_melgan_csmsc 可以正常合成可能是 speedyspeech 恰巧对于 "sp" 输出的 mel 的长度 >=6 帧,可以自己打印下 mel 的长度看下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants