<a href="https://colab.research.google.com/github/DLSeed/DeepLearning/blob/main/%E2%80%9CSovits_(Rcell%E7%89%88%E7%8C%AB%E9%9B%B7)%E2%80%9D%E5%85%AC%E6%B5%8B%E7%89%88.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 简介
按照[Rcell](https://space.bilibili.com/343303724)大佬的思路拼合soft-vc与vits，
使用[Francis-Komizu](https://space.bilibili.com/636704927)大佬的原colab结构，并延续Sovits的称呼。

R佬合成音频时使用librosa模块取f0、效率略低，以torchcrepe模块代替，合成音频步骤节约了30%的时间。

hubert.pt为[soft-vc](https://github.com/bshall/hubert)发布的内容合成器模型，generator_idxr.pth为R佬在huggingface发布的模型；采用存在谷歌云盘的方式，节约下载时间。
[Sovits](https://github.com/IceKyrin/Sovits) fork自F佬的[github](https://github.com/Francis-Komizu/Sovits)，其中内置了R佬pth的config.json及官方hubert模块（改为加载本地模型方式），以方便使用。

# 配置环境

In [None]:
!git clone https://github.com/IceKyrin/Sovits
%cd Sovits
!pip install -r requirements.txt
!pip install torchcrepe
%cd monotonic_align
!python setup.py build_ext --inplace
%cd ..
!mkdir results
!mkdir uploadings
!mkdir recordings

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import IPython.display as ipd

import os
import json
import math
import torch
import torchaudio
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader

import commons
import utils
from data_utils import UnitAudioLoader, UnitAudioCollate
from models import SynthesizerTrn
import requests

from scipy.io.wavfile import write

# 加载模型

## 加载内容编码器

In [None]:
import hubert
!gdown --id '1cA37nsiSnsouF2TJkaXb3_VoA-rbifTu' --output /content/Sovits/hubert/hubert.pt
hubert_soft = hubert.hubert_soft('/content/Sovits/hubert/hubert.pt')

## 加载生成器

In [None]:
import librosa
import torch

import commons
import utils
from models import SynthesizerTrn
from text.symbols import symbols

!gdown --id '1gg1Igsa7nOtsLohtv-hNq2mmXCsbFqZJ' --output generator_idxr.pth

hps = utils.get_hparams_from_file("/content/Sovits/configs/ljs_base.json")
hps_ms = utils.get_hparams_from_file("/content/Sovits/configs/vctk_base.json")
net_g_ms = SynthesizerTrn(
    # len(symbols),
    hps_ms.data.filter_length // 2 + 1,
    hps_ms.train.segment_size // hps.data.hop_length,
    n_speakers=hps_ms.data.n_speakers,
    **hps_ms.model)
_ = utils.load_checkpoint("generator_idxr.pth", net_g_ms, None)


# 声音转换

支持{1、2}**任选一个方式**的声音转换！

1、使用参考音频

In [None]:
# 任选一个demo
!gdown --id '1p-LO3kG7E6VpY-p-2-AgRmkiJWYFDC8V' --output demo.wav
# !gdown --id '10JQMPdzp0gjg9cVVersxVZWhIr4UwrFF' --output demo.wav

source_path = 'demo.wav'

2、使用上传音频（建议30s以内，单声道，22050hz，格式不符可能出bug）

In [None]:
from google.colab import files
import shutil

uploaded = files.upload()
old_path = list(uploaded.keys())[0]
new_path = f'uploadings/{old_path}'
shutil.move(old_path, new_path)
source_path = new_path

source, sr = torchaudio.load(source_path)
source = torchaudio.functional.resample(source, sr, 22050)
source = source.unsqueeze(0)

3、合成音频

In [None]:
import librosa
import soundfile
import torchcrepe
import torchaudio
import numpy as np
def resize2d(source, target_len):
    source[source<0.001] = np.nan
    target = np.interp(np.arange(0, len(source), len(source) / target_len), np.arange(0, len(source)), source)
    return np.nan_to_num(target)
def convert_wav_22050_to_f0(audio):
    tmp = librosa.pyin(audio,
                fmin=librosa.note_to_hz('C0'),
                fmax=librosa.note_to_hz('C7'),
                frame_length=1780)[0]
    print(tmp)
    f0 = np.zeros_like(tmp)
    f0[tmp>0] = tmp[tmp>0]
    return f0

def convert_wav_22050_to_f1(audio):
    audio, sr = torchcrepe.load.audio(source_path)
    tmp = torchcrepe.predict(audio=audio,
                  fmin=50,
                  fmax=550,
                  sample_rate=22050,
                  model='full',
                  batch_size=1780, device='cuda:0').numpy()[0]
    # print(tmp)
    f0 = np.zeros_like(tmp)
    f0[tmp > 0] = tmp[tmp > 0]
    return f0


# 原版
r_source, r_sr = torchaudio.load(source_path)
r_resampler = torchaudio.transforms.Resample(r_sr, 22050)
r_source = r_resampler(r_source)
r_source = r_source.unsqueeze(0)


vc_transform = 1

audio, sampling_rate = soundfile.read(source_path)
if sampling_rate != 16000:
  audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=16000)

audio22050 = librosa.resample(audio, orig_sr=16000, target_sr=22050)

# 改此函数可切换回Rcell原版，此版本使用torchcrepe加速获取f0
f0 = convert_wav_22050_to_f1(audio22050)
# f0 = convert_wav_22050_to_f0(audio22050)

source = torch.FloatTensor(audio).unsqueeze(0).unsqueeze(0)
print(source.shape)
with torch.inference_mode():
    units = hubert_soft.units(source)
    soft = units.squeeze(0).numpy()
    print(sampling_rate)
    f0 = resize2d(f0, len(soft[:, 0])) * vc_transform
    soft[:, 0] = f0 / 10
sid = torch.LongTensor([0])
stn_tst = torch.FloatTensor(soft)
with torch.no_grad():
    x_tst = stn_tst.unsqueeze(0)
    x_tst_lengths = torch.LongTensor([stn_tst.size(0)])
    audio = net_g_ms.infer(x_tst, x_tst_lengths,sid=sid, noise_scale=0, noise_scale_w=0, length_scale=1)[0][
        0, 0].data.float().numpy()
print("Source:")
ipd.display(ipd.Audio(r_source.squeeze(), rate=r_sr))
print("Converted:")
ipd.display(ipd.Audio(audio, rate=hps.data.sampling_rate))

### 保存

为生合成的语音设定一个文件名。注意不需要加扩展名！

命名后运行该代码块，你将在左侧文件系统中`/content/Sovits/results/`文件夹中找到它！

In [None]:
filename = 'natsume' #@param {type: "string"}
audio_path = f'/content/Sovits/results/{filename}.wav'
write(audio_path, 22050, audio)

# 参考

https://github.com/bshall/soft-vc

[基于VITS和SoftVC实现任意对一VoiceConversion](https://www.bilibili.com/video/BV1S14y1x78X?share_source=copy_web&vd_source=630b87174c967a898cae3765fba3bfa8)

