# **[Diff-SVC](https://github.com/prophesier/diff-svc)**
Singing Voice Conversion via diffusion model

____

####  **Notebook put together by [justinjohn-03](https://github.com/justinjohn0306)**

## **Special thanks to [prophesier](https://github.com/prophesier) and [UtaUtaUtau](https://github.com/UtaUtaUtau)**

In [3]:
#@title # Setup
#@markdown ## Install Diff-SVC
from IPython.display import clear_output 
from google.colab import files 
import gdown
import os

!rm -rf /content/sample_data


Mode = "install" #@param ["install", "update", "remove"]
Repository = "Official Diff-SVC" #@param ["Official Diff-SVC", "UtaUtaUtau"]
Branch_name = "" #@param {type:"string"}

repositories = {
  'Official Diff-SVC':'prophesier',
  'UtaUtaUtau':'UtaUtaUtau'
}

from pathlib import Path
if Mode == 'install':
  git_cmd = ''
  if Branch_name: git_cmd += f"-b {Branch_name} "

  git_cmd += f"--depth 1 https://github.com/{repositories[Repository]}/diff-svc.git"
  !git clone $git_cmd
  %cd /content/diff-svc
  print('Installing torch')
  !pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
  !pip install -r requirements_short.txt
  !pip install tensorboard<2.9,>=2.8
  %reload_ext tensorboard
  print('Downloading pretrained models')
  %cd "/content/"
  %mkdir -p /content/diff-svc/checkpoints/
  %cd "/content/diff-svc/"
  !wget https://download1592.mediafire.com/f8p8qf1fp4cg/d1zcrvki20zc0bo/checkpoints.zip -O checkpoints.zip
  !unzip /content/checkpoints.zip -d /content/diff-svc/

  clear_output()

  print('Done!')

elif Mode == 'update':
  %cd /content/diff-svc
  !git pull
  !pip install -r requirements_short.txt
  clear_output()
  print("Done!")
else:
  answer = input("Are you sure you want to delete diff-svc folder? (y/n)").lower()
  while answer not in ["y", "n"]:
    print("Invalid input")
    answer = input("Are you sure you want to delete diff-svc folder? (y/n)").lower()
  if answer == "y":
    %cd /content
    %rm -r diff-svc/
    print("Done!")
  else:
    print("Cancelled...")

Done!


In [None]:
#@markdown ## Mount your Gdrive 
#@markdown (This is an essential step if you want to load your own trained model)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import gdown

#@title # **Load model**

#@markdown ### **Load the pretrained model (default)**


#@markdown ___

#@markdown Note: Add the full path to the most recent checkpoint located on your Gdrive as well as the speaker's name if you wish to use your own model.

#@markdown Example:-

#@markdown The ``project_name`` will be the name of your speaker

#@markdown  ``model_path: /content/drive/MyDrive/Diff-SVC/checkpoints/model_name/model_ckpt_steps_50000.ckpt``

#@markdown           ``config_path: /content/drive/MyDrive/Diff-SVC/checkpoints/model_name/config.yaml``


#@markdown ___

#@markdown ### **Set model location with the name of the speaker:**
#@markdown *If you wish to use the pre-trained model and don't have your own model, leave these at their default values.*

#@markdown ___

%cd "/content/diff-svc/"

os.environ['PYTHONPATH']='.'

!CUDA_VISIBLE_DEVICES=0


from utils.hparams import hparams
from preprocessing.data_gen_utils import get_pitch_parselmouth,get_pitch_crepe
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import utils
import librosa
import torchcrepe
from infer import *
import logging
from infer_tools.infer_tool import *
gdrive_id_model = "" #@param {type: "string"}
gdrive_id_config= "" #@param {type: "string"}

%cd "/content/"
gdown.download(
     "https://drive.google.com/uc?export=download&confirm=pbef&id=" + str(gdrive_id_model),
)

gdown.download(
     "https://drive.google.com/uc?export=download&confirm=pbef&id=" + str(gdrive_id_config),
)
%cd "/content/diff-svc/"
logging.getLogger('numba').setLevel(logging.WARNING)

# 工程文件夹名，训练时用的那个
project_name = "" #@param {type: "string"}
model_path = "" #@param {type: "string"}
config_path="" #@param {type: "string"}
hubert_gpu=True
svc_model = Svc(project_name,config_path,hubert_gpu, model_path)
print('model loaded')

In [None]:
#@markdown ## Upload your reference audio

%cd "/content/diff-svc/raw/"

print("\n\033[34m\033[1mupload your reference audio")
listfn, length = files.upload().popitem()

%cd "/content/diff-svc/"
print("\n\033[32m\033[1mdone")

In [None]:
#@markdown # Input audio and adjust parameters

%cd "/content/diff-svc/"

#@markdown ___

#@markdown #### ``Path of input audio, default path is located in root directory``
wav_fn='raw/1_videoplayback (8)_(Vocals).mp3' #@param {type: "string"}
demoaudio, sr = librosa.load(wav_fn)

#@markdown ___

#@markdown #### ``This shifts the raw audio up by one semitone before rendering, if the raw input is of a male voice and the desired voice is female, you can input 8 or 12 etc (12 would shift a whole octave).``
key = 0#@param {type: "integer"}
# 加速倍数
#@markdown ___

#@markdown #### ``The multiple of the inference acceleration , the default value is 1000 steps, inputting a value if 10 would mean only using 100 steps to render, it's a rather straightforward value. The value can go up to 50x (rendering in 20 steps) without causing audible quality loss, if the value is set any higher it may start to cause quality loss.``

#@markdown ___


#@markdown #### Note: ``If use_gt_mel is set to True below, you should keep this value lower than the add_noise_step value and keep it at a value where it can completely divide 1000.``


pndm_speedup = 20 #@param {type: "integer"}

#@markdown ___

#@markdown #### ``Path of input audio, default path is located in root directory``
wav_gen='test_output.wav' #@param {type: "string"}

#@markdown ___

#@markdown #### ``Related to the use_gt_mel parameter, it controls the balance of the input and target voice, a value of 1 is completely the raw input, a value of 1000 is completely the target voice, there's an audible mix in tone when the value falls around 300 (this value isn't linear, also, if this parameter is set very low, you can decrease the pndm exceleration value for higher rendering quality)``
add_noise_step = 1000 #@param {type: "integer"}
#@markdown ___
#@markdown #### ``Crepe's noise filter threshold, you can increase the value of the raw audio is clean, and if there is a lot of noise, you can keep or decrease the value, changing the use_crepe parameter to False will disable this parameter.``
thre = 0.02 #@param {type: "integer"}
#@markdown ___
#@markdown #### ``Crepe is a F0 calculation algorithm, it's good but slow, setting the value to False will change the F0 calculation algorithm from crepe to parselmouth that is faster than crepe but is of lower quality``
use_crepe= True #@param {type:"boolean"}
#@markdown ___
#@markdown #### ``F0 extraction algorithm for MEL spectogram rendering, using False will use the raw input's F0 for rendering. There's usually a difference in output between using True and False for rendering, usually setting it to True yields better results, but it's not set in stone, either value doesn't impact rendering speeds much. (Whatever the key value is, this is always changeable, doesn't affect it)``
use_pe=True #@param {type:"boolean"}
#@markdown ___
#@markdown #### ``This option is similar to the image-to-image function of AI art generation, if set to True, the output audio shall be a mix of the input voice and the target voice, the percentage of each is decided by the next parameter.``

#@markdown #### ``NOTE!!!: If this parameter is set to true, keep the key parameter value at 0, as rendering with various pitch input is not supported.``
use_gt_mel= False #@param {type:"boolean"}
#@markdown ___

f0_tst, f0_pred, audio = run_clip(svc_model,file_path=wav_fn, key=key, acc=pndm_speedup, use_crepe=use_crepe, use_pe=use_pe, thre=thre,
                                        use_gt_mel=use_gt_mel, add_noise_step=add_noise_step,project_name=project_name,out_path=wav_gen)

In [None]:
#@markdown #Display results
ipd.display(ipd.Audio(demoaudio, rate=sr))
ipd.display(ipd.Audio(audio, rate=hparams['audio_sample_rate'], normalize=False))

In [None]:
#@markdown #Display graph

#f0_gen,_=get_pitch_crepe(*vocoder.wav2spec(wav_gen),hparams,threshold=0.05)
%matplotlib inline
f0_gen,_=get_pitch_parselmouth(*svc_model.vocoder.wav2spec(wav_gen),hparams)
f0_tst[f0_tst==0]=np.nan#ground truth f0
f0_pred[f0_pred==0]=np.nan#f0 pe predicted
f0_gen[f0_gen==0]=np.nan#f0 generated
fig=plt.figure(figsize=[15,5])
plt.plot(np.arange(0,len(f0_tst)),f0_tst,color='black')
plt.plot(np.arange(0,len(f0_pred)),f0_pred,color='orange')
plt.plot(np.arange(0,len(f0_gen)),f0_gen,color='red')
plt.axhline(librosa.note_to_hz('C4'),ls=":",c="blue")
plt.axhline(librosa.note_to_hz('G4'),ls=":",c="green")
plt.axhline(librosa.note_to_hz('C5'),ls=":",c="orange")
plt.axhline(librosa.note_to_hz('F#5'),ls=":",c="red")
#plt.axhline(librosa.note_to_hz('A#5'),ls=":",c="black")
plt.show()