<a href="https://colab.research.google.com/github/SusanSuY/DiffSingerColabNotebook/blob/main/Kei's_DiffSinger_colab_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Start with an NNSVS database. Once you have that, train it starting and stopping at Stage 0. Save the `ETK/train/data/acoustic` folder.

In [None]:
#@markdown # Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@markdown # Install NNSVS
!git clone https://github.com/nnsvs/nnsvs.git
%cd nnsvs
!pip install -e ".[dev,lint,test]"

In [None]:
#@markdown # Patch nnsvs2opencpop.py
%%writefile /content/nnsvs/utils/nnsvs2opencpop.py
"""Convert NNSVS's segmented data to Opencpop's structure

so that the code for DiffSinger can be used.
"""
import argparse
import re
import shutil
import sys
from pathlib import Path

import librosa
from nnmnkwii.frontend import merlin as fe
from nnmnkwii.io import hts
from tqdm.auto import tqdm


def note_by_regex(regex, s):
    match = re.search(regex, s)
    if match is None:
        return 'rest'
    return match.group(1)


def numeric_feature_by_regex(regex, s):
    match = re.search(regex, s)
    if match is None:
        return 0
    return int(match.group(1))


def get_parser():
    parser = argparse.ArgumentParser(
        description="Convert NNSVS's segmented data to Opencpop's structure",
    )
    parser.add_argument("in_dir", type=str, help="Path to input dir")
    parser.add_argument("out_dir", type=str, help="Output directory")
    return parser


if __name__ == "__main__":
    args = get_parser().parse_args(sys.argv[1:])
    in_dir = Path(args.in_dir)
    out_dir = Path(args.out_dir)

    label_score_dir = in_dir / "label_phone_score"
    label_align_dir = in_dir / "label_phone_align"
    in_wav_dir = in_dir / "wav"

    out_wav_dir = out_dir / "wavs"
    out_wav_dir.mkdir(exist_ok=True, parents=True)

    label_score_files = sorted(label_score_dir.glob("*.lab"))
    utt_ids = [f.stem for f in label_score_files]

    rows = []
    for utt_id in tqdm(utt_ids):
        if utt_id in ["namine_ritsu_hana_seg12"]:
            continue
        label_score = hts.load(label_score_dir / f"{utt_id}.lab")
        label_align = hts.load(label_align_dir / f"{utt_id}.lab")

        ph = [
            re.search(r"\-(.*?)\+", context).group(1)
            for context in label_score.contexts
        ]
        note = [
            note_by_regex(r"/E:([A-Z][b]?[0-9]+)]", context)
            for context in label_score.contexts
        ]
        note_dur = [
            numeric_feature_by_regex(r"@(\d+)#", context) / 100.0
            for context in label_score.contexts
        ]
        ph_dur = fe.duration_features(label_align).reshape(-1) * 0.005
        is_slur = [0] * len(ph_dur)
        #assert len(ph) == len(note) == len(note_dur) == len(ph_dur) == len(is_slur)
        cols = [
            utt_id,
            " ".join(ph),
            " ".join(ph),
            " ".join(str(n) for n in note),
            " ".join(str(n) for n in note_dur),
            " ".join(str(round(n, 3)) for n in ph_dur),
            " ".join(str(n) for n in is_slur),
        ]
        rows.append("|".join(cols))
        shutil.copyfile(in_wav_dir / f"{utt_id}.wav", out_wav_dir / f"{utt_id}.wav")

    with open(out_dir / "transcriptions.txt", "w") as f:
        for row in rows:
            f.write(row + "\n")

In [None]:
#@markdown # Convert your NNSVS dataset to Opencpop format
#@markdown Upload the contents of your `ETK/train/data/acoustic` folder to this colab session, and copy+paste the directory of it into the following box, then run this cell.
acoustic_data_path = ''#@param{type:'string'}
!python utils/nnsvs2opencpop.py {acoustic_data_path} /content/segments

In [None]:
# @markdown # Alternative - Upload Opencpop formatted data
# @markdown If you have already converted your NNSVS database to Opencpop format, you may upload it as a zip file to the session (or to Drive) and unpack it here.
opencpop_data_path = ''#@param{type:'string'}
!unzip {opencpop_data_path} -d /content/segments

In [None]:
#@markdown # Install DiffSinger
%cd /content
!git clone https://github.com/openvpi/DiffSinger
!pip install onnx==1.12.0 onnxsim==0.4.10 protobuf==3.13.0
!pip3 install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
!wget https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
!7za  -bso0 -y x /content/nsf_hifigan_20221211.zip -o/content/DiffSinger/checkpoints
%cd DiffSinger
!pip install -r requirements.txt

Take the folder called `segments` and drag it into the following directory, where {speaker_name} is the name of your singer: `/content/DiffSinger/data/raw/{speaker_name}`

Make a dictionary. You can copy and modify one from an NNSVS dic. Each line is a syllable, followed by an indentation, followed by the phonemes separated by a space. You can only have two phonemes in one entry. It should be in the following format: \
`a	a \
ka	k a \
sa	s a` \
When you are done making your dictionary, upload it to /content/DiffSinger/dictionaries .

In [None]:
# @markdown #Edit Config
# @markdown  You will want to edit the following configurations:
# @markdown   - speakers: replace "opencpop" with the name of your dataset
# @markdown   - test_prefixes: replace the numbers with some wav file names (without extensions) from your segments folder
# @markdown   - raw_data_dir: replace "opencpop" with the name of your singer
# @markdown   - binary_data_dir: replace "opencpop" with the name of your singer
# @markdown   - dictionary: replace with the path to your dictionary
%%writefile /content/DiffSinger/configs/acoustic.yaml
base_config:
  - configs/base.yaml

task_cls: training.acoustic_task.AcousticTask
num_spk: 1
speakers:
  - opencpop
spk_ids: []
test_prefixes: [
    '2044',
    '2086',
    '2092',
    '2093',
    '2100',
]

vocoder: NsfHifiGAN
vocoder_ckpt: checkpoints/nsf_hifigan/model
audio_sample_rate: 44100
audio_num_mel_bins: 128
hop_size: 512            # Hop size.
fft_size: 2048           # FFT size.
win_size: 2048           # FFT size.
fmin: 40
fmax: 16000

binarization_args:
  shuffle: true
  num_workers: 0
augmentation_args:
  random_pitch_shifting:
    enabled: false
    range: [-5., 5.]
    scale: 1.0
  fixed_pitch_shifting:
    enabled: false
    targets: [-5., 5.]
    scale: 0.75
  random_time_stretching:
    enabled: false
    range: [0.5, 2.]
    domain: log  # or linear
    scale: 1.0

raw_data_dir: 'data/raw/opencpop/segments/diffsinger_db'
binary_data_dir: 'data/binary/opencpop'
binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer
dictionary: dictionaries/opencpop-extension.txt
num_pad_tokens: 1
spec_min: [-5]
spec_max: [0]
mel_vmin: -6. #-6.
mel_vmax: 1.5
interp_uv: true
energy_smooth_width: 0.12
breathiness_smooth_width: 0.12

use_spk_id: false
f0_embed_type: continuous
use_energy_embed: false
use_breathiness_embed: false
use_key_shift_embed: false
use_speed_embed: false

K_step: 1000
timesteps: 1000
max_beta: 0.02
rel_pos: true
diff_accelerator: ddim
pndm_speedup: 10
hidden_size: 256
residual_layers: 20
residual_channels: 512
dilation_cycle_length: 4  # *
diff_decoder_type: 'wavenet'
diff_loss_type: l2
schedule_type: 'linear'

# train and eval
num_sanity_val_steps: 1
optimizer_args:
  lr: 0.0004
lr_scheduler_args:
  step_size: 50000
  gamma: 0.5
max_batch_frames: 80000
max_batch_size: 48
val_with_vocoder: true
val_check_interval: 2000
num_valid_plots: 10
max_updates: 320000
num_ckpt_keep: 5
permanent_ckpt_start: 200000
permanent_ckpt_interval: 40000


finetune_enabled: false
finetune_ckpt_path: null

finetune_ignored_params:
  - model.fs2.encoder.embed_tokens
  - model.fs2.txt_embed
  - model.fs2.spk_embed
finetune_strict_shapes: true

freezing_enabled: false
frozen_params: []

In [None]:
#@markdown # Set output directory of checkpoints
#@markdown It is recommended to set this to your Google Drive so you don't lose checkpoints if colab disconnects you.
%%writefile /content/DiffSinger/utils/hparams.py
output_directory = ''#@param{type:'string'}

import argparse
import multiprocessing
import os
import re
import shutil

import yaml

global_print_hparams = True
hparams = {}


class Args:
    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            self.__setattr__(k, v)


def override_config(old_config: dict, new_config: dict):
    for k, v in new_config.items():
        if isinstance(v, dict) and k in old_config:
            override_config(old_config[k], new_config[k])
        else:
            old_config[k] = v


def set_hparams(config='', exp_name='', hparams_str='', print_hparams=True, global_hparams=True):
    """
        Load hparams from multiple sources:
        1. config chain (i.e. first load base_config, then load config);
        2. if reset == True, load from the (auto-saved) complete config file ('config.yaml')
           which contains all settings and do not rely on base_config;
        3. load from argument --hparams or hparams_str, as temporary modification.
    """
    if config == '':
        parser = argparse.ArgumentParser(description='neural music')
        parser.add_argument('--config', type=str, default='',
                            help='location of the data corpus')
        parser.add_argument('--exp_name', type=str, default='', help='exp_name')
        parser.add_argument('--hparams', type=str, default='',
                            help='location of the data corpus')
        parser.add_argument('--infer', action='store_true', help='infer')
        parser.add_argument('--validate', action='store_true', help='validate')
        parser.add_argument('--reset', action='store_true', help='reset hparams')
        parser.add_argument('--debug', action='store_true', help='debug')
        args, unknown = parser.parse_known_args()
    else:
        args = Args(config=config, exp_name=exp_name, hparams=hparams_str,
                    infer=False, validate=False, reset=False, debug=False)

    args_work_dir = ''
    if args.exp_name != '':
        args.work_dir = args.exp_name
        args_work_dir = output_directory

    config_chains = []
    loaded_config = set()

    def load_config(config_fn):  # deep first
        with open(config_fn, encoding='utf-8') as f:
            hparams_ = yaml.safe_load(f)
        loaded_config.add(config_fn)
        if 'base_config' in hparams_:
            ret_hparams = {}
            if not isinstance(hparams_['base_config'], list):
                hparams_['base_config'] = [hparams_['base_config']]
            for c in hparams_['base_config']:
                if c not in loaded_config:
                    if c.startswith('.'):
                        c = f'{os.path.dirname(config_fn)}/{c}'
                        c = os.path.normpath(c)
                    override_config(ret_hparams, load_config(c))
            override_config(ret_hparams, hparams_)
        else:
            ret_hparams = hparams_
        config_chains.append(config_fn)
        return ret_hparams

    global hparams
    assert args.config != '' or args_work_dir != '', 'Either config or exp name should be specified.'
    saved_hparams = {}
    ckpt_config_path = f'{args_work_dir}/config.yaml'
    if os.path.exists(ckpt_config_path):
        with open(ckpt_config_path, encoding='utf-8') as f:
            saved_hparams.update(yaml.safe_load(f))

    hparams_ = {}
    if args.config != '':
        hparams_.update(load_config(args.config))

    if not args.reset:
        hparams_.update(saved_hparams)
    hparams_['work_dir'] = args_work_dir

    if args.hparams != "":
        for new_hparam in args.hparams.split(","):
            k, v = new_hparam.split("=")
            if k not in hparams_:
                hparams_[k] = eval(v)
            if v in ['True', 'False'] or type(hparams_[k]) == bool:
                hparams_[k] = eval(v)
            else:
                hparams_[k] = type(hparams_[k])(v)

    dictionary = hparams_.get('g2p_dictionary')
    if dictionary is None:
        dictionary = 'dictionaries/opencpop.txt'
    ckpt_dictionary = os.path.join(hparams_['work_dir'], os.path.basename(dictionary))
    if args_work_dir != '' and (not os.path.exists(ckpt_config_path) or args.reset) and not args.infer:
        os.makedirs(hparams_['work_dir'], exist_ok=True)
        if not bool(re.match(r'Process-\d+', multiprocessing.current_process().name)):
            # Only the main process will save the config file and dictionary
            with open(ckpt_config_path, 'w', encoding='utf-8') as f:
                hparams_non_recursive = hparams_.copy()
                hparams_non_recursive['base_config'] = []
                yaml.safe_dump(hparams_non_recursive, f, allow_unicode=True, encoding='utf-8')
            if hparams_.get('reset_phone_dict') or not os.path.exists(ckpt_dictionary):
                shutil.copy(dictionary, ckpt_dictionary)

    ckpt_dictionary_exists = os.path.exists(ckpt_dictionary)
    if not os.path.exists(dictionary) and not ckpt_dictionary_exists:
        raise FileNotFoundError(f'G2P dictionary not found in either of the following paths:\n'
                                f' - \'{dictionary}\'\n'
                                f' - \'{ckpt_dictionary}\'')
    hparams_['original_g2p_dictionary'] = dictionary
    if ckpt_dictionary_exists:
        dictionary = ckpt_dictionary
    hparams_['g2p_dictionary'] = dictionary

    hparams_['infer'] = args.infer
    hparams_['debug'] = args.debug
    hparams_['validate'] = args.validate
    global global_print_hparams
    if global_hparams:
        hparams.clear()
        hparams.update(hparams_)

    if print_hparams and global_print_hparams and global_hparams:
        print('| Hparams chains: ', config_chains)
        print('| Hparams: ')
        for i, (k, v) in enumerate(sorted(hparams_.items())):
            print(f"\033[;33;m{k}\033[0m: {v}, ", end="\n" if i % 5 == 4 else "")
        print("")
        global_print_hparams = False
    # print(hparams_.keys())
    if hparams.get('exp_name') is None:
        hparams['exp_name'] = args.exp_name
    if hparams_.get('exp_name') is None:
        hparams_['exp_name'] = args.exp_name
    return hparams_

In [None]:
# @markdown # Migrate to new format
# @markdown Input the path of your "transcription.txt". \
# @markdown This should generate a "transcription.csv" in the same folder.
transcription_txt_path = ''#@param{type:'string'}
!python scripts/migrate.py txt {transcription_txt_path}

In [None]:
#@markdown # Preprocess data
#@markdown Note: If you get a BinarizationError "transcriptions and dictionary mismatch", please ensure your data contains all phonemes matching the dictionary (and vice versa).
import os
os.environ['PYTHONPATH']='.'
!CUDA_VISIBLE_DEVICES=0 python scripts/binarize.py --config configs/acoustic.yaml

In [None]:
#@markdown # Tensorboard
#@markdown For monitoring training progess. Enter the output directory which you defined in the step before last.
logs = ''#@param{type:'string'}
%load_ext tensorboard
%tensorboard --logdir=logs+lightning_logs

In [None]:
#@markdown # Train your model
#@markdown Enter the name of your singer.
name = ''#@param{type:'string'}
!CUDA_VISIBLE_DEVICES=0 python scripts/train.py --config configs/acoustic.yaml --exp_name {name} --reset

In [None]:
#@markdown # Convert to ONNX for OpenUtau
#@markdown Enter the path of the folder which contains your checkpoint into the `checkpoints_path` box, and the name of the folder into the `name` box. \
#@markdown It will be saved in the output path you specify.
checkpoints_path = ''#@param{type:'string'}
name = ''#@param{type:'string'}
output_path = ''#@param{type:'string'}
!cp {checkpoints_path} -r /content/DiffSinger/checkpoints
!python scripts/export.py acoustic --exp {name} --out {output_path}